diff --git a/.github/ISSUE_TEMPLATE/---feature-request-.md b/.github/ISSUE_TEMPLATE/---feature-request-.md new file mode 100644 index 0000000000000000000000000000000000000000..57708855dce4fcc81e719c59082a8a42415eba47 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---feature-request-.md @@ -0,0 +1,27 @@ +--- +name: 建议(Feature request) +about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue. + +--- + +欢迎您对PaddlePaddle提出建议,非常感谢您对PaddlePaddle的贡献! +在留下您的建议时,辛苦您同步提供如下信息: +- 版本、环境信息 +1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1 +2)CPU/GPU:您是否使用GPU进行训练,如是,请提供您的CUDA和cuDNN版本号 +3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 建议描述:请您详细描述,您认为需优化的功能 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +Please make sure that this is a feature request. +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +**To Reproduce** +Steps to reproduce the behavior +**Describe the feature and the current behavior/state.** +**Any Other info.** diff --git a/.github/ISSUE_TEMPLATE/---inference-issue-.md b/.github/ISSUE_TEMPLATE/---inference-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..37bdc8889e2722dda964ba82c2ac36cef5e60110 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---inference-issue-.md @@ -0,0 +1,40 @@ +--- +name: 预测(Inference Issue) +about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准描述您的问题,例如“最新预测库的API文档在哪儿 ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID +    2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况 +    3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号 +    4)系统环境:请您描述系统类型、版本(如Mac OS 10.14),Python版本 +-预测信息 +    1)C++预测:请您提供预测库安装包的版本信息,及其中的version.txt文件 +    2)CMake包含路径的完整命令 +    3)API信息(如调用请提供) +    4)预测库来源:官网下载/特殊环境(如BCLOUD编译) +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that th +If there is no solution,please make sure that this is an inference issue including the following details : +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Python version +-Cmake orders +-C++version.txt +-API information +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---installation-issue-.md b/.github/ISSUE_TEMPLATE/---installation-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..ce4ba589324673baa4aa39760bcdcd66ecfdd36b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---installation-issue-.md @@ -0,0 +1,40 @@ +--- +name: 安装(Installation Issue) +about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation +  issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +建立issue时,为快速解决问题,请您根据使用情况给出如下信息: +- 标题:请包含关键词“安装错误”/“编译错误”,例如“Mac编译错误” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID +    2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况 +    3)GPU:请提供GPU型号,CUDA和CUDNN版本号 +    4)系统环境:请说明系统类型、版本(如Mac OS 10.14)、Python版本 +- 安装方式信息: +1)pip安装/docker安装 +2)本地编译:请提供cmake命令,编译命令 +3)docker编译:请提供docker镜像,编译命令            +  特殊环境请注明:如离线安装等 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is an installation issue including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg. Mac OS 10.14) +-Python version +- Install method: pip install/install with docker/build from source(without docker)/build within docker +- Other special cases that you think may be related to this problem, eg. offline install, special internet condition   +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---model-issue-.md b/.github/ISSUE_TEMPLATE/---model-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..7cb52f37b90262d2704fe32d8093fac19ed74b5f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---model-issue-.md @@ -0,0 +1,36 @@ +--- +name: 模型(Model Issue) +about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/ + algorithm/dataset  issue. + +--- + +为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +建立issue时,为快速解决问题,请您根据使用情况给出如下信息: +- 标题:简洁、精准描述您的问题,例如“ssd 模型前置lstm报错  ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供PaddlePaddle版本号,例如1.1或CommitID +    2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况 +    3)GPU:请提供GPU型号,CUDA和CUDNN版本号 +    4)系统环境:请说明系统类型、版本(例如Mac OS 10.14),Python版本 +- 模型信息 +    1)模型名称 2)使用数据集名称 3)使用算法名称 4)模型链接 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is a issue of models including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Python version +-Name of Models&Dataset/details of operator +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---others-.md b/.github/ISSUE_TEMPLATE/---others-.md new file mode 100644 index 0000000000000000000000000000000000000000..6a291153e43f51e43646a1c860ec253361b137d6 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---others-.md @@ -0,0 +1,33 @@ +--- +name: 其他(Others) +about: 如上述分类未包含您的问题,可在此提出。 You could use this template for reporting other issues + +--- + +为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准概括您的问题 +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID +    2)CPU/GPU:如果您使用GPU训练,请提供GPU驱动版本、CUDA和cuDNN版本号 +    3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14 +    4)Python版本号 +    5)显存信息 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +If there is no solution,please provide us with the following details : +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/cuDNN version +-OS Platform and Distribution(eg.Mac OS 10.14) +-Python version +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.github/ISSUE_TEMPLATE/---training-issue-.md b/.github/ISSUE_TEMPLATE/---training-issue-.md new file mode 100644 index 0000000000000000000000000000000000000000..29e8383d9779229328c7c12d04ba6173fd0c8ba1 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/---training-issue-.md @@ -0,0 +1,38 @@ +--- +name: 训练(Training issue) +about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training +  issue. + +--- + +为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】 + +如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息: +- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ” +- 版本、环境信息: +    1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID +    2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况 +    3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号 +    4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本 +- 训练信息 +    1)单机/多机,单卡/多卡 +    2)显存信息 +    3)Operator信息 +- 复现信息:如为报错,请给出复现环境、复现步骤 +- 问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段 + +Thank you for contributing to PaddlePaddle. +Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before. +If there is no solution,please make sure that this is a training issue including the following details: +**System information** +-PaddlePaddle version (eg.1.1)or CommitID +-CPU: including CPUMKL/OpenBlas/MKLDNN version +-GPU: including CUDA/CUDNN version +-OS Platform (eg.Mac OS 10.14) +-Other imformation: Distriuted training/informantion of operator/ +Graphics card storage +**To Reproduce** +Steps to reproduce the behavior +**Describe your current behavior** +**Code to reproduce the issue** +**Other info / logs** diff --git a/.gitignore b/.gitignore index 90138f996cf9cacc3c1cbff0cf2600eefca3f305..369fa1cb919c82caec326d1429c8a2eba3b928d6 100644 --- a/.gitignore +++ b/.gitignore @@ -1,9 +1,11 @@ +python/paddle/fluid/tests/unittests/reader_reset_test.recordio paddle/operators/check_t.save paddle/operators/check_tensor.ls paddle/operators/tensor.save python/paddle/v2/fluid/tests/book/image_classification_resnet.inference.model/ python/paddle/v2/fluid/tests/book/image_classification_vgg.inference.model/ python/paddle/v2/fluid/tests/book/label_semantic_roles.inference.model/ +paddle/fluid/operators/distributed/send_recv.proto *.DS_Store *.vs build/ @@ -28,3 +30,5 @@ third_party/ build_* # clion workspace. cmake-build-* +paddle/fluid/operators/distributed/send_recv.proto +model_test diff --git a/AUTHORS.md b/AUTHORS.md index 41b7193677a0208ba2fa82b72862292572dcb6ef..deafa641203ed9d9bd794fe92e4a91e3aaa03f63 100644 --- a/AUTHORS.md +++ b/AUTHORS.md @@ -25,6 +25,7 @@ | kexinzhao | Ke-Xin Zhao | | kuke | Yi-Bing Liu | | lcy-seso | Ying Cao | +| cjld | Dun Liang | | lipeng-unisound | Peng Li | | liuyuan | Yuan Liu | | livc | Zhao Li | @@ -42,7 +43,9 @@ | QiJune | Jun Qi | | qingqing01 | Qing-Qing Dang | | reyoung | Yang Yu | +| Sand3r- | Michal Gallus | | Superjom | Chun-Wei Yan | +| tensor-tang | Jian Tang | | tianbingsz | Tian-Bing Xu | | tpatejko | Tomasz Patejko | | typhoonzero | Yi Wu | diff --git a/CMakeLists.txt b/CMakeLists.txt index bb2ba1ea0c416b00f6a8b9b2f7cf7d6a018c10ac..c0287a9a69088318d0c4be790af2ec62b4dd5714 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -26,6 +26,11 @@ message(STATUS "C compiler: ${CMAKE_C_COMPILER}, version: " "${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}") if(WIN32) set(CMAKE_STATIC_LIBRARY_PREFIX lib) + add_definitions("/DGOOGLE_GLOG_DLL_DECL=") + set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /bigobj /MT") + set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj /MTd") + set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /bigobj /MT") endif(WIN32) if(NOT CMAKE_CROSSCOMPILING) @@ -41,6 +46,7 @@ option(WITH_GPU "Compile PaddlePaddle with NVIDIA GPU" ${CUDA_F option(WITH_AMD_GPU "Compile PaddlePaddle with AMD GPU" OFF) option(WITH_AVX "Compile PaddlePaddle with AVX intrinsics" ${AVX_FOUND}) option(WITH_MKL "Compile PaddlePaddle with MKL support." ${AVX_FOUND}) +option(WITH_NGRAPH "Compile PaddlePaddle with nGraph support." OFF) option(WITH_DSO "Compile PaddlePaddle with dynamic linked CUDA" ON) option(WITH_TESTING "Compile PaddlePaddle with unit testing" OFF) option(WITH_SWIG_PY "Compile PaddlePaddle with inference api" ON) @@ -62,13 +68,14 @@ option(WITH_DISTRIBUTE "Compile with distributed support" OFF) option(USE_EIGEN_FOR_BLAS "Use matrix multiplication in Eigen" OFF) option(EIGEN_USE_THREADS "Compile with multi-threaded Eigen" OFF) option(WITH_ARM_FP16 "Use half precision support on armv8.2-a cpu" OFF) -option(WITH_FAST_BUNDLE_TEST "Bundle tests that can be run in a single process together to reduce launch overhead" OFF) option(WITH_CONTRIB "Compile the third-party contributation" OFF) option(REPLACE_ENFORCE_GLOG "Replace PADDLE_ENFORCE with glog/CHECK for better debug." OFF) option(WITH_ANAKIN "Compile with Anakin library" OFF) +option(ANAKIN_BUILD_FAT_BIN "Build anakin cuda fat-bin lib for all device plantform, ignored when WITH_ANAKIN=OFF" OFF) +option(ANAKIN_BUILD_CROSS_PLANTFORM "Build anakin lib for any nvidia device plantform. ignored when WITH_ANAKIN=OFF" ON) option(WITH_GRPC "Use grpc as the default rpc framework" ${WITH_DISTRIBUTE}) option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF) -option(WITH_INFERENCE "Compile fluid inference library" ON) +option(ON_INFER "Turn on inference optimization." OFF) option(WITH_INFERENCE_API_TEST "Test fluid inference high-level api interface" OFF) option(WITH_SYSTEM_BLAS "Use system blas library" OFF) option(PY_VERSION "Compile PaddlePaddle with python3 support" ${PY_VERSION}) @@ -104,6 +111,8 @@ if(ANDROID OR IOS) "Disable RDMA when cross-compiling for Android and iOS" FORCE) set(WITH_MKL OFF CACHE STRING "Disable MKL when cross-compiling for Android and iOS" FORCE) + set(WITH_NGRAPH OFF CACHE STRING + "Disable nGraph when cross-compiling for Android and iOS" FORCE) set(WITH_GOLANG OFF CACHE STRING "Disable golang when cross-compiling for Android and iOS" FORCE) @@ -121,12 +130,30 @@ if (APPLE OR WIN32) "Disable MKL for building on mac and windows" FORCE) endif() +if (WIN32) + set(WITH_AVX OFF CACHE STRING + "Disable AVX when compiling for Windows" FORCE) + set(WITH_DSO OFF CACHE STRING + "Disable DSO when compiling for Windows" FORCE) + set(WITH_MKL OFF CACHE STRING + "Disable MKL when compiling for Windows" FORCE) + set(WITH_DISTRIBUTE OFF CACHE STRING + "Disable DISTRIBUTE when compiling for Windows" FORCE) + set(WITH_C_API OFF CACHE STRING + "Disable C_API when compiling for Windows" FORCE) + set(WITH_FLUID_ONLY ON CACHE STRING + "Enable FLUID_ONLY when compiling for Windows" FORCE) +endif() + set(THIRD_PARTY_PATH "${CMAKE_BINARY_DIR}/third_party" CACHE STRING "A path setting third party libraries download & build directories.") set(FLUID_INSTALL_DIR "${CMAKE_BINARY_DIR}/fluid_install_dir" CACHE STRING "A path setting fluid shared and static libraries") +set(FLUID_INFERENCE_INSTALL_DIR "${CMAKE_BINARY_DIR}/fluid_inference_install_dir" CACHE STRING + "A path setting fluid inference shared and static libraries") + if (WITH_C_API AND WITH_PYTHON) message(WARNING "It is suggest not embedded a python interpreter in Paddle " "when using C-API. It will give an unpredictable behavior when using a " @@ -169,6 +196,7 @@ include(external/protobuf) # download, build, install protobuf include(external/python) # download, build, install python include(external/openblas) # download, build, install openblas include(external/mkldnn) # download, build, install mkldnn +include(external/ngraph) # download, build, install nGraph include(external/swig) # download, build, install swig include(external/boost) # download boost include(external/any) # download libn::any @@ -177,11 +205,14 @@ include(external/pybind11) # download pybind11 include(external/cares) include(external/cub) include(external/gzstream) - -if (NOT WIN32) -# there is no official support of snappystream, warpctc, nccl, cupti in windows +include(external/rocprim) +include(external/xxhash) # download xxhash +include(external/dlpack) include(external/snappy) # download snappy include(external/snappystream) # download snappystream + +if (NOT WIN32) +# there is no official support of warpctc, nccl, cupti in windows include(external/warpctc) # download, build, install warpctc include(cupti) endif (NOT WIN32) @@ -289,6 +320,14 @@ set(PADDLE_PYTHON_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/python/build") set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "-O3 -g -DNDEBUG") set(CMAKE_C_FLAGS_RELWITHDEBINFO "-O3 -g -DNDEBUG") +if (ON_INFER) + message(STATUS "On inference mode, will take place some specific optimization.") + add_definitions(-DPADDLE_ON_INFERENCE) +else() + #TODO(luotao), combine this warning with `make inference_lib_dist` command. + message(WARNING "On inference mode, will take place some specific optimization. Turn on the ON_INFER flag when building inference_lib only.") +endif() + add_subdirectory(paddle) if(WITH_PYTHON) add_subdirectory(python) diff --git a/Dockerfile b/Dockerfile index 738bba9bc2e1ab19709722fe04f1490b1b13bd4f..84e1edbee91b0c3877ec7c55161af8c3bdf1c0dc 100644 --- a/Dockerfile +++ b/Dockerfile @@ -22,6 +22,29 @@ ENV HOME /root # Add bash enhancements COPY ./paddle/scripts/docker/root/ /root/ +# Prepare packages for Python +RUN apt-get update && \ + apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \ + libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \ + xz-utils tk-dev libffi-dev liblzma-dev + +# Install Python3.6 +RUN mkdir -p /root/python_build/ && wget -q https://www.sqlite.org/2018/sqlite-autoconf-3250300.tar.gz && \ + tar -zxf sqlite-autoconf-3250300.tar.gz && cd sqlite-autoconf-3250300 && \ + ./configure -prefix=/usr/local && make -j8 && make install && cd ../ && rm sqlite-autoconf-3250300.tar.gz && \ + wget -q https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz && \ + tar -xzf Python-3.6.0.tgz && cd Python-3.6.0 && \ + CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \ + make -j8 > /dev/null && make altinstall > /dev/null + +# Install Python3.7 +RUN wget -q https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz && \ + tar -xzf Python-3.7.0.tgz && cd Python-3.7.0 && \ + CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \ + make -j8 > /dev/null && make altinstall > /dev/null + +RUN rm -r /root/python_build + RUN apt-get update && \ apt-get install -y --allow-downgrades patchelf \ python3 python3-dev python3-pip \ @@ -74,30 +97,48 @@ RUN localedef -i en_US -f UTF-8 en_US.UTF-8 RUN pip3 install -U wheel && \ pip3 install -U docopt PyYAML sphinx==1.5.6 && \ pip3 install sphinx-rtd-theme==0.1.9 recommonmark && \ + pip3.6 install -U wheel && \ + pip3.6 install -U docopt PyYAML sphinx==1.5.6 && \ + pip3.6 install sphinx-rtd-theme==0.1.9 recommonmark && \ + pip3.7 install -U wheel && \ + pip3.7 install -U docopt PyYAML sphinx==1.5.6 && \ + pip3.7 install sphinx-rtd-theme==0.1.9 recommonmark && \ easy_install -U pip && \ - pip install -U wheel && \ + pip install -U pip setuptools wheel && \ pip install -U docopt PyYAML sphinx==1.5.6 && \ pip install sphinx-rtd-theme==0.1.9 recommonmark -RUN pip3 install pre-commit 'ipython==5.3.0' && \ +RUN pip3 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ pip3 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ pip3 install opencv-python && \ - pip install pre-commit 'ipython==5.3.0' && \ + pip3.6 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip3.6 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip3.6 install opencv-python && \ + pip3.7 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ + pip3.7 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ + pip3.7 install opencv-python && \ + pip install 'pre-commit==1.10.4' 'ipython==5.3.0' && \ pip install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \ pip install opencv-python #For docstring checker RUN pip3 install pylint pytest astroid isort +RUN pip3.6 install pylint pytest astroid isort +RUN pip3.7 install pylint pytest astroid isort RUN pip install pylint pytest astroid isort LinkChecker COPY ./python/requirements.txt /root/ RUN pip3 install -r /root/requirements.txt +RUN pip3.6 install -r /root/requirements.txt +RUN pip3.7 install -r /root/requirements.txt RUN pip install -r /root/requirements.txt # To fix https://github.com/PaddlePaddle/Paddle/issues/1954, we use # the solution in https://urllib3.readthedocs.io/en/latest/user-guide.html#ssl-py2 RUN apt-get install -y libssl-dev libffi-dev RUN pip3 install certifi urllib3[secure] +RUN pip3.6 install certifi urllib3[secure] +RUN pip3.7 install certifi urllib3[secure] RUN pip install certifi urllib3[secure] diff --git a/README.md b/README.md index 46fdef5e376d3f5bf49ef10c62f5b3a6637913c1..56d6c10c642787836abb55cb2974bda0b8d22da4 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,8 @@ [![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle) -[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html) -[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html) +[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://paddlepaddle.org/documentation/docs/en/1.1/getstarted/index_en.html) +[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://paddlepaddle.org/documentation/docs/zh/1.1/beginners_guide/index.html) [![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases) [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE) @@ -19,7 +19,7 @@ Our vision is to enable deep learning for everyone via PaddlePaddle. Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest feature of PaddlePaddle. -### Latest PaddlePaddle Release: [Fluid 0.15.0](https://github.com/PaddlePaddle/Paddle/tree/v0.15.0) +### Latest PaddlePaddle Release: [Fluid 1.1.0](https://github.com/PaddlePaddle/Paddle/tree/release/1.1) ### Install Latest Stable Release: ``` # Linux CPU @@ -27,9 +27,9 @@ pip install paddlepaddle # Linux GPU cuda9cudnn7 pip install paddlepaddle-gpu # Linux GPU cuda8cudnn7 -pip install paddlepaddle-gpu==0.15.0.post87 +pip install paddlepaddle-gpu==1.1.0.post87 # Linux GPU cuda8cudnn5 -pip install paddlepaddle-gpu==0.15.0.post85 +pip install paddlepaddle-gpu==1.1.0.post85 # For installation on other platform, refer to http://paddlepaddle.org/ ``` @@ -76,26 +76,26 @@ pip install paddlepaddle-gpu==0.15.0.post85 ## Installation -It is recommended to read [this doc](http://paddlepaddle.org/documentation/docs/zh/0.15.0/new_docs/beginners_guide/install/install_doc.html) on our website. +It is recommended to read [this doc](http://paddlepaddle.org/documentation/docs/zh/1.1/beginners_guide/index.html) on our website. ## Documentation -We provide [English](http://paddlepaddle.org/documentation/docs/en/0.15.0/getstarted/index_en.html) and -[Chinese](http://paddlepaddle.org/documentation/docs/zh/0.15.0/new_docs/beginners_guide/index.html) documentation. +We provide [English](http://paddlepaddle.org/documentation/docs/en/1.1/getstarted/index_en.html) and +[Chinese](http://paddlepaddle.org/documentation/docs/zh/1.1/beginners_guide/index.html) documentation. - [Deep Learning 101](https://github.com/PaddlePaddle/book) You might want to start from this online interactive book that can run in a Jupyter Notebook. -- [Distributed Training](http://paddlepaddle.org/documentation/docs/zh/0.15.0/new_docs/user_guides/howto/training/cluster_howto.html) +- [Distributed Training](http://paddlepaddle.org/documentation/docs/zh/1.1/user_guides/howto/training/cluster_howto.html) You can run distributed training jobs on MPI clusters. -- [Python API](http://paddlepaddle.org/documentation/api/zh/0.15.0/fluid.html) +- [Python API](http://paddlepaddle.org/documentation/api/zh/1.1/fluid.html) Our new API enables much shorter programs. -- [How to Contribute](http://paddlepaddle.org/documentation/docs/zh/0.15.0/new_docs/advanced_usage/development/contribute_to_paddle.html) +- [How to Contribute](http://paddlepaddle.org/documentation/docs/zh/1.1/advanced_usage/development/contribute_to_paddle.html) We appreciate your contributions! diff --git a/benchmark/fluid/args.py b/benchmark/fluid/args.py index 9540900b112f54594bbfdbc8d7cd3b6e1f5269dd..ff616ddbb2cb1cb7f348d6d164815823b08b7629 100644 --- a/benchmark/fluid/args.py +++ b/benchmark/fluid/args.py @@ -142,5 +142,10 @@ def parse_args(): choices=['reduce', 'all_reduce'], default='all_reduce', help='Specify the reduce strategy, can be reduce, all_reduce') + parser.add_argument( + '--fuse_broadcast_op', + action='store_true', + help='If set, would fuse multiple broadcast operators into one fused_broadcast operator.' + ) args = parser.parse_args() return args diff --git a/benchmark/fluid/fluid_benchmark.py b/benchmark/fluid/fluid_benchmark.py index ddd9fe809853a830ca676cc98f1819f683866def..5f3ce300acc44ad8d2898c27296b866c403f3cc8 100644 --- a/benchmark/fluid/fluid_benchmark.py +++ b/benchmark/fluid/fluid_benchmark.py @@ -177,6 +177,7 @@ def train_parallel(train_args, test_args, args, train_prog, test_prog, else: build_strategy.reduce_strategy = fluid.BuildStrategy( ).ReduceStrategy.AllReduce + build_strategy.fuse_broadcast_op = args.fuse_broadcast_op avg_loss = train_args[0] @@ -240,7 +241,6 @@ def train_parallel(train_args, test_args, args, train_prog, test_prog, if args.use_fake_data or args.use_reader_op: try: - fetch_ret = exe.run(fetch_list) except fluid.core.EOFException as eof: break diff --git a/benchmark/fluid/run.sh b/benchmark/fluid/run.sh old mode 100644 new mode 100755 diff --git a/cmake/configure.cmake b/cmake/configure.cmake index e9852f00b1835adec31373f58ac538f9685251e0..4e17ddee73958106d5e2c8c8ea5661acc758518a 100644 --- a/cmake/configure.cmake +++ b/cmake/configure.cmake @@ -50,11 +50,7 @@ if(NOT WITH_PROFILER) endif(NOT WITH_PROFILER) if(NOT CMAKE_CROSSCOMPILING) - if(WITH_AVX AND AVX512F_FOUND) - set(SIMD_FLAG ${AVX512F_FLAG}) - elseif(WITH_AVX AND AVX2_FOUND) - set(SIMD_FLAG ${AVX2_FLAG}) - elseif(WITH_AVX AND AVX_FOUND) + if(WITH_AVX AND AVX_FOUND) set(SIMD_FLAG ${AVX_FLAG}) elseif(SSE3_FOUND) set(SIMD_FLAG ${SSE3_FLAG}) @@ -222,3 +218,7 @@ endif(WITH_GRPC) if(WITH_BRPC_RDMA) add_definitions(-DPADDLE_WITH_BRPC_RDMA) endif(WITH_BRPC_RDMA) + +if(ON_INFER) + add_definitions(-DPADDLE_ON_INFERENCE) +endif(ON_INFER) diff --git a/cmake/cuda.cmake b/cmake/cuda.cmake index f507bb41a1103c093e9569176ee868cfaac6bf7b..414e92eb27f56e0670e1977e67c2f5ca9c6bbcc2 100644 --- a/cmake/cuda.cmake +++ b/cmake/cuda.cmake @@ -157,6 +157,9 @@ list(APPEND EXTERNAL_LIBS ${CUDA_LIBRARIES} ${CUDA_rt_LIBRARY}) if(NOT WITH_DSO) # TODO(panyx0718): CUPTI only allows DSO? list(APPEND EXTERNAL_LIBS ${CUDNN_LIBRARY} ${CUPTI_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_curand_LIBRARY} ${NCCL_LIBRARY}) + if(WIN32) + set_property(GLOBAL PROPERTY CUDA_MODULES ${CUDNN_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_curand_LIBRARY}) + endif(WIN32) endif(NOT WITH_DSO) # setting nvcc arch flags @@ -196,10 +199,15 @@ elseif(CMAKE_BUILD_TYPE STREQUAL "MinSizeRel") list(APPEND CUDA_NVCC_FLAGS ${CMAKE_CXX_FLAGS_RELEASE}) endif() else(NOT WIN32) -if(CMAKE_BUILD_TYPE STREQUAL "Release") +list(APPEND CUDA_NVCC_FLAGS "--compiler-options;/bigobj") +if(CMAKE_BUILD_TYPE STREQUAL "Debug") + list(APPEND CUDA_NVCC_FLAGS "-g -G") + # match the cl's _ITERATOR_DEBUG_LEVEL + list(APPEND CUDA_NVCC_FLAGS "-D_DEBUG") +elseif(CMAKE_BUILD_TYPE STREQUAL "Release") list(APPEND CUDA_NVCC_FLAGS "-O3 -DNDEBUG") else() - message(FATAL "Windows only support Release build now. Please set visual studio build type to Release, x64 build.") + message(FATAL "Windows only support Release or Debug build now. Please set visual studio build type to Release/Debug, x64 build.") endif() endif(NOT WIN32) diff --git a/cmake/cudnn.cmake b/cmake/cudnn.cmake index cd51533926de7bb132ab7bfab1686d664a331410..09bec347dbd569203103eccc7dbc0521c291bc0a 100644 --- a/cmake/cudnn.cmake +++ b/cmake/cudnn.cmake @@ -2,7 +2,12 @@ if(NOT WITH_GPU) return() endif() -set(CUDNN_ROOT "/usr" CACHE PATH "CUDNN ROOT") +if(WIN32) + set(CUDNN_ROOT ${CUDA_TOOLKIT_ROOT_DIR}) +else(WIN32) + set(CUDNN_ROOT "/usr" CACHE PATH "CUDNN ROOT") +endif(WIN32) + find_path(CUDNN_INCLUDE_DIR cudnn.h PATHS ${CUDNN_ROOT} ${CUDNN_ROOT}/include $ENV{CUDNN_ROOT} $ENV{CUDNN_ROOT}/include ${CUDA_TOOLKIT_INCLUDE} diff --git a/cmake/external/anakin.cmake b/cmake/external/anakin.cmake index 84354c446e2f54fa13b90fa37221eed90968b251..06fc6061bc98eec8c4c71860333f7d3456952aeb 100644 --- a/cmake/external/anakin.cmake +++ b/cmake/external/anakin.cmake @@ -58,19 +58,21 @@ ExternalProject_Add( -DPROTOBUF_ROOT=${THIRD_PARTY_PATH}/install/protobuf -DMKLML_ROOT=${THIRD_PARTY_PATH}/install/mklml -DENABLE_OP_TIMER=${ANAKIN_ENABLE_OP_TIMER} + -DBUILD_FAT_BIN=${ANAKIN_BUILD_FAT_BIN} + -DBUILD_CROSS_PLANTFORM=${ANAKIN_BUILD_CROSS_PLANTFORM} ${EXTERNAL_OPTIONAL_ARGS} CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${ANAKIN_INSTALL_DIR} ) message(STATUS "Anakin for inference is enabled") message(STATUS "Anakin is set INCLUDE:${ANAKIN_INCLUDE} LIBRARY:${ANAKIN_LIBRARY}") - +add_dependencies(extern_anakin protobuf mklml) add_library(anakin_shared SHARED IMPORTED GLOBAL) set_property(TARGET anakin_shared PROPERTY IMPORTED_LOCATION ${ANAKIN_SHARED_LIB}) -add_dependencies(anakin_shared extern_anakin protobuf mklml) +add_dependencies(anakin_shared extern_anakin) add_library(anakin_saber SHARED IMPORTED GLOBAL) set_property(TARGET anakin_saber PROPERTY IMPORTED_LOCATION ${ANAKIN_SABER_LIB}) -add_dependencies(anakin_saber extern_anakin protobuf mklml) +add_dependencies(anakin_saber extern_anakin) list(APPEND external_project_dependencies anakin_shared anakin_saber) diff --git a/cmake/external/boost.cmake b/cmake/external/boost.cmake index ada61de8eb15ae10288ac54f588e9adf84acee37..5a78a1d1b7dea0d95ae3fa2c9f39679899dd1bcb 100644 --- a/cmake/external/boost.cmake +++ b/cmake/external/boost.cmake @@ -28,34 +28,28 @@ if((NOT DEFINED BOOST_TAR) OR (NOT DEFINED BOOST_URL)) set(BOOST_TAR "boost_1_41_0" CACHE STRING "" FORCE) set(BOOST_URL "http://paddlepaddledeps.cdn.bcebos.com/${BOOST_TAR}.tar.gz" CACHE STRING "" FORCE) endif() -IF (WIN32) - MESSAGE(WARNING, "In windows, boost can not be downloaded automaticlly, please build it manually and put it at " ${THIRD_PARTY_PATH}install/boost) -else() - MESSAGE(STATUS "BOOST_TAR: ${BOOST_TAR}, BOOST_URL: ${BOOST_URL}") -ENDIF(WIN32) + +MESSAGE(STATUS "BOOST_TAR: ${BOOST_TAR}, BOOST_URL: ${BOOST_URL}") set(BOOST_SOURCES_DIR ${THIRD_PARTY_PATH}/boost) set(BOOST_DOWNLOAD_DIR "${BOOST_SOURCES_DIR}/src/${BOOST_PROJECT}") -set(BOOST_INCLUDE_DIR "${BOOST_DOWNLOAD_DIR}/${BOOST_TAR}" CACHE PATH "boost include directory." FORCE) -set_directory_properties(PROPERTIES CLEAN_NO_CUSTOM 1) +set(BOOST_INCLUDE_DIR "${BOOST_DOWNLOAD_DIR}" CACHE PATH "boost include directory." FORCE) +set_directory_properties(PROPERTIES CLEAN_NO_CUSTOM 1) include_directories(${BOOST_INCLUDE_DIR}) -if (NOT WIN32) ExternalProject_Add( ${BOOST_PROJECT} ${EXTERNAL_PROJECT_LOG_ARGS} DOWNLOAD_DIR ${BOOST_DOWNLOAD_DIR} - DOWNLOAD_COMMAND wget --no-check-certificate ${BOOST_URL} -c -q -O ${BOOST_TAR}.tar.gz - && tar zxf ${BOOST_TAR}.tar.gz + URL ${BOOST_URL} DOWNLOAD_NO_PROGRESS 1 PREFIX ${BOOST_SOURCES_DIR} CONFIGURE_COMMAND "" BUILD_COMMAND "" INSTALL_COMMAND "" UPDATE_COMMAND "" -) -endif(NOT WIN32) + ) if (${CMAKE_VERSION} VERSION_LESS "3.3.0" OR NOT WIN32) set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/boost_dummy.c) diff --git a/cmake/external/dlpack.cmake b/cmake/external/dlpack.cmake new file mode 100644 index 0000000000000000000000000000000000000000..94d8fcc66855627d665b8e84a47a2075e7253b03 --- /dev/null +++ b/cmake/external/dlpack.cmake @@ -0,0 +1,31 @@ +include(ExternalProject) + +set(DLPACK_SOURCE_DIR ${THIRD_PARTY_PATH}/dlpack) +set(DLPACK_INCLUDE_DIR ${DLPACK_SOURCE_DIR}/src/extern_dlpack/include) + +include_directories(${DLPACK_INCLUDE_DIR}) + +ExternalProject_Add( + extern_dlpack + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/dmlc/dlpack.git" + GIT_TAG "v0.2" + PREFIX ${DLPACK_SOURCE_DIR} + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + INSTALL_COMMAND "" + TEST_COMMAND "" +) + +if(${CMAKE_VERSION} VERSION_LESS "3.3.0") + set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/dlpack_dummy.c) + file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";") + add_library(dlpack STATIC ${dummyfile}) +else() + add_library(dlpack INTERFACE) +endif() + +add_dependencies(dlpack extern_dlpack) + +LIST(APPEND externl_project_dependencies dlpack) diff --git a/cmake/external/eigen.cmake b/cmake/external/eigen.cmake index 573ad5e5f06a93f38f24c6a8af3b45767e93a1a4..6aef97f21244efd09e22781f703553a19a9e1860 100644 --- a/cmake/external/eigen.cmake +++ b/cmake/external/eigen.cmake @@ -17,7 +17,7 @@ if(WITH_AMD_GPU) extern_eigen3 ${EXTERNAL_PROJECT_LOG_ARGS} GIT_REPOSITORY "https://github.com/sabreshao/hipeigen.git" - GIT_TAG 0cba03ff9f8f9f70bbd92ac5857b031aa8fed6f9 + GIT_TAG 7cb2b6e5a4b4a1efe658abb215cd866c6fb2275e PREFIX ${EIGEN_SOURCE_DIR} UPDATE_COMMAND "" CONFIGURE_COMMAND "" diff --git a/cmake/external/gflags.cmake b/cmake/external/gflags.cmake index cf58cc39762351f8b37d073bcd218d249285bf52..4e98e4bf889bc13938931be7f6cb204c83250a5c 100644 --- a/cmake/external/gflags.cmake +++ b/cmake/external/gflags.cmake @@ -35,7 +35,12 @@ ExternalProject_Add( CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DBUILD_STATIC_LIBS=ON -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR} -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_TESTING=OFF @@ -48,8 +53,8 @@ ExternalProject_Add( IF(WIN32) IF(NOT EXISTS "${GFLAGS_INSTALL_DIR}/lib/libgflags.lib") add_custom_command(TARGET extern_gflags POST_BUILD - COMMAND cmake -E rename ${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib ${GFLAGS_INSTALL_DIR}/lib/libgflags.lib - ) + COMMAND cmake -E copy ${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib ${GFLAGS_INSTALL_DIR}/lib/libgflags.lib + ) ENDIF() ENDIF(WIN32) ADD_LIBRARY(gflags STATIC IMPORTED GLOBAL) diff --git a/cmake/external/glog.cmake b/cmake/external/glog.cmake index 25ef2970ac52f12f961c9c6d3a589fec4c80983f..8cd0455c16bf84909b735102e7fb1089744c4245 100644 --- a/cmake/external/glog.cmake +++ b/cmake/external/glog.cmake @@ -46,7 +46,11 @@ ExternalProject_Add( CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR} -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib -DCMAKE_POSITION_INDEPENDENT_CODE=ON @@ -63,7 +67,7 @@ ExternalProject_Add( IF(WIN32) IF(NOT EXISTS "${GLOG_INSTALL_DIR}/lib/libglog.lib") add_custom_command(TARGET extern_glog POST_BUILD - COMMAND cmake -E rename ${GLOG_INSTALL_DIR}/lib/glog.lib ${GLOG_INSTALL_DIR}/lib/libglog.lib + COMMAND cmake -E copy ${GLOG_INSTALL_DIR}/lib/glog.lib ${GLOG_INSTALL_DIR}/lib/libglog.lib ) ENDIF() ENDIF(WIN32) diff --git a/cmake/external/gtest.cmake b/cmake/external/gtest.cmake index d335298742c73bf1fe44e6a778ab3c142711063d..4fe9c13fb7f2c04ae04e985252996dfa308ac304 100644 --- a/cmake/external/gtest.cmake +++ b/cmake/external/gtest.cmake @@ -50,7 +50,11 @@ IF(WITH_TESTING) CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} -DCMAKE_INSTALL_PREFIX=${GTEST_INSTALL_DIR} -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_GMOCK=ON diff --git a/cmake/external/mkldnn.cmake b/cmake/external/mkldnn.cmake index baf253df2755657b01b67c410f63b7d8422d4df3..b280db23b9b27bc658a79d01ea81122d2c987666 100644 --- a/cmake/external/mkldnn.cmake +++ b/cmake/external/mkldnn.cmake @@ -37,7 +37,6 @@ SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_RPATH}" "${MKLDNN_INSTALL_DIR}/lib") INCLUDE_DIRECTORIES(${MKLDNN_INC_DIR}) # For MKLDNN code to include internal headers. -INCLUDE_DIRECTORIES(${THIRD_PARTY_PATH}/install) # For Paddle code to include mkldnn.h IF(${CBLAS_PROVIDER} STREQUAL "MKLML") SET(MKLDNN_DEPENDS ${MKLML_PROJECT}) @@ -45,7 +44,7 @@ IF(${CBLAS_PROVIDER} STREQUAL "MKLML") ELSE() MESSAGE(FATAL_ERROR "Should enable MKLML when build MKLDNN") ENDIF() -SET(MKLDNN_FLAG "-Wno-error=strict-overflow -Wno-error=unused-result") +SET(MKLDNN_FLAG "-Wno-error=strict-overflow -Wno-error=unused-result -Wno-error=array-bounds") SET(MKLDNN_FLAG "${MKLDNN_FLAG} -Wno-unused-result -Wno-unused-value") SET(MKLDNN_CFLAG "${CMAKE_C_FLAGS} ${MKLDNN_FLAG}") SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} ${MKLDNN_FLAG}") @@ -54,7 +53,7 @@ ExternalProject_Add( ${EXTERNAL_PROJECT_LOG_ARGS} DEPENDS ${MKLDNN_DEPENDS} GIT_REPOSITORY "https://github.com/01org/mkl-dnn.git" - GIT_TAG "64e03a1939e0d526aa8e9f2e3f7dc0ad8d372944" + GIT_TAG "830a10059a018cd2634d94195140cf2d8790a75a" PREFIX ${MKLDNN_SOURCES_DIR} UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} diff --git a/cmake/external/ngraph.cmake b/cmake/external/ngraph.cmake new file mode 100644 index 0000000000000000000000000000000000000000..2e335579f32df4f146c8d88e05e684a9a8105e20 --- /dev/null +++ b/cmake/external/ngraph.cmake @@ -0,0 +1,92 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +add_library(ngraph INTERFACE) + +IF(WIN32 OR APPLE) + MESSAGE(WARNING + "Windows or Mac is not supported with nGraph in Paddle yet." + "Force WITH_NGRAPH=OFF") + SET(WITH_NGRAPH OFF CACHE STRING "Disable nGraph in Windows and MacOS" FORCE) +ENDIF() + +IF(${WITH_NGRAPH} AND NOT ${WITH_MKLDNN}) + MESSAGE(WARNING + "nGraph needs mkl-dnn to be enabled." + "Force WITH_NGRAPH=OFF") + SET(WITH_NGRAPH OFF CACHE STRING "Disable nGraph if mkl-dnn is disabled" FORCE) +ENDIF() + +IF(NOT ${WITH_NGRAPH}) + return() +ENDIF() + +INCLUDE(ExternalProject) + +SET(NGRAPH_PROJECT "extern_ngraph") +SET(NGRAPH_VERSION "0.9") +SET(NGRAPH_GIT_TAG "f9fd9d4cc318dc59dd4b68448e7fbb5f67a28bd0") +SET(NGRAPH_SOURCES_DIR ${THIRD_PARTY_PATH}/ngraph) +SET(NGRAPH_INSTALL_DIR ${THIRD_PARTY_PATH}/install/ngraph) +SET(NGRAPH_INC_DIR ${NGRAPH_INSTALL_DIR}/include) +SET(NGRAPH_SHARED_LIB_NAME libngraph.so.${NGRAPH_VERSION}) +SET(NGRAPH_CPU_LIB_NAME libcpu_backend.so) +SET(NGRAPH_TBB_LIB_NAME libtbb.so.2) +SET(NGRAPH_GIT_REPO "https://github.com/NervanaSystems/ngraph.git") + +ExternalProject_Add( + ${NGRAPH_PROJECT} + ${EXTERNAL_PROJECT_LOG_ARGS} + DEPENDS ${MKLDNN_PROJECT} ${MKLML_PROJECT} + GIT_REPOSITORY ${NGRAPH_GIT_REPO} + GIT_TAG ${NGRAPH_GIT_TAG} + PREFIX ${NGRAPH_SOURCES_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${NGRAPH_INSTALL_DIR} + CMAKE_ARGS -DNGRAPH_UNIT_TEST_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_TOOLS_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_INTERPRETER_ENABLE=FALSE + CMAKE_ARGS -DNGRAPH_DEX_ONLY=TRUE + CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + CMAKE_ARGS -DMKLDNN_INCLUDE_DIR=${MKLDNN_INC_DIR} + CMAKE_ARGS -DMKLDNN_LIB_DIR=${MKLDNN_INSTALL_DIR}/lib +) + +if(UNIX AND NOT APPLE) + include(GNUInstallDirs) + SET(NGRAPH_LIB_DIR ${NGRAPH_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}) +else() + SET(NGRAPH_LIB_DIR ${NGRAPH_INSTALL_DIR}/lib) +endif() +MESSAGE(STATUS "nGraph lib will be installed at: ${NGRAPH_LIB_DIR}") + +SET(NGRAPH_SHARED_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_SHARED_LIB_NAME}) +SET(NGRAPH_CPU_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_CPU_LIB_NAME}) +SET(NGRAPH_TBB_LIB ${NGRAPH_LIB_DIR}/${NGRAPH_TBB_LIB_NAME}) + +# Workaround for nGraph expecting mklml to be in mkldnn install directory. +ExternalProject_Add_Step( + ${NGRAPH_PROJECT} + PrepareMKL + COMMAND ${CMAKE_COMMAND} -E create_symlink ${MKLML_LIB} ${MKLDNN_INSTALL_DIR}/lib/libmklml_intel.so + COMMAND ${CMAKE_COMMAND} -E create_symlink ${MKLML_IOMP_LIB} ${MKLDNN_INSTALL_DIR}/lib/libiomp5.so + DEPENDEES download + DEPENDERS configure +) + +add_dependencies(ngraph ${NGRAPH_PROJECT}) +target_compile_definitions(ngraph INTERFACE -DPADDLE_WITH_NGRAPH) +target_include_directories(ngraph INTERFACE ${NGRAPH_INC_DIR}) +target_link_libraries(ngraph INTERFACE ${NGRAPH_SHARED_LIB}) +LIST(APPEND external_project_dependencies ngraph) diff --git a/cmake/external/openblas.cmake b/cmake/external/openblas.cmake index 755dbd610c40c2d9b85d3017b6f000a869b0f39a..aeb976b840e999a20e8cab11939cbb1f49a27850 100644 --- a/cmake/external/openblas.cmake +++ b/cmake/external/openblas.cmake @@ -17,12 +17,8 @@ IF(USE_EIGEN_FOR_BLAS) ENDIF(USE_EIGEN_FOR_BLAS) INCLUDE(cblas) -# IF(WIN32 AND NOT ${CBLAS_FOUND}) - - IF(NOT ${CBLAS_FOUND}) - INCLUDE(ExternalProject) SET(CBLAS_SOURCES_DIR ${THIRD_PARTY_PATH}/openblas) @@ -34,6 +30,7 @@ IF(NOT ${CBLAS_FOUND}) CACHE FILEPATH "openblas library." FORCE) ADD_DEFINITIONS(-DPADDLE_USE_OPENBLAS) + IF (WIN32) SET(CBLAS_FOUND true) MESSAGE(WARNING, "In windows, openblas only support msvc build, please build it manually and put it at " ${CBLAS_INSTALL_DIR}) diff --git a/cmake/external/protobuf.cmake b/cmake/external/protobuf.cmake index 550b0dada8e90c1e2b33705fd53c065672113b45..e1e619e572b05e83fbe751af2e5391aafc494416 100644 --- a/cmake/external/protobuf.cmake +++ b/cmake/external/protobuf.cmake @@ -30,66 +30,61 @@ UNSET_VAR(PROTOBUF_LITE_LIBRARY) UNSET_VAR(PROTOBUF_LIBRARY) UNSET_VAR(PROTOBUF_INCLUDE_DIR) UNSET_VAR(Protobuf_PROTOC_EXECUTABLE) +function(protobuf_generate_python SRCS) + # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake + if(NOT ARGN) + message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files") + return() + endif() -if(NOT COMMAND protobuf_generate_python) # before cmake 3.4, protobuf_genrerate_python is not defined. - function(protobuf_generate_python SRCS) - # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake - if(NOT ARGN) - message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files") - return() - endif() - - if(PROTOBUF_GENERATE_CPP_APPEND_PATH) - # Create an include path for each file specified - foreach(FIL ${ARGN}) - get_filename_component(ABS_FIL ${FIL} ABSOLUTE) - get_filename_component(ABS_PATH ${ABS_FIL} PATH) - list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) - if(${_contains_already} EQUAL -1) - list(APPEND _protobuf_include_path -I ${ABS_PATH}) - endif() - endforeach() - else() - set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR}) - endif() - - if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS) - set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}") - endif() - - if(DEFINED Protobuf_IMPORT_DIRS) - foreach(DIR ${Protobuf_IMPORT_DIRS}) - get_filename_component(ABS_PATH ${DIR} ABSOLUTE) - list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) - if(${_contains_already} EQUAL -1) - list(APPEND _protobuf_include_path -I ${ABS_PATH}) - endif() - endforeach() - endif() - - set(${SRCS}) + if(PROTOBUF_GENERATE_CPP_APPEND_PATH) + # Create an include path for each file specified foreach(FIL ${ARGN}) get_filename_component(ABS_FIL ${FIL} ABSOLUTE) - get_filename_component(FIL_WE ${FIL} NAME_WE) - if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH) - get_filename_component(FIL_DIR ${FIL} DIRECTORY) - if(FIL_DIR) - set(FIL_WE "${FIL_DIR}/${FIL_WE}") - endif() + get_filename_component(ABS_PATH ${ABS_FIL} PATH) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) endif() + endforeach() + else() + set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR}) + endif() + if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS) + set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}") + endif() - list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py") - add_custom_command( - OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py" - COMMAND ${Protobuf_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL} - DEPENDS ${ABS_FIL} ${Protobuf_PROTOC_EXECUTABLE} - COMMENT "Running Python protocol buffer compiler on ${FIL}" - VERBATIM ) + if(DEFINED Protobuf_IMPORT_DIRS) + foreach(DIR ${Protobuf_IMPORT_DIRS}) + get_filename_component(ABS_PATH ${DIR} ABSOLUTE) + list(FIND _protobuf_include_path ${ABS_PATH} _contains_already) + if(${_contains_already} EQUAL -1) + list(APPEND _protobuf_include_path -I ${ABS_PATH}) + endif() endforeach() + endif() - set(${SRCS} ${${SRCS}} PARENT_SCOPE) - endfunction() -endif() + set(${SRCS}) + foreach(FIL ${ARGN}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + get_filename_component(FIL_WE ${FIL} NAME_WE) + if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH) + get_filename_component(FIL_DIR ${FIL} DIRECTORY) + if(FIL_DIR) + set(FIL_WE "${FIL_DIR}/${FIL_WE}") + endif() + endif() + list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py") + add_custom_command( + OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py" + COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL} + DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE} + COMMENT "Running Python protocol buffer compiler on ${FIL}" + VERBATIM ) + endforeach() + + set(${SRCS} ${${SRCS}} PARENT_SCOPE) +endfunction() # Print and set the protobuf library information, # finish this cmake process and exit from this file. @@ -126,6 +121,7 @@ macro(PROMPT_PROTOBUF_LIB) # FIND_Protobuf.cmake uses `Protobuf_PROTOC_EXECUTABLE`. # make `protobuf_generate_cpp` happy. SET(Protobuf_PROTOC_EXECUTABLE ${PROTOBUF_PROTOC_EXECUTABLE}) + FOREACH(dep ${protobuf_DEPS}) ADD_DEPENDENCIES(protobuf ${dep}) ADD_DEPENDENCIES(protobuf_lite ${dep}) @@ -144,7 +140,6 @@ endmacro() set(PROTOBUF_ROOT "" CACHE PATH "Folder contains protobuf") IF (WIN32) SET(PROTOBUF_ROOT ${THIRD_PARTY_PATH}/install/protobuf) - MESSAGE(WARNING, "In windows, protobuf only support msvc build, please build it manually and put it at " ${PROTOBUF_ROOT}) ENDIF(WIN32) if (NOT "${PROTOBUF_ROOT}" STREQUAL "") @@ -192,13 +187,20 @@ FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST) SET(OPTIONAL_ARGS "-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}" "-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}" - "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}" "-DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}" + "-DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}" + "-DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}" + "-DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}" + "-DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}" "-Dprotobuf_WITH_ZLIB=ON" "-DZLIB_ROOT:FILEPATH=${ZLIB_ROOT}" ${EXTERNAL_OPTIONAL_ARGS}) SET(OPTIONAL_CACHE_ARGS "-DZLIB_ROOT:STRING=${ZLIB_ROOT}") ENDIF() + IF(WIN32) + SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} "-DCMAKE_GENERATOR_PLATFORM=x64") + ENDIF() SET(PROTOBUF_REPO "https://github.com/google/protobuf.git") SET(PROTOBUF_TAG "9f75c5aa851cd877fb0d93ccc31b8567a6706546") diff --git a/cmake/external/pybind11.cmake b/cmake/external/pybind11.cmake index c885877a2bcd622dc8d1f3ca2821a0bed84561bf..3a10ea945d3d1a3fa88954196905872b2668d5a0 100644 --- a/cmake/external/pybind11.cmake +++ b/cmake/external/pybind11.cmake @@ -26,7 +26,7 @@ ExternalProject_Add( extern_pybind ${EXTERNAL_PROJECT_LOG_ARGS} GIT_REPOSITORY "https://github.com/pybind/pybind11.git" - GIT_TAG "v2.1.1" + GIT_TAG "v2.2.4" PREFIX ${PYBIND_SOURCE_DIR} UPDATE_COMMAND "" CONFIGURE_COMMAND "" diff --git a/cmake/external/python.cmake b/cmake/external/python.cmake index f17b8d46dc2d8ded81ced7de5827d5e7fd5109f0..a3599dd798c07f57ed82e3f25b6bb9fc4f8bdc3a 100644 --- a/cmake/external/python.cmake +++ b/cmake/external/python.cmake @@ -21,6 +21,48 @@ INCLUDE(python_module) FIND_PACKAGE(PythonInterp ${PY_VERSION}) FIND_PACKAGE(PythonLibs ${PY_VERSION}) +if(WIN32) + execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" +"from distutils import sysconfig as s;import sys;import struct; +print(sys.prefix); +print(s.get_config_var('LDVERSION') or s.get_config_var('VERSION')); +" + RESULT_VARIABLE _PYTHON_SUCCESS + OUTPUT_VARIABLE _PYTHON_VALUES + ERROR_VARIABLE _PYTHON_ERROR_VALUE) + + if(NOT _PYTHON_SUCCESS MATCHES 0) + set(PYTHONLIBS_FOUND FALSE) + return() + endif() + + # Convert the process output into a list + string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES}) + string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES}) + list(GET _PYTHON_VALUES 0 PYTHON_PREFIX) + list(GET _PYTHON_VALUES 1 PYTHON_LIBRARY_SUFFIX) + + # Make sure all directory separators are '/' + string(REGEX REPLACE "\\\\" "/" PYTHON_PREFIX ${PYTHON_PREFIX}) + + set(PYTHON_LIBRARY + "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + + # when run in a venv, PYTHON_PREFIX points to it. But the libraries remain in the + # original python installation. They may be found relative to PYTHON_INCLUDE_DIR. + if(NOT EXISTS "${PYTHON_LIBRARY}") + get_filename_component(_PYTHON_ROOT ${PYTHON_INCLUDE_DIR} DIRECTORY) + set(PYTHON_LIBRARY + "${_PYTHON_ROOT}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + endif() + + # raise an error if the python libs are still not found. + if(NOT EXISTS "${PYTHON_LIBRARY}") + message(FATAL_ERROR "Python libraries not found") + endif() + SET(PYTHON_LIBRARIES "${PYTHON_LIBRARY}") +endif(WIN32) + # Fixme: Maybe find a static library. Get SHARED/STATIC by FIND_PACKAGE. ADD_LIBRARY(python SHARED IMPORTED GLOBAL) SET_PROPERTY(TARGET python PROPERTY IMPORTED_LOCATION ${PYTHON_LIBRARIES}) diff --git a/cmake/external/rocprim.cmake b/cmake/external/rocprim.cmake new file mode 100644 index 0000000000000000000000000000000000000000..914c06491890574bcdf4374d8e0fd5498e780113 --- /dev/null +++ b/cmake/external/rocprim.cmake @@ -0,0 +1,44 @@ +if (NOT WITH_AMD_GPU) + return() +endif() + +# rocprim is "ROCm Parallel Primitives" for short. +# It is a header-only library providing HIP and HC parallel primitives +# for developing performant GPU-accelerated code on AMD ROCm platform. + +if("x${HCC_HOME}" STREQUAL "x") + set(HCC_HOME "/opt/rocm/hcc") +endif() + +INCLUDE(ExternalProject) + +SET(ROCPRIM_SOURCE_DIR ${THIRD_PARTY_PATH}/rocprim) +SET(ROCPRIM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/rocprim) +SET(ROCPRIM_INCLUDE_DIR ${ROCPRIM_INSTALL_DIR}/include) + +ExternalProject_Add( + extern_rocprim + GIT_REPOSITORY "https://github.com/ROCmSoftwarePlatform/rocPRIM.git" + GIT_TAG 5bd41b96ab8d8343330fb2c3e1b96775bde3b3fc + PREFIX ${ROCPRIM_SOURCE_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${HCC_HOME}/bin/hcc + CMAKE_ARGS -DONLY_INSTALL=ON + CMAKE_ARGS -DBUILD_TEST=OFF + CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${ROCPRIM_INSTALL_DIR} + + INSTALL_DIR ${ROCPRIM_INSTALL_DIR} + ${EXTERNAL_PROJECT_LOG_ARGS} +) + +INCLUDE_DIRECTORIES(${ROCPRIM_INCLUDE_DIR}) + +if (${CMAKE_VERSION} VERSION_LESS "3.3.0") + set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/rocprim_dummy.c) + file(WRITE ${dummyfile} "const char *dummy_rocprim = \"${dummyfile}\";") + add_library(rocprim STATIC ${dummyfile}) +else() + add_library(rocprim INTERFACE) +endif() + +add_dependencies(rocprim extern_rocprim) diff --git a/cmake/external/snappy.cmake b/cmake/external/snappy.cmake index af09ed4d5d6e21cc50aba5198a7e9ea56f49451a..b30403d2d81ce471f39b4d92e24a500fe41eeebb 100644 --- a/cmake/external/snappy.cmake +++ b/cmake/external/snappy.cmake @@ -24,7 +24,11 @@ set(SNAPPY_SOURCES_DIR ${THIRD_PARTY_PATH}/snappy) set(SNAPPY_INSTALL_DIR ${THIRD_PARTY_PATH}/install/snappy) set(SNAPPY_INCLUDE_DIR "${SNAPPY_INSTALL_DIR}/include" CACHE PATH "snappy include directory." FORCE) -set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/libsnappy.a") +if (WIN32) + set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/snappy.lib") +else(WIN32) + set(SNAPPY_LIBRARIES "${SNAPPY_INSTALL_DIR}/lib/libsnappy.a") +endif (WIN32) ExternalProject_Add( extern_snappy @@ -34,8 +38,12 @@ ExternalProject_Add( UPDATE_COMMAND "" CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib -DCMAKE_POSITION_INDEPENDENT_CODE=ON diff --git a/cmake/external/snappystream.cmake b/cmake/external/snappystream.cmake index 6df636d7fa8757ade73892bda03a80ba9767472b..1ec79462c14e44f2d0abe6904383ebd91d94d35e 100644 --- a/cmake/external/snappystream.cmake +++ b/cmake/external/snappystream.cmake @@ -18,36 +18,45 @@ ENDIF() include (ExternalProject) -# NOTE: snappy is needed when linking with recordio - set(SNAPPYSTREAM_SOURCES_DIR ${THIRD_PARTY_PATH}/snappy_stream) set(SNAPPYSTREAM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/snappy_stream) set(SNAPPYSTREAM_INCLUDE_DIR "${SNAPPYSTREAM_INSTALL_DIR}/include" CACHE PATH "snappy stream include directory." FORCE) -set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/libsnappystream.a") - -ExternalProject_Add( - extern_snappystream - GIT_REPOSITORY "https://github.com/hoxnox/snappystream.git" - GIT_TAG "0.2.8" - PREFIX ${SNAPPYSTREAM_SOURCES_DIR} - UPDATE_COMMAND "" - CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} - -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} - -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} - -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} - -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} - -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib - -DCMAKE_POSITION_INDEPENDENT_CODE=ON - -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} - -DSNAPPY_ROOT=${SNAPPY_INSTALL_DIR} - ${EXTERNAL_OPTIONAL_ARGS} - CMAKE_CACHE_ARGS - -DCMAKE_INSTALL_PREFIX:PATH=${SNAPPYSTREAM_INSTALL_DIR} - -DCMAKE_INSTALL_LIBDIR:PATH=${SNAPPYSTREAM_INSTALL_DIR}/lib - -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} - DEPENDS snappy -) +if(WIN32) + # Fix me, VS2015 come without VLA support + set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/snappystream.lib") + MESSAGE(WARNING, "In windows, snappystream has no compile support for windows, + please build it manually and put it at " ${SNAPPYSTREAM_INSTALL_DIR}) +else(WIN32) + set(SNAPPYSTREAM_LIBRARIES "${SNAPPYSTREAM_INSTALL_DIR}/lib/libsnappystream.a") + + ExternalProject_Add( + extern_snappystream + GIT_REPOSITORY "https://github.com/hoxnox/snappystream.git" + GIT_TAG "0.2.8" + PREFIX ${SNAPPYSTREAM_SOURCES_DIR} + UPDATE_COMMAND "" + CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} + -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} + -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS} + -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG} + -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} + -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE} + -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG} + -DCMAKE_INSTALL_PREFIX=${SNAPPY_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR=${SNAPPY_INSTALL_DIR}/lib + -DCMAKE_POSITION_INDEPENDENT_CODE=ON + -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE} + -DSNAPPY_ROOT=${SNAPPY_INSTALL_DIR} + ${EXTERNAL_OPTIONAL_ARGS} + CMAKE_CACHE_ARGS + -DCMAKE_INSTALL_PREFIX:PATH=${SNAPPYSTREAM_INSTALL_DIR} + -DCMAKE_INSTALL_LIBDIR:PATH=${SNAPPYSTREAM_INSTALL_DIR}/lib + -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE} + DEPENDS snappy + ) +endif(WIN32) add_library(snappystream STATIC IMPORTED GLOBAL) set_property(TARGET snappystream PROPERTY IMPORTED_LOCATION ${SNAPPYSTREAM_LIBRARIES}) diff --git a/cmake/external/xxhash.cmake b/cmake/external/xxhash.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4c2d64f627401071098e72bfb930fb5d62fa042d --- /dev/null +++ b/cmake/external/xxhash.cmake @@ -0,0 +1,79 @@ +INCLUDE(ExternalProject) + +set(XXHASH_SOURCE_DIR ${THIRD_PARTY_PATH}/xxhash) +set(XXHASH_INSTALL_DIR ${THIRD_PARTY_PATH}/install/xxhash) +set(XXHASH_INCLUDE_DIR "${XXHASH_INSTALL_DIR}/include") + +IF(WITH_STATIC_LIB) + SET(BUILD_CMD make lib) +ELSE() + IF(APPLE) + SET(BUILD_CMD sed -i \"\" "s/-Wstrict-prototypes -Wundef/-Wstrict-prototypes -Wundef -fPIC/g" ${XXHASH_SOURCE_DIR}/src/extern_xxhash/Makefile && make lib) + ELSE(APPLE) + SET(BUILD_CMD sed -i "s/-Wstrict-prototypes -Wundef/-Wstrict-prototypes -Wundef -fPIC/g" ${XXHASH_SOURCE_DIR}/src/extern_xxhash/Makefile && make lib) + ENDIF(APPLE) +ENDIF() + +if(WIN32) + ExternalProject_Add( + extern_xxhash + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/Cyan4973/xxHash" + GIT_TAG "v0.6.5" + PREFIX ${XXHASH_SOURCE_DIR} + DOWNLOAD_NAME "xxhash" + UPDATE_COMMAND "" + BUILD_IN_SOURCE 1 + PATCH_COMMAND + CONFIGURE_COMMAND + ${CMAKE_COMMAND} ${XXHASH_SOURCE_DIR}/src/extern_xxhash/cmake_unofficial + -DCMAKE_INSTALL_PREFIX:PATH=${XXHASH_INSTALL_DIR} + -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE} + -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF + -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON + -DBUILD_XXHSUM=OFF + -DCMAKE_GENERATOR_PLATFORM=x64 + -DBUILD_SHARED_LIBS=OFF + ${OPTIONAL_CACHE_ARGS} + TEST_COMMAND "" + ) +else() + ExternalProject_Add( + extern_xxhash + ${EXTERNAL_PROJECT_LOG_ARGS} + GIT_REPOSITORY "https://github.com/Cyan4973/xxHash" + GIT_TAG "v0.6.5" + PREFIX ${XXHASH_SOURCE_DIR} + DOWNLOAD_NAME "xxhash" + UPDATE_COMMAND "" + CONFIGURE_COMMAND "" + BUILD_IN_SOURCE 1 + PATCH_COMMAND + BUILD_COMMAND ${BUILD_CMD} + INSTALL_COMMAND export PREFIX=${XXHASH_INSTALL_DIR}/ && make install + TEST_COMMAND "" + ) +endif() + +if (WIN32) + set(XXHASH_LIBRARIES "${XXHASH_INSTALL_DIR}/lib/xxhash.lib") +else() + set(XXHASH_LIBRARIES "${XXHASH_INSTALL_DIR}/lib/libxxhash.a") +endif () +INCLUDE_DIRECTORIES(${XXHASH_INCLUDE_DIR}) + +add_library(xxhash STATIC IMPORTED GLOBAL) +set_property(TARGET xxhash PROPERTY IMPORTED_LOCATION ${XXHASH_LIBRARIES}) +include_directories(${XXHASH_INCLUDE_DIR}) +add_dependencies(xxhash extern_xxhash) + +LIST(APPEND external_project_dependencies xxhash) + +IF(WITH_C_API) + INSTALL(DIRECTORY ${XXHASH_INCLUDE_DIR} DESTINATION third_party/xxhash) + IF(ANDROID) + INSTALL(FILES ${XXHASH_LIBRARIES} DESTINATION third_party/xxhash/lib/${ANDROID_ABI}) + ELSE() + INSTALL(FILES ${XXHASH_LIBRARIES} DESTINATION third_party/xxhash/lib) + ENDIF() +ENDIF() diff --git a/cmake/flags.cmake b/cmake/flags.cmake index 343e44ab4bc21c1a656048b675062f1b897bbc77..c4472040cef870454c072c1b84a04e1ac592b476 100644 --- a/cmake/flags.cmake +++ b/cmake/flags.cmake @@ -129,6 +129,9 @@ set(COMMON_FLAGS -Wno-error=parentheses-equality # Warnings in pybind11 -Wno-error=ignored-attributes # Warnings in Eigen, gcc 6.3 -Wno-error=terminate # Warning in PADDLE_ENFORCE + -Wno-error=int-in-bool-context # Warning in Eigen gcc 7.2 + -Wimplicit-fallthrough=0 # Warning in tinyformat.h + -Wno-error=maybe-uninitialized # Warning in boost gcc 7.2 ) set(GPU_COMMON_FLAGS diff --git a/cmake/generic.cmake b/cmake/generic.cmake index 5bf82b4ddf10a646ca540ac4ee2cfd3d3bc6cf58..312fbaa0b3d83c37debe78be82503103eabc0bfa 100644 --- a/cmake/generic.cmake +++ b/cmake/generic.cmake @@ -261,6 +261,17 @@ function(cc_library TARGET_NAME) add_dependencies(${TARGET_NAME} mklml) target_link_libraries(${TARGET_NAME} "-L${MKLML_LIB_DIR} -liomp5 -Wl,--as-needed") endif() + # remove link to python, see notes at: + # https://github.com/pybind/pybind11/blob/master/docs/compiling.rst#building-manually + if("${cc_library_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_library_DEPS python) + add_dependencies(${TARGET_NAME} python) + if(WIN32) + target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES}) + else() + target_link_libraries(${TARGET_NAME} "-Wl,-undefined,dynamic_lookup") + endif(WIN32) + endif() target_link_libraries(${TARGET_NAME} ${cc_library_DEPS}) add_dependencies(${TARGET_NAME} ${cc_library_DEPS}) endif() @@ -281,6 +292,45 @@ function(cc_library TARGET_NAME) endif(cc_library_SRCS) endfunction(cc_library) +# The link operation under windows may exceeds the maximum characters limit, simply break the link command +# into multiple link opeartion can fix that, say +# original: +# lib /out:target.lib a.lib b.lib c.lib d.lib +# after: +# 1. lib /out:dummy_lib_1.lib a.lib b.lib +# 2. lib /out:dummy_lib_2.lib c.lib d.lib +# 1. lib /out:target.lib dummy_lib_1.lib dummy_lib_2.lib +function(sep_library TARGET_NAME) + set(options STATIC static SHARED shared) + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + cmake_parse_arguments(sep_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + set(dummy_index 1) + set(dummy_offset 1) + # the dummy target would be consisted of limit size libraries + set(dummy_limit 50) + list(LENGTH sep_library_DEPS sep_all_len) + foreach(v ${sep_library_DEPS}) + list(APPEND dummy_list ${v}) + list(LENGTH dummy_list listlen ) + if ((${listlen} GREATER ${dummy_limit}) OR (${dummy_offset} EQUAL ${sep_all_len})) + message("create dummy library ${TARGET_NAME}_dummy_lib_${dummy_index} for ${TARGET_NAME}") + cc_library(${TARGET_NAME}_dummy_lib_${dummy_index} STATIC DEPS ${dummy_list}) + foreach(i ${dummy_list}) + list(REMOVE_AT dummy_list 0) + endforeach() + list(APPEND ${TARGET_NAME}_dummy_list ${TARGET_NAME}_dummy_lib_${dummy_index}) + MATH(EXPR dummy_index "${dummy_index}+1") + endif() + MATH(EXPR dummy_offset "${dummy_offset}+1") + endforeach() + if(${sep_library_SHARED}) + cc_library(${TARGET_NAME} SHARED SRCS ${sep_library_SRCS} DEPS ${${TARGET_NAME}_dummy_list}) + else(${sep_library_SHARED}) + cc_library(${TARGET_NAME} STATIC SRCS ${sep_library_SRCS} DEPS ${${TARGET_NAME}_dummy_list}) + endif(${sep_library_SHARED}) +endfunction(sep_library) + function(cc_binary TARGET_NAME) set(options "") set(oneValueArgs "") @@ -299,8 +349,18 @@ function(cc_test TARGET_NAME) set(oneValueArgs "") set(multiValueArgs SRCS DEPS ARGS) cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + if(WIN32) + list(APPEND win32_deps shlwapi) + if("${cc_test_DEPS};" MATCHES "python;") + list(REMOVE_ITEM cc_test_DEPS python) + list(APPEND win32_deps ${PYTHON_LIBRARIES}) + endif() + endif(WIN32) add_executable(${TARGET_NAME} ${cc_test_SRCS}) target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) + if(WIN32) + target_link_libraries(${TARGET_NAME} ${win32_deps}) + endif(WIN32) add_dependencies(${TARGET_NAME} ${cc_test_DEPS} paddle_gtest_main lod_tensor memory gtest gflags glog) add_test(NAME ${TARGET_NAME} COMMAND ${TARGET_NAME} ${cc_test_ARGS} @@ -311,6 +371,8 @@ function(cc_test TARGET_NAME) set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cpu_deterministic=true) set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_init_allocated_mem=true) set_property(TEST ${TARGET_NAME} PROPERTY ENVIRONMENT FLAGS_cudnn_deterministic=true) + # No unit test should exceed 10 minutes. + set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 600) endif() endfunction(cc_test) @@ -399,25 +461,29 @@ function(hip_library TARGET_NAME) else() add_library(${TARGET_NAME} STATIC ${_cmake_options} ${_generated_files} ${_sources}) set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE CXX) - target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a) - find_fluid_modules(${TARGET_NAME}) + target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a /opt/rocm/rccl/lib/librccl.so /opt/rocm/hiprand/lib/libhiprand.so) + find_fluid_modules(${TARGET_NAME}) endif() - if (hip_library_DEPS) - add_dependencies(${TARGET_NAME} ${hip_library_DEPS}) - target_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) + if("${hip_library_DEPS}" MATCHES "ARCHIVE_START") + # Support linking flags: --whole-archive (Linux) / -force_load (MacOS). + # WARNING: Please don't use ARCHIVE_START&ARCHIVE_END if TARGET_NAME will be linked by other libraries. + target_circle_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) + list(REMOVE_ITEM hip_library_DEPS ARCHIVE_START ARCHIVE_END) + else() + target_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) endif() # cpplint code style foreach(source_file ${hip_library_SRCS}) - string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) - if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) - list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) - endif() + string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) + endif() endforeach() else(hip_library_SRCS) if (hip_library_DEPS) - merge_static_libs(${TARGET_NAME} ${hip_library_DEPS}) + merge_static_libs(${TARGET_NAME} ${hip_library_DEPS}) else() - message(FATAL "Please specify source file or library in nv_library.") + message(FATAL "Please specify source file or library in nv_library.") endif() endif(hip_library_SRCS) endif() @@ -624,11 +690,13 @@ function(py_test TARGET_NAME) set(multiValueArgs SRCS DEPS ARGS ENVS) cmake_parse_arguments(py_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) add_test(NAME ${TARGET_NAME} - COMMAND env FLAGS_init_allocated_mem=true FLAGS_cudnn_deterministic=true + COMMAND ${CMAKE_COMMAND} -E env FLAGS_init_allocated_mem=true FLAGS_cudnn_deterministic=true FLAGS_cpu_deterministic=true PYTHONPATH=${PADDLE_BINARY_DIR}/python ${py_test_ENVS} ${PYTHON_EXECUTABLE} -u ${py_test_SRCS} ${py_test_ARGS} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) + # No unit test should exceed 10 minutes. + set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 600) endif() endfunction() diff --git a/cmake/hip.cmake b/cmake/hip.cmake index bfe491bd6b7602959d3dd60bd06c67993593cc9b..4276bc5b08cd88a52bb5782bca87fc37deabd830 100644 --- a/cmake/hip.cmake +++ b/cmake/hip.cmake @@ -3,6 +3,8 @@ if(NOT WITH_AMD_GPU) endif() include_directories("/opt/rocm/include") +include_directories("/opt/rocm/hip/include") +include_directories("/opt/rocm/miopen/include") include_directories("/opt/rocm/hipblas/include") include_directories("/opt/rocm/hiprand/include") include_directories("/opt/rocm/rocrand/include") @@ -11,20 +13,40 @@ include_directories("/opt/rocm/thrust") list(APPEND EXTERNAL_LIBS "-L/opt/rocm/lib/ -lhip_hcc") -set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++14" ) +set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++11" ) if(WITH_DSO) set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_USE_DSO") endif(WITH_DSO) -if(WITH_DOUBLE) - set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_TYPE_DOUBLE") -endif(WITH_DOUBLE) - if(WITH_TESTING) set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_TESTING") endif(WITH_TESTING) +if(WITH_DISTRIBUTE) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_DISTRIBUTE") +endif(WITH_DISTRIBUTE) + +if(WITH_GRPC) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_GRPC") +endif(WITH_GRPC) + +if(NOT WITH_GOLANG) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITHOUT_GOLANG") +endif(NOT WITH_GOLANG) + +if(WITH_MKLDNN) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_MKLDNN") +endif(WITH_MKLDNN) + +set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DANY_IMPL_ANY_CAST_MOVEABLE") + +if(NOT WITH_RDMA) + set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_DISABLE_RDMA") +endif(NOT WITH_RDMA) + + + if(CMAKE_BUILD_TYPE STREQUAL "Debug") list(APPEND HIP_HCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG}) elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") diff --git a/cmake/inference_lib.cmake b/cmake/inference_lib.cmake index a3e682e54ac496e37ed4a33a7b30d9fdca381d9d..7355b67ab1020f58760f23b1a20ca189591db35e 100644 --- a/cmake/inference_lib.cmake +++ b/cmake/inference_lib.cmake @@ -22,184 +22,232 @@ function(copy TARGET) list(LENGTH copy_lib_SRCS copy_lib_SRCS_len) list(LENGTH copy_lib_DSTS copy_lib_DSTS_len) - if(NOT ${copy_lib_SRCS_len} EQUAL ${copy_lib_DSTS_len}) + if (NOT ${copy_lib_SRCS_len} EQUAL ${copy_lib_DSTS_len}) message(FATAL_ERROR "${TARGET} source numbers are not equal to destination numbers") - endif() + endif () math(EXPR len "${copy_lib_SRCS_len} - 1") add_custom_target(${TARGET} DEPENDS ${copy_lib_DEPS}) - foreach(index RANGE ${len}) + foreach (index RANGE ${len}) list(GET copy_lib_SRCS ${index} src) list(GET copy_lib_DSTS ${index} dst) - add_custom_command(TARGET ${TARGET} PRE_BUILD - COMMAND mkdir -p "${dst}" - COMMAND cp -r "${src}" "${dst}" - COMMENT "copying ${src} -> ${dst}") - endforeach() + if (WIN32) + # windows cmd shell will not expand wildcard automatically. + # below expand the files,libs and copy them by rules. + file(GLOB header_files ${src} "*.h") + file(GLOB static_lib_files ${src} "*.lib") + file(GLOB dll_lib_files ${src} "*.dll") + set(src_files ${header_files} ${static_lib_files} ${dll_lib_files}) + + if (NOT "${src_files}" STREQUAL "") + list(REMOVE_DUPLICATES src_files) + endif () + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory "${dst}" + ) + foreach (src_file ${src_files}) + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND ${CMAKE_COMMAND} -E copy "${src_file}" "${dst}" + COMMENT "copying ${src_file} -> ${dst}") + endforeach () + else (WIN32) # not windows + add_custom_command(TARGET ${TARGET} PRE_BUILD + COMMAND mkdir -p "${dst}" + COMMAND cp -r "${src}" "${dst}" + COMMENT "copying ${src} -> ${dst}") + endif (WIN32) # not windows + endforeach () endfunction() # third party set(dst_dir "${FLUID_INSTALL_DIR}/third_party/eigen3") copy(eigen3_lib - SRCS ${EIGEN_INCLUDE_DIR}/Eigen/Core ${EIGEN_INCLUDE_DIR}/Eigen/src ${EIGEN_INCLUDE_DIR}/unsupported/Eigen - DSTS ${dst_dir}/Eigen ${dst_dir}/Eigen ${dst_dir}/unsupported - DEPS eigen3 -) + SRCS ${EIGEN_INCLUDE_DIR}/Eigen/Core ${EIGEN_INCLUDE_DIR}/Eigen/src ${EIGEN_INCLUDE_DIR}/unsupported/Eigen + DSTS ${dst_dir}/Eigen ${dst_dir}/Eigen ${dst_dir}/unsupported + DEPS eigen3 + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/gflags") copy(gflags_lib - SRCS ${GFLAGS_INCLUDE_DIR} ${GFLAGS_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS gflags -) + SRCS ${GFLAGS_INCLUDE_DIR} ${GFLAGS_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS gflags + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/glog") copy(glog_lib - SRCS ${GLOG_INCLUDE_DIR} ${GLOG_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS glog -) + SRCS ${GLOG_INCLUDE_DIR} ${GLOG_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS glog + ) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/boost/") copy(boost_lib - SRCS ${BOOST_INCLUDE_DIR}/boost - DSTS ${dst_dir} - DEPS boost -) - -if(NOT PROTOBUF_FOUND) + SRCS ${BOOST_INCLUDE_DIR}/boost + DSTS ${dst_dir} + DEPS boost + ) + +set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/xxhash") +copy(xxhash_lib + SRCS ${XXHASH_INCLUDE_DIR} ${XXHASH_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS xxhash + ) + +if (NOT PROTOBUF_FOUND) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/protobuf") copy(protobuf_lib - SRCS ${PROTOBUF_INCLUDE_DIR} ${PROTOBUF_LIBRARY} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS extern_protobuf - ) -endif() + SRCS ${PROTOBUF_INCLUDE_DIR} ${PROTOBUF_LIBRARY} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS extern_protobuf + ) +endif () -if(NOT CBLAS_FOUND) +if (NOT CBLAS_FOUND) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/openblas") copy(openblas_lib - SRCS ${CBLAS_INSTALL_DIR}/lib ${CBLAS_INSTALL_DIR}/include - DSTS ${dst_dir} ${dst_dir} - DEPS extern_openblas - ) + SRCS ${CBLAS_INSTALL_DIR}/lib ${CBLAS_INSTALL_DIR}/include + DSTS ${dst_dir} ${dst_dir} + DEPS extern_openblas + ) elseif (WITH_MKLML) set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mklml") copy(mklml_lib - SRCS ${MKLML_LIB} ${MKLML_IOMP_LIB} ${MKLML_INC_DIR} - DSTS ${dst_dir}/lib ${dst_dir}/lib ${dst_dir} - DEPS mklml - ) -endif() - -if(WITH_MKLDNN) - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mkldnn") - copy(mkldnn_lib - SRCS ${MKLDNN_INC_DIR} ${MKLDNN_SHARED_LIB} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS mkldnn - ) -endif() + SRCS ${MKLML_LIB} ${MKLML_IOMP_LIB} ${MKLML_INC_DIR} + DSTS ${dst_dir}/lib ${dst_dir}/lib ${dst_dir} + DEPS mklml + ) +endif () + +if (WITH_MKLDNN) + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/mkldnn") + copy(mkldnn_lib + SRCS ${MKLDNN_INC_DIR} ${MKLDNN_SHARED_LIB} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS mkldnn + ) +endif () if (NOT WIN32) -if(NOT MOBILE_INFERENCE AND NOT RPI) - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappy") - copy(snappy_lib - SRCS ${SNAPPY_INCLUDE_DIR} ${SNAPPY_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS snappy) - - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappystream") - copy(snappystream_lib - SRCS ${SNAPPYSTREAM_INCLUDE_DIR} ${SNAPPYSTREAM_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS snappystream) - - set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/zlib") - copy(zlib_lib - SRCS ${ZLIB_INCLUDE_DIR} ${ZLIB_LIBRARIES} - DSTS ${dst_dir} ${dst_dir}/lib - DEPS zlib) -endif() -endif(NOT WIN32) + if (NOT MOBILE_INFERENCE AND NOT RPI) + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappy") + copy(snappy_lib + SRCS ${SNAPPY_INCLUDE_DIR} ${SNAPPY_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS snappy) + + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/snappystream") + copy(snappystream_lib + SRCS ${SNAPPYSTREAM_INCLUDE_DIR} ${SNAPPYSTREAM_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS snappystream) + + set(dst_dir "${FLUID_INSTALL_DIR}/third_party/install/zlib") + copy(zlib_lib + SRCS ${ZLIB_INCLUDE_DIR} ${ZLIB_LIBRARIES} + DSTS ${dst_dir} ${dst_dir}/lib + DEPS zlib) + endif () +endif (NOT WIN32) # paddle fluid module set(src_dir "${PADDLE_SOURCE_DIR}/paddle/fluid") set(dst_dir "${FLUID_INSTALL_DIR}/paddle/fluid") set(module "framework") if (NOT WIN32) -set(framework_lib_deps framework_py_proto) -endif(NOT WIN32) + set(framework_lib_deps framework_py_proto) +endif (NOT WIN32) copy(framework_lib DEPS ${framework_lib_deps} - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/details/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/framework/framework.pb.h - ${src_dir}/${module}/ir/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/details ${dst_dir}/${module} ${dst_dir}/${module}/ir -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/details/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/framework/framework.pb.h + ${src_dir}/${module}/ir/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/details ${dst_dir}/${module} ${dst_dir}/${module}/ir + ) set(module "memory") copy(memory_lib - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/detail/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/detail -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/detail/*.h ${src_dir}/${module}/allocation/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/detail ${dst_dir}/${module}/allocation + ) set(inference_deps paddle_fluid_shared paddle_fluid) set(module "inference/api") if (WITH_ANAKIN AND WITH_MKL) copy(anakin_inference_lib DEPS paddle_inference_api inference_anakin_api - SRCS - ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/libinference_anakin_api* # compiled anakin api - ${ANAKIN_INSTALL_DIR} # anakin release - DSTS ${dst_dir}/inference/anakin ${FLUID_INSTALL_DIR}/third_party/install/anakin) - list(APPEND inference_deps anakin_inference_lib) -endif() + SRCS + ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/libinference_anakin_api* # compiled anakin api + ${ANAKIN_INSTALL_DIR} # anakin release + DSTS ${FLUID_INSTALL_DIR}/third_party/install/anakin ${FLUID_INSTALL_DIR}/third_party/install/anakin) + list(APPEND inference_deps anakin_inference_lib) +endif () set(module "inference") copy(inference_lib DEPS ${inference_deps} SRCS ${src_dir}/${module}/*.h ${PADDLE_BINARY_DIR}/paddle/fluid/inference/libpaddle_fluid.* - ${src_dir}/${module}/api/paddle_inference_api.h ${src_dir}/${module}/api/demo_ci + ${src_dir}/${module}/api/paddle_*.h ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/paddle_inference_pass.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} -) + DSTS ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} ${dst_dir}/${module} + ) set(module "platform") copy(platform_lib DEPS profiler_py_proto - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/dynload/*.h ${src_dir}/${module}/details/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/dynload ${dst_dir}/${module}/details -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/dynload/*.h ${src_dir}/${module}/details/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/dynload ${dst_dir}/${module}/details + ) set(module "string") copy(string_lib - SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/tinyformat/*.h - DSTS ${dst_dir}/${module} ${dst_dir}/${module}/tinyformat -) + SRCS ${src_dir}/${module}/*.h ${src_dir}/${module}/tinyformat/*.h + DSTS ${dst_dir}/${module} ${dst_dir}/${module}/tinyformat + ) set(module "pybind") copy(pybind_lib - SRCS ${CMAKE_CURRENT_BINARY_DIR}/paddle/fluid/${module}/pybind.h - DSTS ${dst_dir}/${module} -) + SRCS ${CMAKE_CURRENT_BINARY_DIR}/paddle/fluid/${module}/pybind.h + DSTS ${dst_dir}/${module} + ) # CMakeCache Info copy(cmake_cache - SRCS ${CMAKE_CURRENT_BINARY_DIR}/CMakeCache.txt - DSTS ${FLUID_INSTALL_DIR}) + SRCS ${CMAKE_CURRENT_BINARY_DIR}/CMakeCache.txt + DSTS ${FLUID_INSTALL_DIR}) # This command generates a complete fluid library for both train and inference -add_custom_target(fluid_lib_dist DEPENDS ${fluid_lib_dist_dep}) +add_custom_target(fluid_lib_dist DEPENDS ${fluid_lib_dist_dep}) + +# Following commands generate a inference-only fluid library +# third_party, version.txt and CMakeCache.txt are the same position with ${FLUID_INSTALL_DIR} +copy(third_party DEPS fluid_lib_dist + SRCS ${FLUID_INSTALL_DIR}/third_party ${FLUID_INSTALL_DIR}/CMakeCache.txt + DSTS ${FLUID_INFERENCE_INSTALL_DIR} ${FLUID_INFERENCE_INSTALL_DIR} + ) + +# only need libpaddle_fluid.so/a and paddle_*.h for inference-only library +copy(inference_api_lib DEPS fluid_lib_dist + SRCS ${FLUID_INSTALL_DIR}/paddle/fluid/inference/libpaddle_fluid.* + ${FLUID_INSTALL_DIR}/paddle/fluid/inference/paddle_*.h + DSTS ${FLUID_INFERENCE_INSTALL_DIR}/paddle/lib ${FLUID_INFERENCE_INSTALL_DIR}/paddle/include +) + +add_custom_target(inference_lib_dist DEPENDS third_party inference_api_lib) # paddle fluid version -execute_process( - COMMAND ${GIT_EXECUTABLE} log --pretty=format:%H -1 - WORKING_DIRECTORY ${PADDLE_SOURCE_DIR} - OUTPUT_VARIABLE PADDLE_GIT_COMMIT) -set(version_file ${FLUID_INSTALL_DIR}/version.txt) -file(WRITE ${version_file} - "GIT COMMIT ID: ${PADDLE_GIT_COMMIT}\n" - "WITH_MKL: ${WITH_MKL}\n" - "WITH_GPU: ${WITH_GPU}\n") -if(WITH_GPU) - file(APPEND ${version_file} - "CUDA version: ${CUDA_VERSION}\n" - "CUDNN version: v${CUDNN_MAJOR_VERSION}\n") -endif() +function(version version_file) + execute_process( + COMMAND ${GIT_EXECUTABLE} log --pretty=format:%H -1 + WORKING_DIRECTORY ${PADDLE_SOURCE_DIR} + OUTPUT_VARIABLE PADDLE_GIT_COMMIT) + file(WRITE ${version_file} + "GIT COMMIT ID: ${PADDLE_GIT_COMMIT}\n" + "WITH_MKL: ${WITH_MKL}\n" + "WITH_MKLDNN: ${WITH_MKLDNN}\n" + "WITH_GPU: ${WITH_GPU}\n") + if (WITH_GPU) + file(APPEND ${version_file} + "CUDA version: ${CUDA_VERSION}\n" + "CUDNN version: v${CUDNN_MAJOR_VERSION}\n") + endif () +endfunction() +version(${FLUID_INSTALL_DIR}/version.txt) +version(${FLUID_INFERENCE_INSTALL_DIR}/version.txt) diff --git a/cmake/operators.cmake b/cmake/operators.cmake new file mode 100644 index 0000000000000000000000000000000000000000..89726bf9859e71ee04c2f9380554090845fd44e5 --- /dev/null +++ b/cmake/operators.cmake @@ -0,0 +1,218 @@ +set(PART_CUDA_KERNEL_FILES) +function(op_library TARGET) + # op_library is a function to create op library. The interface is same as + # cc_library. But it handle split GPU/CPU code and link some common library + # for ops. + set(cc_srcs) + set(cu_srcs) + set(hip_cu_srcs) + set(miopen_hip_cc_srcs) + set(cu_cc_srcs) + set(cudnn_cu_cc_srcs) + set(CUDNN_FILE) + set(mkldnn_cc_srcs) + set(MKLDNN_FILE) + set(op_common_deps operator op_registry math_function) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS DEPS) + set(pybind_flag 0) + cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" + "${multiValueArgs}" ${ARGN}) + + list(LENGTH op_library_SRCS op_library_SRCS_len) + if (${op_library_SRCS_len} EQUAL 0) + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cc) + list(APPEND cc_srcs ${TARGET}.cc) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu.cc) + list(APPEND cu_cc_srcs ${TARGET}.cu.cc) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu) + list(APPEND cu_srcs ${TARGET}.cu) + endif() + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu) + set(PART_CUDA_KERNEL_FILES ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu + ${PART_CUDA_KERNEL_FILES} PARENT_SCOPE) + list(APPEND cu_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.part.cu) + endif() + + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.hip.cu) + list(APPEND hip_cu_srcs ${TARGET}.hip.cu) + endif() + string(REPLACE "_op" "_cudnn_op" CUDNN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${CUDNN_FILE}.cu.cc) + list(APPEND cudnn_cu_cc_srcs ${CUDNN_FILE}.cu.cc) + endif() + if(WITH_AMD_GPU) + string(REPLACE "_op" "_miopen_op" MIOPEN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MIOPEN_FILE}.hip.cc) + list(APPEND miopen_hip_cc_srcs ${MIOPEN_FILE}.hip.cc) + endif() + endif() + if(WITH_MKLDNN) + string(REPLACE "_op" "_mkldnn_op" MKLDNN_FILE "${TARGET}") + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MKLDNN_FILE}.cc) + list(APPEND mkldnn_cc_srcs ${MKLDNN_FILE}.cc) + endif() + endif() + else() + foreach(src ${op_library_SRCS}) + if (${src} MATCHES ".*\\.hip.cu$") + list(APPEND hip_cu_srcs ${src}) + elseif (${src} MATCHES ".*\\.cu$") + list(APPEND cu_srcs ${src}) + elseif(${src} MATCHES ".*_cudnn_op.cu.cc$") + list(APPEND cudnn_cu_cc_srcs ${src}) + elseif(WITH_AMD_GPU AND ${src} MATCHES ".*_miopen_op.hip.cc$") + list(APPEND miopen_hip_cc_srcs ${src}) + elseif(WITH_MKLDNN AND ${src} MATCHES ".*_mkldnn_op.cc$") + list(APPEND mkldnn_cc_srcs ${src}) + elseif(${src} MATCHES ".*\\.cu.cc$") + list(APPEND cu_cc_srcs ${src}) + elseif(${src} MATCHES ".*\\.cc$") + list(APPEND cc_srcs ${src}) + else() + message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu") + endif() + endforeach() + endif() + + list(LENGTH cc_srcs cc_srcs_len) + if (${cc_srcs_len} EQUAL 0) + message(FATAL_ERROR "The op library ${TARGET} should contains at least one .cc file") + endif() + if (WIN32) + # remove windows unsupported op, because windows has no nccl, no warpctc such ops. + foreach(windows_unsupport_op "nccl_op" "gen_nccl_id_op" "warpctc_op") + if ("${TARGET}" STREQUAL "${windows_unsupport_op}") + return() + endif() + endforeach() + endif(WIN32) + set(OP_LIBRARY ${TARGET} ${OP_LIBRARY} CACHE INTERNAL "op libs") + + list(LENGTH op_library_DEPS op_library_DEPS_len) + if (${op_library_DEPS_len} GREATER 0) + set(DEPS_OPS ${TARGET} ${DEPS_OPS} PARENT_SCOPE) + endif() + if (WITH_GPU) + nv_library(${TARGET} SRCS ${cc_srcs} ${cu_cc_srcs} ${cudnn_cu_cc_srcs} ${mkldnn_cc_srcs} ${cu_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + elseif (WITH_AMD_GPU) + hip_library(${TARGET} SRCS ${cc_srcs} ${hip_cu_srcs} ${miopen_hip_cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + else() + cc_library(${TARGET} SRCS ${cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} + ${op_common_deps}) + endif() + + # Define operators that don't need pybind here. + foreach(manual_pybind_op "compare_op" "logical_op" "nccl_op" +"tensor_array_read_write_op" "tensorrt_engine_op" "conv_fusion_op" +"fusion_transpose_flatten_concat_op") + if ("${TARGET}" STREQUAL "${manual_pybind_op}") + set(pybind_flag 1) + endif() + endforeach() + + # The registration of USE_OP, please refer to paddle/fluid/framework/op_registry.h. + # Note that it's enough to just adding one operator to pybind in a *_op.cc file. + # And for detail pybind information, please see generated paddle/pybind/pybind.h. + file(READ ${TARGET}.cc TARGET_CONTENT) + string(REGEX MATCH "REGISTER_OPERATOR\\(.*REGISTER_OPERATOR\\(" multi_register "${TARGET_CONTENT}") + string(REGEX MATCH "REGISTER_OPERATOR\\([a-z0-9_]*," one_register "${multi_register}") + if (one_register STREQUAL "") + string(REPLACE "_op" "" TARGET "${TARGET}") + else () + string(REPLACE "REGISTER_OPERATOR(" "" TARGET "${one_register}") + string(REPLACE "," "" TARGET "${TARGET}") + endif() + + # pybind USE_NO_KERNEL_OP + # HACK: if REGISTER_OP_CPU_KERNEL presents the operator must have kernel + string(REGEX MATCH "REGISTER_OP_CPU_KERNEL" regex_result "${TARGET_CONTENT}") + string(REPLACE "_op" "" TARGET "${TARGET}") + if (${pybind_flag} EQUAL 0 AND regex_result STREQUAL "") + file(APPEND ${pybind_file} "USE_NO_KERNEL_OP(${TARGET});\n") + set(pybind_flag 1) + endif() + + # pybind USE_CPU_ONLY_OP + list(LENGTH cu_srcs cu_srcs_len) + list(LENGTH cu_cc_srcs cu_cc_srcs_len) + list(LENGTH mkldnn_cc_srcs mkldnn_cc_srcs_len) + list(LENGTH hip_cu_srcs hip_cu_srcs_len) + list(LENGTH miopen_hip_cc_srcs miopen_hip_cc_srcs_len) + if (${pybind_flag} EQUAL 0 AND ${mkldnn_cc_srcs_len} EQUAL 0 AND ${cu_srcs_len} EQUAL 0 AND ${cu_cc_srcs_len} EQUAL 0 AND + ${hip_cu_srcs_len} EQUAL 0 AND ${miopen_hip_cc_srcs_len} EQUAL 0) + file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") + set(pybind_flag 1) + endif() + + # pybind USE_OP_DEVICE_KERNEL for CUDNN + list(LENGTH cudnn_cu_cc_srcs cudnn_cu_cc_srcs_len) + if (WITH_GPU AND ${cudnn_cu_cc_srcs_len} GREATER 0) + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, CUDNN);\n") + endif() + + # pybind USE_OP_DEVICE_KERNEL for MIOPEN + if (WITH_AMD_GPU AND ${miopen_hip_cc_srcs_len} GREATER 0) + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MIOPEN);\n") + endif() + + # pybind USE_OP_DEVICE_KERNEL for MKLDNN + if (WITH_MKLDNN AND ${mkldnn_cc_srcs_len} GREATER 0) + # Append first implemented MKLDNN activation operator + if (${MKLDNN_FILE} STREQUAL "activation_mkldnn_op") + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(relu, MKLDNN);\n") + else() + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MKLDNN);\n") + endif() + endif() + + # pybind USE_OP + if (${pybind_flag} EQUAL 0) + # NOTE(*): activation use macro to regist the kernels, set use_op manually. + if(${TARGET} STREQUAL "activation") + file(APPEND ${pybind_file} "USE_OP(relu);\n") + elseif(${TARGET} STREQUAL "fake_dequantize") + file(APPEND ${pybind_file} "USE_OP(fake_dequantize_max_abs);\n") + elseif(${TARGET} STREQUAL "fake_quantize") + file(APPEND ${pybind_file} "USE_OP(fake_quantize_abs_max);\n") + elseif(${TARGET} STREQUAL "tensorrt_engine_op") + message(STATUS "Pybind skips [tensorrt_engine_op], for this OP is only used in inference") + elseif(${TARGET} STREQUAL "fc") + # HACK: fc only have mkldnn and cpu, which would mismatch the cpu only condition + file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") + else() + file(APPEND ${pybind_file} "USE_OP(${TARGET});\n") + endif() + endif() +endfunction() + + +function(register_operators) + set(options "") + set(oneValueArgs "") + set(multiValueArgs EXCLUDES DEPS) + cmake_parse_arguments(register_operators "${options}" "${oneValueArgs}" + "${multiValueArgs}" ${ARGN}) + + file(GLOB OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*_op.cc") + string(REPLACE "_mkldnn" "" OPS "${OPS}") + string(REPLACE ".cc" "" OPS "${OPS}") + list(REMOVE_DUPLICATES OPS) + list(LENGTH register_operators_DEPS register_operators_DEPS_len) + + foreach(src ${OPS}) + list(FIND register_operators_EXCLUDES ${src} _index) + if (${_index} EQUAL -1) + if (${register_operators_DEPS_len} GREATER 0) + op_library(${src} DEPS ${register_operators_DEPS}) + else() + op_library(${src}) + endif() + endif() + endforeach() +endfunction() diff --git a/cmake/simd.cmake b/cmake/simd.cmake index 3eacf4d86aa0385eddb690d72e85e3384929bb99..86096d4feaace040da416a01872882456c4098fc 100644 --- a/cmake/simd.cmake +++ b/cmake/simd.cmake @@ -57,41 +57,46 @@ int main() return 0; }" SSE3_FOUND) -# Check AVX -set(CMAKE_REQUIRED_FLAGS ${AVX_FLAG}) -set(AVX_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) -CHECK_CXX_SOURCE_RUNS(" -#include -int main() -{ - __m256 a = _mm256_set_ps (-1.0f, 2.0f, -3.0f, 4.0f, -1.0f, 2.0f, -3.0f, 4.0f); - __m256 b = _mm256_set_ps (1.0f, 2.0f, 3.0f, 4.0f, 1.0f, 2.0f, 3.0f, 4.0f); - __m256 result = _mm256_add_ps (a, b); - return 0; -}" AVX_FOUND) +# disable AVX by default on windows +if(NOT WIN32) + # Check AVX + set(CMAKE_REQUIRED_FLAGS ${AVX_FLAG}) + set(AVX_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) + CHECK_CXX_SOURCE_RUNS(" + #include + int main() + { + __m256 a = _mm256_set_ps (-1.0f, 2.0f, -3.0f, 4.0f, -1.0f, 2.0f, -3.0f, 4.0f); + __m256 b = _mm256_set_ps (1.0f, 2.0f, 3.0f, 4.0f, 1.0f, 2.0f, 3.0f, 4.0f); + __m256 result = _mm256_add_ps (a, b); + return 0; + }" AVX_FOUND) -# Check AVX 2 -set(CMAKE_REQUIRED_FLAGS ${AVX2_FLAG}) -set(AVX2_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) -CHECK_CXX_SOURCE_RUNS(" -#include -int main() -{ - __m256i a = _mm256_set_epi32 (-1, 2, -3, 4, -1, 2, -3, 4); - __m256i result = _mm256_abs_epi32 (a); - return 0; -}" AVX2_FOUND) + # Check AVX 2 + set(CMAKE_REQUIRED_FLAGS ${AVX2_FLAG}) + set(AVX2_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) + CHECK_CXX_SOURCE_RUNS(" + #include + int main() + { + __m256i a = _mm256_set_epi32 (-1, 2, -3, 4, -1, 2, -3, 4); + __m256i result = _mm256_abs_epi32 (a); + return 0; + }" AVX2_FOUND) -# Check AVX512F -set(CMAKE_REQUIRED_FLAGS ${AVX512F_FLAG}) -set(AVX512F_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) -CHECK_CXX_SOURCE_RUNS(" -#include -int main() -{ - __m512i a = _mm512_undefined_epi32(); - return 0; -}" AVX512F_FOUND) + # Check AVX512F + set(CMAKE_REQUIRED_FLAGS ${AVX512F_FLAG}) + set(AVX512F_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE) + CHECK_CXX_SOURCE_RUNS(" + #include + int main() + { + __m512i a = _mm512_set_epi32 (-1, 2, -3, 4, -1, 2, -3, 4, + 13, -5, 6, -7, 9, 2, -6, 3); + __m512i result = _mm512_abs_epi32 (a); + return 0; + }" AVX512F_FOUND) +endif(NOT WIN32) set(CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS_RETAINED}) mark_as_advanced(MMX_FOUND SSE2_FOUND SSE3_FOUND AVX_FOUND AVX2_FOUND AVX512F_FOUND) diff --git a/cmake/tensorrt.cmake b/cmake/tensorrt.cmake index fa0e834a1dfd6e60f0ec07945be9a4d84017316f..3dc7171551bfb7aff8d1e75083c98b00378d247f 100644 --- a/cmake/tensorrt.cmake +++ b/cmake/tensorrt.cmake @@ -34,4 +34,5 @@ if(TENSORRT_FOUND) "Current TensorRT version is v${TENSORRT_MAJOR_VERSION}. ") include_directories(${TENSORRT_INCLUDE_DIR}) list(APPEND EXTERNAL_LIBS ${TENSORRT_LIBRARY}) + add_definitions(-DPADDLE_WITH_TENSORRT) endif() diff --git a/doc/v2/dev/contribute_to_paddle_en.md b/doc/v2/dev/contribute_to_paddle_en.md index c97564d93a7f0a753a23cd97d2467d595bd154ff..72723396444c0a6cc0516f6f2379b2d868ba59f7 120000 --- a/doc/v2/dev/contribute_to_paddle_en.md +++ b/doc/v2/dev/contribute_to_paddle_en.md @@ -1 +1 @@ -../../../CONTRIBUTING.md \ No newline at end of file +../../../CONTRIBUTING.md diff --git a/paddle/CMakeLists.txt b/paddle/CMakeLists.txt index 6653244507742b33d9524a7a0e4a5b2b575d358a..6b665a9effba4bef083d007c0c74f2f4c79e647e 100644 --- a/paddle/CMakeLists.txt +++ b/paddle/CMakeLists.txt @@ -24,6 +24,7 @@ if(NOT WITH_FLUID_ONLY) endif() add_subdirectory(testing) +set(PYTHON_TESTS_DIR ${PADDLE_BINARY_DIR}/python/paddle/fluid/tests CACHE INTERNAL "python tests directory") if(NOT MOBILE_INFERENCE AND NOT RPI AND NOT WITH_C_API) add_subdirectory(fluid) endif() diff --git a/paddle/fluid/API.spec b/paddle/fluid/API.spec index 6a37b5ca433a3baa1388dd4f720d782ca53e4e99..0a71f153434cc89b26d97e1e1fcfae6c2bc88566 100644 --- a/paddle/fluid/API.spec +++ b/paddle/fluid/API.spec @@ -26,10 +26,10 @@ paddle.fluid.release_memory ArgSpec(args=['input_program', 'skip_opt_set'], vara paddle.fluid.DistributeTranspilerConfig.__init__ paddle.fluid.ParallelExecutor.__init__ ArgSpec(args=['self', 'use_cuda', 'loss_name', 'main_program', 'share_vars_from', 'exec_strategy', 'build_strategy', 'num_trainers', 'trainer_id', 'scope'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 1, 0, None)) paddle.fluid.ParallelExecutor.run ArgSpec(args=['self', 'fetch_list', 'feed', 'feed_dict', 'return_numpy'], varargs=None, keywords=None, defaults=(None, None, True)) -paddle.fluid.ExecutionStrategy.__init__ __init__(self: paddle.fluid.core.ExecutionStrategy) -> None -paddle.fluid.BuildStrategy.GradientScaleStrategy.__init__ __init__(self: paddle.fluid.core.GradientScaleStrategy, arg0: int) -> None -paddle.fluid.BuildStrategy.ReduceStrategy.__init__ __init__(self: paddle.fluid.core.ReduceStrategy, arg0: int) -> None -paddle.fluid.BuildStrategy.__init__ __init__(self: paddle.fluid.core.BuildStrategy) -> None +paddle.fluid.ExecutionStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.ExecutionStrategy) -> None +paddle.fluid.BuildStrategy.GradientScaleStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy.GradientScaleStrategy, arg0: int) -> None +paddle.fluid.BuildStrategy.ReduceStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy.ReduceStrategy, arg0: int) -> None +paddle.fluid.BuildStrategy.__init__ __init__(self: paddle.fluid.core.ParallelExecutor.BuildStrategy) -> None paddle.fluid.create_lod_tensor ArgSpec(args=['data', 'recursive_seq_lens', 'place'], varargs=None, keywords=None, defaults=None) paddle.fluid.create_random_int_lodtensor ArgSpec(args=['recursive_seq_lens', 'base_shape', 'place', 'low', 'high'], varargs=None, keywords=None, defaults=None) paddle.fluid.io.save_vars ArgSpec(args=['executor', 'dirname', 'main_program', 'vars', 'predicate', 'filename'], varargs=None, keywords=None, defaults=(None, None, None, None)) @@ -61,14 +61,14 @@ paddle.fluid.layers.cos_sim ArgSpec(args=['X', 'Y'], varargs=None, keywords=None paddle.fluid.layers.cross_entropy ArgSpec(args=['input', 'label', 'soft_label', 'ignore_index'], varargs=None, keywords=None, defaults=(False, -100)) paddle.fluid.layers.square_error_cost ArgSpec(args=['input', 'label'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.chunk_eval ArgSpec(args=['input', 'label', 'chunk_scheme', 'num_chunk_types', 'excluded_chunk_types'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.sequence_conv ArgSpec(args=['input', 'num_filters', 'filter_size', 'filter_stride', 'padding', 'bias_attr', 'param_attr', 'act'], varargs=None, keywords=None, defaults=(3, 1, None, None, None, None)) +paddle.fluid.layers.sequence_conv ArgSpec(args=['input', 'num_filters', 'filter_size', 'filter_stride', 'padding', 'bias_attr', 'param_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(3, 1, None, None, None, None, None)) paddle.fluid.layers.conv2d ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)) paddle.fluid.layers.conv3d ArgSpec(args=['input', 'num_filters', 'filter_size', 'stride', 'padding', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(1, 0, 1, None, None, None, True, None, None)) -paddle.fluid.layers.sequence_pool ArgSpec(args=['input', 'pool_type'], varargs=None, keywords=None, defaults=None) -paddle.fluid.layers.sequence_softmax ArgSpec(args=['input', 'param_attr', 'bias_attr', 'use_cudnn'], varargs=None, keywords=None, defaults=(None, None, False)) -paddle.fluid.layers.softmax ArgSpec(args=['input', 'param_attr', 'bias_attr', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(None, None, True, None)) -paddle.fluid.layers.pool2d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None)) -paddle.fluid.layers.pool3d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None)) +paddle.fluid.layers.sequence_pool ArgSpec(args=['input', 'pool_type', 'is_test'], varargs=None, keywords=None, defaults=(False,)) +paddle.fluid.layers.sequence_softmax ArgSpec(args=['input', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(False, None)) +paddle.fluid.layers.softmax ArgSpec(args=['input', 'use_cudnn', 'name'], varargs=None, keywords=None, defaults=(True, None)) +paddle.fluid.layers.pool2d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name', 'exclusive'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None, True)) +paddle.fluid.layers.pool3d ArgSpec(args=['input', 'pool_size', 'pool_type', 'pool_stride', 'pool_padding', 'global_pooling', 'use_cudnn', 'ceil_mode', 'name', 'exclusive'], varargs=None, keywords=None, defaults=(-1, 'max', 1, 0, False, True, False, None, True)) paddle.fluid.layers.batch_norm ArgSpec(args=['input', 'act', 'is_test', 'momentum', 'epsilon', 'param_attr', 'bias_attr', 'data_layout', 'in_place', 'name', 'moving_mean_name', 'moving_variance_name', 'do_model_average_for_mean_and_var', 'fuse_with_relu'], varargs=None, keywords=None, defaults=(None, False, 0.9, 1e-05, None, None, 'NCHW', False, None, None, None, False, False)) paddle.fluid.layers.beam_search_decode ArgSpec(args=['ids', 'scores', 'beam_size', 'end_id', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.conv2d_transpose ArgSpec(args=['input', 'num_filters', 'output_size', 'filter_size', 'padding', 'stride', 'dilation', 'groups', 'param_attr', 'bias_attr', 'use_cudnn', 'act', 'name'], varargs=None, keywords=None, defaults=(None, None, 0, 1, 1, None, None, None, True, None, None)) @@ -86,28 +86,29 @@ paddle.fluid.layers.reduce_prod ArgSpec(args=['input', 'dim', 'keep_dim', 'name' paddle.fluid.layers.sequence_first_step ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.sequence_last_step ArgSpec(args=['input'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.sequence_slice ArgSpec(args=['input', 'offset', 'length', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.dropout ArgSpec(args=['x', 'dropout_prob', 'is_test', 'seed', 'name'], varargs=None, keywords=None, defaults=(False, None, None)) +paddle.fluid.layers.dropout ArgSpec(args=['x', 'dropout_prob', 'is_test', 'seed', 'name', 'dropout_implementation'], varargs=None, keywords=None, defaults=(False, None, None, 'downgrade_in_infer')) paddle.fluid.layers.split ArgSpec(args=['input', 'num_or_sections', 'dim', 'name'], varargs=None, keywords=None, defaults=(-1, None)) paddle.fluid.layers.ctc_greedy_decoder ArgSpec(args=['input', 'blank', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.edit_distance ArgSpec(args=['input', 'label', 'normalized', 'ignored_tokens'], varargs=None, keywords=None, defaults=(True, None)) paddle.fluid.layers.l2_normalize ArgSpec(args=['x', 'axis', 'epsilon', 'name'], varargs=None, keywords=None, defaults=(1e-12, None)) paddle.fluid.layers.matmul ArgSpec(args=['x', 'y', 'transpose_x', 'transpose_y', 'alpha', 'name'], varargs=None, keywords=None, defaults=(False, False, 1.0, None)) paddle.fluid.layers.topk ArgSpec(args=['input', 'k', 'name'], varargs=None, keywords=None, defaults=(None,)) -paddle.fluid.layers.warpctc ArgSpec(args=['input', 'label', 'blank', 'norm_by_times'], varargs=None, keywords=None, defaults=(0, False)) +paddle.fluid.layers.warpctc ArgSpec(args=['input', 'label', 'blank', 'norm_by_times', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, False, False)) paddle.fluid.layers.sequence_reshape ArgSpec(args=['input', 'new_dim'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.transpose ArgSpec(args=['x', 'perm', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.im2sequence ArgSpec(args=['input', 'filter_size', 'stride', 'padding', 'input_image_size', 'out_stride', 'name'], varargs=None, keywords=None, defaults=(1, 1, 0, None, 1, None)) -paddle.fluid.layers.nce ArgSpec(args=['input', 'label', 'num_total_classes', 'sample_weight', 'param_attr', 'bias_attr', 'num_neg_samples'], varargs=None, keywords=None, defaults=(None, None, None, None)) -paddle.fluid.layers.hsigmoid ArgSpec(args=['input', 'label', 'num_classes', 'param_attr', 'bias_attr'], varargs=None, keywords=None, defaults=(None, None)) +paddle.fluid.layers.nce ArgSpec(args=['input', 'label', 'num_total_classes', 'sample_weight', 'param_attr', 'bias_attr', 'num_neg_samples', 'name', 'sampler', 'custom_dist', 'seed'], varargs=None, keywords=None, defaults=(None, None, None, None, None, 'uniform', None, 0)) +paddle.fluid.layers.hsigmoid ArgSpec(args=['input', 'label', 'num_classes', 'param_attr', 'bias_attr', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.beam_search ArgSpec(args=['pre_ids', 'pre_scores', 'ids', 'scores', 'beam_size', 'end_id', 'level', 'name'], varargs=None, keywords=None, defaults=(0, None)) paddle.fluid.layers.row_conv ArgSpec(args=['input', 'future_context_size', 'param_attr', 'act'], varargs=None, keywords=None, defaults=(None, None)) paddle.fluid.layers.multiplex ArgSpec(args=['inputs', 'index'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.layer_norm ArgSpec(args=['input', 'scale', 'shift', 'begin_norm_axis', 'epsilon', 'param_attr', 'bias_attr', 'act', 'name'], varargs=None, keywords=None, defaults=(True, True, 1, 1e-05, None, None, None, None)) -paddle.fluid.layers.softmax_with_cross_entropy ArgSpec(args=['logits', 'label', 'soft_label', 'ignore_index'], varargs=None, keywords=None, defaults=(False, -100)) +paddle.fluid.layers.group_norm ArgSpec(args=['input', 'groups', 'epsilon', 'param_attr', 'bias_attr', 'act', 'data_layout', 'name'], varargs=None, keywords=None, defaults=(1e-05, None, None, None, 'NCHW', None)) +paddle.fluid.layers.softmax_with_cross_entropy ArgSpec(args=['logits', 'label', 'soft_label', 'ignore_index', 'numeric_stable_mode', 'return_softmax'], varargs=None, keywords=None, defaults=(False, -100, False, False)) paddle.fluid.layers.smooth_l1 ArgSpec(args=['x', 'y', 'inside_weight', 'outside_weight', 'sigma'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.one_hot ArgSpec(args=['input', 'depth'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.autoincreased_step_counter ArgSpec(args=['counter_name', 'begin', 'step'], varargs=None, keywords=None, defaults=(None, 1, 1)) -paddle.fluid.layers.reshape ArgSpec(args=['x', 'shape', 'actual_shape', 'act', 'inplace', 'name'], varargs=None, keywords=None, defaults=(None, None, True, None)) +paddle.fluid.layers.reshape ArgSpec(args=['x', 'shape', 'actual_shape', 'act', 'inplace', 'name'], varargs=None, keywords=None, defaults=(None, None, False, None)) paddle.fluid.layers.squeeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.unsqueeze ArgSpec(args=['input', 'axes', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.lod_reset ArgSpec(args=['x', 'y', 'target_lod'], varargs=None, keywords=None, defaults=(None, None)) @@ -116,16 +117,19 @@ paddle.fluid.layers.pad ArgSpec(args=['x', 'paddings', 'pad_value', 'name'], var paddle.fluid.layers.pad_constant_like ArgSpec(args=['x', 'y', 'pad_value', 'name'], varargs=None, keywords=None, defaults=(0.0, None)) paddle.fluid.layers.label_smooth ArgSpec(args=['label', 'prior_dist', 'epsilon', 'dtype', 'name'], varargs=None, keywords=None, defaults=(None, 0.1, 'float32', None)) paddle.fluid.layers.roi_pool ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale'], varargs=None, keywords=None, defaults=(1, 1, 1.0)) +paddle.fluid.layers.roi_align ArgSpec(args=['input', 'rois', 'pooled_height', 'pooled_width', 'spatial_scale', 'sampling_ratio', 'name'], varargs=None, keywords=None, defaults=(1, 1, 1.0, -1, None)) paddle.fluid.layers.dice_loss ArgSpec(args=['input', 'label', 'epsilon'], varargs=None, keywords=None, defaults=(1e-05,)) -paddle.fluid.layers.image_resize ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR')) +paddle.fluid.layers.image_resize ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'resample', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, 'BILINEAR', None)) paddle.fluid.layers.image_resize_short ArgSpec(args=['input', 'out_short_len', 'resample'], varargs=None, keywords=None, defaults=('BILINEAR',)) -paddle.fluid.layers.resize_bilinear ArgSpec(args=['input', 'out_shape', 'scale', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) +paddle.fluid.layers.resize_bilinear ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, None)) +paddle.fluid.layers.resize_nearest ArgSpec(args=['input', 'out_shape', 'scale', 'name', 'actual_shape'], varargs=None, keywords=None, defaults=(None, None, None, None)) paddle.fluid.layers.gather ArgSpec(args=['input', 'index'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.sequence_scatter ArgSpec(args=['input', 'index', 'updates', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.random_crop ArgSpec(args=['x', 'shape', 'seed'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.mean_iou ArgSpec(args=['input', 'label', 'num_classes'], varargs=None, keywords=None, defaults=None) paddle.fluid.layers.relu ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.selu ArgSpec(args=['x', 'scale', 'alpha', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.log ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.crop ArgSpec(args=['x', 'shape', 'offsets', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.rank_loss ArgSpec(args=['label', 'left', 'right', 'name'], varargs=None, keywords=None, defaults=(None,)) @@ -173,6 +177,16 @@ paddle.fluid.layers.mean ArgSpec(args=['x', 'name'], varargs=None, keywords=None paddle.fluid.layers.mul ArgSpec(args=['x', 'y', 'x_num_col_dims', 'y_num_col_dims', 'name'], varargs=None, keywords=None, defaults=(1, 1, None)) paddle.fluid.layers.sigmoid_cross_entropy_with_logits ArgSpec(args=['x', 'label', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.maxout ArgSpec(args=['x', 'groups', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.space_to_depth ArgSpec(args=['x', 'blocksize', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.affine_grid ArgSpec(args=['theta', 'out_shape', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.sequence_reverse ArgSpec(args=['x', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.affine_channel ArgSpec(args=['x', 'scale', 'bias', 'data_layout', 'name'], varargs=None, keywords=None, defaults=(None, None, 'NCHW', None)) +paddle.fluid.layers.similarity_focus ArgSpec(args=['input', 'axis', 'indexes', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.hash ArgSpec(args=['input', 'hash_size', 'num_hash', 'name'], varargs=None, keywords=None, defaults=(1, None)) +paddle.fluid.layers.grid_sampler ArgSpec(args=['x', 'grid', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.log_loss ArgSpec(args=['input', 'label', 'epsilon', 'name'], varargs=None, keywords=None, defaults=(0.0001, None)) +paddle.fluid.layers.add_position_encoding ArgSpec(args=['input', 'alpha', 'beta', 'name'], varargs=None, keywords=None, defaults=(None,)) +paddle.fluid.layers.bilinear_tensor_product ArgSpec(args=['x', 'y', 'size', 'act', 'name', 'param_attr', 'bias_attr'], varargs=None, keywords=None, defaults=(None, None, None, None)) paddle.fluid.layers.data ArgSpec(args=['name', 'shape', 'append_batch_size', 'dtype', 'lod_level', 'type', 'stop_gradient'], varargs=None, keywords=None, defaults=(True, 'float32', 0, VarType.LOD_TENSOR, True)) paddle.fluid.layers.open_files ArgSpec(args=['filenames', 'shapes', 'lod_levels', 'dtypes', 'thread_num', 'buffer_size', 'pass_num', 'is_test'], varargs=None, keywords=None, defaults=(None, None, 1, None)) paddle.fluid.layers.read_file ArgSpec(args=['reader'], varargs=None, keywords=None, defaults=None) @@ -181,6 +195,7 @@ paddle.fluid.layers.batch ArgSpec(args=['reader', 'batch_size'], varargs=None, k paddle.fluid.layers.double_buffer ArgSpec(args=['reader', 'place', 'name'], varargs=None, keywords=None, defaults=(None, None)) paddle.fluid.layers.random_data_generator ArgSpec(args=['low', 'high', 'shapes', 'lod_levels', 'for_parallel'], varargs=None, keywords=None, defaults=(True,)) paddle.fluid.layers.py_reader ArgSpec(args=['capacity', 'shapes', 'dtypes', 'lod_levels', 'name', 'use_double_buffer'], varargs=None, keywords=None, defaults=(None, None, True)) +paddle.fluid.layers.create_py_reader_by_data ArgSpec(args=['capacity', 'feed_list', 'name', 'use_double_buffer'], varargs=None, keywords=None, defaults=(None, True)) paddle.fluid.layers.Preprocessor.__init__ ArgSpec(args=['self', 'reader', 'name'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.Preprocessor.block ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) paddle.fluid.layers.Preprocessor.inputs ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) @@ -190,6 +205,7 @@ paddle.fluid.layers.create_tensor ArgSpec(args=['dtype', 'name', 'persistable'], paddle.fluid.layers.create_parameter ArgSpec(args=['shape', 'dtype', 'name', 'attr', 'is_bias', 'default_initializer'], varargs=None, keywords=None, defaults=(None, None, False, None)) paddle.fluid.layers.create_global_var ArgSpec(args=['shape', 'value', 'dtype', 'persistable', 'force_cpu', 'name'], varargs=None, keywords=None, defaults=(False, False, None)) paddle.fluid.layers.cast ArgSpec(args=['x', 'dtype'], varargs=None, keywords=None, defaults=None) +paddle.fluid.layers.tensor_array_to_tensor ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None)) paddle.fluid.layers.concat ArgSpec(args=['input', 'axis', 'name'], varargs=None, keywords=None, defaults=(0, None)) paddle.fluid.layers.sums ArgSpec(args=['input', 'out'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.assign ArgSpec(args=['input', 'output'], varargs=None, keywords=None, defaults=(None,)) @@ -260,6 +276,7 @@ paddle.fluid.layers.hard_shrink ArgSpec(args=['x', 'threshold'], varargs=None, k paddle.fluid.layers.cumsum ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.thresholded_relu ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.prior_box ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False)) +paddle.fluid.layers.density_prior_box ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'flatten_to_2d', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, False, None)) paddle.fluid.layers.multi_box_head ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False)) paddle.fluid.layers.bipartite_match ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) @@ -325,7 +342,7 @@ paddle.fluid.transpiler.RoundRobin.dispatch ArgSpec(args=['self', 'varlist'], va paddle.fluid.transpiler.RoundRobin.reset ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None) paddle.fluid.transpiler.DistributeTranspilerConfig.__init__ paddle.fluid.nets.simple_img_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'pool_size', 'pool_stride', 'pool_padding', 'pool_type', 'global_pooling', 'conv_stride', 'conv_padding', 'conv_dilation', 'conv_groups', 'param_attr', 'bias_attr', 'act', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, 'max', False, 1, 0, 1, 1, None, None, None, True)) -paddle.fluid.nets.sequence_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max')) +paddle.fluid.nets.sequence_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type', 'bias_attr'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max', None)) paddle.fluid.nets.glu ArgSpec(args=['input', 'dim'], varargs=None, keywords=None, defaults=(-1,)) paddle.fluid.nets.scaled_dot_product_attention ArgSpec(args=['queries', 'keys', 'values', 'num_heads', 'dropout_rate'], varargs=None, keywords=None, defaults=(1, 0.0)) paddle.fluid.nets.img_conv_group ArgSpec(args=['input', 'conv_num_filter', 'pool_size', 'conv_padding', 'conv_filter_size', 'conv_act', 'param_attr', 'conv_with_batchnorm', 'conv_batchnorm_drop_rate', 'pool_stride', 'pool_type', 'use_cudnn'], varargs=None, keywords=None, defaults=(1, 3, None, None, False, 0.0, 1, 'max', True)) @@ -351,6 +368,8 @@ paddle.fluid.optimizer.ModelAverage.__init__ ArgSpec(args=['self', 'average_wind paddle.fluid.optimizer.ModelAverage.apply ArgSpec(args=[], varargs='args', keywords='kwds', defaults=None) paddle.fluid.optimizer.ModelAverage.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.optimizer.ModelAverage.restore ArgSpec(args=['self', 'executor'], varargs=None, keywords=None, defaults=None) +paddle.fluid.optimizer.LarsMomentumOptimizer.__init__ ArgSpec(args=['self', 'learning_rate', 'momentum', 'lars_coeff', 'lars_weight_decay', 'regularization', 'name'], varargs=None, keywords=None, defaults=(0.001, 0.0005, None, None)) +paddle.fluid.optimizer.LarsMomentumOptimizer.minimize ArgSpec(args=['self', 'loss', 'startup_program', 'parameter_list', 'no_grad_set'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.backward.append_backward ArgSpec(args=['loss', 'parameter_list', 'no_grad_set', 'callbacks'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.regularizer.L1DecayRegularizer.__init__ ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)) paddle.fluid.regularizer.L2DecayRegularizer.__init__ ArgSpec(args=['self', 'regularization_coeff'], varargs=None, keywords=None, defaults=(0.0,)) diff --git a/paddle/fluid/CMakeLists.txt b/paddle/fluid/CMakeLists.txt index 48b36df6499e59fe742766b5f81fd30a9fb8b900..6b526f0103ad3c530c06a68757cf89293f4fb84b 100644 --- a/paddle/fluid/CMakeLists.txt +++ b/paddle/fluid/CMakeLists.txt @@ -3,14 +3,9 @@ add_subdirectory(platform) add_subdirectory(framework) add_subdirectory(operators) add_subdirectory(string) - -if (NOT WIN32) -add_subdirectory(pybind) add_subdirectory(recordio) -endif(NOT WIN32) +add_subdirectory(pybind) -if(WITH_INFERENCE) - # NOTE: please add subdirectory inference at last. - add_subdirectory(inference) - add_subdirectory(train) -endif() +# NOTE: please add subdirectory inference at last. +add_subdirectory(inference) +add_subdirectory(train) diff --git a/paddle/fluid/framework/CMakeLists.txt b/paddle/fluid/framework/CMakeLists.txt index 844291140602a7a0aac9d9d40256deaf9d8a4c60..52946c7f11f90490b1af1347f20db236a8fe24af 100644 --- a/paddle/fluid/framework/CMakeLists.txt +++ b/paddle/fluid/framework/CMakeLists.txt @@ -31,9 +31,7 @@ function(windows_symbolic TARGET) endfunction() add_subdirectory(ir) -if (NOT WIN32) add_subdirectory(details) -endif (NOT WIN32) # ddim lib proto_library(framework_proto SRCS framework.proto) @@ -68,11 +66,7 @@ if(WITH_GPU) else() cc_test(mixed_vector_test SRCS mixed_vector_test.cc DEPS place memory device_context tensor) endif() -if (NOT WIN32) - cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto recordio version) -else() - cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto version) -endif (NOT WIN32) +cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto recordio version) cc_test(lod_tensor_test SRCS lod_tensor_test.cc DEPS lod_tensor memory) nv_test(lod_tensor_gpu_test SRCS lod_tensor_test.cu DEPS lod_tensor) @@ -122,13 +116,9 @@ cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker) cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto) cc_library(shape_inference SRCS shape_inference.cc DEPS ddim attribute device_context) -if (NOT WIN32) -cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog - shape_inference data_transform lod_tensor profiler) -else() +cc_library(transfer_scope_cache SRCS transfer_scope_cache.cc DEPS scope framework_proto device_context) cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog - shape_inference data_transform lod_tensor) -endif(NOT WIN32) + shape_inference data_transform lod_tensor profiler transfer_scope_cache) cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry device_context) @@ -136,20 +126,32 @@ cc_library(version SRCS version.cc) cc_test(version_test SRCS version_test.cc DEPS version) cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS shape_inference op_info operator glog version) +cc_library(ngraph_bridge SRCS ngraph_bridge.cc DEPS operator framework_proto) +if(NOT WIN32) +cc_library(ngraph_operator SRCS ngraph_operator.cc DEPS ngraph_bridge operator op_info device_context tensor scope glog + shape_inference data_transform lod_tensor profiler) +endif(NOT WIN32) cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc) nv_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry) -if (NOT WIN32) py_proto_compile(framework_py_proto SRCS framework.proto) # Generate an empty __init__.py to make framework_py_proto as a valid python module. add_custom_target(framework_py_proto_init ALL COMMAND ${CMAKE_COMMAND} -E touch __init__.py) add_dependencies(framework_py_proto framework_py_proto_init) -add_custom_command(TARGET framework_py_proto POST_BUILD - COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto - COMMAND cp *.py ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/ - COMMENT "Copy generated python proto into directory paddle/fluid/proto." - WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) +if (NOT WIN32) + add_custom_command(TARGET framework_py_proto POST_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto + COMMAND cp *.py ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/ + COMMENT "Copy generated python proto into directory paddle/fluid/proto." + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) +else(NOT WIN32) + string(REPLACE "/" "\\" proto_dstpath "${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/") + add_custom_command(TARGET framework_py_proto POST_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto + COMMAND copy /Y *.py ${proto_dstpath} + COMMENT "Copy generated python proto into directory paddle/fluid/proto." + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) endif(NOT WIN32) cc_library(lod_rank_table SRCS lod_rank_table.cc DEPS lod_tensor) @@ -163,16 +165,18 @@ if(WITH_DISTRIBUTE) set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") set_source_files_properties(executor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) else() - cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass) + if(NOT WIN32) + cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass ngraph_operator) + else(NOT WIN32) + cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto glog lod_rank_table feed_fetch_method graph_to_program_pass) + endif(NOT WIN32) cc_test(test_naive_executor SRCS naive_executor_test.cc DEPS naive_executor elementwise_add_op) endif() - -if (NOT WIN32) + cc_library(parallel_executor SRCS parallel_executor.cc DEPS threaded_ssa_graph_executor scope_buffered_ssa_graph_executor graph build_strategy fast_threaded_ssa_graph_executor) -endif() # NOT WIN32 cc_library(prune SRCS prune.cc DEPS framework_proto) cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context) @@ -189,3 +193,6 @@ cc_test(tuple_test SRCS tuple_test.cc ) if (NOT WIN32) cc_test(rw_lock_test SRCS rw_lock_test.cc) endif (NOT WIN32) + +cc_library(dlpack_tensor SRCS dlpack_tensor.cc DEPS tensor dlpack) +cc_test(dlpack_tensor_test SRCS dlpack_tensor_test.cc DEPS dlpack_tensor glog) diff --git a/paddle/fluid/framework/attribute.cc b/paddle/fluid/framework/attribute.cc index 0dcecb62dba971b48c4f11c0ef47494be40eeea0..fabf2abfc803b8838edb48aa01ab8896799c97ac 100644 --- a/paddle/fluid/framework/attribute.cc +++ b/paddle/fluid/framework/attribute.cc @@ -64,6 +64,13 @@ Attribute GetAttrValue(const proto::OpDesc::Attr& attr_desc) { case proto::AttrType::LONG: { return attr_desc.l(); } + case proto::AttrType::LONGS: { + std::vector val(attr_desc.longs_size()); + for (int i = 0; i < attr_desc.longs_size(); ++i) { + val[i] = attr_desc.longs(i); + } + return val; + } default: PADDLE_THROW("Unsupport attr type %d", attr_desc.type()); } diff --git a/paddle/fluid/framework/attribute.h b/paddle/fluid/framework/attribute.h index 14ca3e96209ed17f12e87fda8506806514698977..d9c76881b7e98d0b7cd29024b98c8f7720398c66 100644 --- a/paddle/fluid/framework/attribute.h +++ b/paddle/fluid/framework/attribute.h @@ -26,6 +26,113 @@ limitations under the License. */ namespace paddle { namespace framework { + +template +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + T* operator()(Attribute& attr) const { + T* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s", + attr_name_, paddle::platform::demangle(typeid(T).name()), + paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +// special handle bool +// FIXME(yuyang18): Currently we cast bool into int in python binding. It is +// hard to change the logic there. In another way, we should correct handle +// if the user set `some_flag=1`. +// +// FIX ME anytime if there is a better solution. +template <> +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + bool* operator()(Attribute& attr) const { + if (attr.type() == typeid(int)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } else if (attr.type() == typeid(float)) { // NOLINT + float val = boost::get(attr); + attr = static_cast(val); + } + bool* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +template <> +struct ExtractAttribute { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + int64_t* operator()(Attribute& attr) const { + if (attr.type() == typeid(int)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } else if (attr.type() == typeid(float)) { // NOLINT + int val = boost::get(attr); + attr = static_cast(val); + } + int64_t* attr_value = nullptr; + try { + attr_value = &boost::get(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + +template <> +struct ExtractAttribute> { + explicit ExtractAttribute(const std::string& attr_name) + : attr_name_(attr_name) {} + + std::vector* operator()(Attribute& attr) const { + if (attr.type() == typeid(std::vector)) { // NOLINT + std::vector val = boost::get>(attr); + std::vector vec(val.begin(), val.end()); + attr = vec; + } else if (attr.type() == typeid(std::vector)) { // NOLINT + std::vector val = boost::get>(attr); + std::vector vec(val.begin(), val.end()); + attr = vec; + } + std::vector* attr_value = nullptr; + try { + attr_value = &boost::get>(attr); + } catch (boost::bad_get& bad_get) { + PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", + attr_name_, paddle::platform::demangle(attr.type().name())); + } + return attr_value; + } + + const std::string& attr_name_; +}; + template inline proto::AttrType AttrTypeID() { Attribute tmp = T(); @@ -42,7 +149,11 @@ class AttrReader { inline const T& Get(const std::string& name) const { PADDLE_ENFORCE(attrs_.count(name) != 0, "%s should be in AttributeMap", name); - return boost::get(attrs_.at(name)); + + Attribute& attr = const_cast(attrs_.at(name)); + ExtractAttribute extract_attr(name); + T* attr_value = extract_attr(attr); + return *attr_value; } private: @@ -82,7 +193,7 @@ class DefaultValueSetter { public: explicit DefaultValueSetter(T default_value) : default_value_(default_value) {} - void operator()(T& value) const { value = default_value_; } + void operator()(T& value) const { value = default_value_; } // NOLINT private: T default_value_; @@ -117,84 +228,6 @@ class EnumInContainer { std::unordered_set container_; }; -template -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - T* operator()(Attribute& attr) const { - T* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s", - attr_name_, paddle::platform::demangle(typeid(T).name()), - paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - -// special handle bool -// FIXME(yuyang18): Currently we cast bool into int in python binding. It is -// hard to change the logic there. In another way, we should correct handle -// if the user set `some_flag=1`. -// -// FIX ME anytime if there is a better solution. -template <> -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - bool* operator()(Attribute& attr) const { - if (attr.type() == typeid(int)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } else if (attr.type() == typeid(float)) { // NOLINT - float val = boost::get(attr); - attr = static_cast(val); - } - bool* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s", - attr_name_, paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - -template <> -struct ExtractAttribute { - explicit ExtractAttribute(const std::string& attr_name) - : attr_name_(attr_name) {} - - int64_t* operator()(Attribute& attr) const { - if (attr.type() == typeid(int)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } else if (attr.type() == typeid(float)) { // NOLINT - int val = boost::get(attr); - attr = static_cast(val); - } - int64_t* attr_value = nullptr; - try { - attr_value = &boost::get(attr); - } catch (boost::bad_get& bad_get) { - PADDLE_THROW("Cannot get attribute %s by type int64_t, its type is %s", - attr_name_, paddle::platform::demangle(attr.type().name())); - } - return attr_value; - } - - const std::string& attr_name_; -}; - // check whether a certain attribute fit its limits // an attribute can have more than one limits template @@ -235,7 +268,7 @@ class TypedAttrChecker { return *this; } - void operator()(AttributeMap& attr_map) const { + void operator()(AttributeMap& attr_map) const { // NOLINT if (!attr_map.count(attr_name_)) { // user do not set this attr PADDLE_ENFORCE(!default_value_setter_.empty(), @@ -271,7 +304,7 @@ class OpAttrChecker { return *(checker.target>()); } - void Check(AttributeMap& attr_map) const { + void Check(AttributeMap& attr_map) const { // NOLINT for (const auto& checker : attr_checkers_) { checker(attr_map); } diff --git a/paddle/fluid/framework/data_device_transform.cc b/paddle/fluid/framework/data_device_transform.cc index fee6ba40047053ed5662fe044eceb0c687bd4db9..57ff061fe5e612495add86df8f82fe7d9f9107dc 100644 --- a/paddle/fluid/framework/data_device_transform.cc +++ b/paddle/fluid/framework/data_device_transform.cc @@ -18,8 +18,8 @@ namespace framework { void TransDataDevice(const Tensor &in, const platform::Place &dst_place, Tensor *out) { - VLOG(3) << "DeviceTransform in, src_place " << in.place() - << " dst_place: " << dst_place; + VLOG(30) << "DeviceTransform in, src_place " << in.place() + << " dst_place: " << dst_place; PADDLE_ENFORCE_NE( in.place().which(), dst_place.which(), diff --git a/paddle/fluid/framework/data_device_transform_test.cu b/paddle/fluid/framework/data_device_transform_test.cu index f2c55e533a2747325b1b16fdada37945a8ed3c42..2d2323edc3a6636bec72ea2ae7329ebd4e619348 100644 --- a/paddle/fluid/framework/data_device_transform_test.cu +++ b/paddle/fluid/framework/data_device_transform_test.cu @@ -17,7 +17,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/init.h" @@ -49,10 +49,10 @@ class TestOpWithKernel : public OperatorWithKernel { OpKernelType GetExpectedKernelType( const ExecutionContext& ctx) const override { if (Attr("use_gpu")) { - VLOG(3) << "force use gpu kernel"; + VLOG(30) << "force use gpu kernel"; return OpKernelType(proto::VarType::FP32, platform::CUDAPlace(0)); } else { - VLOG(3) << "use default kernel"; + VLOG(30) << "use default kernel"; return OpKernelType(proto::VarType::FP32, ctx.Input("input")->place()); } @@ -148,7 +148,7 @@ TEST(Operator, CPUtoGPU) { // get output auto* output2 = scope.Var("OUT2"); gpu_op->Run(scope, cuda_place); - VLOG(3) << "after gpu_op run"; + VLOG(30) << "after gpu_op run"; // auto* output2_ptr = output2->Get().data(); paddle::platform::DeviceContextPool& pool = diff --git a/paddle/fluid/framework/details/CMakeLists.txt b/paddle/fluid/framework/details/CMakeLists.txt index e0a3ef5a9c6c53c42ebea1a41cac0d18a77781b2..d6b5ad4570c1d8402dedb8596cc75d9eae5a91c7 100644 --- a/paddle/fluid/framework/details/CMakeLists.txt +++ b/paddle/fluid/framework/details/CMakeLists.txt @@ -1,5 +1,6 @@ cc_library(var_handle SRCS var_handle.cc DEPS place framework_proto node) cc_library(op_handle_base SRCS op_handle_base.cc DEPS var_handle device_context lod_tensor) +cc_library(op_graph_view SRCS op_graph_view.cc DEPS op_handle_base) cc_library(scale_loss_grad_op_handle SRCS scale_loss_grad_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory) cc_library(fetch_op_handle SRCS fetch_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory) cc_library(computation_op_handle SRCS computation_op_handle.cc DEPS framework_proto scope place operator op_registry) @@ -16,32 +17,39 @@ if(WITH_GPU) dynload_cuda variable_visitor) nv_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope ddim dynload_cuda) nv_library(broadcast_op_handle SRCS broadcast_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor dynload_cuda) + nv_library(fused_broadcast_op_handle SRCS fused_broadcast_op_handle.cc DEPS broadcast_op_handle) else() cc_library(all_reduce_op_handle SRCS all_reduce_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory variable_visitor) cc_library(reduce_op_handle SRCS reduce_op_handle.cc DEPS op_handle_base variable_visitor scope ddim) cc_library(broadcast_op_handle SRCS broadcast_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor) + cc_library(fused_broadcast_op_handle SRCS fused_broadcast_op_handle.cc DEPS broadcast_op_handle) endif() cc_library(data_balance_op_handle SRCS data_balance_op_handle.cc DEPS op_handle_base scope lod_tensor) cc_library(gather_op_handle SRCS gather_op_handle.cc DEPS op_handle_base scope ddim memory variable_visitor) cc_library(fuse_vars_op_handle SRCS fuse_vars_op_handle.cc DEPS op_handle_base scope) -if(WITH_GPU) +cc_library(modify_op_lock_and_record_event_pass SRCS modify_op_lock_and_record_event_pass.cc DEPS computation_op_handle op_graph_view multi_devices_helper) + +if (WITH_GPU) cc_library(reference_count_pass SRCS reference_count_pass.cc DEPS computation_op_handle scale_loss_grad_op_handle rpc_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle graph graph_helper pass) endif() +cc_library(sequential_execution_pass SRCS sequential_execution_pass.cc DEPS graph graph_helper pass) + cc_library(multi_devices_graph_pass SRCS multi_devices_graph_pass.cc DEPS multi_devices_helper computation_op_handle - scale_loss_grad_op_handle rpc_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle) + scale_loss_grad_op_handle rpc_op_handle all_reduce_op_handle reduce_op_handle broadcast_op_handle data_balance_op_handle fused_broadcast_op_handle) -if(WITH_GPU) - cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS graph framework_proto reference_count_pass) -else() - cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS graph framework_proto) +set(SSA_GRAPH_EXECUTOR_DEPS graph framework_proto sequential_execution_pass modify_op_lock_and_record_event_pass) +if (WITH_GPU) + list(APPEND SSA_GRAPH_EXECUTOR_DEPS reference_count_pass) endif() +cc_library(ssa_graph_executor SRCS ssa_graph_executor.cc DEPS ${SSA_GRAPH_EXECUTOR_DEPS}) + cc_library(threaded_ssa_graph_executor SRCS threaded_ssa_graph_executor.cc DEPS fetch_op_handle ssa_graph_executor scope simple_threadpool device_context) @@ -54,8 +62,9 @@ cc_library(scope_buffered_ssa_graph_executor SRCS scope_buffered_ssa_graph_execu # device_context reduce_op_handle ) cc_library(fast_threaded_ssa_graph_executor SRCS fast_threaded_ssa_graph_executor.cc DEPS fetch_op_handle ssa_graph_executor scope simple_threadpool device_context) +cc_test(fused_broadcast_op_test SRCS fused_broadcast_op_handle_test.cc DEPS fused_broadcast_op_handle) cc_library(build_strategy SRCS build_strategy.cc DEPS graph_viz_pass multi_devices_graph_pass multi_devices_graph_print_pass multi_devices_graph_check_pass - fuse_elewise_add_act_pass) + fuse_elewise_add_act_pass multi_batch_merge_pass) diff --git a/paddle/fluid/framework/details/all_reduce_op_handle.cc b/paddle/fluid/framework/details/all_reduce_op_handle.cc index 7c5f5bd80a937bf1a1c891155764833d7b21c5c2..a003995ae3f8e111881b4681554aa8eb17b60cc1 100644 --- a/paddle/fluid/framework/details/all_reduce_op_handle.cc +++ b/paddle/fluid/framework/details/all_reduce_op_handle.cc @@ -23,7 +23,7 @@ namespace paddle { namespace framework { namespace details { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, @@ -34,7 +34,7 @@ AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, nccl_ctxs_(ctxs) { if (nccl_ctxs_) { for (auto &p : places_) { - this->dev_ctxes_[p] = nccl_ctxs_->DevCtx(p); + this->SetDeviceContext(p, nccl_ctxs_->DevCtx(p)); } } } @@ -46,7 +46,7 @@ AllReduceOpHandle::AllReduceOpHandle(ir::Node *node, #endif void AllReduceOpHandle::RunImpl() { - platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); + platform::RecordEvent record_event(Name(), dev_ctxes_.cbegin()->second); if (NoDummyInputSize() == 1) { return; // No need to all reduce when GPU count = 1; @@ -74,7 +74,7 @@ void AllReduceOpHandle::RunImpl() { } if (platform::is_gpu_place(lod_tensors[0]->place())) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) PADDLE_ENFORCE(nccl_ctxs_, "nccl_ctxs should not be nullptr."); int dtype = -1; size_t numel = 0; @@ -127,7 +127,7 @@ void AllReduceOpHandle::RunImpl() { *local_scopes_[i]->FindVar(kLocalExecScopeName)->Get(); auto &p = places_[i]; auto *var = scope.FindVar(out_var_handles[i]->name_); - auto *dev_ctx = dev_ctxes_[p]; + auto *dev_ctx = dev_ctxes_.at(p); RunAndRecordEvent(p, [&trg, var, dev_ctx, p] { auto &tensor_gpu = *var->GetMutable(); diff --git a/paddle/fluid/framework/details/all_reduce_op_handle.h b/paddle/fluid/framework/details/all_reduce_op_handle.h index f6ef3a1367b91b6abf8ce74a91f73056efd0f84e..b449796fcaee73a6b84e0db2b5c76ff94bedcf08 100644 --- a/paddle/fluid/framework/details/all_reduce_op_handle.h +++ b/paddle/fluid/framework/details/all_reduce_op_handle.h @@ -20,7 +20,7 @@ #include "paddle/fluid/framework/details/op_handle_base.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -29,7 +29,7 @@ namespace framework { namespace details { struct AllReduceOpHandle : public OpHandleBase { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) AllReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, const platform::NCCLContextMap *ctxs); @@ -49,7 +49,7 @@ struct AllReduceOpHandle : public OpHandleBase { private: std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const platform::NCCLContextMap *nccl_ctxs_; #endif }; diff --git a/paddle/fluid/framework/details/broadcast_op_handle.cc b/paddle/fluid/framework/details/broadcast_op_handle.cc index 4fdab5cd94358d08eac7f8b041bf16d09042f0bd..d98df3bbadd391d6df9b0a65a063e43e07a06fbc 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle.cc +++ b/paddle/fluid/framework/details/broadcast_op_handle.cc @@ -48,16 +48,27 @@ void BroadcastOpHandle::RunImpl() { var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get()); } + BroadcastOneVar(*in_var_handle, out_var_handles, var_scopes); +} + +void BroadcastOpHandle::BroadcastOneVar( + const VarHandle &in_var_handle, + const std::vector &out_var_handles, + const std::vector &var_scopes) { auto *in_var = - var_scopes.at(in_var_handle->scope_idx_)->FindVar(in_var_handle->name_); + var_scopes.at(in_var_handle.scope_idx_)->FindVar(in_var_handle.name_); PADDLE_ENFORCE_NOT_NULL(in_var); Tensor &in_tensor = VariableVisitor::GetMutableTensor(in_var); + if (UNLIKELY(!in_tensor.IsInitialized())) { + VLOG(30) << "in var " << in_var_handle.name_ << "not inited, return!"; + return; + } - InitOutputValue(*in_var_handle, out_var_handles); + InitOutputValue(in_var_handle, out_var_handles); if (platform::is_cpu_place(in_tensor.place())) { for (auto *out_var_handle : out_var_handles) { - if (out_var_handle->IsTheSameVar(*in_var_handle)) { + if (out_var_handle->IsTheSameVar(in_var_handle)) { continue; } auto &out_p = out_var_handle->place_; @@ -71,7 +82,7 @@ void BroadcastOpHandle::RunImpl() { }); } } else { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) VarHandle *out_handle = nullptr; int root_id = boost::get(in_tensor.place()).device; std::vector> broadcast_calls; @@ -114,12 +125,12 @@ void BroadcastOpHandle::RunImpl() { } } - if (!out_handle->IsTheSameVar(*in_var_handle)) { - auto out_var = var_scopes.at(in_var_handle->scope_idx_) + if (!out_handle->IsTheSameVar(in_var_handle)) { + auto out_var = var_scopes.at(in_var_handle.scope_idx_) ->FindVar(out_var_handles[0]->name_); paddle::framework::TensorCopy( - in_tensor, in_var_handle->place_, - *(dev_ctxes_.at(in_var_handle->place_)), + in_tensor, in_var_handle.place_, + *(dev_ctxes_.at(in_var_handle.place_)), &VariableVisitor::GetMutableTensor(out_var)); } }); diff --git a/paddle/fluid/framework/details/broadcast_op_handle.h b/paddle/fluid/framework/details/broadcast_op_handle.h index fe4e733e43417977df324fde808f52b228a27d19..0c75e05f861636565ae855ddd534c1082d40d237 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle.h +++ b/paddle/fluid/framework/details/broadcast_op_handle.h @@ -24,7 +24,7 @@ #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/platform/device_context.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -34,7 +34,7 @@ namespace details { struct BroadcastOpHandle : public OpHandleBase { public: -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) BroadcastOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, const platform::NCCLContextMap *nccl_ctxs) @@ -44,7 +44,8 @@ struct BroadcastOpHandle : public OpHandleBase { nccl_ctxs_(nccl_ctxs) { if (nccl_ctxs_) { for (auto &p_ctx : nccl_ctxs_->contexts_) { - dev_ctxes_[platform::CUDAPlace(p_ctx.first)] = p_ctx.second.ctx_.get(); + this->SetDeviceContext(platform::CUDAPlace(p_ctx.first), + p_ctx.second.ctx_.get()); } } } @@ -61,10 +62,13 @@ struct BroadcastOpHandle : public OpHandleBase { protected: void RunImpl() override; - private: + void BroadcastOneVar(const VarHandle &in_var_handle, + const std::vector &out_var_handles, + const std::vector &var_scopes); + std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const platform::NCCLContextMap *nccl_ctxs_; #endif diff --git a/paddle/fluid/framework/details/broadcast_op_handle_test.cc b/paddle/fluid/framework/details/broadcast_op_handle_test.cc index ab7412a19fbd13fa39dbae9af528d158cc9ddbd0..650de5a48de6b1fdab120cdeda563a169fd1a1c1 100644 --- a/paddle/fluid/framework/details/broadcast_op_handle_test.cc +++ b/paddle/fluid/framework/details/broadcast_op_handle_test.cc @@ -12,232 +12,12 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/framework/details/broadcast_op_handle.h" -#include "gtest/gtest.h" - -#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/framework/details/broadcast_op_handle_test.h" namespace paddle { namespace framework { namespace details { -namespace f = paddle::framework; -namespace p = paddle::platform; - -// test data amount -const f::DDim kDims = {20, 20}; - -struct TestBroadcastOpHandle { - std::vector> ctxs_; - std::vector local_scopes_; - std::vector param_scopes_; - Scope g_scope_; - std::unique_ptr op_handle_; - std::vector> vars_; - std::vector gpu_list_; - bool use_gpu_; -#ifdef PADDLE_WITH_CUDA - std::unique_ptr nccl_ctxs_; -#endif - - void WaitAll() { - for (size_t j = 0; j < ctxs_.size(); ++j) { - ctxs_[j]->Wait(); - } -#ifdef PADDLE_WITH_CUDA - if (nccl_ctxs_) { - nccl_ctxs_->WaitAll(); - } -#endif - } - - void InitCtxOnGpu(bool use_gpu) { - use_gpu_ = use_gpu; - if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA - int count = p::GetCUDADeviceCount(); - if (count <= 1) { - LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " - "device count is " - << count; - exit(0); - } - for (int i = 0; i < count; ++i) { - auto p = p::CUDAPlace(i); - gpu_list_.push_back(p); - ctxs_.emplace_back(new p::CUDADeviceContext(p)); - } - nccl_ctxs_.reset(new platform::NCCLContextMap(gpu_list_)); -#else - PADDLE_THROW("CUDA is not support."); -#endif - } else { - int count = 8; - for (int i = 0; i < count; ++i) { - auto p = p::CPUPlace(); - gpu_list_.push_back(p); - ctxs_.emplace_back(new p::CPUDeviceContext(p)); - } -#ifdef PADDLE_WITH_CUDA - nccl_ctxs_.reset(nullptr); -#endif - } - } - - void InitBroadcastOp(size_t input_scope_idx) { - for (size_t j = 0; j < gpu_list_.size(); ++j) { - local_scopes_.push_back(&(g_scope_.NewScope())); - Scope& local_scope = local_scopes_.back()->NewScope(); - *local_scopes_.back() - ->Var(details::kLocalExecScopeName) - ->GetMutable() = &local_scope; - local_scope.Var("out"); - param_scopes_.emplace_back(&local_scope); - } - param_scopes_[input_scope_idx]->Var("input"); - - std::unique_ptr n = - ir::CreateNodeForTest("node0", ir::Node::Type::kOperation); - if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA - op_handle_.reset(new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_, - nccl_ctxs_.get())); -#else - PADDLE_THROW("CUDA is not support."); -#endif - } else { -#ifdef PADDLE_WITH_CUDA - op_handle_.reset(new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_, - nccl_ctxs_.get())); -#else - op_handle_.reset( - new BroadcastOpHandle(n.get(), local_scopes_, gpu_list_)); -#endif - } - - std::unique_ptr v = - ir::CreateNodeForTest("node1", ir::Node::Type::kVariable); - auto* in_var_handle = new VarHandle(v.get(), 1, input_scope_idx, "input", - gpu_list_[input_scope_idx]); - vars_.emplace_back(in_var_handle); - op_handle_->AddInput(in_var_handle); - - // add dummy var - - std::unique_ptr v2 = - ir::CreateNodeForTest("node2", ir::Node::Type::kVariable); - vars_.emplace_back(new DummyVarHandle(v2.get())); - DummyVarHandle* dummy_var_handle = - static_cast(vars_.back().get()); - dummy_var_handle->ClearGeneratedOp(); - op_handle_->AddInput(dummy_var_handle); - - for (size_t j = 0; j < gpu_list_.size(); ++j) { - if (!use_gpu_) { - op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get()); - } - std::unique_ptr v3 = - ir::CreateNodeForTest("node3", ir::Node::Type::kVariable); - VarHandle* out_var_handle = - new VarHandle(v3.get(), 2, j, "out", gpu_list_[j]); - vars_.emplace_back(out_var_handle); - op_handle_->AddOutput(out_var_handle); - } - - // add dummy var - std::unique_ptr v4 = - ir::CreateNodeForTest("node4", ir::Node::Type::kVariable); - vars_.emplace_back(new DummyVarHandle(v4.get())); - DummyVarHandle* out_dummy_var_handle = - static_cast(vars_.back().get()); - out_dummy_var_handle->ClearGeneratedOp(); - op_handle_->AddOutput(out_dummy_var_handle); - } - - void TestBroadcastLodTensor(size_t input_scope_idx) { - auto in_var = param_scopes_[input_scope_idx]->FindVar("input"); - PADDLE_ENFORCE_NOT_NULL(in_var); - auto in_lod_tensor = in_var->GetMutable(); - in_lod_tensor->mutable_data(kDims, gpu_list_[input_scope_idx]); - - std::vector send_vector(static_cast(f::product(kDims))); - for (size_t k = 0; k < send_vector.size(); ++k) { - send_vector[k] = k; - } - f::LoD lod{{0, 10, 20}}; - paddle::framework::TensorFromVector( - send_vector, *(ctxs_[input_scope_idx]), in_lod_tensor); - in_lod_tensor->set_lod(lod); - in_lod_tensor->Resize(kDims); - - op_handle_->Run(false); - - WaitAll(); - - p::CPUPlace cpu_place; - for (size_t j = 0; j < gpu_list_.size(); ++j) { - auto out_var = param_scopes_[j]->FindVar("out"); - PADDLE_ENFORCE_NOT_NULL(out_var); - auto out_tensor = out_var->Get(); - PADDLE_ENFORCE_EQ(out_tensor.lod(), lod, "lod is not equal."); - - f::Tensor result_tensor; - f::TensorCopySync(out_tensor, cpu_place, &result_tensor); - float* ct = result_tensor.mutable_data(cpu_place); - - for (int64_t i = 0; i < f::product(kDims); ++i) { - ASSERT_NEAR(ct[i], send_vector[i], 1e-5); - } - } - } - - void TestBroadcastSelectedRows(size_t input_scope_idx) { - auto in_var = param_scopes_[input_scope_idx]->FindVar("input"); - PADDLE_ENFORCE_NOT_NULL(in_var); - auto in_selected_rows = in_var->GetMutable(); - auto value = in_selected_rows->mutable_value(); - value->mutable_data(kDims, gpu_list_[input_scope_idx]); - int height = static_cast(kDims[0]) * 2; - std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, - 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; - in_selected_rows->set_height(height); - in_selected_rows->set_rows(rows); - - std::vector send_vector(static_cast(f::product(kDims))); - for (size_t k = 0; k < send_vector.size(); ++k) { - send_vector[k] = k; - } - paddle::framework::TensorFromVector( - send_vector, *(ctxs_[input_scope_idx]), value); - - op_handle_->Run(false); - - WaitAll(); - - p::CPUPlace cpu_place; - for (size_t j = 0; j < gpu_list_.size(); ++j) { - auto out_var = param_scopes_[j]->FindVar("out"); - PADDLE_ENFORCE_NOT_NULL(out_var); - auto& out_select_rows = out_var->Get(); - auto rt = out_select_rows.value(); - - PADDLE_ENFORCE_EQ(out_select_rows.height(), height, - "height is not equal."); - for (size_t k = 0; k < out_select_rows.rows().size(); ++k) { - PADDLE_ENFORCE_EQ(out_select_rows.rows()[k], rows[k]); - } - - f::Tensor result_tensor; - f::TensorCopySync(rt, cpu_place, &result_tensor); - float* ct = result_tensor.data(); - - for (int64_t i = 0; i < f::product(kDims); ++i) { - ASSERT_NEAR(ct[i], send_vector[i], 1e-5); - } - } - } -}; - TEST(BroadcastTester, TestCPUBroadcastTestLodTensor) { TestBroadcastOpHandle test_op; size_t input_scope_idx = 0; diff --git a/paddle/fluid/framework/details/broadcast_op_handle_test.h b/paddle/fluid/framework/details/broadcast_op_handle_test.h new file mode 100644 index 0000000000000000000000000000000000000000..df3b3cc9ca012eabc428a7fb4c3af9be5b1c5bd5 --- /dev/null +++ b/paddle/fluid/framework/details/broadcast_op_handle_test.h @@ -0,0 +1,273 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include + +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/broadcast_op_handle.h" + +#include "paddle/fluid/platform/device_context.h" + +namespace paddle { +namespace framework { +namespace details { + +namespace f = paddle::framework; +namespace p = paddle::platform; + +// test data amount +const f::DDim kDims = {20, 20}; + +struct TestBroadcastOpHandle { + std::vector> ctxs_; + std::vector local_scopes_; + std::vector param_scopes_; + Scope g_scope_; + OpHandleBase* op_handle_; + std::vector vars_; + std::vector> nodes_; + std::vector place_list_; + bool use_gpu_; +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + std::unique_ptr nccl_ctxs_; +#endif + + void WaitAll() { + for (size_t j = 0; j < ctxs_.size(); ++j) { + ctxs_[j]->Wait(); + } +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + if (nccl_ctxs_) { + nccl_ctxs_->WaitAll(); + } +#endif + } + + void InitCtxOnGpu(bool use_gpu) { + use_gpu_ = use_gpu; + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + int count = p::GetCUDADeviceCount(); + if (count <= 1) { + LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " + "device count is " + << count; + exit(0); + } + for (int i = 0; i < count; ++i) { + auto p = p::CUDAPlace(i); + place_list_.push_back(p); + ctxs_.emplace_back(new p::CUDADeviceContext(p)); + } + nccl_ctxs_.reset(new platform::NCCLContextMap(place_list_)); +#else + PADDLE_THROW("CUDA is not support."); +#endif + } else { + int count = 8; + for (int i = 0; i < count; ++i) { + auto p = p::CPUPlace(); + place_list_.push_back(p); + ctxs_.emplace_back(new p::CPUDeviceContext(p)); + } +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + nccl_ctxs_.reset(nullptr); +#endif + } + } + + void InitBroadcastOp(size_t input_scope_idx) { + nodes_.clear(); + for (size_t j = 0; j < place_list_.size(); ++j) { + local_scopes_.push_back(&(g_scope_.NewScope())); + Scope& local_scope = local_scopes_.back()->NewScope(); + *local_scopes_.back() + ->Var(details::kLocalExecScopeName) + ->GetMutable() = &local_scope; + local_scope.Var("out"); + param_scopes_.emplace_back(&local_scope); + } + param_scopes_[input_scope_idx]->Var("input"); + + nodes_.emplace_back( + ir::CreateNodeForTest("node0", ir::Node::Type::kOperation)); + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_, nccl_ctxs_.get()); +#else + PADDLE_THROW("CUDA is not support."); +#endif + } else { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_, nccl_ctxs_.get()); +#else + op_handle_ = new BroadcastOpHandle(nodes_.back().get(), local_scopes_, + place_list_); +#endif + } + + nodes_.emplace_back( + ir::CreateNodeForTest("node1", ir::Node::Type::kVariable)); + auto* in_var_handle = new VarHandle(nodes_.back().get(), 1, input_scope_idx, + "input", place_list_[input_scope_idx]); + vars_.emplace_back(in_var_handle); + op_handle_->AddInput(in_var_handle); + + // add dummy var + + nodes_.emplace_back( + ir::CreateNodeForTest("node2", ir::Node::Type::kVariable)); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); + DummyVarHandle* dummy_var_handle = + static_cast(vars_.back()); + dummy_var_handle->ClearGeneratedOp(); + op_handle_->AddInput(dummy_var_handle); + + for (size_t j = 0; j < place_list_.size(); ++j) { + if (!use_gpu_) { + op_handle_->SetDeviceContext(place_list_[j], ctxs_[j].get()); + } + nodes_.emplace_back( + ir::CreateNodeForTest("node3", ir::Node::Type::kVariable)); + VarHandle* out_var_handle = + new VarHandle(nodes_.back().get(), 2, j, "out", place_list_[j]); + vars_.emplace_back(out_var_handle); + op_handle_->AddOutput(out_var_handle); + } + + // add dummy var + nodes_.emplace_back( + ir::CreateNodeForTest("node4", ir::Node::Type::kVariable)); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); + DummyVarHandle* out_dummy_var_handle = + static_cast(vars_.back()); + out_dummy_var_handle->ClearGeneratedOp(); + op_handle_->AddOutput(out_dummy_var_handle); + } + + std::vector InitLoDTensor(const std::string& varname, + size_t input_scope_idx, const f::LoD& lod, + float val_scalar = 0.0) { + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + + PADDLE_ENFORCE_NOT_NULL(var); + auto lod_tensor = var->GetMutable(); + std::vector send_vector(static_cast(f::product(kDims))); + for (size_t k = 0; k < send_vector.size(); ++k) { + send_vector[k] = k + val_scalar; + } + paddle::framework::TensorFromVector( + send_vector, *(ctxs_[input_scope_idx]), lod_tensor); + lod_tensor->set_lod(lod); + lod_tensor->Resize(kDims); + return send_vector; + } + + std::vector InitSelectedRows(const std::string& varname, + size_t input_scope_idx, + const std::vector& rows, + int height, float value_scalar = 0.0) { + std::vector send_vector(static_cast(f::product(kDims))); + for (size_t k = 0; k < send_vector.size(); ++k) { + send_vector[k] = k + value_scalar; + } + + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto selected_rows = var->GetMutable(); + auto value = selected_rows->mutable_value(); + value->mutable_data(kDims, place_list_[input_scope_idx]); + selected_rows->set_height(height); + selected_rows->set_rows(rows); + + paddle::framework::TensorFromVector( + send_vector, *(ctxs_[input_scope_idx]), value); + + return send_vector; + } + + void SelectedRowsEqual(const std::string& varname, int input_scope_idx, + const std::vector& send_vector, + const std::vector& rows, int height) { + auto var = param_scopes_[input_scope_idx]->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto& selected_rows = var->Get(); + auto rt = selected_rows.value(); + PADDLE_ENFORCE_EQ(selected_rows.height(), height, "height is not equal."); + + for (size_t k = 0; k < selected_rows.rows().size(); ++k) { + PADDLE_ENFORCE_EQ(selected_rows.rows()[k], rows[k]); + } + + p::CPUPlace cpu_place; + f::Tensor result_tensor; + f::TensorCopySync(rt, cpu_place, &result_tensor); + float* ct = result_tensor.data(); + + for (int64_t i = 0; i < f::product(kDims); ++i) { + ASSERT_NEAR(ct[i], send_vector[i], 1e-5); + } + } + + void LoDTensorEqual(const std::string& varname, + const std::vector& send_vec, const f::LoD& lod, + framework::Scope* scope) { + p::CPUPlace cpu_place; + auto var = scope->FindVar(varname); + PADDLE_ENFORCE_NOT_NULL(var); + auto tensor = var->Get(); + PADDLE_ENFORCE_EQ(tensor.lod(), lod, "lod is not equal."); + f::Tensor result_tensor; + f::TensorCopySync(tensor, cpu_place, &result_tensor); + float* ct = result_tensor.mutable_data(cpu_place); + for (int64_t k = 0; k < f::product(kDims); ++k) { + ASSERT_NEAR(ct[k], send_vec[k], 1e-5); + } + } + + void TestBroadcastLodTensor(size_t input_scope_idx) { + f::LoD lod{{0, 10, 20}}; + auto send_vector = InitLoDTensor("input", input_scope_idx, lod); + + op_handle_->Run(false); + + WaitAll(); + for (size_t j = 0; j < place_list_.size(); ++j) { + LoDTensorEqual("out", send_vector, lod, param_scopes_[j]); + } + } + + void TestBroadcastSelectedRows(size_t input_scope_idx) { + std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, + 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; + int height = static_cast(kDims[0] * 2); + auto send_vector = InitSelectedRows("input", input_scope_idx, rows, height); + + op_handle_->Run(false); + + WaitAll(); + for (size_t j = 0; j < place_list_.size(); ++j) { + SelectedRowsEqual("out", input_scope_idx, send_vector, rows, height); + } + } +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/build_strategy.cc b/paddle/fluid/framework/details/build_strategy.cc index 6a6b497fa897e3882995688bf36704b1d77ea962..70baced0ada33c23ba05cd2722e607edf847585a 100644 --- a/paddle/fluid/framework/details/build_strategy.cc +++ b/paddle/fluid/framework/details/build_strategy.cc @@ -16,6 +16,7 @@ limitations under the License. */ #include "paddle/fluid/framework/details/multi_devices_graph_check_pass.h" #include "paddle/fluid/framework/details/multi_devices_graph_print_pass.h" +#include "paddle/fluid/framework/details/sequential_execution_pass.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/graph_viz_pass.h" @@ -27,6 +28,10 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder { public: explicit ParallelExecutorPassBuilder(const BuildStrategy &strategy) : ir::PassBuilder(), strategy_(strategy) { + if (strategy_.enable_sequential_execution_) { + AppendPass("sequential_execution_pass"); + } + // Add a graph viz pass to record a graph. if (!strategy_.debug_graphviz_path_.empty()) { auto viz_pass = AppendPass("graph_viz_pass"); @@ -64,15 +69,25 @@ class ParallelExecutorPassBuilder : public ir::PassBuilder { // Verify that the graph is correct for multi-device executor. AppendPass("multi_devices_check_pass"); + + if (strategy_.remove_unnecessary_lock_) { + AppendPass("modify_op_lock_and_record_event_pass"); + } } private: BuildStrategy strategy_; }; -std::shared_ptr BuildStrategy::CreatePassesFromStrategy() - const { +std::shared_ptr BuildStrategy::CreatePassesFromStrategy( + bool finalize_strategy) const { + if (is_finalized_) { + return pass_builder_; + } pass_builder_.reset(new ParallelExecutorPassBuilder(*this)); + if (finalize_strategy) { + is_finalized_ = true; + } return pass_builder_; } @@ -81,15 +96,13 @@ std::unique_ptr BuildStrategy::Apply( const std::string &loss_var_name, const std::unordered_set ¶m_names, const std::vector &local_scopes, -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const bool use_cuda, platform::NCCLContextMap *nccl_ctxs) const { #else const bool use_cuda) const { #endif - // Create a default one if not initialized by user. - if (!pass_builder_) { - CreatePassesFromStrategy(); - } + // Create a default one if not finalized by user. + CreatePassesFromStrategy(false); std::unique_ptr graph(new ir::Graph(main_program)); @@ -105,11 +118,16 @@ std::unique_ptr BuildStrategy::Apply( pass->Erase("local_scopes"); pass->SetNotOwned>("local_scopes", &local_scopes); -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) platform::NCCLContextMap *nctx = use_cuda ? nccl_ctxs : nullptr; pass->Erase("nccl_ctxs"); pass->SetNotOwned("nccl_ctxs", nctx); #endif + } else if (pass->Type() == "sequential_execution_pass") { + pass->Erase(kAllOpDescs); + pass->Set>( + kAllOpDescs, + new std::vector(main_program.Block(0).AllOps())); } graph = pass->Apply(std::move(graph)); } @@ -121,6 +139,9 @@ std::unique_ptr BuildStrategy::Apply( USE_PASS(fuse_elewise_add_act_pass); USE_PASS(graph_viz_pass); +USE_PASS(multi_batch_merge_pass); USE_PASS(multi_devices_pass); USE_PASS(multi_devices_check_pass); USE_PASS(multi_devices_print_pass); +USE_PASS(sequential_execution_pass); +USE_PASS(modify_op_lock_and_record_event_pass); diff --git a/paddle/fluid/framework/details/build_strategy.h b/paddle/fluid/framework/details/build_strategy.h index 02c4bea16916d58a6d0fce8918f8fceb9ff9356e..3236c35efdbf1175c3d06e531fc551f202ae17ad 100644 --- a/paddle/fluid/framework/details/build_strategy.h +++ b/paddle/fluid/framework/details/build_strategy.h @@ -23,7 +23,7 @@ #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/enforce.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -69,12 +69,26 @@ struct BuildStrategy { bool enable_data_balance_{false}; + bool enable_sequential_execution_{false}; + + bool fuse_broadcast_op_{false}; + + bool remove_unnecessary_lock_{false}; + + // NOTE: + // Before you add new options, think if it's a general strategy that works + // with other strategy. If not, the strategy should be created through + // CreatePassesFromStrategy and the pass can be managed separately. + // User normally doesn't need to call this API. // The PassBuilder allows for more customized insert, remove of passes // from python side. // A new PassBuilder is created based on configs defined above and // passes are owned by the PassBuilder. - std::shared_ptr CreatePassesFromStrategy() const; + std::shared_ptr CreatePassesFromStrategy( + bool finalize_strategy) const; + + bool IsFinalized() const { return is_finalized_; } // Apply the passes built by the pass_builder_. The passes will be // applied to the Program and output an ir::Graph. @@ -84,13 +98,14 @@ struct BuildStrategy { const std::string &loss_var_name, const std::unordered_set ¶m_names, const std::vector &local_scopes, -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const bool use_cuda, platform::NCCLContextMap *nccl_ctxs) const; #else const bool use_cuda) const; #endif private: + mutable bool is_finalized_ = false; mutable std::shared_ptr pass_builder_; }; diff --git a/paddle/fluid/framework/details/computation_op_handle.cc b/paddle/fluid/framework/details/computation_op_handle.cc index b6282debdb4eb6b1f29c39e54ac4f3e2296838da..7ad1e40c600c6e70cea822fac777ff20163078e6 100644 --- a/paddle/fluid/framework/details/computation_op_handle.cc +++ b/paddle/fluid/framework/details/computation_op_handle.cc @@ -29,15 +29,21 @@ ComputationOpHandle::ComputationOpHandle(ir::Node *node, Scope *scope, void ComputationOpHandle::RunImpl() { WaitInputVarGenerated(place_); - this->RunAndRecordEvent([this] { + auto run_func = [this]() { op_->Run(*scope_->FindVar(kLocalExecScopeName)->Get(), place_); - }); + }; + + if (is_lock_and_record_event_free_) { + run_func(); + } else { + this->RunAndRecordEvent(run_func); + } } bool ComputationOpHandle::NeedWait(VarHandleBase *in_var) { bool need_wait = in_var && in_var->GeneratedOp() && - in_var->GeneratedOp()->DeviceContext(place_) != dev_ctxes_[place_]; + in_var->GeneratedOp()->DeviceContext(place_) != dev_ctxes_.at(place_); return need_wait; } diff --git a/paddle/fluid/framework/details/computation_op_handle.h b/paddle/fluid/framework/details/computation_op_handle.h index e98f1ab148db083ac63a1afd43e334fbfae62539..662a91d6b4dfcfed563fdf2e46c22f83f90b40af 100644 --- a/paddle/fluid/framework/details/computation_op_handle.h +++ b/paddle/fluid/framework/details/computation_op_handle.h @@ -36,6 +36,8 @@ struct ComputationOpHandle : public OpHandleBase { const platform::Place &GetPlace() const { return place_; } + void SetLockAndRecordEventFree(bool b) { is_lock_and_record_event_free_ = b; } + protected: void RunImpl() override; @@ -45,6 +47,7 @@ struct ComputationOpHandle : public OpHandleBase { std::unique_ptr op_; Scope *scope_; platform::Place place_; + bool is_lock_and_record_event_free_{false}; }; } // namespace details } // namespace framework diff --git a/paddle/fluid/framework/details/data_balance_op_handle.cc b/paddle/fluid/framework/details/data_balance_op_handle.cc index 525d24322442ef4dd6e8c24212af61c908959b87..cc562c7b102cea80e18cbd2c054c34415a7442c9 100644 --- a/paddle/fluid/framework/details/data_balance_op_handle.cc +++ b/paddle/fluid/framework/details/data_balance_op_handle.cc @@ -20,7 +20,7 @@ namespace paddle { namespace framework { namespace details { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) DataBalanceOpHandle::DataBalanceOpHandle( ir::Node *node, const std::vector &local_scopes, const std::vector &places, @@ -28,7 +28,7 @@ DataBalanceOpHandle::DataBalanceOpHandle( : OpHandleBase(node), local_scopes_(local_scopes), places_(places) { if (ctxs) { for (auto &p : places_) { - this->dev_ctxes_[p] = ctxs->DevCtx(p); + this->SetDeviceContext(p, ctxs->DevCtx(p)); } } } @@ -89,8 +89,8 @@ void DataBalanceOpHandle::RunImpl() { PADDLE_ENFORCE_GT(places_.size(), 1, "Data balance can only be enabled when the number of " "places to run larger than 1."); - auto in_var_handles = DynamicCast(inputs_); - auto out_var_handles = DynamicCast(outputs_); + auto in_var_handles = DynamicCast(this->Inputs()); + auto out_var_handles = DynamicCast(this->Outputs()); PADDLE_ENFORCE(in_var_handles.size() % places_.size() == 0); PADDLE_ENFORCE_EQ( in_var_handles.size(), out_var_handles.size(), diff --git a/paddle/fluid/framework/details/data_balance_op_handle.h b/paddle/fluid/framework/details/data_balance_op_handle.h index 0462fb6ec713eb977f420a9cb485c0273e782496..2db18a1a7203f85aac6338576f2e68c7b37d7c69 100644 --- a/paddle/fluid/framework/details/data_balance_op_handle.h +++ b/paddle/fluid/framework/details/data_balance_op_handle.h @@ -19,7 +19,7 @@ #include "paddle/fluid/framework/details/op_handle_base.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -29,7 +29,7 @@ namespace details { struct DataBalanceOpHandle : public OpHandleBase { public: -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) DataBalanceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, const platform::NCCLContextMap *ctxs); diff --git a/paddle/fluid/framework/details/exception_holder.h b/paddle/fluid/framework/details/exception_holder.h index c97b364de1ecae21e97351196389615187932b5e..1b1afce04ebbf803f543f839eadc26c522cc89ef 100644 --- a/paddle/fluid/framework/details/exception_holder.h +++ b/paddle/fluid/framework/details/exception_holder.h @@ -30,6 +30,8 @@ class ExceptionHolder { Catch(exp); } catch (platform::EnforceNotMet exp) { Catch(exp); + } catch (std::exception& ex) { + LOG(FATAL) << "std::exception caught, " << ex.what(); } catch (...) { LOG(FATAL) << "Unknown exception caught"; } diff --git a/paddle/fluid/framework/details/execution_strategy.h b/paddle/fluid/framework/details/execution_strategy.h index 5183be878eb49cccc68603c3fdd8023be5578036..15c496130c2b6c7643ff96661be09e5ac4870344 100644 --- a/paddle/fluid/framework/details/execution_strategy.h +++ b/paddle/fluid/framework/details/execution_strategy.h @@ -13,6 +13,7 @@ // limitations under the License. #pragma once +#include // for size_t namespace paddle { namespace framework { @@ -26,6 +27,7 @@ struct ExecutionStrategy { bool allow_op_delay_{false}; size_t num_iteration_per_drop_scope_{100}; ExecutorType type_{kDefault}; + bool dry_run_{false}; }; } // namespace details diff --git a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc index 6e22fedf1c39428528c00cce4c9a4460dfb95cb3..949510e03705a4a0900f1c7b8758a8f7308aa44b 100644 --- a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc @@ -16,6 +16,7 @@ #include #include "paddle/fluid/framework/details/fetch_op_handle.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -29,16 +30,14 @@ FastThreadedSSAGraphExecutor::FastThreadedSSAGraphExecutor( local_scopes_(local_scopes), places_(places), graph_(std::move(graph)), - pool_(strategy.num_threads_ + - 1), // add one more thread for generate op_deps + pool_(strategy.num_threads_), + prepare_pool_(1), // add one more thread for generate op_deps fetch_ctxs_(places) { - auto &ops = graph_->Get("ops"); - - for (auto &op : ops) { + for (auto &op : ir::FilterByNodeWrapper(*graph_)) { int dep = static_cast(op->NotReadyInputSize()); - op_deps_.emplace(op.get(), dep); + op_deps_.emplace(op, dep); if (dep == 0) { - bootstrap_ops_.emplace_back(op.get()); + bootstrap_ops_.emplace_back(op); } } @@ -54,13 +53,13 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( paddle::framework::FeedFetchList fetches; fetches.resize(fetch_tensors.size()); std::unordered_map> fetched_vars; - std::vector> fetch_ops; + std::vector fetch_ops; for (auto &fetch_var_name : fetch_tensors) { for (auto &var_map : graph_->Get("vars")) { auto it = var_map.find(fetch_var_name); if (it != var_map.end()) { - fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get()); + fetched_vars[fetch_var_name].push_back(*it->second.rbegin()); } } } @@ -92,13 +91,13 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( size_t num_complete = 0; remaining_ = 0; - BlockingQueue complete_q; + auto complete_q = std::make_shared>(); for (auto op : bootstrap_ops_) { - RunOpAsync(op_deps.get(), op, &complete_q); + RunOpAsync(op_deps.get(), op, complete_q); } while (num_complete != op_deps->size()) { - size_t num_comp = complete_q.Pop(); + size_t num_comp = complete_q->Pop(); if (num_comp == -1UL) { int remaining = 0; while (true) { @@ -107,10 +106,13 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( break; } for (int i = 0; i < remaining; ++i) { - complete_q.Pop(); + complete_q->Pop(); } } - exception_.ReThrow(); + if (exception_.IsCaught()) { + ClearFetchOp(graph_.get(), &fetch_ops); + exception_.ReThrow(); + } } num_complete += num_comp; } @@ -120,14 +122,17 @@ FeedFetchList FastThreadedSSAGraphExecutor::Run( } void FastThreadedSSAGraphExecutor::RunOpAsync( std::unordered_map> *op_deps, - OpHandleBase *op, BlockingQueue *complete_q) { + OpHandleBase *op, + const std::shared_ptr> &complete_q) { ++remaining_; this->pool_.enqueue([=] { OpHandleBase *op_to_run = op; size_t complete = 0; while (op_to_run != nullptr) { try { - op_to_run->Run(strategy_.use_cuda_); + if (LIKELY(!strategy_.dry_run_)) { + op_to_run->Run(strategy_.use_cuda_); + } ++complete; } catch (...) { exception_.Catch(std::current_exception()); @@ -144,7 +149,7 @@ void FastThreadedSSAGraphExecutor::RunOpAsync( if (op_to_run == nullptr) { op_to_run = pending_op; } else { - this->RunOpAsync(op_deps, pending_op, complete_q); + RunOpAsync(op_deps, pending_op, complete_q); } } } @@ -155,9 +160,8 @@ void FastThreadedSSAGraphExecutor::RunOpAsync( }); } void FastThreadedSSAGraphExecutor::PrepareAtomicOpDeps() { - atomic_op_deps_ = pool_.enqueue([&] { - std::unordered_map> *op_deps = - new std::unordered_map>; + atomic_op_deps_ = prepare_pool_.enqueue([&] { + auto *op_deps = new std::unordered_map>; for (auto &pair : op_deps_) { (*op_deps)[pair.first] = pair.second; } diff --git a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h index dad3a231cba6402f57ba654a9ac5fb520b9c8f04..c3a8b85423403992e3a12ceb0a1acbae82d25dfa 100644 --- a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h +++ b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h @@ -13,9 +13,9 @@ // limitations under the License. #pragma once +#include #include #include -#include "ThreadPool.h" #include "paddle/fluid/framework/blocking_queue.h" #include "paddle/fluid/framework/details/exception_holder.h" #include "paddle/fluid/framework/details/execution_strategy.h" @@ -46,11 +46,13 @@ class FastThreadedSSAGraphExecutor : public SSAGraphExecutor { std::vector bootstrap_ops_; ::ThreadPool pool_; + ::ThreadPool prepare_pool_; platform::DeviceContextPool fetch_ctxs_; std::atomic remaining_; void RunOpAsync(std::unordered_map> *op_deps, - OpHandleBase *op, BlockingQueue *complete_q); + OpHandleBase *op, + const std::shared_ptr> &complete_q); void PrepareAtomicOpDeps(); diff --git a/paddle/fluid/framework/details/fetch_op_handle.cc b/paddle/fluid/framework/details/fetch_op_handle.cc index fe18b2060c5cd7e157374da53c5a985f70545ab7..648adae06facb504042d8286f6eab5d98e99c015 100644 --- a/paddle/fluid/framework/details/fetch_op_handle.cc +++ b/paddle/fluid/framework/details/fetch_op_handle.cc @@ -28,11 +28,7 @@ FetchOpHandle::FetchOpHandle(ir::Node *node, FeedFetchList *data, size_t offset, offset_(offset), local_scopes_(local_scopes) {} -FetchOpHandle::~FetchOpHandle() { - for (auto *input_var : inputs_) { - input_var->RemoveOutput(this, this->Node()); - } -} +FetchOpHandle::~FetchOpHandle() {} void FetchOpHandle::RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx) { PADDLE_THROW("Nobody should wait FetchOp. Unexpceted Error"); diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle.cc b/paddle/fluid/framework/details/fused_broadcast_op_handle.cc new file mode 100644 index 0000000000000000000000000000000000000000..51dfa2d0711f49aaefab0af3549283dbf77eee4a --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle.cc @@ -0,0 +1,55 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" +#include "paddle/fluid/framework/details/container_cast.h" +#include "paddle/fluid/framework/details/variable_visitor.h" +#include "paddle/fluid/platform/profiler.h" + +namespace paddle { +namespace framework { +namespace details { + +void FusedBroadcastOpHandle::RunImpl() { + platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); + + if (places_.size() == 1UL) return; + + auto in_var_handles = DynamicCast(inputs_); + auto out_var_handles = DynamicCast(outputs_); + + WaitInputVarGenerated(); + + std::vector var_scopes; + for (auto *s : local_scopes_) { + var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get()); + } + + size_t place_num = places_.size(); + PADDLE_ENFORCE_EQ(in_var_handles.size() * place_num, out_var_handles.size()); + + for (size_t i = 0; i < in_var_handles.size(); ++i) { + BroadcastOneVar( + *in_var_handles[i], + std::vector(out_var_handles.begin() + i * place_num, + out_var_handles.begin() + (i + 1) * place_num), + var_scopes); + } +} + +std::string FusedBroadcastOpHandle::Name() const { return "fused_broadcast"; } + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle.h b/paddle/fluid/framework/details/fused_broadcast_op_handle.h new file mode 100644 index 0000000000000000000000000000000000000000..e43d545c9c0d062a8814a2f5d27f9ca20290b9dd --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include + +#include "paddle/fluid/framework/details/broadcast_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/platform/device_context.h" + +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) +#include "paddle/fluid/platform/nccl_helper.h" +#endif + +namespace paddle { +namespace framework { +namespace details { + +struct FusedBroadcastOpHandle : public BroadcastOpHandle { + public: +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + FusedBroadcastOpHandle(ir::Node *node, + const std::vector local_scopes, + const std::vector &places, + const platform::NCCLContextMap *nccl_ctx) + : BroadcastOpHandle(node, local_scopes, places, nccl_ctx) {} +#else + FusedBroadcastOpHandle(ir::Node* node, const std::vector local_scopes, + const std::vector& places) + : BroadcastOpHandle(node, local_scopes, places) {} +#endif + std::string Name() const override; + + protected: + void RunImpl() override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc b/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..be0d941c4f9c2fe8fbb1da8ec2c11868112fcf9b --- /dev/null +++ b/paddle/fluid/framework/details/fused_broadcast_op_handle_test.cc @@ -0,0 +1,167 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" +#include "gtest/gtest.h" +#include "paddle/fluid/framework/details/broadcast_op_handle_test.h" + +namespace paddle { +namespace framework { +namespace details { + +struct TestFusedBroadcastOpHandle : TestBroadcastOpHandle { + std::vector out_varnames_; + std::vector> nodes_; + + void InitFusedBroadcastOp(std::vector input_scope_idxes) { + nodes_.clear(); + // initialize scope and var + for (size_t i = 0; i < place_list_.size(); ++i) { + local_scopes_.push_back(&(g_scope_.NewScope())); + Scope& local_scope = local_scopes_.back()->NewScope(); + *local_scopes_.back() + ->Var(details::kLocalExecScopeName) + ->GetMutable() = &local_scope; + for (size_t j = 0; j < input_scope_idxes.size(); ++j) { + local_scope.Var("out_var" + j); + if (i == j) local_scope.Var("in_var" + j); + } + param_scopes_.emplace_back(&local_scope); + } + + // create op handle node + nodes_.emplace_back( + ir::CreateNodeForTest("fused_broadcast", ir::Node::Type::kOperation)); + if (use_gpu_) { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new FusedBroadcastOpHandle( + nodes_.back().get(), local_scopes_, place_list_, nccl_ctxs_.get()); +#else + PADDLE_THROW("CUDA is not supported."); +#endif + } else { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + op_handle_ = new FusedBroadcastOpHandle( + nodes_.back().get(), local_scopes_, place_list_, nccl_ctxs_.get()); +#else + op_handle_ = new FusedBroadcastOpHandle(nodes_.back().get(), + local_scopes_, place_list_); +#endif + } + + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + // add input var handle + nodes_.emplace_back( + ir::CreateNodeForTest("in_node" + i, ir::Node::Type::kVariable)); + VarHandle* in_var_handle = + new VarHandle(nodes_.back().get(), 1, input_scope_idxes[i], + "in_var" + i, place_list_[input_scope_idxes[i]]); + vars_.emplace_back(in_var_handle); + op_handle_->AddInput(in_var_handle); + + // add output var handle + for (size_t j = 0; j < place_list_.size(); ++j) { + nodes_.emplace_back( + ir::CreateNodeForTest("out_node" + i, ir::Node::Type::kVariable)); + VarHandle* out_var_handle = new VarHandle( + nodes_.back().get(), 2, j, "out_var" + i, place_list_[j]); + vars_.emplace_back(out_var_handle); + op_handle_->AddOutput(out_var_handle); + } + } + } + + void TestFusedBroadcastLoDTensor(std::vector input_scope_idxes) { + std::vector> send_vec; + f::LoD lod{{0, 10, 20}}; + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string varname("in_var" + i); + float val_scalar = static_cast(i); + send_vec.push_back( + InitLoDTensor(varname, input_scope_idxes[i], lod, val_scalar)); + } + + op_handle_->Run(false); + + WaitAll(); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string& varname("out_var" + i); + for (size_t j = 0; j < place_list_.size(); ++j) { + LoDTensorEqual(varname, send_vec[i], lod, param_scopes_[j]); + } + } + } + + void TestFusedBroadcastSelectedRows(std::vector input_scope_idxes) { + std::vector> send_vector; + std::vector rows{0, 1, 2, 3, 3, 0, 14, 7, 3, 1, + 2, 4, 6, 3, 1, 1, 1, 1, 3, 7}; + int height = static_cast(kDims[0] * 2); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string varname("in_var" + i); + float val_scalar = static_cast(i); + send_vector.push_back(InitSelectedRows(varname, input_scope_idxes[i], + rows, height, val_scalar)); + } + + op_handle_->Run(false); + + WaitAll(); + for (size_t i = 0; i < input_scope_idxes.size(); ++i) { + const std::string& varname("out_var" + i); + for (size_t j = 0; j < place_list_.size(); ++j) { + SelectedRowsEqual(varname, input_scope_idxes[i], send_vector[i], rows, + height); + } + } + } +}; + +TEST(FusedBroadcastTester, CPULodTensor) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(false); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastLoDTensor(input_scope_idxes); +} + +TEST(FusedBroadcastTester, CPUSelectedRows) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(false); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastSelectedRows(input_scope_idxes); +} + +#ifdef PADDLE_WITH_CUDA +TEST(FusedBroadcastTester, GPULodTensor) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(true); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastLoDTensor(input_scope_idxes); +} + +TEST(FusedBroadcastTester, GPUSelectedRows) { + TestFusedBroadcastOpHandle test_op; + std::vector input_scope_idxes = {0, 1}; + test_op.InitCtxOnGpu(true); + test_op.InitFusedBroadcastOp(input_scope_idxes); + test_op.TestFusedBroadcastSelectedRows(input_scope_idxes); +} +#endif + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/gather_op_handle.cc b/paddle/fluid/framework/details/gather_op_handle.cc index 9aae19fc73de4387186da47c55710c94d53f1b88..ca4633c5a8f22fc9f7319b06aa766f9fe37dc68c 100644 --- a/paddle/fluid/framework/details/gather_op_handle.cc +++ b/paddle/fluid/framework/details/gather_op_handle.cc @@ -36,7 +36,7 @@ void GatherOpHandle::RunImpl() { VarHandle *out_var_handle; { - auto out_var_handles = DynamicCast(outputs_); + auto out_var_handles = DynamicCast(this->Outputs()); PADDLE_ENFORCE_EQ(out_var_handles.size(), 1, "The number of output should be one."); out_var_handle = out_var_handles.front(); @@ -99,7 +99,7 @@ void GatherOpHandle::RunImpl() { Tensor *out_tensor = out_value->mutable_value(); // copy - auto dev_ctx = dev_ctxes_[out_var_handle->place_]; + auto dev_ctx = dev_ctxes_.at(out_var_handle->place_); RunAndRecordEvent(out_var_handle->place_, [in_tensors, out_tensor, &dev_ctx, t_out_p] { int s = 0, e = 0; diff --git a/paddle/fluid/framework/details/gather_op_handle_test.cc b/paddle/fluid/framework/details/gather_op_handle_test.cc index ed67e88ff6a7fe9efd93e5dfd4d7bdf4c43aac2e..e8cb7feb8bea92a7486b8a9d84ba4b9e2b93dbfb 100644 --- a/paddle/fluid/framework/details/gather_op_handle_test.cc +++ b/paddle/fluid/framework/details/gather_op_handle_test.cc @@ -31,9 +31,10 @@ struct TestGatherOpHandle { std::vector local_scopes_; std::vector param_scopes_; Scope g_scope_; - std::unique_ptr op_handle_; - std::vector> vars_; + OpHandleBase* op_handle_; + std::vector vars_; std::vector gpu_list_; + std::vector> nodes_; void WaitAll() { for (size_t j = 0; j < ctxs_.size(); ++j) { @@ -70,7 +71,7 @@ struct TestGatherOpHandle { } void InitGatherOp(size_t input_scope_idx) { - std::vector> nodes; + nodes_.clear(); for (size_t j = 0; j < gpu_list_.size(); ++j) { local_scopes_.push_back(&(g_scope_.NewScope())); Scope& local_scope = local_scopes_.back()->NewScope(); @@ -82,44 +83,45 @@ struct TestGatherOpHandle { } param_scopes_[input_scope_idx]->Var("out"); - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node", ir::Node::Type::kOperation).release()); - op_handle_.reset( - new GatherOpHandle(nodes.back().get(), local_scopes_, gpu_list_)); + op_handle_ = + new GatherOpHandle(nodes_.back().get(), local_scopes_, gpu_list_); // add input for (size_t j = 0; j < gpu_list_.size(); ++j) { op_handle_->SetDeviceContext(gpu_list_[j], ctxs_[j].get()); - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node1", ir::Node::Type::kVariable).release()); auto* in_var_handle = - new VarHandle(nodes.back().get(), 1, j, "input", gpu_list_[j]); + new VarHandle(nodes_.back().get(), 1, j, "input", gpu_list_[j]); vars_.emplace_back(in_var_handle); op_handle_->AddInput(in_var_handle); } // add dummy var - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node2", ir::Node::Type::kVariable).release()); - vars_.emplace_back(new DummyVarHandle(nodes.back().get())); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); DummyVarHandle* in_dummy_var_handle = - static_cast(vars_.back().get()); + static_cast(vars_.back()); in_dummy_var_handle->ClearGeneratedOp(); op_handle_->AddInput(in_dummy_var_handle); // add output - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node3", ir::Node::Type::kVariable).release()); - auto* out_var_handle = new VarHandle(nodes.back().get(), 2, input_scope_idx, - "out", gpu_list_[input_scope_idx]); + auto* out_var_handle = + new VarHandle(nodes_.back().get(), 2, input_scope_idx, "out", + gpu_list_[input_scope_idx]); vars_.emplace_back(out_var_handle); op_handle_->AddOutput(out_var_handle); // add dummy var - nodes.emplace_back( + nodes_.emplace_back( ir::CreateNodeForTest("node4", ir::Node::Type::kVariable).release()); - vars_.emplace_back(new DummyVarHandle(nodes.back().get())); + vars_.emplace_back(new DummyVarHandle(nodes_.back().get())); DummyVarHandle* dummy_var_handle = - static_cast(vars_.back().get()); + static_cast(vars_.back()); op_handle_->AddOutput(dummy_var_handle); } diff --git a/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..bf3f3637b551a8a8084e6e4f1ca6a94b65361f17 --- /dev/null +++ b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.cc @@ -0,0 +1,60 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h" +#include "paddle/fluid/framework/details/computation_op_handle.h" +#include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/details/op_graph_view.h" +#include "paddle/fluid/framework/ir/graph_helper.h" + +namespace paddle { +namespace framework { +namespace details { + +static bool IsLockAndRecordEventFreeComputationOpHandle( + ComputationOpHandle *op, const OpGraphView &graph_view) { + if (!platform::is_gpu_place(op->GetPlace())) return false; + for (auto &pending_op : graph_view.PendingOps(op)) { + auto *tmp = dynamic_cast(pending_op); + if (tmp == nullptr || !(tmp->GetPlace() == op->GetPlace())) { + return false; + } + } + return true; +} + +std::unique_ptr ModifyOpLockAndRecordEventPass::ApplyImpl( + std::unique_ptr ir_graph) const { + auto all_ops = ir::FilterByNodeWrapper(*ir_graph); + OpGraphView graph_view(all_ops); + for (auto &op : all_ops) { + auto *compute_op = dynamic_cast(op); + if (compute_op == nullptr) continue; + bool is_lock_and_record_event_free = + IsLockAndRecordEventFreeComputationOpHandle(compute_op, graph_view); + compute_op->SetLockAndRecordEventFree(is_lock_and_record_event_free); + if (is_lock_and_record_event_free) { + VLOG(100) << "Set is_lock_and_record_event_free be true in op " + << compute_op->DebugString(); + } + } + return ir_graph; +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(modify_op_lock_and_record_event_pass, + paddle::framework::details::ModifyOpLockAndRecordEventPass); diff --git a/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..b54e1b318be95e1e0abf6830f8c918895df02718 --- /dev/null +++ b/paddle/fluid/framework/details/modify_op_lock_and_record_event_pass.h @@ -0,0 +1,32 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +class ModifyOpLockAndRecordEventPass : public ir::Pass { + protected: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc index c9c255864a2477ed29873f8521acce37fa928c06..c8ea18804630fea4ada98062256730dbf4c24860 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_check_pass.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/details/multi_devices_graph_check_pass.h" #include #include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -36,20 +37,20 @@ bool SSAGraghBuilderWithChecker::IsValidGraph(const ir::Graph *graph) const { for (auto &var_map : graph->Get(kGraphVars)) { for (auto &name_pair : var_map) { for (auto &version_pair : name_pair.second) { - insert_pending_var(version_pair.get()); + insert_pending_var(version_pair); } } } for (auto &var : graph->Get(kGraphDepVars)) { - insert_pending_var(var.get()); + insert_pending_var(var); } - for (auto &op : graph->Get(kGraphOps)) { + for (OpHandleBase *op : ir::FilterByNodeWrapper(*graph)) { if (op->Inputs().empty()) { - ready_ops.insert(op.get()); + ready_ops.insert(op); } else { - pending_ops.insert({op.get(), op.get()->NoDupInputSize()}); + pending_ops.insert({op, op->NoDupInputSize()}); } } @@ -89,6 +90,4 @@ bool SSAGraghBuilderWithChecker::IsValidGraph(const ir::Graph *graph) const { REGISTER_PASS(multi_devices_check_pass, paddle::framework::details::SSAGraghBuilderWithChecker) .RequireGraphAttr(paddle::framework::details::kGraphVars) - .RequireGraphAttr(paddle::framework::details::kGraphDepVars) - .RequireGraphAttr(paddle::framework::details::kGraphOps) - .RequireGraphAttr(paddle::framework::details::kShardedVarDevice); + .RequireGraphAttr(paddle::framework::details::kGraphDepVars); diff --git a/paddle/fluid/framework/details/multi_devices_graph_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_pass.cc index 134fcee826715672a6e021e9bf694bb771ebb830..26666212ae8c4dc5ce9a45b5c51bab1f9ff1a8ab 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_pass.cc @@ -21,6 +21,7 @@ #include "paddle/fluid/framework/details/broadcast_op_handle.h" #include "paddle/fluid/framework/details/computation_op_handle.h" #include "paddle/fluid/framework/details/data_balance_op_handle.h" +#include "paddle/fluid/framework/details/fused_broadcast_op_handle.h" #include "paddle/fluid/framework/details/multi_devices_graph_pass.h" #include "paddle/fluid/framework/details/reduce_op_handle.h" #include "paddle/fluid/framework/details/rpc_op_handle.h" @@ -33,7 +34,14 @@ namespace paddle { namespace framework { namespace details { + namespace { +// TODO(panyx0718): Clean this up as well. +// all operators. NOTE that even we use a vector here, the operators is +// unordered. +typedef std::vector GraphOps; +const char kGraphOps[] = "ops"; + void PolishGraphToSupportDataHazards(ir::Graph *graph) { for (auto &var_map : graph->Get(kGraphVars)) { for (auto &name_pair : var_map) { @@ -91,7 +99,7 @@ VarHandle *CreateOrGetLatestVarHandle(ir::Graph *graph, ir::Node *node, } var_holder.emplace_back(var); } else { - var = var_holder.rbegin()->get(); + var = *var_holder.rbegin(); } return var; } @@ -134,7 +142,7 @@ void MultiDevSSAGraphBuilder::Init() const { places_ = Get>(kPlaces); local_scopes_ = Get>(kLocalScopes); strategy_ = Get(kStrategy); -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) nccl_ctxs_ = &Get("nccl_ctxs"); #endif @@ -153,7 +161,7 @@ void MultiDevSSAGraphBuilder::CreateOpHandleIOs(ir::Graph *result, ir::Node *node, size_t place_id) const { auto p = places_[place_id]; - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); op_handle->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); @@ -252,9 +260,9 @@ std::vector SortOpsAndDelayOptimizeOp(const ir::Graph &graph) { std::vector sorted_ret; for (size_t i = 0; i < ret.size(); ++i) { if (i < last_backward) { - if (boost::get(ret[i]->Op()->GetAttr( - OpProtoAndCheckerMaker::OpRoleAttrName())) == - static_cast(OpRole::kOptimize)) { + if (static_cast(boost::get(ret[i]->Op()->GetAttr( + OpProtoAndCheckerMaker::OpRoleAttrName())) & + static_cast(OpRole::kOptimize))) { optimize_ops.push_back(ret[i]); } else { sorted_ret.push_back(ret[i]); @@ -302,7 +310,6 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( result.Set(kGraphVars, new GraphVars(places_.size())); result.Set(kGraphDepVars, new GraphDepVars); result.Set(kGraphOps, new GraphOps); - result.Set(kShardedVarDevice, new ShardedVarDevice); // find send/recv vars so that we can place the distributed training // related op in the place 0 @@ -316,11 +323,13 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( bool is_forwarding = true; bool is_dist_train = false; + std::unordered_map sharded_var_device; + for (ir::Node *node : sorted_ops) { if (boost::get( node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) == static_cast(OpRole::kRPC)) { - int op_dev_id = CreateRPCOp(&result, node); + int op_dev_id = CreateRPCOp(&result, node, &sharded_var_device); PADDLE_ENFORCE(op_dev_id != -1, "Can not schedule the RPC operator to the right place."); if (node->Op()->Type() == "recv") { @@ -336,7 +345,7 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( } else if (boost::get(node->Op()->GetAttr( OpProtoAndCheckerMaker::OpRoleAttrName())) == static_cast(OpRole::kDist)) { - int op_dev_id = CreateDistTrainOp(&result, node); + int op_dev_id = CreateDistTrainOp(&result, node, &sharded_var_device); if (node->Op()->Type() == "concat") { auto origin_param_name = node->Op()->OutputArgumentNames()[0]; bcast_var_name_set[op_dev_id].emplace(origin_param_name); @@ -347,7 +356,7 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( BuildStrategy::GradientScaleStrategy::kCustomized) { // TODO(paddle-dev): Why is there no input for this op_handle? auto loss_grad_name = node->Op()->OutputArgumentNames()[0]; - CreateScaleLossGradOp(&result, loss_grad_name); + CreateScaleLossGradOp(&result, loss_grad_name, node->outputs[0]); } // This assumes the backward generating code will ensure IsScaleLossOp // is true only for the op that scale the final scalar loss. @@ -355,12 +364,11 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( // the block. is_forwarding = false; } else { - int op_dev_id = GetOpDeviceID(result, node); + int op_dev_id = GetOpDeviceID(result, node, sharded_var_device); if (op_dev_id != -1) { // This op only runs on one specific device. CreateComputationalOp(&result, node, op_dev_id); for (ir::Node *n : node->outputs) { - graph->Get(kShardedVarDevice) - .emplace(n->Name(), op_dev_id); + sharded_var_device.emplace(n->Name(), op_dev_id); } } else { // This op runs on all devices, and its output may have parameter's @@ -391,14 +399,13 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( for (size_t i = 0; i < backward_vars.size(); i += 2) { auto &p_name = backward_vars[i]; auto &g_name = backward_vars[i + 1]; - VLOG(10) << "Bcast " << g_name << " for parameter " << p_name; + VLOG(100) << "Bcast " << g_name << " for parameter " << p_name; switch (strategy_.reduce_) { case BuildStrategy::ReduceStrategy::kReduce: cur_device_id = GetAppropriateDeviceID({g_name}); CreateReduceOp(&result, g_name, cur_device_id); - graph->Get(kShardedVarDevice) - .emplace(g_name, cur_device_id); + sharded_var_device.emplace(g_name, cur_device_id); if (!is_dist_train) { bcast_var_name_set[cur_device_id].emplace(p_name); } @@ -424,7 +431,7 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( } } bool use_gpu = false; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) use_gpu = nccl_ctxs_ != nullptr; #endif @@ -436,10 +443,14 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( if ((use_gpu && strategy_.reduce_ == BuildStrategy::ReduceStrategy::kReduce) || is_dist_train) { - for (size_t dev_id = 0; dev_id < bcast_var_name_set.size(); ++dev_id) { - auto &to_bcast_set = bcast_var_name_set[dev_id]; - for (auto &bcast_name : to_bcast_set) { - CreateBroadcastOp(&result, bcast_name, dev_id); + if (strategy_.fuse_broadcast_op_) { + CreateFusedBroadcastOp(&result, bcast_var_name_set); + } else { + for (size_t dev_id = 0; dev_id < bcast_var_name_set.size(); ++dev_id) { + auto &to_bcast_set = bcast_var_name_set[dev_id]; + for (auto &bcast_name : to_bcast_set) { + CreateBroadcastOp(&result, bcast_name, dev_id); + } } } } @@ -453,7 +464,7 @@ std::unique_ptr MultiDevSSAGraphBuilder::ApplyImpl( * Only variables should be the leaves of graph. */ AddOutputToLeafOps(&result); - PADDLE_ENFORCE(!ir::HasCircle(result)); + result.Erase(kGraphOps); return graph; } @@ -467,7 +478,7 @@ bool MultiDevSSAGraphBuilder::IsSparseGradient(const std::string &og) const { void MultiDevSSAGraphBuilder::SetCommunicationContext( OpHandleBase *op_handle, const platform::Place &p) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) if (nccl_ctxs_ == nullptr) { op_handle->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); @@ -481,7 +492,7 @@ void MultiDevSSAGraphBuilder::SetCommunicationContext( void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, const std::string &p_name, size_t src_dev_id) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) auto *op_handle = new BroadcastOpHandle( result->CreateEmptyNode("broadcast", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_); @@ -493,7 +504,7 @@ void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, result->Get(kGraphOps).emplace_back(op_handle); auto *in = - result->Get(kGraphVars).at(src_dev_id).at(p_name).back().get(); + result->Get(kGraphVars).at(src_dev_id).at(p_name).back(); op_handle->AddInput(in); for (size_t i = 0; i < places_.size(); ++i) { @@ -508,6 +519,44 @@ void MultiDevSSAGraphBuilder::CreateBroadcastOp(ir::Graph *result, } } +void MultiDevSSAGraphBuilder::CreateFusedBroadcastOp( + ir::Graph *result, + const std::vector> &bcast_varnames) const { +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) + auto *op_handle = new FusedBroadcastOpHandle( + result->CreateEmptyNode("fused_broadcast", ir::Node::Type::kOperation), + local_scopes_, places_, nccl_ctxs_); +#else + auto *op_handle = new FusedBroadcastOpHandle( + result->CreateEmptyNode("fused_broadcast", ir::Node::Type::kOperation), + local_scopes_, places_); +#endif + result->Get(kGraphOps).emplace_back(op_handle); + + for (size_t i = 0; i < places_.size(); ++i) { + auto &p = places_[i]; + SetCommunicationContext(op_handle, p); + } + + for (size_t dev_id = 0; dev_id < bcast_varnames.size(); ++dev_id) { + for (auto &p_name : bcast_varnames[dev_id]) { + auto *in = + result->Get(kGraphVars).at(dev_id).at(p_name).back(); + op_handle->AddInput(in); + for (size_t out_dev_id = 0; out_dev_id < places_.size(); ++out_dev_id) { + auto &p = places_[out_dev_id]; + auto &vars = + result->Get(kGraphVars).at(out_dev_id).at(p_name); + auto *out_var = new VarHandle( + result->CreateEmptyNode(p_name, ir::Node::Type::kVariable), + vars.size(), out_dev_id, p_name, p); + vars.emplace_back(out_var); + op_handle->AddOutput(out_var); + } + } + } +} + void MultiDevSSAGraphBuilder::CreateComputationalOp(ir::Graph *result, ir::Node *node, int dev_id) const { @@ -519,7 +568,7 @@ void MultiDevSSAGraphBuilder::CreateComputationalOp(ir::Graph *result, void MultiDevSSAGraphBuilder::InsertAllReduceOp(ir::Graph *result, const std::string &og) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) result->Get(kGraphOps).emplace_back(new AllReduceOpHandle( result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_)); @@ -528,7 +577,7 @@ void MultiDevSSAGraphBuilder::InsertAllReduceOp(ir::Graph *result, result->CreateEmptyNode("allreduce", ir::Node::Type::kOperation), local_scopes_, places_)); #endif - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (size_t i = 0; i < places_.size(); ++i) { auto &p = places_[i]; @@ -536,7 +585,7 @@ void MultiDevSSAGraphBuilder::InsertAllReduceOp(ir::Graph *result, auto &vars = result->Get(kGraphVars)[i][og]; PADDLE_ENFORCE(!vars.empty()); auto &prev_grad = vars.back(); - op_handle->AddInput(prev_grad.get()); + op_handle->AddInput(prev_grad); auto var = new VarHandle(result->CreateEmptyNode(og, ir::Node::Type::kVariable), @@ -548,7 +597,7 @@ void MultiDevSSAGraphBuilder::InsertAllReduceOp(ir::Graph *result, void MultiDevSSAGraphBuilder::InsertDataBalanceOp( ir::Graph *result, const std::vector &datas) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) result->Get(kGraphOps).emplace_back(new DataBalanceOpHandle( result->CreateEmptyNode("data_balance", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_)); @@ -557,14 +606,14 @@ void MultiDevSSAGraphBuilder::InsertDataBalanceOp( result->CreateEmptyNode("data_balance", ir::Node::Type::kOperation), local_scopes_, places_)); #endif - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (size_t i = 0; i < places_.size(); ++i) { auto &p = places_[i]; SetCommunicationContext(op_handle, p); for (const std::string &d_name : datas) { auto &vars = result->Get(kGraphVars)[i][d_name]; PADDLE_ENFORCE(!vars.empty()); - op_handle->AddInput(vars.back().get()); + op_handle->AddInput(vars.back()); auto var = new VarHandle( result->CreateEmptyNode(d_name, ir::Node::Type::kVariable), vars.size(), i, d_name, p); @@ -574,8 +623,9 @@ void MultiDevSSAGraphBuilder::InsertDataBalanceOp( } } -int MultiDevSSAGraphBuilder::GetOpDeviceID(const ir::Graph &graph, - ir::Node *node) const { +int MultiDevSSAGraphBuilder::GetOpDeviceID( + const ir::Graph &graph, ir::Node *node, + const std::unordered_map &sharded_var_device) const { if (strategy_.reduce_ != BuildStrategy::ReduceStrategy::kReduce) { return -1; } @@ -588,21 +638,28 @@ int MultiDevSSAGraphBuilder::GetOpDeviceID(const ir::Graph &graph, node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); PADDLE_ENFORCE_EQ(param_grad.size(), 2U); - int dev_id = GetVarDeviceID(graph, param_grad[1]); + int dev_id = GetVarDeviceID(graph, param_grad[1], sharded_var_device); PADDLE_ENFORCE_NE(dev_id, -1, "dev_id should not be -1.[%s, %s, %s]", node->Op()->Type(), param_grad[0], param_grad[1]); return dev_id; } -int MultiDevSSAGraphBuilder::GetVarDeviceID(const ir::Graph &graph, - const std::string &varname) const { - auto &sharded_var_device = graph.Get(kShardedVarDevice); +int MultiDevSSAGraphBuilder::GetVarDeviceID( + const ir::Graph &graph, const std::string &varname, + const std::unordered_map &sharded_var_device) const { auto got = sharded_var_device.find(varname); + if (got == sharded_var_device.end()) { + auto pos = varname.find(framework::kNewGradSuffix); + if (pos != std::string::npos) { + got = sharded_var_device.find(varname.substr(0, pos)); + } + } return got == sharded_var_device.end() ? -1 : got->second; } void MultiDevSSAGraphBuilder::CreateScaleLossGradOp( - ir::Graph *result, const std::string &loss_grad_name) const { + ir::Graph *result, const std::string &loss_grad_name, + ir::Node *out_var_node) const { for (size_t i = 0; i < places_.size(); ++i) { // Insert ScaleCost OpHandle auto *dev_ctx = platform::DeviceContextPool::Instance().Get(places_[i]); @@ -617,10 +674,8 @@ void MultiDevSSAGraphBuilder::CreateScaleLossGradOp( // loss->pending_ops_.emplace_back(op_handle); // op_handle->inputs_.emplace_back(loss); - CreateOpOutput( - result, op_handle, - result->CreateEmptyNode(loss_grad_name, ir::Node::Type::kVariable), - places_[i], i); + CreateOpOutput(result, op_handle, + result->CreateVarNode(out_var_node->Var()), places_[i], i); } } @@ -639,7 +694,7 @@ void MultiDevSSAGraphBuilder::CreateComputationalOps(ir::Graph *result, VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, const std::string &og, int dst_dev_id) const { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) result->Get(kGraphOps).emplace_back(new ReduceOpHandle( result->CreateEmptyNode("reduce", ir::Node::Type::kOperation), local_scopes_, places_, nccl_ctxs_)); @@ -648,7 +703,7 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, result->CreateEmptyNode("reduce", ir::Node::Type::kOperation), local_scopes_, places_)); #endif - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (size_t i = 0; i < places_.size(); ++i) { auto &p = places_[i]; @@ -656,7 +711,7 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, auto &vars = result->Get(kGraphVars)[i][og]; PADDLE_ENFORCE(!vars.empty()); auto &prev_grad = vars.back(); - op_handle->AddInput(prev_grad.get()); + op_handle->AddInput(prev_grad); } auto &vars = result->Get(kGraphVars)[dst_dev_id][og]; auto var = @@ -667,8 +722,9 @@ VarHandle *MultiDevSSAGraphBuilder::CreateReduceOp(ir::Graph *result, return var; } -int MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result, - ir::Node *node) const { +int MultiDevSSAGraphBuilder::CreateDistTrainOp( + ir::Graph *result, ir::Node *node, + std::unordered_map *sharded_var_device) const { int op_dev_id = -1; std::vector input_var_names; std::vector output_var_names; @@ -680,25 +736,25 @@ int MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result, } if (node->Op()->Type() == "split_byref" || - node->Op()->Type() == "split_selected_rows") { + node->Op()->Type() == "split_selected_rows" || + node->Op()->Type() == "split_ids") { // TODO(paddle-dev): getting the first var is not safe. - op_dev_id = GetVarDeviceID(*result, input_var_names[0]); + op_dev_id = + GetVarDeviceID(*result, input_var_names[0], *sharded_var_device); if (strategy_.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce) { op_dev_id = GetAppropriateDeviceID(input_var_names); for (auto &varname : input_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device->emplace(varname, op_dev_id); } } for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device->emplace(varname, op_dev_id); } } else if (node->Op()->Type() == "concat") { - op_dev_id = GetVarDeviceID(*result, input_var_names[0]); + op_dev_id = + GetVarDeviceID(*result, input_var_names[0], *sharded_var_device); for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device->emplace(varname, op_dev_id); } } else { LOG(ERROR) << "got unexpected dist op: " << node->Op()->Type(); @@ -716,14 +772,14 @@ int MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result, } void SetOpInputsAllPlaces(ir::Graph *result, ir::Node *node, int num_places) { - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); for (ir::Node *input : node->inputs) { VarHandle *var = nullptr; for (int place_offset = 0; place_offset < num_places; ++place_offset) { auto &var_holders = result->Get(kGraphVars)[place_offset]; auto &var_holder = var_holders[input->Name()]; if (!var_holder.empty()) { - var = var_holder.rbegin()->get(); + var = *var_holder.rbegin(); op_handle->AddInput(var); } } @@ -731,12 +787,14 @@ void SetOpInputsAllPlaces(ir::Graph *result, ir::Node *node, int num_places) { } // Create RPC related op handles that connects its in ops and out ops. -int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, - ir::Node *node) const { +int MultiDevSSAGraphBuilder::CreateRPCOp( + ir::Graph *result, ir::Node *node, + std::unordered_map *sharded_var_device) const { int op_dev_id = -1; if (node->Op()->Type() == "send") { // TODO(paddle-dev): getting the first var is not safe. - op_dev_id = GetVarDeviceID(*result, node->inputs[0]->Name()); + op_dev_id = + GetVarDeviceID(*result, node->inputs[0]->Name(), *sharded_var_device); PADDLE_ENFORCE(!ir::IsControlDepVar(*node->inputs[0]), "This hack no longer holds, please fix."); // the variable name which contains .block means it was splited by @@ -751,14 +809,12 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); PADDLE_ENFORCE_EQ(send_param_grad.size(), 2U); op_dev_id = GetAppropriateDeviceID({send_param_grad[1]}); - VLOG(10) << "send grad " << input_var_names[0] << " origin " - << send_param_grad[1] << " place: " << op_dev_id; + VLOG(100) << "send grad " << input_var_names[0] << " origin " + << send_param_grad[1] << " place: " << op_dev_id; for (auto &varname : input_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device->emplace(varname, op_dev_id); } - result->Get(kShardedVarDevice) - .emplace(send_param_grad[1], op_dev_id); + sharded_var_device->emplace(send_param_grad[1], op_dev_id); } } else if (node->Op()->Type() == "recv") { std::vector output_var_names; @@ -768,16 +824,16 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, auto recv_param_grad = boost::get>( node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName())); if (recv_param_grad.size() == 2U) { - op_dev_id = GetVarDeviceID(*result, recv_param_grad[1]); - VLOG(10) << "recv param " << recv_param_grad[0] - << " get grad place: " << recv_param_grad[1] - << " place: " << op_dev_id; + op_dev_id = + GetVarDeviceID(*result, recv_param_grad[1], *sharded_var_device); + VLOG(100) << "recv param " << recv_param_grad[0] + << " get grad place: " << recv_param_grad[1] + << " place: " << op_dev_id; } else { op_dev_id = GetAppropriateDeviceID(output_var_names); } for (auto &varname : output_var_names) { - result->Get(kShardedVarDevice) - .emplace(varname, op_dev_id); + sharded_var_device->emplace(varname, op_dev_id); } } else { // send_barrier, fetch_barrier will run on place 0; @@ -796,7 +852,7 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, // send_barrier, recv, fetch_barrier's inputs are deps var, get them from // all places auto p = places_[op_dev_id]; - auto *op_handle = result->Get(kGraphOps).back().get(); + auto *op_handle = result->Get(kGraphOps).back(); op_handle->SetDeviceContext(p, platform::DeviceContextPool::Instance().Get(p)); @@ -804,7 +860,8 @@ int MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result, for (ir::Node *output : node->outputs) { int outvar_dev_id = op_dev_id; if (node->Op()->Type() == "fetch_barrier") { - outvar_dev_id = GetVarDeviceID(*result, output->Name()); + outvar_dev_id = + GetVarDeviceID(*result, output->Name(), *sharded_var_device); PADDLE_ENFORCE_NE(outvar_dev_id, -1); } p = places_[outvar_dev_id]; diff --git a/paddle/fluid/framework/details/multi_devices_graph_pass.h b/paddle/fluid/framework/details/multi_devices_graph_pass.h index cdf9f13cde608b546d17a1e53e0f6acea9e12566..8e462aec7dc7ce45cad592b89de0b6edde8c9146 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_pass.h +++ b/paddle/fluid/framework/details/multi_devices_graph_pass.h @@ -40,16 +40,22 @@ class MultiDevSSAGraphBuilder : public ir::Pass { size_t device_id) const; void Init() const; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) mutable platform::NCCLContextMap *nccl_ctxs_; #endif - int GetVarDeviceID(const ir::Graph &graph, const std::string &varname) const; + int GetVarDeviceID( + const ir::Graph &graph, const std::string &varname, + const std::unordered_map &sharded_var_device) const; bool IsScaleLossOp(ir::Node *node) const; - int CreateRPCOp(ir::Graph *result, ir::Node *node) const; - int CreateDistTrainOp(ir::Graph *result, ir::Node *node) const; + int CreateRPCOp( + ir::Graph *result, ir::Node *node, + std::unordered_map *sharded_var_device) const; + int CreateDistTrainOp( + ir::Graph *result, ir::Node *node, + std::unordered_map *sharded_var_device) const; std::vector FindDistTrainSendVars( const std::vector &nodes) const; @@ -61,14 +67,17 @@ class MultiDevSSAGraphBuilder : public ir::Pass { size_t num_places) const; void CreateScaleLossGradOp(ir::Graph *result, - const std::string &loss_grad_name) const; + const std::string &loss_grad_name, + ir::Node *out_var_node) const; VarHandle *CreateReduceOp(ir::Graph *result, const std::string &og, int dst_dev_id) const; void CreateComputationalOp(ir::Graph *result, ir::Node *node, int dev_id) const; - int GetOpDeviceID(const ir::Graph &graph, ir::Node *node) const; + int GetOpDeviceID( + const ir::Graph &graph, ir::Node *node, + const std::unordered_map &sharded_var_device) const; void InsertAllReduceOp(ir::Graph *result, const std::string &og) const; @@ -78,6 +87,10 @@ class MultiDevSSAGraphBuilder : public ir::Pass { void CreateBroadcastOp(ir::Graph *result, const std::string &p_name, size_t src_dev_id) const; + void CreateFusedBroadcastOp( + ir::Graph *result, + const std::vector> &bcast_varnames) const; + bool IsSparseGradient(const std::string &og) const; size_t GetAppropriateDeviceID( diff --git a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc index 361c91dc78c08a2cbf84ee88211d389c1e2312e5..8f92f0948d7d397ab0f20c01eae9e313f739adec 100644 --- a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc +++ b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/details/multi_devices_graph_print_pass.h" #include #include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -62,7 +63,7 @@ void GraphvizSSAGraphPrinter::Print(const ir::Graph &graph, }); size_t op_id = 0; - for (auto &op : graph.Get(kGraphOps)) { + for (auto &op : ir::FilterByNodeWrapper(graph)) { std::string op_name = "op_" + std::to_string(op_id++); sout << op_name << " [label=\"" << op->Name() << "\", shape=rect]" << std::endl; diff --git a/paddle/fluid/framework/details/multi_devices_helper.h b/paddle/fluid/framework/details/multi_devices_helper.h index 175c5a9950be69d7bf6ae9e386af762007a18a51..1a2b75fbc0c28984ce5cf00e0a2ce0f804349bb1 100644 --- a/paddle/fluid/framework/details/multi_devices_helper.h +++ b/paddle/fluid/framework/details/multi_devices_helper.h @@ -35,23 +35,14 @@ namespace details { // The outside vector is the device vector. Each element of this vector is a // map from variable name to variables. The variables, who have the same name, // will have a differsent version. The offset in the -// `std::vector>` is the version of varaibles. -typedef std::vector< - std::unordered_map>>> +// `std::vector` is the version of varaibles. +typedef std::vector>> GraphVars; const char kGraphVars[] = "vars"; // aux variables to represent dependency. Useful to resolve data hazard. -typedef std::unordered_set> GraphDepVars; +typedef std::unordered_set GraphDepVars; const char kGraphDepVars[] = "dep_vars"; - -// all operators. NOTE that even we use a vector here, the operators is -// unordered. -typedef std::vector> GraphOps; -const char kGraphOps[] = "ops"; - -typedef std::unordered_map ShardedVarDevice; -const char kShardedVarDevice[] = "sharded_var_device"; } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/op_graph_view.cc b/paddle/fluid/framework/details/op_graph_view.cc new file mode 100644 index 0000000000000000000000000000000000000000..4838c4198ff35ba3fb562f3a7c0563ee60179e3b --- /dev/null +++ b/paddle/fluid/framework/details/op_graph_view.cc @@ -0,0 +1,66 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/op_graph_view.h" +#include +#include + +namespace paddle { +namespace framework { +namespace details { + +OpGraphView::OpGraphView(const std::vector &ops) { Build(ops); } + +void OpGraphView::Build(const std::vector &ops) { + for (auto &op : ops) { + preceding_ops_[op]; + pending_ops_[op]; + for (auto &var : op->Outputs()) { + for (auto &pending_op : var->PendingOps()) { + preceding_ops_[pending_op].insert(op); + pending_ops_[op].insert(pending_op); + } + } + } + PADDLE_ENFORCE( + preceding_ops_.size() == ops.size() && pending_ops_.size() == ops.size(), + "There are duplicate ops in graph."); +} + +std::unordered_set OpGraphView::AllOps() const { + std::unordered_set ret; + for (auto &pair : preceding_ops_) { + ret.insert(pair.first); + } + return ret; +} + +bool OpGraphView::HasOp(OpHandleBase *op) const { + return preceding_ops_.count(op) != 0; +} + +void OpGraphView::EnforceHasOp(OpHandleBase *op) const { + PADDLE_ENFORCE(HasOp(op), "Cannot find op %s in OpGraphView", + op == nullptr ? "nullptr" : op->DebugString()); +} + +const std::unordered_set &OpGraphView::PendingOps( + OpHandleBase *op) const { + EnforceHasOp(op); + return pending_ops_.at(op); +} + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/op_graph_view.h b/paddle/fluid/framework/details/op_graph_view.h new file mode 100644 index 0000000000000000000000000000000000000000..afb3e8e59461eeba10d7027fc70b89cc170c1805 --- /dev/null +++ b/paddle/fluid/framework/details/op_graph_view.h @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/details/op_handle_base.h" + +namespace paddle { +namespace framework { +namespace details { + +class OpGraphView { + public: + explicit OpGraphView(const std::vector &ops); + + std::unordered_set AllOps() const; + + const std::unordered_set &PendingOps(OpHandleBase *op) const; + + bool HasOp(OpHandleBase *op) const; + + private: + void Build(const std::vector &ops); + void EnforceHasOp(OpHandleBase *op) const; + + std::unordered_map> + preceding_ops_; + std::unordered_map> + pending_ops_; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/op_handle_base.cc b/paddle/fluid/framework/details/op_handle_base.cc index 3812f0abf1b7069525c4420054c61c01c908acfe..4822627ac3b65972f41d9a23d9fe3dba3de3f97d 100644 --- a/paddle/fluid/framework/details/op_handle_base.cc +++ b/paddle/fluid/framework/details/op_handle_base.cc @@ -103,7 +103,7 @@ void OpHandleBase::WaitInputVarGenerated() { void OpHandleBase::WaitInputVarGenerated(const platform::Place &place) { for (auto *in : inputs_) { if (NeedWait(in)) { - in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_[place]); + in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_.at(place)); } } } diff --git a/paddle/fluid/framework/details/op_handle_base.h b/paddle/fluid/framework/details/op_handle_base.h index d09b94a3fd32952985a37cf4246c7640d2db4f56..ba12ca3c61c05b3e856fffa8353d4ec5bf79bc39 100644 --- a/paddle/fluid/framework/details/op_handle_base.h +++ b/paddle/fluid/framework/details/op_handle_base.h @@ -31,7 +31,10 @@ constexpr char kLocalExecScopeName[] = "@LCOAL_SCOPE@"; // It's responsible for populating necessary fields of ir::Node. class OpHandleBase { public: - explicit OpHandleBase(ir::Node *node) : node_(node) {} + // Owned by `node`. No need to be deleted explicitly. + explicit OpHandleBase(ir::Node *node) : node_(node) { + node_->WrappedBy(this); + } virtual ~OpHandleBase(); diff --git a/paddle/fluid/framework/details/reduce_op_handle.cc b/paddle/fluid/framework/details/reduce_op_handle.cc index 7fc06f234d42a992328c0b6164f17945d8075c28..c9f1107aeab5a21d46e828308cfcb2dde827cba6 100644 --- a/paddle/fluid/framework/details/reduce_op_handle.cc +++ b/paddle/fluid/framework/details/reduce_op_handle.cc @@ -27,7 +27,7 @@ namespace framework { namespace details { void ReduceOpHandle::RunImpl() { - platform::RecordEvent record_event(Name(), dev_ctxes_.begin()->second); + platform::RecordEvent record_event(Name(), dev_ctxes_.cbegin()->second); if (places_.size() == 1) return; // the input and output may have dummy var. @@ -125,7 +125,7 @@ void ReduceOpHandle::RunImpl() { } }); } else if (paddle::platform::is_gpu_place(lod_tensors[0]->place())) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) auto pre_in = pre_in_var->Get(); VariableVisitor::ShareDimsAndLoD(*pre_in_var, out_var); VariableVisitor::GetMutableTensor(out_var).mutable_data( diff --git a/paddle/fluid/framework/details/reduce_op_handle.h b/paddle/fluid/framework/details/reduce_op_handle.h index a6289b055f97b7b0e57928358d84117b33cf2df8..846839029ca65be1bdeac2f6ea497db07a01b6cf 100644 --- a/paddle/fluid/framework/details/reduce_op_handle.h +++ b/paddle/fluid/framework/details/reduce_op_handle.h @@ -23,7 +23,7 @@ #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/platform/device_context.h" -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) #include "paddle/fluid/platform/nccl_helper.h" #endif @@ -35,7 +35,7 @@ struct ReduceOpHandle : public OpHandleBase { std::vector local_scopes_; std::vector places_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) const platform::NCCLContextMap *nccl_ctxs_; ReduceOpHandle(ir::Node *node, const std::vector &local_scopes, const std::vector &places, @@ -46,7 +46,8 @@ struct ReduceOpHandle : public OpHandleBase { nccl_ctxs_(nccl_ctxs) { if (nccl_ctxs_) { for (auto &p_ctx : nccl_ctxs_->contexts_) { - dev_ctxes_[platform::CUDAPlace(p_ctx.first)] = p_ctx.second.ctx_.get(); + this->SetDeviceContext(platform::CUDAPlace(p_ctx.first), + p_ctx.second.ctx_.get()); } } } diff --git a/paddle/fluid/framework/details/reduce_op_handle_test.cc b/paddle/fluid/framework/details/reduce_op_handle_test.cc index 3a9a58412391b188c5e804b41fa47b3607a36bd1..6cee4770e64354cf8a719b0e11b1816b345dd8bd 100644 --- a/paddle/fluid/framework/details/reduce_op_handle_test.cc +++ b/paddle/fluid/framework/details/reduce_op_handle_test.cc @@ -30,12 +30,12 @@ struct TestReduceOpHandle { Scope g_scope_; std::vector local_scopes_; std::vector param_scopes_; - std::unique_ptr op_handle_; - std::vector> vars_; + OpHandleBase *op_handle_; + std::vector vars_; std::vector gpu_list_; std::vector> ctxs_; -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) std::unique_ptr nccl_ctxs_; #endif @@ -43,7 +43,7 @@ struct TestReduceOpHandle { for (size_t j = 0; j < ctxs_.size(); ++j) { ctxs_[j]->Wait(); } -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) if (nccl_ctxs_) { nccl_ctxs_->WaitAll(); } @@ -53,7 +53,7 @@ struct TestReduceOpHandle { void InitCtxOnGpu(bool use_gpu) { use_gpu_ = use_gpu; if (use_gpu) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) int count = p::GetCUDADeviceCount(); if (count <= 1) { LOG(WARNING) << "Cannot test multi-gpu Broadcast, because the CUDA " @@ -77,7 +77,7 @@ struct TestReduceOpHandle { gpu_list_.push_back(p); ctxs_.emplace_back(new p::CPUDeviceContext(p)); } -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) nccl_ctxs_.reset(nullptr); #endif } @@ -99,14 +99,14 @@ struct TestReduceOpHandle { nodes.emplace_back(new ir::Node("node")); if (use_gpu_) { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) op_handle_.reset(new ReduceOpHandle(nodes.back().get(), local_scopes_, gpu_list_, nccl_ctxs_.get())); #else PADDLE_THROW("CUDA is not support."); #endif } else { -#ifdef PADDLE_WITH_CUDA +#if defined(PADDLE_WITH_CUDA) && !defined(_WIN32) op_handle_.reset(new ReduceOpHandle(nodes.back().get(), local_scopes_, gpu_list_, nccl_ctxs_.get())); #else diff --git a/paddle/fluid/framework/details/reference_count_op_handle.h b/paddle/fluid/framework/details/reference_count_op_handle.h index fc479a4c4a1e7d5c824d3c202e0cccf743dd52c9..cc4ccfbdfc720284e683a8f3f59a4aa57a3a9eb1 100644 --- a/paddle/fluid/framework/details/reference_count_op_handle.h +++ b/paddle/fluid/framework/details/reference_count_op_handle.h @@ -51,7 +51,7 @@ class ReferenceCountOpHandle : public OpHandleBase { dev_ctx_ = static_cast( platform::DeviceContextPool::Instance().Get(place)); if (IsStreamGarabageCollector()) { - PADDLE_ENFORCE(cudaSetDevice(place.device)); + platform::SetDeviceId(place.device); PADDLE_ENFORCE(cudaEventCreateWithFlags(&event_, cudaEventDisableTiming)); } @@ -61,7 +61,7 @@ class ReferenceCountOpHandle : public OpHandleBase { ~ReferenceCountOpHandle() { if (IsStreamGarabageCollector()) { auto gpu_place = boost::get(dev_ctx_->GetPlace()); - PADDLE_ENFORCE(cudaSetDevice(gpu_place.device)); + platform::SetDeviceId(gpu_place.device); PADDLE_ENFORCE(cudaEventDestroy(event_)); } } diff --git a/paddle/fluid/framework/details/reference_count_pass.cc b/paddle/fluid/framework/details/reference_count_pass.cc index 2d1f688d64ece3322e253b0c070264b9eb73d678..28443cc886e4c3f5db707d6d8fe9971618d8c2f7 100644 --- a/paddle/fluid/framework/details/reference_count_pass.cc +++ b/paddle/fluid/framework/details/reference_count_pass.cc @@ -19,6 +19,7 @@ #include "paddle/fluid/framework/details/computation_op_handle.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" #include "paddle/fluid/framework/details/reference_count_pass.h" +#include "paddle/fluid/framework/ir/graph_helper.h" namespace paddle { namespace framework { @@ -43,6 +44,23 @@ static ComputationOpHandle *FindNextComputationOpHandle(VarHandle *var_in) { return nullptr; } +static void AddDependencyBetween(OpHandleBase *in, OpHandleBase *out, + ir::Graph *graph) { + auto it = std::find_if( + in->Outputs().begin(), in->Outputs().end(), [](VarHandleBase *var) { + return dynamic_cast(var) != nullptr; + }); + + if (it != in->Outputs().end()) { + out->AddInput(*it); + } else { + auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); + graph->Get(kGraphDepVars).emplace(dep_var); + in->AddOutput(dep_var); + out->AddInput(dep_var); + } +} + std::unique_ptr ReferenceCountPass::ApplyImpl( std::unique_ptr graph) const { auto &ref_cnts = Get(kGlobalReferenceCount); @@ -54,14 +72,13 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( // Step 2: Find all variables in non-computation ops which refers to variables // in computation ops std::unordered_set names; - std::unordered_map> + std::unordered_map compute_ref_cnt_map; auto get_ref_cnts_from_compute_op = [&]( - const std::unique_ptr &op, - const std::vector &vars) { + OpHandleBase *op, const std::vector &vars) { std::vector var_names_in_op; - auto *compute_op = dynamic_cast(op.get()); + auto *compute_op = dynamic_cast(op); if (compute_op == nullptr || !platform::is_gpu_place(compute_op->GetPlace())) return var_names_in_op; @@ -104,9 +121,8 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( }; auto update_ref_cnts_from_non_compute_op = [&]( - const std::unique_ptr &op, - const std::vector &vars) { - if (dynamic_cast(op.get()) != nullptr) return; + OpHandleBase *op, const std::vector &vars) { + if (dynamic_cast(op) != nullptr) return; for (VarHandleBase *var_handle_base : vars) { auto *var_handle = dynamic_cast(var_handle_base); if (var_handle == nullptr || !var_handle->Node()->IsVar()) continue; @@ -124,8 +140,8 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( if (next_compute_op != nullptr) { if (compute_ref_cnt_map.count(next_compute_op)) { compute_ref_cnt_map[next_compute_op]->AddVar(var_name); - VLOG(5) << "Add reference count of " << var_name << " to Operator " - << next_compute_op->Name(); + VLOG(50) << "Add reference count of " << var_name << " to Operator " + << next_compute_op->Name(); } else { // Create new reference_count_op_handle ir::Node *ref_cnt_node = graph->CreateEmptyNode( @@ -133,40 +149,30 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( auto *ref_cnt_handle = new ReferenceCountOpHandle( ref_cnt_node, next_compute_op->GetScope(), place, {var_name}, gcs[place.device].get(), cur_ref_cnts[place.device].get()); - if (next_compute_op->Outputs().empty()) { - auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); - next_compute_op->AddOutput(dep_var); - graph->Get(kGraphDepVars).emplace(dep_var); - } - ref_cnt_handle->AddInput(next_compute_op->Outputs().front()); - compute_ref_cnt_map[next_compute_op].reset(ref_cnt_handle); + AddDependencyBetween(next_compute_op, ref_cnt_handle, graph.get()); + compute_ref_cnt_map[next_compute_op] = ref_cnt_handle; } } } } }; - auto &all_ops = graph->Get(kGraphOps); + auto all_ops = ir::FilterByNodeWrapper(*graph); for (auto &op : all_ops) { auto in_var_names = get_ref_cnts_from_compute_op(op, op->Inputs()); auto out_var_names = get_ref_cnts_from_compute_op(op, op->Outputs()); if (in_var_names.empty() && out_var_names.empty()) continue; in_var_names.insert(in_var_names.end(), out_var_names.begin(), out_var_names.end()); - auto *compute_op = dynamic_cast(op.get()); + auto *compute_op = dynamic_cast(op); auto place = boost::get(compute_op->GetPlace()); ir::Node *ref_cnt_node = graph->CreateEmptyNode("reference_count", ir::Node::Type::kOperation); auto *ref_cnt_handle = new ReferenceCountOpHandle( ref_cnt_node, compute_op->GetScope(), place, in_var_names, gcs[place.device].get(), cur_ref_cnts[place.device].get()); - if (compute_op->Outputs().empty()) { - auto *dep_var = new DummyVarHandle(graph->CreateControlDepVar()); - compute_op->AddOutput(dep_var); - graph->Get(kGraphDepVars).emplace(dep_var); - } - ref_cnt_handle->AddInput(compute_op->Outputs().front()); - compute_ref_cnt_map[compute_op].reset(ref_cnt_handle); + AddDependencyBetween(compute_op, ref_cnt_handle, graph.get()); + compute_ref_cnt_map[compute_op] = ref_cnt_handle; } for (auto &op : all_ops) { @@ -174,11 +180,11 @@ std::unique_ptr ReferenceCountPass::ApplyImpl( update_ref_cnts_from_non_compute_op(op, op->Outputs()); } - std::vector> new_all_ops; + std::vector new_all_ops; new_all_ops.reserve(compute_ref_cnt_map.size() + all_ops.size()); for (auto &op : all_ops) { new_all_ops.emplace_back(std::move(op)); - auto it = compute_ref_cnt_map.find(new_all_ops.back().get()); + auto it = compute_ref_cnt_map.find(new_all_ops.back()); if (it != compute_ref_cnt_map.end()) { // Add LeafNode to ReferenceCountOpHandle auto *dummy_leaf = new DummyVarHandle(graph->CreateControlDepVar()); diff --git a/paddle/fluid/framework/details/rpc_op_handle.cc b/paddle/fluid/framework/details/rpc_op_handle.cc index f44b374edb29228dff5a8bf003d945291f166d49..dfa6c1ade1a024bb9087144d0e96fa5b0417f06a 100644 --- a/paddle/fluid/framework/details/rpc_op_handle.cc +++ b/paddle/fluid/framework/details/rpc_op_handle.cc @@ -29,22 +29,19 @@ RPCOpHandle::RPCOpHandle(ir::Node *node, const framework::OpDesc &op_desc, place_(place) {} void RPCOpHandle::RunImpl() { - // TODO(wuyi): need further analysis whether wait VarDummyHandle. - // Wait input done for (auto *in : inputs_) { auto &p = static_cast(in)->place_; - // FIXME(Yancey1989): need a better solution instead of use DebugString() - if (ir::IsControlDepVar(*in->Node())) { // HACK + if (ir::IsControlDepVar(*in->Node())) { continue; } if (in->GeneratedOp()) { - in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_[p]); + in->GeneratedOp()->RecordWaitEventOnCtx(dev_ctxes_.at(p)); } } - auto &tmp_scope = local_scope_->FindVar(kLocalExecScopeName)->Get(); - // FIXME(wuyi): can not use RunAndRecordEvent here, for it will cause dead - // lock. - op_->Run(*tmp_scope, place_); + this->RunAndRecordEvent([this] { + op_->Run(*local_scope_->FindVar(kLocalExecScopeName)->Get(), + place_); + }); } std::string RPCOpHandle::Name() const { return name_; } diff --git a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc index ba243979b34aa1f683de707525403becaf0a1c00..6ab6cb2332b0af3fa16b986f115513ee098fae4f 100644 --- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc +++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc @@ -27,7 +27,7 @@ ScaleLossGradOpHandle::ScaleLossGradOpHandle(ir::Node *node, size_t num_dev, coeff_(static_cast(1.0 / num_dev)), scope_(scope), place_(place) { - dev_ctxes_[place_] = dev_ctx; + this->SetDeviceContext(place_, dev_ctx); } ScaleLossGradOpHandle::~ScaleLossGradOpHandle() {} @@ -46,12 +46,12 @@ void ScaleLossGradOpHandle::RunImpl() { } else { #ifdef PADDLE_WITH_CUDA this->RunAndRecordEvent([&] { - auto stream = - static_cast(this->dev_ctxes_[place_]) - ->stream(); + auto stream = static_cast( + this->dev_ctxes_.at(place_)) + ->stream(); memory::Copy(boost::get(place_), tmp, platform::CPUPlace(), &coeff_, sizeof(float), stream); - VLOG(10) << place_ << "RUN Scale loss grad op"; + VLOG(100) << place_ << "RUN Scale loss grad op"; }); #endif } diff --git a/paddle/fluid/framework/details/sequential_execution_pass.cc b/paddle/fluid/framework/details/sequential_execution_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..f78a47bb78e6f1d81db6abed11a7762f21dd2226 --- /dev/null +++ b/paddle/fluid/framework/details/sequential_execution_pass.cc @@ -0,0 +1,109 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/details/sequential_execution_pass.h" +#include +#include +#include +#include +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace details { + +static bool IsSameOpDesc(OpDesc *op1, OpDesc *op2) { + return op1->Type() == op2->Type() && op1->Inputs() == op2->Inputs() && + op1->Outputs() == op2->Outputs(); +} + +std::unique_ptr SequentialExecutionPass::ApplyImpl( + std::unique_ptr graph) const { + // FIXME(zjl): Insert dependencies between some distributed ops may cause + // the multi_devices_graph_pass fails. So we skip these ops here. + // Indeed, maybe we should not insert dependencies between these ops + // casually, which may cause deadlock easily. + // We should add more skipped distributed ops when found errors in + // multi_devices_graph_pass + static std::unordered_set skip_dist_ops{ + "send", "recv", "send_barrier", "fetch_barrier"}; + + auto &ops = Get>(kAllOpDescs); + std::vector op_node_list; + op_node_list.reserve(ops.size()); + + std::unordered_map op_deps; + std::unordered_map> pending_ops; + std::unordered_set ready_ops; + + for (ir::Node *node : graph->Nodes()) { + if (!node->IsOp()) continue; + std::unordered_set preceding_ops; + for (auto *in : node->inputs) { + PADDLE_ENFORCE(in->IsVar(), + "Preceding Node of Op Nodes must be Var Node"); + if (in->inputs.empty()) continue; + PADDLE_ENFORCE(in->inputs.size() == 1 && in->inputs[0]->IsOp(), + "Preceding Op Node of Var Node must be unique"); + preceding_ops.insert(in->inputs[0]); + pending_ops[in->inputs[0]].insert(node); + } + op_deps[node] = preceding_ops.size(); + if (preceding_ops.empty()) { + ready_ops.insert(node); + } + } + + for (auto *op_desc : ops) { + ir::Node *found_node = nullptr; + for (auto *node : ready_ops) { + if (IsSameOpDesc(op_desc, node->Op())) { + PADDLE_ENFORCE(found_node == nullptr, + "Found multiple op_desc in graph: %s", op_desc->Type()); + found_node = node; + } + } + + PADDLE_ENFORCE_NOT_NULL(found_node, "Cannot find op_desc in graph: %s", + op_desc->Type()); + for (auto *pending_op : pending_ops[found_node]) { + if (--op_deps.at(pending_op) == 0) { + ready_ops.insert(pending_op); + } + } + ready_ops.erase(found_node); + if (skip_dist_ops.count(op_desc->Type()) == 0) { + op_node_list.push_back(found_node); + } + } + + for (size_t i = 1; i < op_node_list.size(); ++i) { + auto *dep_var = graph->CreateControlDepVar(); + op_node_list[i]->inputs.push_back(dep_var); + op_node_list[i - 1]->outputs.push_back(dep_var); + dep_var->outputs.push_back(op_node_list[i]); + dep_var->inputs.push_back(op_node_list[i - 1]); + VLOG(100) << "Add dependencies between " << op_node_list[i - 1]->Name() + << " and " << op_node_list[i]->Name(); + } + return graph; +} + +} // namespace details +} // namespace framework +} // namespace paddle + +REGISTER_PASS(sequential_execution_pass, + paddle::framework::details::SequentialExecutionPass) + .RequirePassAttr(paddle::framework::details::kAllOpDescs); diff --git a/paddle/fluid/framework/details/sequential_execution_pass.h b/paddle/fluid/framework/details/sequential_execution_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..a04c08bc2eb3bae797d648b30a22a5fee7ba0eaa --- /dev/null +++ b/paddle/fluid/framework/details/sequential_execution_pass.h @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace details { + +constexpr char kAllOpDescs[] = "all_op_descs"; + +class SequentialExecutionPass : public ir::Pass { + protected: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; +}; + +} // namespace details +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/details/ssa_graph_executor.cc b/paddle/fluid/framework/details/ssa_graph_executor.cc index 780da5478ff34ecd7096d0ef62b72bf1088dd221..af2cbd5c876fdd7c27cd679f7e9412d1b0604ecc 100644 --- a/paddle/fluid/framework/details/ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/ssa_graph_executor.cc @@ -19,14 +19,16 @@ namespace framework { namespace details { SSAGraphExecutor::~SSAGraphExecutor() {} -void ClearFetchOp(ir::Graph* graph, - std::vector>* fetch_ops) { +void ClearFetchOp(ir::Graph* graph, std::vector* fetch_ops) { if (fetch_ops->empty()) return; for (auto& op : *fetch_ops) { for (auto& out_var : op->Node()->outputs) { graph->RemoveNode(out_var); } + for (auto& in_var : op->Inputs()) { + in_var->RemoveOutput(op, op->Node()); + } graph->RemoveNode(op->Node()); } fetch_ops->clear(); diff --git a/paddle/fluid/framework/details/ssa_graph_executor.h b/paddle/fluid/framework/details/ssa_graph_executor.h index d5cf7737d565c523995e6685b73c57e5a6f0197b..860eaa25b58e4579ad792ff18618de3b90707e8d 100644 --- a/paddle/fluid/framework/details/ssa_graph_executor.h +++ b/paddle/fluid/framework/details/ssa_graph_executor.h @@ -38,8 +38,7 @@ class SSAGraphExecutor { virtual FeedFetchList Run(const std::vector& fetch_tensors) = 0; }; -void ClearFetchOp(ir::Graph* graph, - std::vector>* fetch_ops); +void ClearFetchOp(ir::Graph* graph, std::vector* fetch_ops); } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc index 31beef3ae829d72570ee7c879dac71ed600cd216..f781f02a076594b5a70fd4863ebf273e88607dfd 100644 --- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc +++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" #include "paddle/fluid/framework/details/multi_devices_helper.h" +#include "paddle/fluid/framework/ir/graph_helper.h" #include "paddle/fluid/platform/profiler.h" namespace paddle { @@ -39,7 +40,7 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( new platform::RecordEvent("ThreadedSSAGraphExecutorPrepare", nullptr)); std::unordered_map pending_ops; std::unordered_set pending_vars; - BlockingQueue ready_vars; + auto ready_vars = std::make_shared>(); std::unordered_set ready_ops; // For ops (e.g. nccl_all_reduce) that need to coordinate multiple // streams from multiple GPUs, it's faster to buffer them and schedule @@ -51,34 +52,34 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( for (auto &var_map : graph_->Get(details::kGraphVars)) { for (auto &name_pair : var_map) { for (auto &version_pair : name_pair.second) { - InsertPendingVar(&pending_vars, &ready_vars, version_pair.get()); + InsertPendingVar(&pending_vars, ready_vars.get(), version_pair); } } } for (auto &var : graph_->Get(details::kGraphDepVars)) { - InsertPendingVar(&pending_vars, &ready_vars, var.get()); + InsertPendingVar(&pending_vars, ready_vars.get(), var); } - for (auto &op : graph_->Get(details::kGraphOps)) { + for (auto &op : ir::FilterByNodeWrapper(*graph_)) { if (op->Inputs().empty()) { // Special case, Op has no input. - ready_ops.insert(op.get()); + ready_ops.insert(op); } else { - InsertPendingOp(&pending_ops, op.get()); + InsertPendingOp(&pending_ops, op); } } // Step 2. Insert FetchOps - std::vector> fetch_ops; - std::unordered_set> fetch_dependencies; + std::vector fetch_ops; + std::unordered_set fetch_dependencies; FeedFetchList fetch_data(fetch_tensors.size()); InsertFetchOps(fetch_tensors, &fetch_ops, &fetch_dependencies, &pending_ops, - &pending_vars, &ready_vars, &fetch_data); + &pending_vars, ready_vars.get(), &fetch_data); auto run_all_ops = [&](std::unordered_set &set) { for (auto *op : set) { running_ops_++; - RunOp(&ready_vars, op); + RunOp(ready_vars, op); } set.clear(); }; @@ -87,7 +88,6 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( run_op_futures_.clear(); exception_holder_.Clear(); event.reset(nullptr); - // Step 3. Execution while (!pending_vars.empty()) { // 1. Run All Ready ops @@ -103,13 +103,14 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( // 2. Find ready variable bool timeout; - auto cur_ready_vars = ready_vars.PopAll(1, &timeout); + auto cur_ready_vars = ready_vars->PopAll(1, &timeout); if (timeout) { if (exception_holder_.IsCaught()) { for (auto &run_op_future : run_op_futures_) { run_op_future.wait(); } + ClearFetchOp(graph_.get(), &fetch_ops); exception_holder_.ReThrow(); } else { continue; @@ -133,7 +134,6 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( } } PADDLE_ENFORCE(ready_ops.empty()); - // Wait FetchOps. ClearFetchOp(graph_.get(), &fetch_ops); @@ -142,8 +142,8 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( void ThreadedSSAGraphExecutor::InsertFetchOps( const std::vector &fetch_tensors, - std::vector> *fetch_ops, - std::unordered_set> *fetch_dependencies, + std::vector *fetch_ops, + std::unordered_set *fetch_dependencies, std::unordered_map *pending_ops, std::unordered_set *pending_vars, BlockingQueue *ready_vars, FeedFetchList *fetch_data) { @@ -153,7 +153,7 @@ void ThreadedSSAGraphExecutor::InsertFetchOps( for (auto &var_map : graph_->Get(details::kGraphVars)) { auto it = var_map.find(fetch_var_name); if (it != var_map.end()) { - fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get()); + fetched_vars[fetch_var_name].push_back(*it->second.rbegin()); } } } @@ -206,17 +206,20 @@ void ThreadedSSAGraphExecutor::InsertPendingVar( } void ThreadedSSAGraphExecutor::RunOp( - BlockingQueue *ready_var_q, details::OpHandleBase *op) { + const std::shared_ptr> &ready_var_q, + details::OpHandleBase *op) { auto op_run = [ready_var_q, op, this] { try { - if (VLOG_IS_ON(10)) { - VLOG(10) << op << " " << op->Name() << " : " << op->DebugString(); + if (VLOG_IS_ON(100)) { + VLOG(100) << op << " " << op->Name() << " : " << op->DebugString(); + } + if (LIKELY(!strategy_.dry_run_)) { + op->Run(strategy_.use_cuda_); } - op->Run(strategy_.use_cuda_); - VLOG(10) << op << " " << op->Name() << " Done "; + VLOG(100) << op << " " << op->Name() << " Done "; running_ops_--; ready_var_q->Extend(op->Outputs()); - VLOG(10) << op << " " << op->Name() << "Signal posted"; + VLOG(100) << op << " " << op->Name() << "Signal posted"; } catch (...) { exception_holder_.Catch(std::current_exception()); } diff --git a/paddle/fluid/framework/details/threaded_ssa_graph_executor.h b/paddle/fluid/framework/details/threaded_ssa_graph_executor.h index 512f8a4ca5a9b82a395dde11722b8db44ea5ec27..24da56c09e3e0f3894d58e5af8838c98e3e1e67c 100644 --- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.h +++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.h @@ -48,10 +48,10 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor { // Use topological sort algorithm FeedFetchList Run(const std::vector &fetch_tensors) override; - ~ThreadedSSAGraphExecutor() {} + ~ThreadedSSAGraphExecutor() final = default; private: - void RunOp(BlockingQueue *ready_var_q, + void RunOp(const std::shared_ptr> &ready_var_q, details::OpHandleBase *op); private: @@ -70,13 +70,13 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor { BlockingQueue *ready_vars, VarHandleBase *var) const; - void InsertFetchOps( - const std::vector &fetch_tensors, - std::vector> *fetch_ops, - std::unordered_set> *fetch_dependencies, - std::unordered_map *pending_ops, - std::unordered_set *pending_vars, - BlockingQueue *ready_vars, FeedFetchList *fetch_data); + void InsertFetchOps(const std::vector &fetch_tensors, + std::vector *fetch_ops, + std::unordered_set *fetch_dependencies, + std::unordered_map *pending_ops, + std::unordered_set *pending_vars, + BlockingQueue *ready_vars, + FeedFetchList *fetch_data); private: ExecutionStrategy strategy_; diff --git a/paddle/fluid/framework/details/var_handle.cc b/paddle/fluid/framework/details/var_handle.cc index 5457870e9ff5d7cf67c9c7076b9aae94eeada779..30da029ca2a90e7faa6288557ff2f1aeb21cc1c6 100644 --- a/paddle/fluid/framework/details/var_handle.cc +++ b/paddle/fluid/framework/details/var_handle.cc @@ -20,6 +20,8 @@ namespace details { VarHandleBase::~VarHandleBase() {} +VarHandle::~VarHandle() { VLOG(4) << "deleting var handle " << DebugString(); } + std::string VarHandle::DebugString() const { std::stringstream ss; ss << name_ << ":" << place_; @@ -27,6 +29,10 @@ std::string VarHandle::DebugString() const { } std::string DummyVarHandle::DebugString() const { return node_->Name(); } + +DummyVarHandle::~DummyVarHandle() { + VLOG(4) << "deleting dummy var handle " << DebugString(); +} } // namespace details } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/details/var_handle.h b/paddle/fluid/framework/details/var_handle.h index d8c2bc40b9458a1d5a7dd8a32277d04f69295f09..3b007d7b1a52df765a2dbd41939f8f865123cb43 100644 --- a/paddle/fluid/framework/details/var_handle.h +++ b/paddle/fluid/framework/details/var_handle.h @@ -35,7 +35,10 @@ class OpHandleBase; // A variable can only be generated by a single operator. i.e. // This is a single assignment graph. struct VarHandleBase { - explicit VarHandleBase(ir::Node* node) : node_(node) {} + // Owned by `node`. No need to be deleted explicitly. + explicit VarHandleBase(ir::Node* node) : node_(node) { + node_->WrappedBy(this); + } virtual ~VarHandleBase(); @@ -49,6 +52,8 @@ struct VarHandleBase { void AddOutput(OpHandleBase* out, ir::Node* node) { if (pending_ops_.find(out) == pending_ops_.end()) { + PADDLE_ENFORCE(out != nullptr, "The output of %s should not be nullptr", + this->Node()->Name()); pending_ops_.insert(out); node_->outputs.push_back(node); } @@ -92,6 +97,8 @@ struct VarHandleBase { struct VarHandle : public VarHandleBase { explicit VarHandle(ir::Node* node) : VarHandleBase(node) {} + virtual ~VarHandle(); + std::string DebugString() const override; VarHandle(ir::Node* node, size_t version, size_t scope_index, @@ -119,6 +126,8 @@ struct VarHandle : public VarHandleBase { struct DummyVarHandle : public VarHandleBase { explicit DummyVarHandle(ir::Node* node) : VarHandleBase(node) {} + virtual ~DummyVarHandle(); + std::string DebugString() const override; }; diff --git a/paddle/fluid/framework/dlpack_tensor.cc b/paddle/fluid/framework/dlpack_tensor.cc new file mode 100644 index 0000000000000000000000000000000000000000..04e3f78afe44bf748e4514fd82e5571ce2a50838 --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor.cc @@ -0,0 +1,127 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/dlpack_tensor.h" + +namespace paddle { +namespace framework { + +namespace internal { +template +static ::DLDataType GetDLDataTypeCode() { + ::DLDataType dtype; + if (std::is_same::value || + std::is_floating_point::value) { + dtype.code = kDLFloat; + } else if (std::is_unsigned::value) { + dtype.code = kDLUInt; + } else if (std::is_integral::value) { + dtype.code = kDLInt; + } else { + PADDLE_THROW("Unsupported data type %s", typeid(T).name()); + } + dtype.bits = 8 * sizeof(T); + dtype.lanes = 1; + return dtype; +} + +static DLDataType GetDLDataTypeFromTypeIndex(const std::type_index &type) { +#define REG_DL_DATA_TYPE(type) \ + { std::type_index(typeid(type)), GetDLDataTypeCode() } + static const std::unordered_map + type_to_dtype_map({ + REG_DL_DATA_TYPE(platform::float16), // NOLINT + REG_DL_DATA_TYPE(float), // NOLINT + REG_DL_DATA_TYPE(double), // NOLINT + REG_DL_DATA_TYPE(int), // NOLINT + REG_DL_DATA_TYPE(int64_t), // NOLINT + REG_DL_DATA_TYPE(bool), // NOLINT + REG_DL_DATA_TYPE(size_t), // NOLINT + REG_DL_DATA_TYPE(int16_t), // NOLINT + REG_DL_DATA_TYPE(uint8_t), // NOLINT + REG_DL_DATA_TYPE(int8_t) // NOLINT + }); + static auto type_to_dtype_map_end_it = type_to_dtype_map.end(); + auto it = type_to_dtype_map.find(type); + PADDLE_ENFORCE(it != type_to_dtype_map_end_it, "Unsupported data type %s", + type.name()); + return it->second; +#undef REG_DL_DATA_TYPE +} + +struct DLContextVisitor : public boost::static_visitor<::DLContext> { + inline ::DLContext operator()(const platform::CPUPlace &place) const { + DLContext ctx; + ctx.device_type = kDLCPU; + ctx.device_id = 0; + return ctx; + } + + inline ::DLContext operator()(const platform::CUDAPlace &place) const { +#ifdef PADDLE_WITH_CUDA + DLContext ctx; + ctx.device_type = kDLGPU; + ctx.device_id = place.device; + return ctx; +#else + PADDLE_THROW("platform::CUDAPlace is not supported in CPU only version"); +#endif + } + + inline ::DLContext operator()(const platform::CUDAPinnedPlace &place) const { +#ifdef PADDLE_WITH_CUDA + DLContext ctx; + ctx.device_type = kDLCPUPinned; + ctx.device_id = 0; + return ctx; +#else + PADDLE_THROW( + "platform::CUDAPinnedPlace is not supported in CPU only version"); +#endif + } +}; +} // namespace internal + +DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) { + // init data, data buffer + t_.data = const_cast(tensor.data()); + + // init ctx, DLContext type with device_type and device_id + auto place = tensor.place(); + t_.ctx = boost::apply_visitor(internal::DLContextVisitor(), place); + + // init dtype + t_.dtype = internal::GetDLDataTypeFromTypeIndex(tensor.type()); + t_.dtype.lanes = lanes; + + // init ndim, tensor rank + auto &dims = tensor.dims(); + using DimType = decltype(t_.ndim); // int + t_.ndim = static_cast(dims.size()); + + // init shape, tensor dims + t_.shape = shape_; + for (DimType i = 0; i < t_.ndim; ++i) { + t_.shape[i] = dims[i]; + } + + // init strides, nullptr means the tensor is compact + t_.strides = nullptr; + + // init byte_offset + t_.byte_offset = 0; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/dlpack_tensor.h b/paddle/fluid/framework/dlpack_tensor.h new file mode 100644 index 0000000000000000000000000000000000000000..0c52bce1ef6af9b92bcb9f87c6781de878ed5898 --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor.h @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/tensor.h" + +namespace paddle { +namespace framework { + +class DLPackTensor { + public: + using LaneType = decltype(::DLTensor::dtype.lanes); // uint16_t + using ShapeType = + std::remove_reference::type; // int64_t + + // lanes is only used in CPU to enable vectorization + explicit DLPackTensor(const Tensor& tensor, LaneType lanes = 1); + + inline operator const ::DLTensor&() const { return t_; } + + inline operator ::DLTensor&() { return t_; } + + private: + ::DLTensor t_; + + // The shape in DLTensor is defined as int64_t* + // Add this member to make TVMTensor init without heap allocation + ShapeType shape_[9]; +}; + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/dlpack_tensor_test.cc b/paddle/fluid/framework/dlpack_tensor_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..938b05635004fcc417f753d5912269333e3ebc01 --- /dev/null +++ b/paddle/fluid/framework/dlpack_tensor_test.cc @@ -0,0 +1,113 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/dlpack_tensor.h" +#include +#include +#include + +namespace paddle { +namespace framework { + +namespace { // NOLINT +template +constexpr uint8_t GetDLDataTypeCode() { + return std::is_same::value || + std::is_floating_point::value + ? static_cast(kDLFloat) + : (std::is_unsigned::value + ? static_cast(kDLUInt) + : (std::is_integral::value ? static_cast(kDLInt) + : static_cast(-1))); +} +} // NOLINT + +template +void TestMain(const platform::Place &place, uint16_t lanes) { + DDim dims{4, 5, 6, 7}; + Tensor tensor; + tensor.Resize(dims); + void *p = tensor.mutable_data(place); + + DLPackTensor dlpack_tensor(tensor, lanes); + ::DLTensor &dl_tensor = dlpack_tensor; + + CHECK_EQ(p, dl_tensor.data); + if (platform::is_cpu_place(place)) { + CHECK_EQ(kDLCPU, dl_tensor.ctx.device_type); + CHECK_EQ(0, dl_tensor.ctx.device_id); + } else if (platform::is_gpu_place(place)) { + CHECK_EQ(kDLGPU, dl_tensor.ctx.device_type); + CHECK_EQ(boost::get(place).device, + dl_tensor.ctx.device_id); + } else if (platform::is_cuda_pinned_place(place)) { + CHECK_EQ(kDLCPUPinned, dl_tensor.ctx.device_type); + CHECK_EQ(0, dl_tensor.ctx.device_id); + } else { + CHECK_EQ(false, true); + } + + CHECK_EQ(dims.size(), dl_tensor.ndim); + for (auto i = 0; i < dims.size(); ++i) { + CHECK_EQ(dims[i], dl_tensor.shape[i]); + } + + CHECK_EQ(dl_tensor.strides == nullptr, true); + CHECK_EQ(static_cast(0), dl_tensor.byte_offset); + + CHECK_EQ(lanes, dl_tensor.dtype.lanes); + CHECK_EQ(sizeof(T) * 8, dl_tensor.dtype.bits); + + CHECK_EQ(GetDLDataTypeCode(), dl_tensor.dtype.code); +} + +template +void TestMainLoop() { +#ifdef PADDLE_WITH_CUDA + std::vector places{platform::CPUPlace(), + platform::CUDAPlace(0), + platform::CUDAPinnedPlace()}; + if (platform::GetCUDADeviceCount() > 1) { + places.emplace_back(platform::CUDAPlace(1)); + } +#else + std::vector places{platform::CPUPlace()}; +#endif + std::vector lanes{1, 2}; + for (auto &p : places) { + for (auto &l : lanes) { + TestMain(p, l); + } + } +} + +#define PADDLE_DLPACK_TEST(type) \ + TEST(dlpack, test_##type) { TestMainLoop(); } + +using float16 = platform::float16; +PADDLE_DLPACK_TEST(float16); +PADDLE_DLPACK_TEST(float); +PADDLE_DLPACK_TEST(double); +PADDLE_DLPACK_TEST(int); +PADDLE_DLPACK_TEST(int64_t); +PADDLE_DLPACK_TEST(bool); +PADDLE_DLPACK_TEST(size_t); +PADDLE_DLPACK_TEST(int16_t); +PADDLE_DLPACK_TEST(uint8_t); +PADDLE_DLPACK_TEST(int8_t); + +#undef PADDLE_DLPACK_TEST + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/eigen.h b/paddle/fluid/framework/eigen.h index 2b265a773fe967f5b2ab38ce795b0f599d859c2a..5bafa4345f42a1f6209b5ee31ae6ba2ded6a899c 100644 --- a/paddle/fluid/framework/eigen.h +++ b/paddle/fluid/framework/eigen.h @@ -13,11 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -// logging.h and windows.h conflict -#define GLOG_NO_ABBREVIATED_SEVERITIES -// solve static linking error in windows -// https://github.com/google/glog/issues/301 -#define GOOGLE_GLOG_DLL_DECL #include "paddle/fluid/framework/tensor.h" #include "unsupported/Eigen/CXX11/Tensor" diff --git a/paddle/fluid/framework/executor.cc b/paddle/fluid/framework/executor.cc index b212666637a5289c9c6cd3585655deaeed8afd4b..3dc571d75706b732fe9b254897b6cbd2e206cfc3 100644 --- a/paddle/fluid/framework/executor.cc +++ b/paddle/fluid/framework/executor.cc @@ -17,14 +17,17 @@ limitations under the License. */ #include "paddle/fluid/framework/feed_fetch_method.h" #include "paddle/fluid/framework/lod_rank_table.h" #include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/ngraph_operator.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/reader.h" +#include "paddle/fluid/framework/transfer_scope_cache.h" #include "paddle/fluid/operators/detail/macros.h" #include "paddle/fluid/platform/place.h" #include "paddle/fluid/platform/profiler.h" DECLARE_bool(benchmark); DEFINE_bool(use_mkldnn, false, "Use MKLDNN to run"); +DEFINE_bool(use_ngraph, false, "Use NGRAPH to run"); namespace paddle { namespace framework { @@ -43,7 +46,7 @@ ExecutorPrepareContext::ExecutorPrepareContext( } ExecutorPrepareContext::~ExecutorPrepareContext() { - VLOG(5) << "destroy ExecutorPrepareContext"; + VLOG(50) << "destroy ExecutorPrepareContext"; } template @@ -60,7 +63,7 @@ static void DeleteUnusedTensors(const Scope& scope, const OperatorBase* op, if ((it->second)-- == 1) { auto* var = scope.FindVar(name); if (var != nullptr) { - VLOG(10) << "Erase tensor \'" << name << "\'"; + VLOG(100) << "Erase tensor \'" << name << "\'"; if (var->IsType()) { erase_tensors.insert(var->GetMutable()); } else if (var->IsType()) { @@ -81,12 +84,32 @@ static void DeleteUnusedTensors(const Scope& scope, const OperatorBase* op, } } +static void EnableFusedOp(ExecutorPrepareContext* ctx) { +#ifdef PADDLE_WITH_NGRAPH + VLOG(3) << "use_ngraph=True"; + auto intervals = FusedOperator::FusedOpIntervals(&ctx->ops_); + for (auto& interval : intervals) { + auto* fused_op = new FusedOperator(ctx->prog_, ctx->block_id_, + interval.at(0), interval.at(1)); + *interval[0] = std::unique_ptr(fused_op); + } + for (auto it = intervals.rbegin(); it != intervals.rend(); ++it) { + ctx->ops_.erase(it->at(0) + 1, it->at(1)); + } +#else + LOG(WARNING) + << "'NGRAPH' is not supported, Please re-compile with WITH_NGRAPH option"; +#endif +} + Executor::Executor(const platform::Place& place) : place_(place) {} void Executor::Close() { #ifdef PADDLE_WITH_DISTRIBUTE + // TODO(typhoonzero): complete message will need to use real trainer_id, + // except 0. ::paddle::operators::distributed::RPCClient::GetInstance< - ::paddle::operators::distributed::GRPCClient>() + ::paddle::operators::distributed::GRPCClient>(0) ->SendComplete(); #endif } @@ -139,21 +162,21 @@ void Executor::CreateVariables(const ProgramDesc& pdesc, Scope* scope, if (var->Persistable()) { auto* ptr = const_cast(ancestor_scope)->Var(var->Name()); InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " global, which pointer is " << ptr; + VLOG(30) << "Create Variable " << var->Name() + << " global, which pointer is " << ptr; } else { auto* ptr = scope->Var(var->Name()); InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " locally, which pointer is " << ptr; + VLOG(30) << "Create Variable " << var->Name() + << " locally, which pointer is " << ptr; } } } else { for (auto& var : global_block.AllVars()) { auto* ptr = scope->Var(var->Name()); InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create variable " << var->Name() << ", which pointer is " - << ptr; + VLOG(30) << "Create variable " << var->Name() << ", which pointer is " + << ptr; } } } @@ -284,7 +307,7 @@ void Executor::Run(const ProgramDesc& program, Scope* scope, int i = 0; for (auto& feed_target : (*feed_targets)) { std::string var_name = feed_target.first; - VLOG(3) << "feed target's name: " << var_name; + VLOG(30) << "feed target's name: " << var_name; // prepend feed op auto* op = global_block->PrependOp(); @@ -307,7 +330,7 @@ void Executor::Run(const ProgramDesc& program, Scope* scope, int i = 0; for (auto& fetch_target : (*fetch_targets)) { std::string var_name = fetch_target.first; - VLOG(3) << "fetch target's name: " << var_name; + VLOG(30) << "fetch target's name: " << var_name; // append fetch op auto* op = global_block->AppendOp(); @@ -336,6 +359,7 @@ std::unique_ptr Executor::Prepare( for (auto& op_desc : block.AllOps()) { ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc)); } + if (FLAGS_use_ngraph) EnableFusedOp(ctx.get()); return ctx; } @@ -357,6 +381,7 @@ std::vector> Executor::Prepare( void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, bool create_local_scope, bool create_vars, bool keep_kids) { + PADDLE_ENFORCE_NOT_NULL(scope); Scope* local_scope = scope; if (create_vars) { if (create_local_scope) { @@ -367,8 +392,8 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, int64_t max_memory_size = GetEagerDeletionThreshold(); std::unique_ptr> gc; - // WhileOp would set keep_kids to false - // WhileGradOp would need the scopes created in WhileOp + // WhileOp would set keep_kids to true, + // because WhileGradOp needs the scopes created in WhileOp. // Perhaps, we should not perform eager deletion in WhileOp // The scopes and variables created by WhileOp would be deleted // in WhileGradOp. @@ -394,11 +419,6 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, DeleteUnusedTensors(*local_scope, op.get(), gc.get(), &(ctx->cur_ref_cnts_)); } - - if (FLAGS_benchmark) { - VLOG(2) << "Memory used after operator " + op->Type() + " running: " - << memory::memory_usage(place_); - } } if (gc != nullptr) { @@ -420,13 +440,6 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope, scope->DropKids(); } } - - if (FLAGS_benchmark) { - VLOG(2) << "-------------------------------------------------------"; - VLOG(2) << "Memory used after deleting local scope: " - << memory::memory_usage(place_); - VLOG(2) << "-------------------------------------------------------"; - } } void Executor::RunPreparedContext( @@ -469,7 +482,7 @@ void Executor::RunPreparedContext( void Executor::EnableMKLDNN(const ProgramDesc& program) { #ifdef PADDLE_WITH_MKLDNN - VLOG(3) << "use_mkldnn=True"; + VLOG(30) << "use_mkldnn=True"; for (size_t bid = 0; bid < program.Size(); ++bid) { auto* block = const_cast(program).MutableBlock(bid); for (auto* op : block->AllOps()) { @@ -483,6 +496,5 @@ void Executor::EnableMKLDNN(const ProgramDesc& program) { << "'MKLDNN' is not supported, Please re-compile with WITH_MKLDNN option"; #endif } - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/feed_fetch_method.cc b/paddle/fluid/framework/feed_fetch_method.cc index 3e9353f5cf67d8de62c5551f12ea786e49190549..1f3c19c0d5901cec9acc4ac9c5dab538d620c956 100644 --- a/paddle/fluid/framework/feed_fetch_method.cc +++ b/paddle/fluid/framework/feed_fetch_method.cc @@ -25,7 +25,7 @@ void SetFeedVariable(Scope* scope, const LoDTensor& input, const std::string& var_name, size_t index) { // If var_name Variable is not found in GlobalScope, a new variable will // be created. - VLOG(3) << "SetFeedVariable name=" << var_name << " index=" << index; + VLOG(30) << "SetFeedVariable name=" << var_name << " index=" << index; Variable* g_feed_value = scope->Var(var_name); auto& feed_inputs = *(g_feed_value->GetMutable()); if (index >= feed_inputs.size()) { @@ -47,8 +47,8 @@ LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name, typeid(FeedFetchList).name()); auto& fetch_outputs = *g_fetch_value->GetMutable(); auto& tensor = fetch_outputs[index]; - VLOG(3) << "Fetch " << var_name << " with index " << index - << " shape= " << tensor.dims(); + VLOG(30) << "Fetch " << var_name << " with index " << index + << " shape= " << tensor.dims(); PADDLE_ENFORCE_LT(index, fetch_outputs.size()); return tensor; } diff --git a/paddle/fluid/framework/framework.proto b/paddle/fluid/framework/framework.proto index 25f0ba418433571343c5b2bbfdbf9fb940eaec52..efdabffb9b33ddf007c13008d0f3afb7a3961eda 100644 --- a/paddle/fluid/framework/framework.proto +++ b/paddle/fluid/framework/framework.proto @@ -35,6 +35,7 @@ enum AttrType { BLOCK = 8; LONG = 9; BLOCKS = 10; + LONGS = 11; } // OpDesc describes an instance of a C++ framework::OperatorBase @@ -55,6 +56,7 @@ message OpDesc { optional int32 block_idx = 12; optional int64 l = 13; repeated int32 blocks_idx = 14; + repeated int64 longs = 15; }; message Var { @@ -80,7 +82,6 @@ message OpProto { optional bool duplicable = 3 [ default = false ]; optional bool intermediate = 4 [ default = false ]; optional bool dispensable = 5 [ default = false ]; - optional string reuse = 6; } // AttrProto describes the C++ type Attribute. diff --git a/paddle/fluid/framework/garbage_collector.h b/paddle/fluid/framework/garbage_collector.h index b403252c972d26da6deeca54ce88a9547ffe7afa..818b3334ea4171fd7a9cbaa896ee1672e8ecca51 100644 --- a/paddle/fluid/framework/garbage_collector.h +++ b/paddle/fluid/framework/garbage_collector.h @@ -29,7 +29,7 @@ template class GarbageCollector { public: GarbageCollector(const platform::Place &place, size_t max_memory_size) - : max_memory_size_(std::max(max_memory_size, static_cast(1))) { + : max_memory_size_((std::max)(max_memory_size, static_cast(1))) { garbages_.reset(new std::deque()); dev_ctx_ = platform::DeviceContextPool::Instance().Get(place); } diff --git a/paddle/fluid/framework/ir/CMakeLists.txt b/paddle/fluid/framework/ir/CMakeLists.txt index 796ce1f91ce6f3e21dc6f0af8fca4960d43f6e2b..883575e41db2d883e9b969978419a10ffc58b97e 100644 --- a/paddle/fluid/framework/ir/CMakeLists.txt +++ b/paddle/fluid/framework/ir/CMakeLists.txt @@ -5,40 +5,50 @@ file(APPEND ${pass_file} "\#include \"paddle/fluid/framework/ir/pass.h\"\n") # Usage: pass_library(target inference) will append to paddle_inference_pass.h +unset(INFER_IR_PASSES CACHE) # clear the global variable function(pass_library TARGET DEST) set(options "") set(oneValueArgs "") set(multiValueArgs SRCS DEPS) cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) - cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass ${op_library_DEPS}) + cc_library(${TARGET} SRCS ${TARGET}.cc DEPS graph_pattern_detector pass fuse_pass_base ${op_library_DEPS}) # add more DEST here, such as train, dist and collect USE_PASS into a file automatically. if (${DEST} STREQUAL "base" OR ${DEST} STREQUAL "inference") message(STATUS "add pass ${TARGET} ${DEST}") file(APPEND ${pass_file} "USE_PASS(${TARGET});\n") - set(PASS_LIBRARY ${TARGET} ${PASS_LIBRARY} PARENT_SCOPE) + set(INFER_IR_PASSES ${INFER_IR_PASSES} ${TARGET} CACHE INTERNAL "") endif() endfunction() + cc_library(node SRCS node.cc DEPS proto_desc) cc_library(graph SRCS graph.cc DEPS node pretty_log) cc_library(graph_helper SRCS graph_helper.cc DEPS graph) cc_library(pass SRCS pass.cc DEPS graph node graph_helper) cc_library(graph_traits SRCS graph_traits.cc DEPS graph) cc_library(graph_pattern_detector SRCS graph_pattern_detector.cc DEPS graph graph_helper graph_traits) +cc_library(fuse_pass_base SRCS fuse_pass_base.cc DEPS pass) pass_library(graph_to_program_pass base) pass_library(graph_viz_pass base) pass_library(fc_fuse_pass inference) -if (WITH_MKLDNN) - pass_library(conv_relu_mkldnn_fuse_pass inference) -endif () pass_library(attention_lstm_fuse_pass inference) pass_library(infer_clean_graph_pass inference) pass_library(fc_lstm_fuse_pass inference) pass_library(embedding_fc_lstm_fuse_pass inference) pass_library(fc_gru_fuse_pass inference) pass_library(seq_concat_fc_fuse_pass inference) +pass_library(multi_batch_merge_pass base) pass_library(conv_bn_fuse_pass inference) +pass_library(seqconv_eltadd_relu_fuse_pass inference) +pass_library(is_test_pass base) +if(WITH_MKLDNN) + pass_library(mkldnn_placement_pass base) + pass_library(depthwise_conv_mkldnn_pass base) + pass_library(conv_bias_mkldnn_fuse_pass inference) + pass_library(conv_relu_mkldnn_fuse_pass inference) + pass_library(conv_elementwise_add_mkldnn_fuse_pass inference) +endif() cc_library(fuse_elewise_add_act_pass SRCS fuse_elewise_add_act_pass.cc DEPS pass graph_pattern_detector ) @@ -46,12 +56,16 @@ set(GLOB_PASS_LIB ${PASS_LIBRARY} CACHE INTERNAL "Global PASS library") cc_library(pass_builder SRCS pass_builder.cc DEPS pass) +cc_test(node_test SRCS node_test.cc DEPS node) cc_test(pass_test SRCS pass_test.cc DEPS graph pass graph_helper) cc_test(graph_test SRCS graph_test.cc DEPS graph graph_helper op_registry) cc_test(graph_helper_test SRCS graph_helper_test.cc DEPS graph graph_helper op_registry) cc_test(graph_to_program_pass_test SRCS graph_to_program_pass_test.cc DEPS graph_to_program_pass) cc_test(test_graph_pattern_detector SRCS graph_pattern_detector_tester.cc DEPS graph_pattern_detector) cc_test(test_fc_fuse_pass SRCS fc_fuse_pass_tester.cc DEPS fc_fuse_pass framework_proto) +cc_test(test_is_test_pass SRCS is_test_pass_tester.cc DEPS is_test_pass) if (WITH_MKLDNN) + cc_test(test_depthwise_conv_mkldnn_pass SRCS depthwise_conv_mkldnn_pass_tester.cc DEPS depthwise_conv_mkldnn_pass) cc_test(test_conv_relu_mkldnn_fuse_pass SRCS conv_relu_mkldnn_fuse_pass_tester.cc DEPS conv_relu_mkldnn_fuse_pass) + cc_test(test_conv_elementwise_add_mkldnn_fuse_pass SRCS conv_elementwise_add_mkldnn_fuse_pass_tester.cc DEPS conv_elementwise_add_mkldnn_fuse_pass) endif () diff --git a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc index 1c75cb5a82029b6a542a3a2f031a353f5e40f4ea..c436dd414d01ab61d143427fe7ecd34a82f11f8d 100644 --- a/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc +++ b/paddle/fluid/framework/ir/attention_lstm_fuse_pass.cc @@ -91,10 +91,10 @@ void FindWhileOp(Graph* graph) { #undef OP_SET_IN #undef OP_SET_OUT - auto* X = graph->RetriveNode(34); - auto* LSTMOUT = graph->RetriveNode(81); - auto* cell_init = graph->RetriveNode(6); - auto* hidden_init = graph->RetriveNode(8); + auto* X = graph->RetrieveNode(34); + auto* LSTMOUT = graph->RetrieveNode(81); + auto* cell_init = graph->RetrieveNode(6); + auto* hidden_init = graph->RetrieveNode(8); auto* lstm_op = graph->CreateOpNode(&op_desc); PrepareParameters(graph, param); @@ -147,19 +147,19 @@ void PrepareParameters(Graph* graph, const Param& param) { scope->Var(param.LSTMX)->GetMutable(); scope->Var(param.LSTMOUT)->GetMutable(); -#define GATE_W(name__) \ - auto* W_##name__##_w0 = scope->FindVar(#name__ ".w_0"); \ - auto* W_##name__##_w1 = scope->FindVar(#name__ ".w_1"); \ - auto* W_##name__##_b0 = scope->FindVar(#name__ ".b_0"); \ - CHECK_P3(W_##name__##_w0, W_##name__##_w1, W_##name__##_b0); \ - VLOG(4) << #name__ "_w0" \ - << " shape: " << W_##name__##_w0->Get().dims(); \ - VLOG(4) << #name__ "_w1" \ - << " shape: " << W_##name__##_w1->Get().dims(); \ - VLOG(4) << #name__ "_b0" \ - << " shape: " << W_##name__##_b0->Get().dims(); \ - auto& W_##name__##_w0_t = W_##name__##_w0->Get(); \ - auto& W_##name__##_w1_t = W_##name__##_w1->Get(); \ +#define GATE_W(name__) \ + auto* W_##name__##_w0 = scope->FindVar(#name__ ".w_0"); \ + auto* W_##name__##_w1 = scope->FindVar(#name__ ".w_1"); \ + auto* W_##name__##_b0 = scope->FindVar(#name__ ".b_0"); \ + CHECK_P3(W_##name__##_w0, W_##name__##_w1, W_##name__##_b0); \ + VLOG(40) << #name__ "_w0" \ + << " shape: " << W_##name__##_w0->Get().dims(); \ + VLOG(40) << #name__ "_w1" \ + << " shape: " << W_##name__##_w1->Get().dims(); \ + VLOG(40) << #name__ "_b0" \ + << " shape: " << W_##name__##_b0->Get().dims(); \ + auto& W_##name__##_w0_t = W_##name__##_w0->Get(); \ + auto& W_##name__##_w1_t = W_##name__##_w1->Get(); \ auto& W_##name__##_b0_t = W_##name__##_b0->Get(); GATE_W(forget); @@ -208,15 +208,15 @@ void PrepareLSTMWeight(const LoDTensor& W_forget_w0, int D = W_forget_w0.dims()[0]; int M = W_forget_w1.dims()[0]; out->Resize(make_ddim({D + M, 4 * D})); - VLOG(3) << "LSTMWeight resized to " << out->dims(); + VLOG(30) << "LSTMWeight resized to " << out->dims(); float* out_data = out->mutable_data(platform::CPUPlace()); - std::array tensors( - {{W_forget_w0.data(), W_input_w0.data(), - W_output_w0.data(), W_cell_w0.data()}}); - std::array tensors1( - {{W_forget_w1.data(), W_input_w1.data(), - W_output_w1.data(), W_cell_w1.data()}}); + std::array tensors{ + W_forget_w0.data(), W_input_w0.data(), + W_output_w0.data(), W_cell_w0.data()}; + std::array tensors1{ + W_forget_w1.data(), W_input_w1.data(), + W_output_w1.data(), W_cell_w1.data()}; for (int row = 0; row < D; row++) { for (int col = 0; col < 4; col++) { @@ -238,9 +238,9 @@ void PrepareLSTMWeight(const LoDTensor& W_forget_w0, void PrepareLSTMBias(const LoDTensor& B_forget, const LoDTensor& B_input, const LoDTensor& B_output, const LoDTensor& B_cell, LoDTensor* out) { - std::array tensors( - {{B_forget.data(), B_input.data(), B_output.data(), - B_cell.data()}}); + std::array tensors{ + B_forget.data(), B_input.data(), B_output.data(), + B_cell.data()}; PADDLE_ENFORCE_EQ(B_forget.dims().size(), 1); int D = B_forget.dims()[0]; @@ -262,7 +262,7 @@ std::unique_ptr AttentionLSTMFusePass::ApplyImpl( std::unordered_set specified_vars({"data_lod_attention", "cell_init", "hidden_init", "data", "week", "minute"}); - int count = 0; + size_t count = 0; for (auto* node : graph->Nodes()) { if (node->IsVar() && specified_vars.count(node->Name())) { ++count; diff --git a/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..c9c4d5afe5a0cd67ea14ae7abcf2b2bad1407e39 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.cc @@ -0,0 +1,137 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h" +#include +#include +#include +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace framework { +namespace ir { + +template +LoDTensor tensor_apply_eltwise(const LoDTensor& vec_a, const LoDTensor& vec_b, + BinaryOperation f) { + PADDLE_ENFORCE_EQ(vec_a.dims(), vec_b.dims()); + LoDTensor vec_y; + vec_y.Resize(vec_a.dims()); + const float* a = vec_a.data(); + const float* b = vec_b.data(); + float* y = vec_y.mutable_data(platform::CPUPlace()); + for (int i = 0; i < vec_a.numel(); i++) { + y[i] = f(a[i], b[i]); + } + return vec_y; +} + +std::unique_ptr ConvBiasFusePass::ApplyImpl( + std::unique_ptr graph) const { + PADDLE_ENFORCE(graph.get()); + FusePassBase::Init(name_scope_, graph.get()); + + auto* scope = param_scope(); + PADDLE_ENFORCE(scope); + + GraphPatternDetector gpd; + auto* conv_input = + gpd.mutable_pattern() + ->NewNode(patterns::PDNodeName(name_scope_, "conv_input")) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + patterns::ConvBias conv_bias_pattern(gpd.mutable_pattern(), name_scope_); + conv_bias_pattern(conv_input); + int found_conv_bias_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(40) << "handle ConvBias fuse"; + GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, + conv_bias_pattern); // Filter + GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, conv_bias_pattern); // tmp + GET_IR_NODE_FROM_SUBGRAPH(conv, conv, conv_bias_pattern); // CONV op + // bias + GET_IR_NODE_FROM_SUBGRAPH(eltwise_bias, eltwise_bias, conv_bias_pattern); + // output + GET_IR_NODE_FROM_SUBGRAPH(eltwise_out, eltwise_out, conv_bias_pattern); + // elementwise_add op + GET_IR_NODE_FROM_SUBGRAPH(eltwise, eltwise, conv_bias_pattern); + + PADDLE_ENFORCE(subgraph.count(conv_input)); + + // check if fuse can be done and if MKL-DNN should be used + FuseOptions fuse_option = FindFuseOption(*conv, *eltwise); + if (fuse_option == DO_NOT_FUSE || fuse_option == FUSE_NATIVE) { + VLOG(30) << "do not perform conv+bias fuse"; + return; + } + + auto* eltwise_bias_tensor = + scope->FindVar(eltwise_bias->Name())->GetMutable(); + + auto input_names = conv->Op()->InputNames(); + bool has_bias = std::find(input_names.begin(), input_names.end(), "Bias") != + input_names.end(); + if (has_bias && conv->Op()->Input("Bias").size() > 0) { + auto conv_bias_names = conv->Op()->Input("Bias"); + // add eltwise bias to existing conv bias + PADDLE_ENFORCE_EQ(conv_bias_names.size(), 1); + auto* conv_bias_var = scope->FindVar(conv_bias_names[0]); + auto* conv_bias_tensor = conv_bias_var->GetMutable(); + PADDLE_ENFORCE_EQ(conv_bias_tensor->dims(), eltwise_bias_tensor->dims()); + *conv_bias_tensor = tensor_apply_eltwise( + *conv_bias_tensor, *eltwise_bias_tensor, std::plus()); + + conv->Op()->SetOutput("Output", + std::vector({eltwise_out->Name()})); + + GraphSafeRemoveNodes(graph.get(), {eltwise, conv_out}); + + IR_NODE_LINK_TO(conv, eltwise_out); + } else { + // take eltwise bias as conv bias + OpDesc desc; + + desc.SetInput( + "Input", std::vector({subgraph.at(conv_input)->Name()})); + desc.SetInput("Filter", std::vector({conv_weight->Name()})); + desc.SetInput("Bias", std::vector({eltwise_bias->Name()})); + desc.SetOutput("Output", std::vector({eltwise_out->Name()})); + desc.SetType("conv2d"); + + for (auto& attr : conv->Op()->GetAttrMap()) { + desc.SetAttr(attr.first, attr.second); + } + auto conv_bias_node = g->CreateOpNode(&desc); + + IR_NODE_LINK_TO(subgraph.at(conv_input), conv_bias_node); + IR_NODE_LINK_TO(conv_weight, conv_bias_node); + IR_NODE_LINK_TO(eltwise_bias, conv_bias_node); + IR_NODE_LINK_TO(conv_bias_node, eltwise_out); + + GraphSafeRemoveNodes(graph.get(), {conv, eltwise, conv_out}); + } + + found_conv_bias_count++; + }; + gpd(graph.get(), handler); + AddStatis(found_conv_bias_count); + return graph; +} +} // namespace ir +} // namespace framework +} // namespace paddle +REGISTER_PASS(conv_bias_mkldnn_fuse_pass, + paddle::framework::ir::ConvBiasFusePass); diff --git a/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..5775b83b88730ec298c421a15f5c0b83c27b0750 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_bias_mkldnn_fuse_pass.h @@ -0,0 +1,36 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/pass.h" +namespace paddle { +namespace framework { +namespace ir { +/* +* Fuse the Conv and Elementwise_add to a ConvBiasOp. +*/ +class ConvBiasFusePass : public FusePassBase { + public: + virtual ~ConvBiasFusePass() {} + + protected: + std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + const std::string name_scope_{"conv_bias_mkldnn_fuse"}; +}; +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc index 04459612a726bcb60f9d752dfd8927b6f5c2500d..34b4c26ae3a8c281cd2729f67e49c78a8f440cc5 100644 --- a/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc +++ b/paddle/fluid/framework/ir/conv_bn_fuse_pass.cc @@ -121,17 +121,26 @@ std::unique_ptr ConvBNFusePass::ApplyImpl( int found_conv_bn_count = 0; auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, Graph* g) { - VLOG(4) << "handle ConvBN fuse"; + VLOG(40) << "handle ConvBN fuse"; // conv, batch_norm, // conv_weight, conv_out, // bn_scale, bn_bias, bn_mean, bn_variance, - // bn_out, bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance + // bn_out, bn_mean_out, bn_variance_out, bn_saved_mean, + // bn_saved_variance GET_CONV_BN_NODES(conv_bn_pattern); + // check if fuse can be done and if MKL-DNN should be used + FuseOptions fuse_option = FindFuseOption(*conv, *batch_norm); + if (fuse_option == DO_NOT_FUSE) { + VLOG(30) << "do not perform conv+bn fuse"; + return; + } + // Create eltwise_y (conv bias) variable VarDesc eltwise_y_in_desc( patterns::PDNodeName(name_scope_, "eltwise_y_in")); + eltwise_y_in_desc.SetPersistable(true); auto* eltwise_y_in_node = g->CreateVarNode(&eltwise_y_in_desc); auto* eltwise_y_in_tensor = scope->Var(eltwise_y_in_node->Name())->GetMutable(); @@ -151,27 +160,59 @@ std::unique_ptr ConvBNFusePass::ApplyImpl( *bn_mean, *bn_variance, eltwise_y_in_tensor, epsilon); - // Create an elementwise add node - OpDesc desc; - desc.SetInput("X", std::vector({conv_out->Name()})); - desc.SetInput("Y", std::vector({eltwise_y_in_node->Name()})); - desc.SetOutput("Out", std::vector({bn_out->Name()})); - desc.SetType("elementwise_add"); - desc.SetAttr("axis", 1); - bool a = boost::get(conv->Op()->GetAttr("use_mkldnn")); - desc.SetAttr("use_mkldnn", a); - auto eltwise_op = g->CreateOpNode(&desc); // OpDesc will be copied. - - GraphSafeRemoveNodes(graph.get(), {bn_scale, bn_bias, bn_mean, bn_variance, - batch_norm, bn_mean_out, bn_variance_out, - bn_saved_mean, bn_saved_variance}); - - PADDLE_ENFORCE(subgraph.count(conv_input)); - IR_NODE_LINK_TO(conv_out, eltwise_op); - IR_NODE_LINK_TO(eltwise_y_in_node, eltwise_op); - IR_NODE_LINK_TO(eltwise_op, bn_out); - - found_conv_bn_count++; + // with MKL-DNN fuse conv+bn into conv with bias + // without MKL-DNN fuse conv+bn into conv+elementwise_add + if (fuse_option == FUSE_MKLDNN) { + auto input_names = conv->Op()->InputNames(); + bool has_bias = std::find(input_names.begin(), input_names.end(), + "Bias") != input_names.end(); + if (has_bias && conv->Op()->Input("Bias").size() > 0) { + // reuse existing conv bias node + auto conv_bias_names = conv->Op()->Input("Bias"); + PADDLE_ENFORCE_EQ(conv_bias_names.size(), 1); + auto* conv_bias_var = scope->FindVar(conv_bias_names[0]); + auto* conv_bias_tensor = conv_bias_var->GetMutable(); + PADDLE_ENFORCE_EQ(conv_bias_tensor->dims(), + eltwise_y_in_tensor->dims()); + + auto eigen_conv_bias = EigenVector::From(*conv_bias_tensor); + eigen_conv_bias += EigenVector::From(*eltwise_y_in_tensor); + } else { + // add new conv_bias node + conv->Op()->SetInput( + "Bias", std::vector({eltwise_y_in_node->Name()})); + IR_NODE_LINK_TO(eltwise_y_in_node, conv); + } + conv->Op()->SetOutput("Output", + std::vector({bn_out->Name()})); + + GraphSafeRemoveNodes( + graph.get(), + {conv_out, bn_scale, bn_bias, bn_mean, bn_variance, batch_norm, + bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance}); + + IR_NODE_LINK_TO(conv, bn_out); + found_conv_bn_count++; + } else { // fuse_option == FUSE_NATIVE + // create an elementwise add node. + OpDesc desc; + desc.SetInput("X", std::vector({conv_out->Name()})); + desc.SetInput("Y", std::vector({eltwise_y_in_node->Name()})); + desc.SetOutput("Out", std::vector({bn_out->Name()})); + desc.SetType("elementwise_add"); + desc.SetAttr("axis", 1); + auto eltwise_op = g->CreateOpNode(&desc); // OpDesc will be copied. + + GraphSafeRemoveNodes( + graph.get(), + {bn_scale, bn_bias, bn_mean, bn_variance, batch_norm, bn_mean_out, + bn_variance_out, bn_saved_mean, bn_saved_variance}); + + IR_NODE_LINK_TO(conv_out, eltwise_op); + IR_NODE_LINK_TO(eltwise_y_in_node, eltwise_op); + IR_NODE_LINK_TO(eltwise_op, bn_out); + found_conv_bn_count++; + } }; gpd(graph.get(), handler); @@ -200,7 +241,7 @@ std::unique_ptr ConvEltwiseAddBNFusePass::ApplyImpl( int found_conv_bn_count = 0; auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, Graph* g) { - VLOG(4) << "handle ConvBN fuse"; + VLOG(40) << "handle ConvBN fuse"; // conv, batch_norm, // conv_weight, conv_out, @@ -237,7 +278,6 @@ std::unique_ptr ConvEltwiseAddBNFusePass::ApplyImpl( {bn_scale, bn_bias, bn_mean, bn_variance, batch_norm, bn_mean_out, bn_variance_out, bn_saved_mean, bn_saved_variance, eltwise_out}); - PADDLE_ENFORCE(subgraph.count(conv_input)); IR_NODE_LINK_TO(eltwise, bn_out); found_conv_bn_count++; diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..5376fc163e259e5049955052baf02fd614aa511e --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.cc @@ -0,0 +1,438 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h" +#include +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_traits.h" + +namespace paddle { +namespace framework { +namespace ir { + +// The function keeps the graph consistent by replacing +// a node 'from' in the set of inputs nodes +// of the visited node by a node 'to'. +void CorrectGraphEdges(Graph* graph, Node* from, Node* to) { + for (auto& node : GraphTraits::DFS(*graph)) { + auto from_in_inputs = + std::find(std::begin(node.inputs), std::end(node.inputs), from); + + if (from_in_inputs != std::end(node.inputs)) { + IR_NODE_LINK_TO(to, (&node)); + + auto inputs = node.Op()->Inputs(); + + using input_type = VariableNameMap::value_type; + + std::for_each(std::begin(inputs), std::end(inputs), + [from, to, &node](const input_type& i) -> void { + auto param_names = i.second; + auto pi = std::find(std::begin(param_names), + std::end(param_names), from->Name()); + + if (pi != std::end(param_names)) { + node.Op()->SetInput(i.first, {to->Name()}); + } + }); + } + } +} + +bool IsReachable(ir::Graph* graph, Node* from, Node* to) { + auto find_node = [](ir::Graph* graph, const Node* node) -> Node* { + for (auto n : graph->Nodes()) { + if (n == node) { + return n; + } + } + + return nullptr; + }; + + if (from == to) { + return true; + } + + std::map visited; + + for (auto& node : GraphTraits::DFS(*graph)) { + visited[&node] = false; + } + + visited[from] = true; + + std::list queue; + queue.push_back(from); + + while (!queue.empty()) { + auto cur = find_node(graph, queue.front()); + queue.pop_front(); + + if (!cur) return false; + + for (auto n : cur->outputs) { + if (n == to) { + return true; + } + + if (!visited[n]) { + visited[n] = true; + queue.push_back(n); + } + } + } + return false; +} + +boost::optional HasBias(const Node& op, const std::string& bias_name) { + auto bias_input_names = op.Op()->Inputs(); + auto bias_it = bias_input_names.find(bias_name); + + if (bias_it != std::end(bias_input_names)) { + bool has_bias = !bias_it->second.empty(); + + if (has_bias) { + auto bias_names = bias_it->second; + auto bias_names_it = + std::find_if(std::begin(op.inputs), std::end(op.inputs), + [&bias_names](Node* n) -> bool { + return n->Name() == bias_names[0]; + }); + return *bias_names_it; + } + } + + return boost::none; +} + +ResidualConnectionMKLDNNFusePass::IdentityFuseHandle::IdentityFuseHandle( + const ResidualConnectionMKLDNNFusePass::CanFuseFunc& can_fuse_func, + const ResidualConnectionMKLDNNFusePass::IdentityConvFunc& + get_node_from_conv_op, + const ResidualConnectionMKLDNNFusePass::IdentityElementwiseAddFunc& + get_node_from_elementwise_add_op) + : fusion_stats{std::make_shared(0)}, + can_fuse_func{can_fuse_func}, + get_node_from_conv_op{get_node_from_conv_op}, + get_node_from_elementwise_add_op{get_node_from_elementwise_add_op} {} + +void ResidualConnectionMKLDNNFusePass::IdentityFuseHandle::operator()( + const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { + Node* conv_op; + Node* conv_input; + Node* conv_filter; + Node* conv_output; + + Node* elementwise_add_op; + Node* elementwise_add_identity; + Node* elementwise_add_out; + + std::tie(conv_op, conv_input, conv_filter, conv_output) = + get_node_from_conv_op(subgraph); + std::tie(elementwise_add_op, elementwise_add_identity, elementwise_add_out) = + get_node_from_elementwise_add_op(subgraph); + + if (!can_fuse_func(conv_op, elementwise_add_op)) return; + + if (!IsReachable(graph, elementwise_add_identity, conv_output)) return; + + OpDesc op_desc; + op_desc.SetType("conv2d"); + + op_desc.SetInput("Input", {conv_input->Name()}); + op_desc.SetInput("Filter", {conv_filter->Name()}); + op_desc.SetInput("ResidualData", {elementwise_add_identity->Name()}); + op_desc.SetOutput("Output", {conv_output->Name()}); + + auto conv_bias = HasBias(*conv_op, "Bias"); + + if (conv_bias) { + op_desc.SetInput("Bias", {(*conv_bias)->Name()}); + } + + for (const auto& attr : conv_op->Op()->GetAttrMap()) { + op_desc.SetAttr(attr.first, attr.second); + } + + op_desc.SetAttr("fuse_residual_connection", true); + + auto fused_conv_op = graph->CreateOpNode(&op_desc); + + IR_NODE_LINK_TO(conv_input, fused_conv_op); + IR_NODE_LINK_TO(conv_filter, fused_conv_op); + IR_NODE_LINK_TO(elementwise_add_identity, fused_conv_op); + IR_NODE_LINK_TO(fused_conv_op, conv_output); + + if (conv_bias) { + IR_NODE_LINK_TO((*conv_bias), fused_conv_op); + } + + CorrectGraphEdges(graph, elementwise_add_out, conv_output); + GraphSafeRemoveNodes(graph, + {elementwise_add_out, conv_op, elementwise_add_op}); + (*fusion_stats)++; +} + +ResidualConnectionMKLDNNFusePass::ProjectionFuseHandle::ProjectionFuseHandle( + const ResidualConnectionMKLDNNFusePass::CanFuseFunc& can_fuse_func, + const ResidualConnectionMKLDNNFusePass::ProjectionConvFunc& + get_node_from_conv_x_op, + const ResidualConnectionMKLDNNFusePass::ProjectionConvFunc& + get_node_from_conv_y_op, + const ResidualConnectionMKLDNNFusePass::ProjectionElementwiseAddFunc& + get_node_from_elementwise_add_op) + : fusion_stats{std::make_shared(0)}, + can_fuse_func{can_fuse_func}, + get_node_from_conv_x_op{get_node_from_conv_x_op}, + get_node_from_conv_y_op{get_node_from_conv_y_op}, + get_node_from_elementwise_add_op{get_node_from_elementwise_add_op} {} + +void ResidualConnectionMKLDNNFusePass::ProjectionFuseHandle::operator()( + const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { + Node* conv_x_op; + Node* conv_x_input; + Node* conv_x_filter; + Node* conv_x_output; + + Node* conv_y_op; + Node* conv_y_input; + Node* conv_y_filter; + Node* conv_y_output; + + Node* elementwise_add_op; + Node* elementwise_add_out; + + std::tie(conv_x_op, conv_x_input, conv_x_filter, conv_x_output) = + get_node_from_conv_x_op(subgraph); + std::tie(conv_y_op, conv_y_input, conv_y_filter, conv_y_output) = + get_node_from_conv_y_op(subgraph); + std::tie(elementwise_add_op, elementwise_add_out) = + get_node_from_elementwise_add_op(subgraph); + + if (!can_fuse_func(conv_x_op, elementwise_add_op)) return; + if (!can_fuse_func(conv_y_op, elementwise_add_op)) return; + + Node* projection_node; + Node* residual_conv_op; + Node* residual_conv_input; + Node* residual_conv_filter; + Node* residual_conv_output; + + if (IsReachable(graph, conv_x_input, conv_y_output)) { + projection_node = conv_x_output; + residual_conv_op = conv_y_op; + residual_conv_input = conv_y_input; + residual_conv_filter = conv_y_filter; + residual_conv_output = conv_y_output; + } else if (IsReachable(graph, conv_y_input, conv_x_output)) { + projection_node = conv_y_output; + residual_conv_op = conv_x_op; + residual_conv_input = conv_x_input; + residual_conv_filter = conv_x_filter; + residual_conv_output = conv_x_output; + } else { + return; + } + + OpDesc op_desc; + op_desc.SetType("conv2d"); + + op_desc.SetInput("Input", {residual_conv_input->Name()}); + op_desc.SetInput("Filter", {residual_conv_filter->Name()}); + op_desc.SetInput("ResidualData", {projection_node->Name()}); + op_desc.SetOutput("Output", {residual_conv_output->Name()}); + + auto residual_conv_bias = HasBias(*residual_conv_op, "Bias"); + + if (residual_conv_bias) { + op_desc.SetInput("Bias", {(*residual_conv_bias)->Name()}); + } + + for (const auto& attr : residual_conv_op->Op()->GetAttrMap()) { + op_desc.SetAttr(attr.first, attr.second); + } + + op_desc.SetAttr("fuse_residual_connection", true); + + auto fused_conv_op = graph->CreateOpNode(&op_desc); + + IR_NODE_LINK_TO(residual_conv_input, fused_conv_op); + IR_NODE_LINK_TO(residual_conv_filter, fused_conv_op); + IR_NODE_LINK_TO(projection_node, fused_conv_op); + IR_NODE_LINK_TO(fused_conv_op, residual_conv_output); + + if (residual_conv_bias) { + IR_NODE_LINK_TO((*residual_conv_bias), fused_conv_op); + } + + CorrectGraphEdges(graph, elementwise_add_out, residual_conv_output); + GraphSafeRemoveNodes( + graph, {elementwise_add_out, residual_conv_op, elementwise_add_op}); + (*fusion_stats)++; +} + +std::tuple +ResidualConnectionMKLDNNFusePass::GetNodesFromConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const { + GET_IR_NODE_FROM_SUBGRAPH(conv_op, conv_op, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_input, conv_input, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_filter, conv_filter, conv_pattern); + GET_IR_NODE_FROM_SUBGRAPH(conv_output, conv_output, conv_pattern); + + return std::make_tuple(conv_op, conv_input, conv_filter, conv_output); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsX( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + ir::Graph* graph; + int stats; + + std::tie(graph, stats) = graph_with_stats; + + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_pattern{pattern, name_scope}; + auto conv_output = conv_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern( + conv_output, + pattern->NewNode(elementwise_add_pattern.elementwise_add_y_repr())); + conv_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_y, elementwise_add_y, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_y, + elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, &conv_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsY( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_pattern{pattern, name_scope}; + auto conv_output = conv_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern( + pattern->NewNode(elementwise_add_pattern.elementwise_add_x_repr()), + conv_output); + conv_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_x, elementwise_add_x, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_x, + elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, &conv_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +GraphWithStats ResidualConnectionMKLDNNFusePass::FuseProjectionConv( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const { + GraphPatternDetector gpd; + auto pattern = gpd.mutable_pattern(); + + patterns::Conv conv_x_pattern{pattern, name_scope}; + auto conv_x_output = conv_x_pattern(); + + patterns::Conv conv_y_pattern{pattern, name_scope}; + auto conv_y_output = conv_y_pattern(); + + patterns::ElementwiseAdd elementwise_add_pattern{pattern, name_scope}; + elementwise_add_pattern(conv_x_output, conv_y_output); + conv_x_output->AsIntermediate(); + conv_y_output->AsIntermediate(); + + auto get_node_from_elementwise_add = [&elementwise_add_pattern]( + const GraphPatternDetector::subgraph_t& subgraph) + -> std::tuple { + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_op, elementwise_add_op, + elementwise_add_pattern); + GET_IR_NODE_FROM_SUBGRAPH(elementwise_add_out, elementwise_add_out, + elementwise_add_pattern); + + return std::make_tuple(elementwise_add_op, elementwise_add_out); + }; + + return ExecuteHandleOnGraph( + &gpd, graph_with_stats, + [this, + &conv_x_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_x_pattern, subgraph); + }, + [this, + &conv_y_pattern](const GraphPatternDetector::subgraph_t& subgraph) { + return GetNodesFromConv(conv_y_pattern, subgraph); + }, + get_node_from_elementwise_add); +} + +graph_ptr ResidualConnectionMKLDNNFusePass::ApplyImpl(graph_ptr graph) const { + FusePassBase::Init(name_scope_, graph.get()); + auto fused_graph_with_stats = FuseConvAsY( + name_scope_, + FuseConvAsX( + name_scope_, + FuseProjectionConv(name_scope_, std::make_pair(graph.get(), 0)))); + + std::cout << "Fused graph " << fused_graph_with_stats.second << std::endl; + AddStatis(fused_graph_with_stats.second); + return graph; +} +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(conv_elementwise_add_mkldnn_fuse_pass, + paddle::framework::ir::ResidualConnectionMKLDNNFusePass); diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..6629dae425ae85446fe2f6c8c172ca53f5ae8bea --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h @@ -0,0 +1,133 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +#include + +namespace paddle { +namespace framework { +namespace ir { + +using graph_ptr = std::unique_ptr; +using GraphWithStats = std::pair; + +void CorrectGraphEdges(Graph* graph, Node* from, Node* to); +bool IsReachable(ir::Graph* graph, Node* from, Node* to); +boost::optional HasBias(const Node& op, const std::string& bias_name); + +class ResidualConnectionMKLDNNFusePass : public FusePassBase { + private: + GraphWithStats FuseConvAsX(const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + GraphWithStats FuseConvAsY(const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + GraphWithStats FuseProjectionConv( + const std::string& name_scope, + const GraphWithStats& graph_with_stats) const; + + template + using GetNodeFunc = + std::function; + using IdentityConvFunc = GetNodeFunc>; + using IdentityElementwiseAddFunc = + GetNodeFunc>; + + using ProjectionConvFunc = IdentityConvFunc; + using ProjectionElementwiseAddFunc = GetNodeFunc>; + + using CanFuseFunc = std::function; + + std::tuple GetNodesFromConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const; + + std::tuple GetNodesFromProjectionConv( + const patterns::Conv& conv_pattern, + const GraphPatternDetector::subgraph_t& subgraph) const; + + template + GraphWithStats ExecuteHandleOnGraph(GraphPatternDetector* gpd, + const GraphWithStats& graph_with_stats, + OpFuncs&&... op_funcs) const { + ir::Graph* graph; + int stats; + + std::tie(graph, stats) = graph_with_stats; + + auto can_fuse = [this](Node* op1, Node* op2) -> bool { + return this->FindFuseOption(*op1, *op2) == FUSE_MKLDNN; + }; + + auto fuse_handle = HandleType{can_fuse, std::forward(op_funcs)...}; + + (*gpd)(graph, fuse_handle); + + return std::make_pair(graph, stats + fuse_handle.get_stats()); + } + + struct IdentityFuseHandle { + IdentityFuseHandle( + const CanFuseFunc& can_fuse_func, + const IdentityConvFunc& get_node_from_conv_op, + const IdentityElementwiseAddFunc& get_node_from_elementwise_add_op); + + void operator()(const GraphPatternDetector::subgraph_t& subgraph, + Graph* graph); + int get_stats() const { return *fusion_stats; } + + private: + std::shared_ptr fusion_stats; + CanFuseFunc can_fuse_func; + IdentityConvFunc get_node_from_conv_op; + IdentityElementwiseAddFunc get_node_from_elementwise_add_op; + }; + + struct ProjectionFuseHandle { + ProjectionFuseHandle( + const CanFuseFunc& can_fuse_func, + const ProjectionConvFunc& get_node_from_conv_x_op, + const ProjectionConvFunc& get_node_from_conv_y_op, + const ProjectionElementwiseAddFunc& get_node_from_elementwise_add_op); + + void operator()(const GraphPatternDetector::subgraph_t& subgraph, + Graph* graph); + int get_stats() const { return *fusion_stats; } + + private: + std::shared_ptr fusion_stats; + CanFuseFunc can_fuse_func; + ProjectionConvFunc get_node_from_conv_x_op; + ProjectionConvFunc get_node_from_conv_y_op; + ProjectionElementwiseAddFunc get_node_from_elementwise_add_op; + }; + + public: + virtual ~ResidualConnectionMKLDNNFusePass() {} + + protected: + std::unique_ptr ApplyImpl(graph_ptr graph) const; + + const std::string name_scope_{"residual_connection_fuse_pass"}; +}; +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..61ba097fd8cb55e25bda1947ea97d53308c55bd3 --- /dev/null +++ b/paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass_tester.cc @@ -0,0 +1,244 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +#include "paddle/fluid/framework/ir/conv_elementwise_add_mkldnn_fuse_pass.h" +#include "paddle/fluid/framework/ir/graph_traits.h" + +namespace paddle { +namespace framework { +namespace ir { + +namespace { +constexpr int nodes_removed = 3; +constexpr int nodes_added = 1; + +void SetOp(ProgramDesc* prog, const std::string& type, + const std::vector>& inputs, + const std::pair& output) { + auto op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", true); + + for (const auto& input : inputs) { + op->SetInput(input.first, {input.second}); + } + + op->SetOutput(output.first, {output.second}); +} + +struct TestIsReachable { + using func = std::function; + + auto operator()(const std::unique_ptr& graph) -> func { + auto find_node = [](const std::unique_ptr& graph, + const std::string& name) -> Node* { + for (auto& node : GraphTraits::DFS(*graph)) { + if (name == node.Name()) { + return &node; + } + } + + return nullptr; + }; + + return [&](std::string from, const std::string to) -> bool { + if (from == to) return true; + + std::map visited; + + for (auto& node : GraphTraits::DFS(*graph)) { + visited[node.Name()] = false; + } + + visited[from] = true; + + std::list queue; + queue.push_back(from); + + while (!queue.empty()) { + auto cur = find_node(graph, queue.front()); + queue.pop_front(); + + if (cur == nullptr) return false; + + for (auto n : cur->outputs) { + if (n->Name() == to) return true; + + if (!visited[n->Name()]) { + visited[n->Name()] = true; + queue.push_back(n->Name()); + } + } + } + return false; + }; + } +}; + +void AssertOpsCount(const std::unique_ptr& graph, + int expected_conv_count, + int expected_elementwise_add_count = 0) { + int conv_count = 0; + int elementwise_add_count = 0; + + for (auto* node : graph->Nodes()) { + if (node->IsOp() && node->Op()->Type() == "conv2d") { + ++conv_count; + } + if (node->IsOp() && node->Op()->Type() == "elementwise_add") { + ++elementwise_add_count; + } + } + EXPECT_EQ(conv_count, expected_conv_count); + EXPECT_EQ(elementwise_add_count, expected_elementwise_add_count); +} + +ProgramDesc BuildProgramDesc(const std::vector& transient_vars, + const std::vector& persistent_vars) { + ProgramDesc prog; + + auto add_var_to_prog = [&prog](const std::string& var_name) -> VarDesc* { + auto var = prog.MutableBlock(0)->Var(var_name); + var->SetType(proto::VarType::LOD_TENSOR); + + return var; + }; + + for (const auto& v : transient_vars) { + add_var_to_prog(v); + } + + for (const auto& v : persistent_vars) { + auto var = add_var_to_prog(v); + var->SetPersistable(true); + } + + return prog; +} + +void RunPassAndAssert(ProgramDesc* prog, const std::string& from, + const std::string& to, int expected_conv_num) { + std::unique_ptr graph(new ir::Graph(*prog)); + + TestIsReachable is_reachable; + EXPECT_TRUE(is_reachable(graph)(from, to)); + + auto pass = + PassRegistry::Instance().Get("conv_elementwise_add_mkldnn_fuse_pass"); + int original_nodes_num = graph->Nodes().size(); + graph = pass->Apply(std::move(graph)); + int current_nodes_num = graph->Nodes().size(); + + EXPECT_TRUE(is_reachable(graph)(from, to)); + + EXPECT_EQ(original_nodes_num - nodes_removed + nodes_added, + current_nodes_num); + + AssertOpsCount(graph, expected_conv_num); +} +} // namespace + +TEST(ConvElementwiseAddMKLDNNFusePass, ConvolutionAsYWithElementwiseAddRelu) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"bias", "weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", + {{"Input", "b"}, {"Bias", "bias"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "elementwise_add", {{"X", "a"}, {"Y", "c"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, + ConvolutionAsYWithElementwiseAddReluNoBias) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + SetOp(&prog, "elementwise_add", {{"X", "a"}, {"Y", "c"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, ConvolutionAsXWithElementwiseAddRelu) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"bias", "weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", + {{"Input", "b"}, {"Bias", "bias"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "a"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, + ConvolutionAsXWithElementwiseAddReluNoBias) { + auto prog = BuildProgramDesc({"a", "b", "c", "d", "e"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "a"}}, {"Out", "d"}); + SetOp(&prog, "relu", {{"X", "d"}}, {"Out", "e"}); + + RunPassAndAssert(&prog, "a", "relu", 1); +} + +TEST(ConvElementwiseAddMKLDNNFusePass, NoFusion) { + auto prog = + BuildProgramDesc({"a", "b", "c", "d", "e", "f", "g"}, {"weights"}); + + SetOp(&prog, "sigmoid", {{"X", "a"}}, {"Out", "b"}); + SetOp(&prog, "conv2d", {{"Input", "b"}, {"Filter", "weights"}}, + {"Output", "c"}); + + SetOp(&prog, "conv2d", {{"Input", "d"}, {"Filter", "weights"}}, + {"Output", "e"}); + + SetOp(&prog, "elementwise_add", {{"X", "c"}, {"Y", "e"}}, {"Out", "f"}); + SetOp(&prog, "relu", {{"X", "f"}}, {"Out", "g"}); + + std::unique_ptr graph(new ir::Graph(prog)); + + TestIsReachable is_reachable; + EXPECT_TRUE(is_reachable(graph)("a", "g")); + + auto pass = + PassRegistry::Instance().Get("conv_elementwise_add_mkldnn_fuse_pass"); + int original_nodes_num = graph->Nodes().size(); + graph = pass->Apply(std::move(graph)); + int current_nodes_num = graph->Nodes().size(); + + EXPECT_TRUE(is_reachable(graph)("a", "g")); + EXPECT_EQ(original_nodes_num, current_nodes_num); + + AssertOpsCount(graph, 2, 1); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(conv_elementwise_add_mkldnn_fuse_pass); diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc index d7df6389cfd595324e284e0da10f65213ccee80f..048868e1f913e9df3d985b9e66c075a02a7f0bcb 100644 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc +++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.cc @@ -38,7 +38,7 @@ std::unique_ptr ConvReLUFusePass::ApplyImpl( int found_conv_relu_count = 0; auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, Graph* g) { - VLOG(4) << "handle ConvReLU fuse"; + VLOG(40) << "handle ConvReLU fuse"; GET_IR_NODE_FROM_SUBGRAPH(conv_weight, conv_weight, conv_relu_pattern); // Filter GET_IR_NODE_FROM_SUBGRAPH(conv_out, conv_out, conv_relu_pattern); // tmp @@ -46,6 +46,12 @@ std::unique_ptr ConvReLUFusePass::ApplyImpl( GET_IR_NODE_FROM_SUBGRAPH(relu_out, relu_out, conv_relu_pattern); // Out GET_IR_NODE_FROM_SUBGRAPH(relu, relu, conv_relu_pattern); // ReLU op + FuseOptions fuse_option = FindFuseOption(*conv, *relu); + if (fuse_option == DO_NOT_FUSE) { + VLOG(30) << "do not perform conv+relu fuse"; + return; + } + // Transform Conv node into ConvReLU node. OpDesc* desc = conv->Op(); desc->SetOutput("Output", std::vector({relu_out->Name()})); diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h index b5de0d548713772e7ad41cfb6d8b3e9460683efb..fe585bd7c41bb32ae00462e989ab4c0051fc89a8 100644 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h +++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h @@ -31,7 +31,8 @@ class ConvReLUFusePass : public FusePassBase { virtual ~ConvReLUFusePass() {} protected: - std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; }; } // namespace ir diff --git a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc index 9dd780ec89ab991d6d99cb66fa2a9b683be2b9ca..19248b4dfee1da81d18cd2effac08ba68dde80fb 100644 --- a/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc +++ b/paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass_tester.cc @@ -15,25 +15,30 @@ #include "paddle/fluid/framework/ir/conv_relu_mkldnn_fuse_pass.h" #include +#include "paddle/fluid/framework/op_proto_maker.h" namespace paddle { namespace framework { namespace ir { -void SetOp(ProgramDesc* prog, const std::string& type, +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, const std::vector& inputs, - const std::vector& outputs) { + const std::vector& outputs, bool use_mkldnn = false) { auto* op = prog->MutableBlock(0)->AppendOp(); op->SetType(type); if (type == "conv2d") { - op->SetAttr("use_mkldnn", true); + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); op->SetInput("Input", {inputs[0]}); op->SetInput("Filter", {inputs[1]}); op->SetInput("Bias", {inputs[2]}); } else if (type == "relu") { + op->SetAttr("use_mkldnn", use_mkldnn); op->SetInput("X", inputs); } op->SetOutput("Out", outputs); + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); } // a->OP0->b @@ -43,7 +48,8 @@ void SetOp(ProgramDesc* prog, const std::string& type, ProgramDesc BuildProgramDesc() { ProgramDesc prog; for (auto& v : - std::vector({"a", "b", "c", "weights", "bias", "f", "g"})) { + std::vector({"a", "b", "c", "weights", "bias", "f", "g", + "h", "weights2", "bias2", "k", "l"})) { auto* var = prog.MutableBlock(0)->Var(v); var->SetType(proto::VarType::SELECTED_ROWS); if (v == "weights" || v == "bias") { @@ -51,14 +57,24 @@ ProgramDesc BuildProgramDesc() { } } - SetOp(&prog, "OP0", std::vector({"a"}), + SetOp(&prog, "OP0", "op0", std::vector({"a"}), std::vector({"b"})); - SetOp(&prog, "OP1", std::vector({"b"}), + SetOp(&prog, "OP1", "op1", std::vector({"b"}), std::vector({"c"})); - SetOp(&prog, "conv2d", std::vector({"c", "weights", "bias"}), - std::vector({"f"})); - SetOp(&prog, "relu", std::vector({"f"}), - std::vector({"g"})); + // conv+relu, both with MKL-DNN + SetOp(&prog, "conv2d", "conv1", + std::vector({"c", "weights", "bias"}), + std::vector({"f"}), true); + SetOp(&prog, "relu", "relu1", std::vector({"f"}), + std::vector({"g"}), true); + SetOp(&prog, "OP3", "op3", std::vector({"g"}), + std::vector({"h"})); + // conv+relu, only one with MKL-DNN + SetOp(&prog, "conv2d", "conv2", + std::vector({"h", "weights2", "bias2"}), + std::vector({"k"}), true); + SetOp(&prog, "relu", "relu2", std::vector({"k"}), + std::vector({"l"})); return prog; } @@ -88,10 +104,16 @@ TEST(ConvReLUFusePass, basic) { auto* op = node->Op(); ASSERT_TRUE(op->HasAttr("use_mkldnn")); EXPECT_TRUE(boost::get(op->GetAttr("use_mkldnn"))); - ASSERT_TRUE(op->HasAttr("fuse_relu")); - bool fuse_relu = boost::get(op->GetAttr("fuse_relu")); - if (fuse_relu) { - ++conv_relu_count; + // check if only "conv1" convolution is fused + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "conv1") { + ASSERT_TRUE(op->HasAttr("fuse_relu")); + bool fuse_relu = boost::get(op->GetAttr("fuse_relu")); + if (fuse_relu) { + ++conv_relu_count; + } + } else if (op_name == "conv2") { + ASSERT_FALSE(op->HasAttr("fuse_relu")); } } } diff --git a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.cc b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..5f3334578d10f64b197215bfc11d08e30747cb90 --- /dev/null +++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.cc @@ -0,0 +1,58 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +#define GET_NODE(id, pattern) \ + PADDLE_ENFORCE(subgraph.count(pattern.RetrieveNode(#id)), \ + "pattern has no Node called %s", #id); \ + auto* id = subgraph.at(pattern.RetrieveNode(#id)); \ + PADDLE_ENFORCE_NOT_NULL(id, "subgraph has no node %s", #id); + +std::unique_ptr DepthwiseConvMKLDNNPass::ApplyImpl( + std::unique_ptr graph) const { + PADDLE_ENFORCE(graph.get()); + FusePassBase::Init("depthwise_conv_mkldnn_pass", graph.get()); + GraphPatternDetector gpd; + + auto* pattern = gpd.mutable_pattern(); + pattern->NewNode("depthwise_conv") + ->assert_is_op("depthwise_conv2d") + ->assert_op_attr("use_mkldnn", true); + + int found_depthwise_conv_mkldnn_count = 0; + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(30) << "handle DepthwiseConvMKLDNN fuse"; + GET_NODE(depthwise_conv, (*pattern)); + depthwise_conv->Op()->SetType("conv2d"); + found_depthwise_conv_mkldnn_count++; + }; + + gpd(graph.get(), handler); + AddStatis(found_depthwise_conv_mkldnn_count); + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(depthwise_conv_mkldnn_pass, + paddle::framework::ir::DepthwiseConvMKLDNNPass); diff --git a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..8ca6a7325186401c26eb7f9375cf83b7b97cc1c9 --- /dev/null +++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h @@ -0,0 +1,34 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" + +namespace paddle { +namespace framework { +namespace ir { + +class DepthwiseConvMKLDNNPass : public FusePassBase { + public: + virtual ~DepthwiseConvMKLDNNPass() {} + + protected: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass_tester.cc b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..09d0b15f46a7e50afb6aea46383013ce6a6c6118 --- /dev/null +++ b/paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass_tester.cc @@ -0,0 +1,123 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/depthwise_conv_mkldnn_pass.h" + +#include + +namespace paddle { +namespace framework { +namespace ir { + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn = false) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("use_mkldnn", use_mkldnn); + op->SetAttr("name", name); + op->SetInput("Input", {inputs[0]}); + op->SetInput("Filter", {inputs[1]}); + op->SetInput("Bias", {inputs[2]}); + op->SetOutput("Out", outputs); +} + +// (a, weights, bias)->depthwise conv mkldnn->b +// (b, weights2, bias2)->depthwise conv no mkldnn->c +// (c, weights3, bias3)->conv mkldnn->d +// (d, weights3, bias3)->conv no mkldnn->e +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : std::vector( + {"a", "b", "c", "d", "e", "weights", "bias", "weights2", "bias2", + "weights3", "bias3", "weights4", "bias4"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights" || v == "bias" || v == "weights2" || v == "bias2" || + v == "weights3" || v == "bias3" || v == "weights4" || v == "bias4") { + var->SetPersistable(true); + } + } + + // depthwise conv with MKL-DNN + SetOp(&prog, "depthwise_conv2d", "conv1", + std::vector({"a", "weights", "bias"}), + std::vector({"b"}), true); + // depthwise conv without MKL-DNN + SetOp(&prog, "depthwise_conv2d", "conv2", + std::vector({"b", "weights2", "bias2"}), + std::vector({"c"}), false); + // conv with MKL-DNN + SetOp(&prog, "conv2d", "conv3", + std::vector({"c", "weights3", "bias3"}), + std::vector({"d"}), true); + // conv without MKL-dNN + SetOp(&prog, "conv2d", "conv4", + std::vector({"d", "weights4", "bias4"}), + std::vector({"e"}), false); + + return prog; +} + +TEST(DepthwiseConvMKLDNNPass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("depthwise_conv_mkldnn_pass"); + + struct counters { + int mkldnn_depthwise_conv_nodes; + int other_depthwise_conv_nodes; + int mkldnn_conv_nodes; + int other_conv_nodes; + }; + + counters before{1, 1, 1, 1}; + + graph = pass->Apply(std::move(graph)); + + // initialize counters before loop + counters after{0, 0, 0, 0}; + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + if (op->Type() == "conv2d") { + if (boost::get(op->GetAttr("use_mkldnn"))) + after.mkldnn_conv_nodes++; + else + after.other_conv_nodes++; + } else if (op->Type() == "depthwise_conv2d") { + if (boost::get(op->GetAttr("use_mkldnn"))) + after.mkldnn_depthwise_conv_nodes++; + else + after.other_depthwise_conv_nodes++; + } + } + } + + EXPECT_EQ(after.other_depthwise_conv_nodes, + before.other_depthwise_conv_nodes); + EXPECT_EQ(after.other_conv_nodes, before.other_conv_nodes); + EXPECT_EQ(after.mkldnn_depthwise_conv_nodes, + before.mkldnn_depthwise_conv_nodes - 1); + EXPECT_EQ(after.mkldnn_conv_nodes, before.mkldnn_conv_nodes + 1); +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(depthwise_conv_mkldnn_pass); diff --git a/paddle/fluid/framework/ir/fc_fuse_pass.cc b/paddle/fluid/framework/ir/fc_fuse_pass.cc index ca704c7f5631bbaa88f1bc2caaa22fd021de11c4..7b6ce0da07309a0ed2a5c8bcd5f59d84105261d7 100644 --- a/paddle/fluid/framework/ir/fc_fuse_pass.cc +++ b/paddle/fluid/framework/ir/fc_fuse_pass.cc @@ -39,7 +39,7 @@ std::unique_ptr FCFusePass::ApplyImpl( int found_fc_count = 0; auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, Graph* g) { - VLOG(4) << "handle FC fuse"; + VLOG(40) << "handle FC fuse"; GET_IR_NODE_FROM_SUBGRAPH(w, w, fc_pattern); GET_IR_NODE_FROM_SUBGRAPH(fc_bias, bias, fc_pattern); GET_IR_NODE_FROM_SUBGRAPH(fc_out, Out, fc_pattern); @@ -57,6 +57,7 @@ std::unique_ptr FCFusePass::ApplyImpl( desc.SetInput("W", std::vector({fc_Y_in})); desc.SetInput("Bias", std::vector({fc_bias_in})); desc.SetOutput("Out", std::vector({fc_out_out})); + desc.SetAttr("in_num_col_dims", mul->Op()->GetAttr("x_num_col_dims")); desc.SetType("fc"); auto fc_node = g->CreateOpNode(&desc); // OpDesc will be copied. GraphSafeRemoveNodes(graph.get(), {mul, elementwise_add, mul_out}); diff --git a/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc b/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc index 06286a109d01af638e74e06ccc83e2a5500663ea..4e1e4e27f9ba932b56ecc25e816a2aee9d42362e 100644 --- a/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc +++ b/paddle/fluid/framework/ir/fc_fuse_pass_tester.cc @@ -15,6 +15,7 @@ #include "paddle/fluid/framework/ir/fc_fuse_pass.h" #include +#include "paddle/fluid/framework/op_proto_maker.h" namespace paddle { namespace framework { @@ -28,10 +29,13 @@ void SetOp(ProgramDesc* prog, const std::string& type, if (type == "mul") { op->SetInput("X", {inputs[0]}); op->SetInput("Y", {inputs[1]}); + op->SetAttr("x_num_col_dims", {1}); } else if (type == "elementwise_add") { op->SetInput("X", inputs); } op->SetOutput("Out", outputs); + op->SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kForward)); } // a->OP0->b diff --git a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc index 648acc4a759417240d9a39749b059289182ebb1e..8ed68905beed2faedc34f194070cc76e8ff3c32d 100644 --- a/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc +++ b/paddle/fluid/framework/ir/fuse_elewise_add_act_pass.cc @@ -61,7 +61,7 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddAct( auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, Graph *g) { - VLOG(4) << "handle FuseElewiseAddAct fuse"; + VLOG(40) << "handle FuseElewiseAddAct fuse"; GET_IR_NODE_FROM_SUBGRAPH(ele_y, ele_y, elewise_add_act_pattern); GET_IR_NODE_FROM_SUBGRAPH(ele_out, elewise_add_out, elewise_add_act_pattern); @@ -77,10 +77,10 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddAct( Node *elewise_add_act_node = CreateFuseElewiseAddActNode( g, act, ele_add, ele_x_n, ele_y_n, ele_out_n, act_out_n); - VLOG(4) << "\n\t " << ele_x_n << " and " << ele_y_n << " -> " - << ele_add->Name() << " -> " << ele_out_n << "\n" - << "\t " << ele_out_n << " -> " << act->Name() << " -> " - << act_out_n; + VLOG(40) << "\n\t " << ele_x_n << " and " << ele_y_n << " -> " + << ele_add->Name() << " -> " << ele_out_n << "\n" + << "\t " << ele_out_n << " -> " << act->Name() << " -> " + << act_out_n; ReLinkNodes(g, ele_out, ele_add, act, elewise_add_act_node); found_elewise_add_act_count++; @@ -113,7 +113,7 @@ std::unique_ptr FuseElewiseAddActPass::FuseActElewiseAdd( auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, Graph *g) { - VLOG(4) << "handle FuseElewiseAddAct fuse"; + VLOG(40) << "handle FuseElewiseAddAct fuse"; GET_IR_NODE_FROM_SUBGRAPH(act_out, act_out, act_elewise_add_pattern); GET_IR_NODE_FROM_SUBGRAPH(ele_x, ele_x, act_elewise_add_pattern); GET_IR_NODE_FROM_SUBGRAPH(ele_out, elewise_add_out, @@ -129,9 +129,9 @@ std::unique_ptr FuseElewiseAddActPass::FuseActElewiseAdd( Node *elewise_add_act_node = CreateFuseElewiseAddActNode( g, ele_add, act, elewise_add_x_n, act_i_n, act_o_n, elewise_add_out_n); - VLOG(4) << "\n\t " << act_i_n << " -> " << act->Name() << " -> " << act_o_n - << "\n\t " << act_o_n << " and " << elewise_add_x_n << " -> " - << ele_add->Name() << " -> " << elewise_add_out_n; + VLOG(40) << "\n\t " << act_i_n << " -> " << act->Name() << " -> " << act_o_n + << "\n\t " << act_o_n << " and " << elewise_add_x_n << " -> " + << ele_add->Name() << " -> " << elewise_add_out_n; ReLinkNodes(g, act_out, act, ele_add, elewise_add_act_node); found_elewise_add_act_count++; @@ -165,7 +165,7 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddActInplaceGrad( auto handler = [&](const GraphPatternDetector::subgraph_t &subgraph, Graph *g) { - VLOG(4) << "handle FuseElewiseAddActGrad1 fuse"; + VLOG(40) << "handle FuseElewiseAddActGrad1 fuse"; GET_IR_NODE_FROM_SUBGRAPH(act_out, act_out, elewise_add_act_grad_pattern); GET_IR_NODE_FROM_SUBGRAPH(act_grad, act_grad, elewise_add_act_grad_pattern); GET_IR_NODE_FROM_SUBGRAPH(d_itermediate_out, d_itermediate_out, @@ -208,10 +208,10 @@ std::unique_ptr FuseElewiseAddActPass::FuseElewiseAddActInplaceGrad( auto fused_node = g->CreateOpNode(&desc); - VLOG(4) << "\n\t " << d_act_out_n << " and " << act_out_n << " -> " - << act_grad->Name() << " -> " << d_itermediate_out_n << "\n\t " - << d_itermediate_out_n << " and " << act_out_n << " -> " - << ele_add_grad->Name() << " -> " << d_itermediate_out_n; + VLOG(40) << "\n\t " << d_act_out_n << " and " << act_out_n << " -> " + << act_grad->Name() << " -> " << d_itermediate_out_n << "\n\t " + << d_itermediate_out_n << " and " << act_out_n << " -> " + << ele_add_grad->Name() << " -> " << d_itermediate_out_n; ReLinkNodes(g, d_itermediate_out, act_grad, ele_add_grad, fused_node); found_elewise_add_act_count++; diff --git a/paddle/fluid/framework/ir/fuse_pass_base.cc b/paddle/fluid/framework/ir/fuse_pass_base.cc new file mode 100644 index 0000000000000000000000000000000000000000..d70010089e4b4fbb4542ef7748b8e9ece48d3942 --- /dev/null +++ b/paddle/fluid/framework/ir/fuse_pass_base.cc @@ -0,0 +1,62 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" + +namespace paddle { +namespace framework { +namespace ir { + +void FusePassBase::Init(const std::string& repr, Graph* graph) const { + repr_ = repr; + graph_ = graph; +} + +Scope* FusePassBase::param_scope() const { + PADDLE_ENFORCE(graph_->Has(kParamScopeAttr)); + return graph_->Get(kParamScopeAttr); +} + +void FusePassBase::AddStatis(int count_of_fused) const { + PADDLE_ENFORCE(graph_); + PADDLE_ENFORCE(!repr_.empty()); + if (!graph_->Has(kFuseStatisAttr)) { + graph_->Set(kFuseStatisAttr, new std::unordered_map); + } + auto& info = + graph_->Get>(kFuseStatisAttr); + info[repr_] = count_of_fused; +} + +FuseOptions FusePassBase::FindFuseOption(const Node& node1, + const Node& node2) const { +#ifdef PADDLE_WITH_MKLDNN + bool node1_mkldnn = node1.Op()->HasAttr("use_mkldnn") && + boost::get(node1.Op()->GetAttr("use_mkldnn")); + bool node2_mkldnn = node2.Op()->HasAttr("use_mkldnn") && + boost::get(node2.Op()->GetAttr("use_mkldnn")); + if (node1_mkldnn && node2_mkldnn) + return FUSE_MKLDNN; + else if (!node1_mkldnn && !node2_mkldnn) + return FUSE_NATIVE; + else + return DO_NOT_FUSE; +#else + return FUSE_NATIVE; +#endif +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/fuse_pass_base.h b/paddle/fluid/framework/ir/fuse_pass_base.h index 877bbeb502252cac77095981641d7ce283ca1eb7..c53b2a6186741d86f14faf1d21fa19aa09cec036 100644 --- a/paddle/fluid/framework/ir/fuse_pass_base.h +++ b/paddle/fluid/framework/ir/fuse_pass_base.h @@ -25,32 +25,24 @@ namespace ir { static const char kParamScopeAttr[] = "__param_scope__"; static const char kFuseStatisAttr[] = "__fuse_statis__"; +enum FuseOptions { + DO_NOT_FUSE, // fusing will not be done + FUSE_NATIVE, // fusing will be done without MKL-DNN + FUSE_MKLDNN // fusing will be done with MKL-DNN +}; + class FusePassBase : public Pass { public: - void Init(const std::string& repr, Graph* graph) const { - repr_ = repr; - graph_ = graph; - } - - Scope* param_scope() const { - PADDLE_ENFORCE(graph_->Has(kParamScopeAttr)); - return graph_->Get(kParamScopeAttr); - } - - void AddStatis(int count_of_fused) const { - PADDLE_ENFORCE(graph_); - PADDLE_ENFORCE(!repr_.empty()); - if (!graph_->Has(kFuseStatisAttr)) { - graph_->Set(kFuseStatisAttr, new std::unordered_map); - } - auto& info = - graph_->Get>(kFuseStatisAttr); - info[repr_] = count_of_fused; - } + void Init(const std::string& repr, Graph* graph) const; + Scope* param_scope() const; + void AddStatis(int count_of_fused) const; virtual ~FusePassBase() {} protected: + virtual FuseOptions FindFuseOption(const Node& node1, + const Node& node2) const; + mutable Graph* graph_; mutable std::string repr_; }; diff --git a/paddle/fluid/framework/ir/graph.cc b/paddle/fluid/framework/ir/graph.cc index 398f7095968e62f92d610f560d7574b27706d13e..ae0e42ff5e89466013382ab97650e6afeeff3d2d 100644 --- a/paddle/fluid/framework/ir/graph.cc +++ b/paddle/fluid/framework/ir/graph.cc @@ -23,80 +23,81 @@ limitations under the License. */ namespace paddle { namespace framework { namespace ir { - -std::vector FindDistTrainSendVars( - const std::vector &nodes) { - std::vector send_vars; - // since parameters are all in block 0, - // it's enough to only scan send ops in block 0 - for (auto &node : nodes) { - auto op_vars = node->Op()->InputArgumentNames(); - send_vars.reserve(send_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - send_vars.insert(send_vars.end(), op_vars.begin(), op_vars.end()); - } - return send_vars; -} - -std::vector FindDistTrainRecvVars( - const std::vector &nodes) { - std::vector recv_vars; - for (auto &node : nodes) { - auto op_vars = node->Op()->OutputArgumentNames(); - recv_vars.reserve(recv_vars.size() + - std::distance(op_vars.begin(), op_vars.end())); - recv_vars.insert(recv_vars.end(), op_vars.begin(), op_vars.end()); - } - return recv_vars; -} - -bool IsDistTrainOp(ir::Node *node, const std::vector &send_vars, - const std::vector &recv_vars) { - if (send_vars.size() == 0 || recv_vars.size() == 0) { - return false; - } - - /** - * Check any of opvars contains `.block` and in sendvars - */ - auto checker = [](const std::vector &opvars, - const std::vector &rpc_vars) -> bool { - for (auto &var : opvars) { - // a variable name with the suffix `.block` means it's a splited - // variable by (DistributeTranspiler) - // [python/paddle/fluid/transpiler/distribute_transpiler.py] - if (var.find(".block") != std::string::npos && - std::find(rpc_vars.begin(), rpc_vars.end(), var) != rpc_vars.end()) { - return true; - } +namespace { + +void CheckProgram(const ProgramDesc &program) { +#define _INT(role) static_cast(role) + + std::map visit; + for (OpDesc *op : program.Block(0).AllOps()) { + // For backward compatibility, some program doesn't have role added. + if (!op->HasAttr(OpProtoAndCheckerMaker::OpRoleAttrName())) continue; + int role_id = + boost::get(op->GetAttr(OpProtoAndCheckerMaker::OpRoleAttrName())); + visit[role_id] = true; + switch (role_id) { + case _INT(OpRole::kForward): + if (visit.find(_INT(OpRole::kBackward)) != visit.end()) { + LOG(ERROR) + << "Cannot add backward operator before forward operator %s." + << op->Type(); + } + break; + case _INT(OpRole::kBackward): + case _INT(OpRole::kBackward) | _INT(OpRole::kLoss): + PADDLE_ENFORCE( + visit.find(_INT(OpRole::kOptimize)) == visit.end(), + "Cannot add backward operator %s after optimize operator.", + op->Type()); + break; + case _INT(OpRole::kForward) | _INT(OpRole::kLoss): + PADDLE_ENFORCE(visit.find(_INT(OpRole::kBackward) | + _INT(OpRole::kLoss)) == visit.end(), + "Cannot add backward|loss operator before " + "forward|loss operator %s.", + op->Type()); + PADDLE_ENFORCE( + visit.find(_INT(OpRole::kOptimize)) == visit.end(), + "Cannot add forward|loss operator %s after optimize operator.", + op->Type()); + break; + case _INT(OpRole::kOptimize): + case _INT(OpRole::kOptimize) | _INT(OpRole::kLRSched): + PADDLE_ENFORCE(visit.find(_INT(OpRole::kBackward)) != visit.end(), + "Optimize operators %s must follow backward operator.", + op->Type()); + break; + case _INT(OpRole::kLRSched): + case _INT(OpRole::kDist): + case _INT(OpRole::kRPC): + case _INT(OpRole::kNotSpecified): + break; + default: + LOG(FATAL) << "Unknown operator role. Don't add new role because " + "you don't know what you are doing."; } - return false; - }; - - std::vector input_var_names; - std::vector output_var_names; - for (ir::Node *input : node->inputs) { - input_var_names.push_back(input->Name()); - } - for (ir::Node *output : node->outputs) { - output_var_names.push_back(output->Name()); } - return checker(output_var_names, send_vars) || - checker(input_var_names, recv_vars); +#undef _INT } +} // namespace Graph::Graph(const ProgramDesc &program) : program_(program) { - // Make the nodes id start from 0. - Node::ResetId(); + CheckProgram(program_); + auto var_nodes = InitFromProgram(program_); + ResolveHazard(var_nodes); +} - VLOG(3) << "block in program:" << program_.Size(); +std::map> Graph::InitFromProgram( + const ProgramDesc &program) { + VLOG(30) << "block in program:" << program_.Size(); std::unordered_map all_vars; + // var nodes for each var name, will have multiple versions in SSA + std::map> var_nodes; for (auto *var : program.Block(0).AllVars()) { all_vars.emplace(var->Name(), var); } - std::map> var_nodes; for (auto *op : program.Block(0).AllOps()) { ir::Node *node = CreateOpNode(op); // For input args, reuse the same var name if it was created before. @@ -134,7 +135,11 @@ Graph::Graph(const ProgramDesc &program) : program_(program) { var->inputs.push_back(node); } } + return std::move(var_nodes); +} +void Graph::ResolveHazard( + const std::map> &var_nodes) { /** * We should handle write after read(WAR) and write after write(WAW) here. * Because some of the operators of the program can be executed parallelly. @@ -153,6 +158,7 @@ Graph::Graph(const ProgramDesc &program) : program_(program) { auto it_old = versions.rbegin(); ++it_old; for (; it_old != versions.rend(); it_new = it_old, ++it_old) { + VLOG(30) << "deal with var: " << (*it_new)->Name(); ir::Node *write_op = (*it_new)->inputs.empty() ? nullptr : (*it_new)->inputs[0]; const auto &read_ops = (*it_old)->outputs; diff --git a/paddle/fluid/framework/ir/graph.h b/paddle/fluid/framework/ir/graph.h index ab687e760a761d4e445726bd5149966adc2403d0..0c856f8e610077c69416ccfb8a763d4b8ae881b8 100644 --- a/paddle/fluid/framework/ir/graph.h +++ b/paddle/fluid/framework/ir/graph.h @@ -89,7 +89,7 @@ class Graph { attr_name); attrs_[attr_name] = attr; attr_dels_[attr_name] = [attr, attr_name]() { - VLOG(3) << "deleting " << attr_name; + VLOG(30) << "deleting " << attr_name; delete attr; }; } @@ -102,18 +102,31 @@ class Graph { attr_dels_[attr_name] = []() {}; } + template + void Erase(const std::string &attr_name) { + PADDLE_ENFORCE(attrs_.count(attr_name) != 0, "%s not set in the graph", + attr_name); + attr_dels_[attr_name](); + attrs_.erase(attr_name); + attr_dels_.erase(attr_name); + } + const std::unordered_set &Nodes() const { return node_set_; } // Create a normal variable with non-null VarDesc. ir::Node *CreateVarNode(VarDesc *var_desc) { PADDLE_ENFORCE(var_desc); - return AddNode(new ir::Node(var_desc)); + auto *x = AddNode(new ir::Node(var_desc)); + x->SetId(num_node_created_++); + return x; } // Create a normal runnable operator with OpDesc. ir::Node *CreateOpNode(OpDesc *op_desc) { PADDLE_ENFORCE(op_desc); - return AddNode(new ir::Node(op_desc)); + auto *x = AddNode(new ir::Node(op_desc)); + x->SetId(num_node_created_++); + return x; } // Create a control dependency var that connects 2 operations. The @@ -123,13 +136,17 @@ class Graph { // TODO(panyx0718): control var name should be really unique. const std::string name = string::Sprintf( "%s@%llu", ir::Node::kControlDepVarName, node_set_.size()); - return AddNode(new ir::Node(name, ir::Node::Type::kVariable)); + auto *x = AddNode(new ir::Node(name, ir::Node::Type::kVariable)); + x->SetId(num_node_created_++); + return x; } // A more free style way of creating a graph node. Mostly use for test // or "copy" from another node. Avoid using it if possible. ir::Node *CreateEmptyNode(const std::string &name, ir::Node::Type type) { - return AddNode(new ir::Node(name, type)); + auto *x = AddNode(new ir::Node(name, type)); + x->SetId(num_node_created_++); + return x; } // Clear all node information of the graph and return the ownership of the @@ -151,7 +168,7 @@ class Graph { } // NOTE low performance, but simple and secure. - Node *RetriveNode(int id) { + Node *RetrieveNode(int id) { for (auto &node : nodes_) { if (node.second->id() == id) { return node.second.get(); @@ -160,6 +177,13 @@ class Graph { return nullptr; } + const ProgramDesc &program() const { return program_; } + std::map> InitFromProgram( + const ProgramDesc &program); + + void ResolveHazard( + const std::map> &var_nodes); + private: // This method takes ownership of `node`. ir::Node *AddNode(ir::Node *node) { @@ -175,6 +199,7 @@ class Graph { std::map> attr_dels_; std::map> nodes_; std::unordered_set node_set_; + size_t num_node_created_{0}; // help to generate a unique node id. }; bool IsControlDepVar(const ir::Node &var); diff --git a/paddle/fluid/framework/ir/graph_helper.cc b/paddle/fluid/framework/ir/graph_helper.cc index c54766d95a61ac1a4b61566c6de62cbc86685a1d..963179192fa6cc959db66f76e0f48393143be0da 100644 --- a/paddle/fluid/framework/ir/graph_helper.cc +++ b/paddle/fluid/framework/ir/graph_helper.cc @@ -15,8 +15,15 @@ limitations under the License. */ #include "paddle/fluid/framework/ir/graph_helper.h" #include #include +#include +#include +#include #include +DEFINE_string(print_sub_graph_dir, "", + "FLAGS_print_sub_graph_dir is used " + "to print the nodes of sub_graphs."); + namespace paddle { namespace framework { namespace ir { @@ -33,8 +40,9 @@ void SortHelper( } } - VLOG(3) << "topology sort insert: " << node->Name() - << reinterpret_cast(node) << " input " << node->inputs.size(); + VLOG(30) << "topology sort insert: " << node->Name() + << reinterpret_cast(node) << " input " + << node->inputs.size(); ret->push_back(node); } @@ -103,9 +111,9 @@ std::map> BuildOperationAdjList( for (auto &var : n->inputs) { for (auto &adj_n : var->inputs) { PADDLE_ENFORCE(adj_n->NodeType() == ir::Node::Type::kOperation); - VLOG(4) << "adj " << adj_n->Name() << reinterpret_cast(adj_n) - << " -> " << n->Name() << reinterpret_cast(n) - << " via " << var->Name() << reinterpret_cast(var); + VLOG(40) << "adj " << adj_n->Name() << reinterpret_cast(adj_n) + << " -> " << n->Name() << reinterpret_cast(n) + << " via " << var->Name() << reinterpret_cast(var); adj_list[n].insert(adj_n); } } @@ -120,19 +128,25 @@ size_t GraphNum(const Graph &graph) { std::deque q_nodes; std::vector> graph_nodes; std::unordered_set g_nodes; + // q_set used to record records in the queue. + std::unordered_set q_set; size_t graph_count = 0; - auto traverse_nodes = [&visited_nodes, - &q_nodes](const std::vector &nodes) { - std::copy_if( - nodes.begin(), nodes.end(), std::back_inserter(q_nodes), - [&visited_nodes](Node *node) { return !visited_nodes.count(node); }); + auto traverse_nodes = [&visited_nodes, &q_nodes, + &q_set](const std::vector &nodes) { + for (auto n : nodes) { + if (visited_nodes.count(n) == 0 && q_set.count(n) == 0) { + q_nodes.push_back(n); + q_set.insert(n); + } + } }; while (visited_nodes.size() != nodes.size()) { if (!q_nodes.empty()) { auto cur_node = q_nodes.front(); q_nodes.pop_front(); + q_set.erase(cur_node); visited_nodes.insert(cur_node); g_nodes.insert(cur_node); traverse_nodes(cur_node->inputs); @@ -146,6 +160,7 @@ size_t GraphNum(const Graph &graph) { for (auto &n : nodes) { if (visited_nodes.count(n) == 0) { q_nodes.push_back(n); + q_set.insert(n); break; } } @@ -156,12 +171,15 @@ size_t GraphNum(const Graph &graph) { graph_nodes.emplace_back(g_nodes); } - if (VLOG_IS_ON(10)) { - VLOG(10) << "graph_num: " << graph_nodes.size(); - for (auto &g_n : graph_nodes) { - VLOG(10) << "graph_nodes: " << g_n.size(); - if (g_n.size() < 10) { - std::stringstream out; + if (FLAGS_print_sub_graph_dir.size()) { + if (graph_nodes.size() > 1) { + std::stringstream out; + for (auto &g_n : graph_nodes) { + out << "graph_nodes: " << g_n.size() << "\n"; + } + out << "\n\n"; + for (auto &g_n : graph_nodes) { + out << "graph_nodes: " << g_n.size(); for (auto &node : g_n) { out << "\nNode: " << node->Name() << " in ["; for (auto &n : node->inputs) { @@ -173,8 +191,12 @@ size_t GraphNum(const Graph &graph) { } out << "]"; } - VLOG(10) << out.str(); + out << "\n\n\n"; } + std::unique_ptr fout( + new std::ofstream(FLAGS_print_sub_graph_dir)); + PADDLE_ENFORCE(fout->good()); + *fout << out.str(); } } diff --git a/paddle/fluid/framework/ir/graph_helper.h b/paddle/fluid/framework/ir/graph_helper.h index ec46b38c01b8c369ab37b4fbd5497ec120d8db91..8d92c406689ab3a97596a8666ceb452aec4be170 100644 --- a/paddle/fluid/framework/ir/graph_helper.h +++ b/paddle/fluid/framework/ir/graph_helper.h @@ -37,6 +37,15 @@ std::vector TopologySortOperations(const Graph &graph); std::map> BuildOperationAdjList( const Graph &graph); +template +std::vector FilterByNodeWrapper(const Graph &graph) { + std::vector ret; + for (ir::Node *n : graph.Nodes()) { + if (n->IsWrappedBy()) ret.push_back(&n->Wrapper()); + } + return ret; +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_helper_test.cc b/paddle/fluid/framework/ir/graph_helper_test.cc index cea902809339f9d45b0e2525163f08a3c1c44c95..260a73ae763bd2cdea9948e4d928377a7c718dda 100644 --- a/paddle/fluid/framework/ir/graph_helper_test.cc +++ b/paddle/fluid/framework/ir/graph_helper_test.cc @@ -200,15 +200,15 @@ TEST(GraphHelperTest, GraphNum) { Graph g(prog); BuildZeroGraph(&g); - ASSERT_EQ(GraphNum(g), 0); + ASSERT_EQ(GraphNum(g), 0UL); Graph g2(prog); BuildOneGraph(&g2); - ASSERT_EQ(GraphNum(g2), 1); + ASSERT_EQ(GraphNum(g2), 1UL); Graph g3(prog); BuildTwoGraphs(&g3); - ASSERT_EQ(GraphNum(g3), 2); + ASSERT_EQ(GraphNum(g3), 2UL); } } // namespace ir diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.cc b/paddle/fluid/framework/ir/graph_pattern_detector.cc index 8625b562e7dfab5a65692863cdc22b62ce15d758..f1f971656ae6ab6bbf66c4a75dd7cf68b5848b7b 100644 --- a/paddle/fluid/framework/ir/graph_pattern_detector.cc +++ b/paddle/fluid/framework/ir/graph_pattern_detector.cc @@ -12,6 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include #include #include #include @@ -91,19 +92,19 @@ void GraphPatternDetector::operator()(Graph *graph, PrettyLogEndl(Style::detail(), "--- detect %d subgraphs", subgraphs.size()); int id = 0; for (auto &g : subgraphs) { - VLOG(3) << "optimizing #" << id++ << " subgraph"; + VLOG(30) << "optimizing #" << id++ << " subgraph"; handler(g, graph); } } bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) { - VLOG(3) << "mark pdnodes in graph"; + VLOG(30) << "mark pdnodes in graph"; if (graph.Nodes().empty()) return false; for (auto &node : GraphTraits::DFS(graph)) { for (const auto &pdnode : pattern_.nodes()) { if (pdnode->Tell(&node)) { - VLOG(4) << "pdnode " << pdnode->name() << " marked"; + VLOG(40) << "pdnode " << pdnode->name() << " marked"; pdnodes2nodes_[pdnode.get()].insert(&node); } } @@ -111,7 +112,7 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) { // Check to early stop if some PDNode can't find matched Node. for (auto &pdnode : pattern_.nodes()) { if (!pdnodes2nodes_.count(pdnode.get())) { - VLOG(4) << pdnode->name() << " can't find matched Node, early stop"; + VLOG(40) << pdnode->name() << " can't find matched Node, early stop"; // return false; } } @@ -120,7 +121,7 @@ bool GraphPatternDetector::MarkPDNodesInGraph(const ir::Graph &graph) { GetMarkedNodes(const_cast(&graph)).insert(n); } } - VLOG(3) << pdnodes2nodes_.size() << " nodes marked"; + VLOG(30) << pdnodes2nodes_.size() << " nodes marked"; return !pdnodes2nodes_.empty(); } @@ -166,10 +167,12 @@ struct HitGroup { bool Match(Node *node, PDNode *pat) { if (nodes_.count(node)) { - if (!roles.count(pat)) return false; - return roles[pat] == node; + if (roles.count(pat) && roles[pat] == node) return true; + return false; + } else { + if (roles.count(pat) && roles[pat] != node) return false; + return true; } - return !roles.count(pat) || roles.at(pat) == node; } void Register(Node *node, PDNode *pat) { @@ -197,7 +200,6 @@ GraphPatternDetector::DetectPatterns() { std::vector result; std::vector init_groups; std::array, 2> bi_records; - // PADDLE_ENFORCE(!pattern_.edges().empty(), "At least one edge is needed"); auto *first_pnode = pattern_.edges().empty() ? pattern().nodes().front().get() : pattern_.edges().front().first; if (!pdnodes2nodes_.count(first_pnode)) return result; @@ -213,7 +215,7 @@ GraphPatternDetector::DetectPatterns() { // Extend a PDNode to subgraphs by deducing the connection relations defined // in edges of PDNodes. for (const auto &edge : pattern_.edges()) { - VLOG(4) << "check " << edge.first->name() << " -> " << edge.second->name(); + VLOG(40) << "check " << edge.first->name() << " -> " << edge.second->name(); // TODO(Superjomn) Fix bug here, the groups might be duplicate here. // Each role has two PDNodes, which indicates two roles. // Detect two Nodes that can match these two roles and they are connected. @@ -224,14 +226,15 @@ GraphPatternDetector::DetectPatterns() { // source -> target for (Node *source : pdnodes2nodes_[edge.first]) { for (Node *target : pdnodes2nodes_[edge.second]) { - VLOG(8) << "check " << source->id() << " -- " << target->id(); + VLOG(80) << "check " << source->id() << " -- " << target->id(); // TODO(Superjomn) add some prune strategies. for (const auto &group : pre_groups) { - HitGroup new_group = group; - if (IsNodesLink(source, target) && - new_group.Match(source, edge.first)) { - new_group.Register(source, edge.first); - if (new_group.Match(target, edge.second)) { + if (IsNodesLink(source, target)) { + HitGroup new_group = group; + bool flag = new_group.Match(source, edge.first) && + new_group.Match(target, edge.second); + if (flag) { + new_group.Register(source, edge.first); new_group.Register(target, edge.second); cur_groups.push_back(new_group); // TODO(Superjomn) need to unique @@ -240,12 +243,13 @@ GraphPatternDetector::DetectPatterns() { } } } - VLOG(3) << "step " << step << " get records: " << cur_groups.size(); + VLOG(30) << "step " << step << " get records: " << cur_groups.size(); for (auto &group : cur_groups) { for (auto &item : group.roles) { - VLOG(4) << "node " << item.second->id() << " as " << item.first->name(); + VLOG(40) << "node " << item.second->id() << " as " + << item.first->name(); } - VLOG(4) << "========================================================="; + VLOG(40) << "========================================================="; } } @@ -259,18 +263,35 @@ GraphPatternDetector::DetectPatterns() { return result; } +struct GraphItemLessThan { + bool operator()(const std::pair &a, + const std::pair &b) { + if (a.first != b.first) { + return a.first < b.first; + } else { + return a.second < b.second; + } + } +}; + +// TODO(Superjomn) enhance the function as it marks unique unique as duplicates +// see https://github.com/PaddlePaddle/Paddle/issues/13550 void GraphPatternDetector::UniquePatterns( std::vector *subgraphs) { if (subgraphs->empty()) return; std::vector result; std::unordered_set set; + std::hash hasher; for (auto &g : *subgraphs) { - size_t key = 0; - for (auto &item : g) { - key ^= std::hash{}(item.first); - key ^= std::hash{}(item.second); + // Sort the items in the sub-graph, and transform to a string key. + std::vector> sorted_keys(g.begin(), g.end()); + std::sort(sorted_keys.begin(), sorted_keys.end(), GraphItemLessThan()); + std::stringstream ss; + for (auto &item : sorted_keys) { + ss << item.first << ":" << item.second; } + auto key = hasher(ss.str()); if (!set.count(key)) { result.emplace_back(g); set.insert(key); @@ -759,6 +780,51 @@ PDNode *patterns::ConvReLU::operator()( return relu_out_var; } +PDNode *patterns::SeqConvEltAddRelu::operator()( + paddle::framework::ir::PDNode *seqconv_input) { + // Create Operators + seqconv_input->assert_is_op_input("sequence_conv", "X"); + auto *seqconv_op = pattern->NewNode(seqconv_repr()) + ->assert_is_op("sequence_conv") + ->assert_op_attr("paddingTrainable", false) + ->assert_op_attr("contextStride", 1); + + auto *eltadd_op = + pattern->NewNode(eltadd_repr())->assert_is_op("elementwise_add"); + auto *relu_op = pattern->NewNode(relu_repr())->assert_is_op("relu"); + // Create variables + // Filter + auto *seqconv_weight_var = + pattern->NewNode(seqconv_weight_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("sequence_conv", "Filter"); + // Bias + auto *eltadd_bias_var = pattern->NewNode(eltadd_bias_repr()) + ->AsInput() + ->assert_is_op_input("elementwise_add"); + // intermediate variable, will be removed in the IR after fuse. + auto *seqconv_out_var = pattern->NewNode(seqconv_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("sequence_conv") + ->assert_is_op_input("elementwise_add"); + auto *eltadd_out_var = pattern->NewNode(eltadd_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("elementwise_add") + ->assert_is_only_input_of_op("relu"); + // output + auto *relu_out_var = pattern->NewNode(relu_out_repr()) + ->AsOutput() + ->assert_is_op_output("relu"); + + seqconv_op->LinksFrom({seqconv_input, seqconv_weight_var}) + .LinksTo({seqconv_out_var}); + eltadd_op->LinksFrom({seqconv_out_var, eltadd_bias_var}) + .LinksTo({eltadd_out_var}); + relu_op->LinksFrom({eltadd_out_var}).LinksTo({relu_out_var}); + return relu_out_var; +} + PDNode *patterns::FC::operator()(paddle::framework::ir::PDNode *x, bool with_bias) { // Create shared nodes. @@ -964,6 +1030,75 @@ PDNode *patterns::ElewiseAddActInplaceGrad::operator()( return ele_add_grad; } +PDNode *patterns::ConvBias::operator()( + paddle::framework::ir::PDNode *conv_input) { + // Create Operators + conv_input->assert_is_op_input("conv2d", "Input"); + auto *conv_op = pattern->NewNode(conv_repr())->assert_is_op("conv2d"); + auto *eltiwse_op = + pattern->NewNode(eltwise_repr())->assert_is_op("elementwise_add"); + // Create variables + // Filter + auto *conv_weight_var = pattern->NewNode(conv_weight_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("conv2d", "Filter"); + // intermediate variable, will be removed in the IR after fuse. + auto *conv_out_var = pattern->NewNode(conv_out_repr()) + ->AsIntermediate() + ->assert_is_only_output_of_op("conv2d") + ->assert_is_op_input("elementwise_add"); + // Bias stored in elementwise_add + auto *eltwise_bias_var = pattern->NewNode(eltwise_bias_repr()) + ->AsInput() + ->assert_is_persistable_var() + ->assert_is_op_input("elementwise_add", "Y"); + // output + auto *eltwise_out_var = pattern->NewNode(eltwise_out_repr()) + ->AsOutput() + ->assert_is_op_output("elementwise_add"); + conv_op->LinksFrom({conv_input, conv_weight_var}).LinksTo({conv_out_var}); + eltiwse_op->LinksFrom({conv_out_var, eltwise_bias_var}) + .LinksTo({eltwise_out_var}); + return eltwise_out_var; +} + +PDNode *patterns::Conv::operator()() { + auto conv_op = pattern->NewNode(conv_op_repr())->assert_is_op("conv2d"); + + auto input_var = pattern->NewNode(conv_input_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Input"); + + auto filter_var = pattern->NewNode(conv_filter_repr()) + ->AsInput() + ->assert_is_op_input("conv2d", "Filter"); + + auto output_var = pattern->NewNode(conv_output_repr()) + ->AsOutput() + ->assert_is_op_output("conv2d", "Output"); + + conv_op->LinksFrom({input_var, filter_var}); + conv_op->LinksTo({output_var}); + + return output_var; +} + +PDNode *patterns::ElementwiseAdd::operator()(PDNode *x_var, PDNode *y_var) { + auto elementwise_add_op = pattern->NewNode(elementwise_add_op_repr()) + ->assert_is_op("elementwise_add"); + + x_var->AsInput()->assert_is_op_input("elementwise_add", "X"); + y_var->AsInput()->assert_is_op_input("elementwise_add", "Y"); + auto out_var = pattern->NewNode(elementwise_add_out_repr()) + ->AsOutput() + ->assert_is_op_output("elementwise_add", "Out"); + + elementwise_add_op->LinksFrom({x_var, y_var}); + elementwise_add_op->LinksTo({out_var}); + + return out_var; +} } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.h b/paddle/fluid/framework/ir/graph_pattern_detector.h index cdd6413d968b065453177ff78b0aad641a09f6e7..c12b9503fd817757ec8d1e988be3e449fc63c6ff 100644 --- a/paddle/fluid/framework/ir/graph_pattern_detector.h +++ b/paddle/fluid/framework/ir/graph_pattern_detector.h @@ -128,6 +128,15 @@ struct PDNode { const std::unordered_set& op_types, const std::string& argument, int nth); + template + PDNode* assert_op_attr(const std::string& attr_name, const T& attr) { + asserts_.emplace_back([=](Node* x) { + return x && x->IsOp() && x->Op()->HasAttr(attr_name) && + boost::get(x->Op()->GetAttr(attr_name)) == attr; + }); + return this; + } + private: PDNode(PDPattern* pattern, const std::string& name = "", Type type = Type::kVar) @@ -301,8 +310,8 @@ void GraphSafeRemoveNodes(Graph* graph, const std::unordered_set& nodes); // Some pre-defined patterns those can be reused in multiple passes. -// The related Fluid Layer or Op should be one pattern here for better reusage -// accross different fusion. +// The related Fluid Layer or Op should be one pattern here for better re-usage +// across different fusion. namespace patterns { struct KeyCounter { @@ -434,6 +443,31 @@ struct ConvReLU : public PatternBase { PATTERN_DECL_NODE(relu_out); }; +// SEQCONV with Elementwise_Add ReLU +// op: seqconv + elementwise_add + relu +// named nodes: +// seqconv_input, seqconv_weight, +// seqconv_out, seqconv, +// elementwise_add_bias, elementwise_add_out, elementwise_add +// relu_out, relu +struct SeqConvEltAddRelu : public PatternBase { + SeqConvEltAddRelu(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "seqconv_eltadd_relu") {} + + PDNode* operator()(PDNode* seqconv_input); + + // declare operator node's name + PATTERN_DECL_NODE(seqconv); + PATTERN_DECL_NODE(eltadd); + PATTERN_DECL_NODE(relu); + // declare variable node's name + PATTERN_DECL_NODE(seqconv_weight); + PATTERN_DECL_NODE(seqconv_out); + PATTERN_DECL_NODE(eltadd_bias); + PATTERN_DECL_NODE(eltadd_out); + PATTERN_DECL_NODE(relu_out); +}; + // FC with bias // op: mul + elementwise_add // named nodes: @@ -578,6 +612,65 @@ struct ElewiseAddActInplaceGrad : public PatternBase { PATTERN_DECL_NODE(d_ele_y); PATTERN_DECL_NODE(ele_y); }; + +// Conv with Elementwise_add as bias +// op: conv + elementwise_add +// named nodes: +// conv_input, conv_weight, +// conv_out, conv, +// eltwise_bias, eltwise_out, +// elementwise_add +struct ConvBias : public PatternBase { + ConvBias(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "conv_bias") {} + PDNode* operator()(PDNode* conv_input); + // declare operator node's name + PATTERN_DECL_NODE(conv); + PATTERN_DECL_NODE(eltwise); + // declare variable node's name + PATTERN_DECL_NODE(conv_weight); + PATTERN_DECL_NODE(conv_out); + PATTERN_DECL_NODE(eltwise_bias); + PATTERN_DECL_NODE(eltwise_out); +}; + +// Convolution op +// Forward pass for convolution. +// conv_input, conv_bias and conv_filter are inputs. +// conv_output is a result of the operator. +// residual_data is data used by skip connection. +// If residual connection fusion is on, the formula is: +// conv_output = conv_op(conv_filter, conv_input, conv_bias) +// + conv_residual_data +// If the fusion is off, conv_residual_data is not added. +struct Conv : public PatternBase { + Conv(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "convolution") {} + + PDNode* operator()(); + + PATTERN_DECL_NODE(conv_op); + PATTERN_DECL_NODE(conv_input); + PATTERN_DECL_NODE(conv_filter); + PATTERN_DECL_NODE(conv_residual_data); + PATTERN_DECL_NODE(conv_output); +}; + +// ElementwiseAdd used in residual connections. +// y_var is used and convolution output. +// The operator is removed, when residual +// connection fusion is on. +struct ElementwiseAdd : public PatternBase { + ElementwiseAdd(PDPattern* pattern, const std::string& name_scope) + : PatternBase(pattern, name_scope, "elementwise_add") {} + + PDNode* operator()(PDNode* x_var, PDNode* y_var); + + PATTERN_DECL_NODE(elementwise_add_op); + PATTERN_DECL_NODE(elementwise_add_x); + PATTERN_DECL_NODE(elementwise_add_y); + PATTERN_DECL_NODE(elementwise_add_out); +}; } // namespace patterns // Link two ir::Nodes from each other. diff --git a/paddle/fluid/framework/ir/graph_test.cc b/paddle/fluid/framework/ir/graph_test.cc index cadda49c399a6d65079cacedfea61f4fd580a69a..7ed2f96eb24239d87965192d73f4ba200ff5dbeb 100644 --- a/paddle/fluid/framework/ir/graph_test.cc +++ b/paddle/fluid/framework/ir/graph_test.cc @@ -124,7 +124,7 @@ TEST(GraphTest, Basic) { ASSERT_EQ(n->outputs.size(), 0UL); } } - ASSERT_EQ(nodes.size(), 5); + ASSERT_EQ(nodes.size(), 5UL); } TEST(GraphTest, WriteAfterRead) { diff --git a/paddle/fluid/framework/ir/graph_to_program_pass.cc b/paddle/fluid/framework/ir/graph_to_program_pass.cc index 414d8f79b15de091c62af5fe099ffae144156e4e..36f36933265c69fcd450894a3e32bbb3e547b62c 100644 --- a/paddle/fluid/framework/ir/graph_to_program_pass.cc +++ b/paddle/fluid/framework/ir/graph_to_program_pass.cc @@ -35,10 +35,11 @@ std::unique_ptr GraphToProgramPass::ApplyImpl( new proto::ProgramDesc(*program.Proto())); auto block = program_pb->mutable_blocks(kRootBlockIndex); + block->set_idx(kRootBlockIndex); block->clear_vars(); std::unordered_set visited_vars; for (ir::Node* n : graph->Nodes()) { - if (n->NodeType() == ir::Node::Type::kVariable) { + if (n->IsVar()) { if (n->Var() && visited_vars.count(n->Var()->Name()) == 0) { visited_vars.insert(n->Var()->Name()); block->add_vars()->MergeFrom(*n->Var()->Proto()); diff --git a/paddle/fluid/framework/ir/graph_traits.cc b/paddle/fluid/framework/ir/graph_traits.cc index 084a4ba2def87eaa8badb3ca2c39865c6e5cb981..2ee12cc410393d1e1aa5fc9e5374d858eca1b901 100644 --- a/paddle/fluid/framework/ir/graph_traits.cc +++ b/paddle/fluid/framework/ir/graph_traits.cc @@ -66,6 +66,76 @@ NodesDFSIterator &NodesDFSIterator::operator=(const NodesDFSIterator &other) { } Node *NodesDFSIterator::operator->() { return stack_.top(); } +inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { + return node.inputs.size() == n; +} + +NodesTSIterator::NodesTSIterator(const std::vector &source) { + PADDLE_ENFORCE(!source.empty(), + "Start points of topological sorting should not be empty!"); + // CHECK all the inputs' in-degree is 0 + for (auto *node : source) { + PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0)); + } + + std::unordered_set visited; + std::unordered_set to_visit{source.begin(), source.end()}; + + std::vector inlink_visited; + while (!to_visit.empty()) { + std::vector queue(to_visit.begin(), to_visit.end()); + for (auto *p : queue) { + inlink_visited.clear(); + + std::copy_if(p->inputs.begin(), p->inputs.end(), + std::back_inserter(inlink_visited), + [&](Node *x) -> bool { return visited.count(x) != 0; }); + + if (inlink_visited.size() == p->inputs.size()) { + sorted_.push_back(p); + for (auto *_ : p->outputs) { + if (!visited.count(_)) { + to_visit.insert(_); + } + } + + to_visit.erase(p); + visited.insert(p); + } + } + } +} + +NodesTSIterator::NodesTSIterator(const NodesTSIterator &other) + : sorted_(other.sorted_), cursor_(other.cursor_) {} + +Node &NodesTSIterator::operator*() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return *sorted_[cursor_]; +} + +NodesTSIterator &NodesTSIterator::operator++() { + if (++cursor_ >= sorted_.size()) { + sorted_.clear(); + cursor_ = 0; + } + return *this; +} +NodesTSIterator &NodesTSIterator::operator=(const NodesTSIterator &other) { + cursor_ = other.cursor_; + sorted_ = other.sorted_; + return *this; +} + +bool NodesTSIterator::operator==(const NodesTSIterator &other) { + return sorted_ == other.sorted_ && cursor_ == other.cursor_; +} + +Node *NodesTSIterator::operator->() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return sorted_[cursor_]; +} + } // namespace ir } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/ir/graph_traits.h b/paddle/fluid/framework/ir/graph_traits.h index f42bab20ed97e372d2da0c4a492a4458ab94e0a0..f6772f9a37567c83c49bd44d551481edda1a74ae 100644 --- a/paddle/fluid/framework/ir/graph_traits.h +++ b/paddle/fluid/framework/ir/graph_traits.h @@ -62,6 +62,32 @@ struct NodesDFSIterator std::unordered_set visited_; }; +// Topological sorting iterator on nodes. +struct NodesTSIterator + : public std::iterator { + NodesTSIterator() = default; + NodesTSIterator(const std::vector &source); + NodesTSIterator(NodesTSIterator &&other) + : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) { + other.cursor_ = 0; + } + NodesTSIterator(const NodesTSIterator &other); + + Node &operator*(); + NodesTSIterator &operator++(); + // TODO(Superjomn) current implementation just compare the first + // element, need to compare the graph and all the elements in the queue and + // set. + NodesTSIterator &operator=(const NodesTSIterator &other); + bool operator==(const NodesTSIterator &other); + bool operator!=(const NodesTSIterator &other) { return !(*this == other); } + Node *operator->(); + + private: + std::vector sorted_; + size_t cursor_{0}; +}; + /* * GraphTraits contains some graph traversal algorithms. * @@ -76,6 +102,14 @@ struct GraphTraits { NodesDFSIterator()); } + static iterator_range TS(const Graph &g) { + auto start_points = ExtractStartPoints(g); + PADDLE_ENFORCE(!start_points.empty()); + NodesTSIterator x(start_points); + return iterator_range(NodesTSIterator(start_points), + NodesTSIterator()); + } + private: // The nodes those have no input will be treated as start points. static std::vector ExtractStartPoints(const Graph &g) { diff --git a/paddle/fluid/framework/ir/graph_viz_pass.cc b/paddle/fluid/framework/ir/graph_viz_pass.cc index 31ed98db72c8fd4af8c970861d386687962001ce..13dd354dc59b2bf00a741c565a4c97719eac76c3 100644 --- a/paddle/fluid/framework/ir/graph_viz_pass.cc +++ b/paddle/fluid/framework/ir/graph_viz_pass.cc @@ -41,7 +41,7 @@ std::string FormatName(const Node* node) { std::unique_ptr GraphVizPass::ApplyImpl( std::unique_ptr graph) const { const std::string graph_viz_path = Get(kGraphVizPath); - VLOG(3) << "draw IR graph viz to " << graph_viz_path; + VLOG(30) << "draw IR graph viz to " << graph_viz_path; std::unique_ptr fout(new std::ofstream(graph_viz_path)); PADDLE_ENFORCE(fout->good()); std::ostream& sout = *fout; diff --git a/paddle/fluid/framework/ir/is_test_pass.cc b/paddle/fluid/framework/ir/is_test_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..292f232ffce48593e1827fe2dfe1b8472360054e --- /dev/null +++ b/paddle/fluid/framework/ir/is_test_pass.cc @@ -0,0 +1,57 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/is_test_pass.h" +#include +#include + +namespace paddle { +namespace framework { +namespace ir { + +std::unique_ptr IsTestPass::ApplyImpl( + std::unique_ptr graph) const { + VLOG(3) << "Sets is_test attrbiute to true and if it is missing, inserts it " + "for activations and pooling."; + auto op_list = {"pool2d", "sigmoid", "logsigmoid", + "softshrink", "exp", "brelu", + "pow", "leaky_relu", "stanh", + "relu", "tanh", "tanh_shrink", + "sqrt", "abs", "ceil", + "elu", "floor", "cos", + "sin", "round", "reciprocal", + "hard_shrink", "hard_sigmoid", "relu6", + "soft_relu", "swish", "thresholded_relu", + "log", "square", "softplus", + "softsign"}; + for (const Node* n : graph->Nodes()) { + if (n->IsOp()) { + auto* op = n->Op(); + if (op->HasAttr("is_test")) { + op->SetAttr("is_test", true); + } else if (std::find(begin(op_list), end(op_list), op->Type()) != + end(op_list)) { + op->MutableAttrMap()->insert( + std::pair("is_test", true)); + } + } + } + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(is_test_pass, paddle::framework::ir::IsTestPass); diff --git a/paddle/fluid/inference/analysis/graph_traits.cc b/paddle/fluid/framework/ir/is_test_pass.h similarity index 59% rename from paddle/fluid/inference/analysis/graph_traits.cc rename to paddle/fluid/framework/ir/is_test_pass.h index 2ea70a1d2060e03769d67060dc6f008207342b52..99e76ca4a3de21e350e68e05e0f241937a743b9e 100644 --- a/paddle/fluid/inference/analysis/graph_traits.cc +++ b/paddle/fluid/framework/ir/is_test_pass.h @@ -1,15 +1,31 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/graph_traits.h" +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class IsTestPass : public Pass { + protected: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/is_test_pass_tester.cc b/paddle/fluid/framework/ir/is_test_pass_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..9696441a21661db89146c448742a992d1f7df022 --- /dev/null +++ b/paddle/fluid/framework/ir/is_test_pass_tester.cc @@ -0,0 +1,120 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/is_test_pass.h" + +#include +#ifdef _WIN32 +#undef FALSE +#undef TRUE +#endif +namespace paddle { +namespace framework { +namespace ir { + +enum class ISTEST_STATE { FALSE, TRUE, UNSET }; + +void SetOp(ProgramDesc* prog, const std::string& type, const std::string& name, + const std::vector& inputs, + const std::vector& outputs, bool use_mkldnn = false, + ISTEST_STATE is_test = ISTEST_STATE::UNSET) { + auto* op = prog->MutableBlock(0)->AppendOp(); + op->SetType(type); + op->SetAttr("name", name); + op->SetInput("X", inputs); + op->SetOutput("Out", outputs); + op->SetAttr("use_mkldnn", use_mkldnn); + if (is_test == ISTEST_STATE::UNSET) + op->MutableAttrMap()->erase("is_test"); + else if (is_test == ISTEST_STATE::FALSE) + op->SetAttr("is_test", false); + else + op->SetAttr("is_test", true); +} + +// a->pool2d->b +// b->relu->c +// c,weights1)->conv2d->d +// +// d->pool2d->e +// e->hard_sigmoid->f +// (f,weights2)->conv2d->g +// +// g->pool2d->h +// h->tanh->i +// (i,weights3)->conv2d->j +ProgramDesc BuildProgramDesc() { + ProgramDesc prog; + for (auto& v : + std::vector({"a", "b", "c", "d", "e", "f", "g", "h", "i", + "j", "weights1", "weights2", "weights3"})) { + auto* var = prog.MutableBlock(0)->Var(v); + var->SetType(proto::VarType::SELECTED_ROWS); + if (v == "weights1" || v == "weights2" || v == "weights3") { + var->SetPersistable(true); + } + } + + SetOp(&prog, "pool2d", "pooling1", std::vector({"a"}), + std::vector({"b"}), true, ISTEST_STATE::TRUE); + SetOp(&prog, "relu", "activation1", std::vector({"b"}), + std::vector({"c"}), true, ISTEST_STATE::TRUE); + SetOp(&prog, "conv2d", "conv1", std::vector({"c", "weights1"}), + std::vector({"d"}), true, ISTEST_STATE::TRUE); + + SetOp(&prog, "pool2d", "pooling2", std::vector({"d"}), + std::vector({"e"}), false, ISTEST_STATE::FALSE); + SetOp(&prog, "hard_sigmoid", "activation2", std::vector({"e"}), + std::vector({"f"}), false, ISTEST_STATE::FALSE); + SetOp(&prog, "conv2d", "conv2", std::vector({"f", "weights2"}), + std::vector({"g"}), false, ISTEST_STATE::FALSE); + + SetOp(&prog, "pool2d", "pooling3", std::vector({"g"}), + std::vector({"h"}), false, ISTEST_STATE::UNSET); + SetOp(&prog, "tanh", "activation3", std::vector({"h"}), + std::vector({"i"}), true, ISTEST_STATE::UNSET); + SetOp(&prog, "conv2d", "conv3", std::vector({"i", "weights3"}), + std::vector({"j"}), false, ISTEST_STATE::UNSET); + + return prog; +} + +TEST(IsTestPass, basic) { + auto prog = BuildProgramDesc(); + + std::unique_ptr graph(new ir::Graph(prog)); + + auto pass = PassRegistry::Instance().Get("is_test_pass"); + + graph = pass->Apply(std::move(graph)); + + for (auto* node : graph->Nodes()) { + if (node->IsOp()) { + auto* op = node->Op(); + auto op_name = boost::get(op->GetAttr("name")); + if (op_name == "conv3") { + ASSERT_FALSE(op->HasAttr("is_test")); + } else { + ASSERT_TRUE(op->HasAttr("is_test")); + EXPECT_TRUE(boost::get(op->GetAttr("is_test"))); + } + } + } +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +USE_PASS(is_test_pass); diff --git a/paddle/fluid/framework/ir/mkldnn_placement_pass.cc b/paddle/fluid/framework/ir/mkldnn_placement_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..145a3a455c8ae2c1e6a5bc4fefa3491f420af5ba --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn_placement_pass.cc @@ -0,0 +1,37 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/ir/mkldnn_placement_pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +std::unique_ptr MKLDNNPlacementPass::ApplyImpl( + std::unique_ptr graph) const { + VLOG(30) << "Aplies MKL-DNN placement strategy."; + for (const Node* n : graph->Nodes()) { + if (n->IsOp() && n->Op()->HasAttr("use_mkldnn")) { + n->Op()->SetAttr("use_mkldnn", true); + } + } + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(mkldnn_placement_pass, + paddle::framework::ir::MKLDNNPlacementPass); diff --git a/paddle/fluid/framework/ir/mkldnn_placement_pass.h b/paddle/fluid/framework/ir/mkldnn_placement_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..3d4dc9e2b6ecccddea4d63e45710c80d55ef2772 --- /dev/null +++ b/paddle/fluid/framework/ir/mkldnn_placement_pass.h @@ -0,0 +1,31 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class MKLDNNPlacementPass : public Pass { + protected: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/multi_batch_merge_pass.cc b/paddle/fluid/framework/ir/multi_batch_merge_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..532961e4d59ad3611dc93b20738080d1755290e8 --- /dev/null +++ b/paddle/fluid/framework/ir/multi_batch_merge_pass.cc @@ -0,0 +1,315 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/multi_batch_merge_pass.h" + +#include +#include +#include + +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/op_proto_maker.h" + +namespace paddle { +namespace framework { +namespace ir { + +static const char kNumRepeats[] = "num_repeats"; +typedef std::unordered_map> SSAVarList; + +ir::Node* SameNameVar(std::unordered_set all, ir::Node* target) { + for (auto n : all) { + if (target->IsVar() && target->Name() == n->Name()) { + return n; + } + } + return nullptr; +} + +VarDesc CopyVarDesc(VarDesc* var_desc) { + VarDesc repeated_var(var_desc->Name()); + // copy other variable attributes + if (var_desc->GetType() != proto::VarType::READER) { + repeated_var.SetType(var_desc->GetType()); + repeated_var.SetShape(var_desc->GetShape()); + repeated_var.SetDataType(var_desc->GetDataType()); + repeated_var.SetLoDLevel(var_desc->GetLoDLevel()); + repeated_var.SetPersistable(var_desc->Persistable()); + } else { + // TODO(typhoonzero): copy reader var + } + return repeated_var; +} + +VarDesc UpdateGradVarDesc( + VarDesc* var_desc, int repeat, + const std::unordered_set& grad_names, + const std::unordered_set& bn_vars_need_rename) { + if (grad_names.find(var_desc->Name()) != grad_names.end() || + bn_vars_need_rename.find(var_desc->Name()) != bn_vars_need_rename.end()) { + std::string new_gname = + string::Sprintf("%s.repeat.%d", var_desc->Name(), repeat); + VarDesc repeated_var = CopyVarDesc(var_desc); + repeated_var.SetName(new_gname); + VLOG(30) << "update " << var_desc->Name() << " to repeat " << repeat; + return repeated_var; + } + return *var_desc; +} + +std::unique_ptr BatchMergePass::ApplyImpl( + std::unique_ptr graph) const { + int num_repeats = Get(kNumRepeats); + std::vector forward_backward_ops; + std::vector optimize_ops; + std::vector lr_ops; // ops other than forward/backward/optimize + std::unordered_set grad_names; + + std::vector nodes = TopologySortOperations(*graph); + auto origin_nodes = graph->ReleaseNodes(); + VLOG(30) << "origin nodes count: " << origin_nodes.size(); + ir::Graph& result = *graph; + + // 1. record op nodes of different roles + for (auto node : nodes) { + if (node->IsVar()) continue; + int op_role = boost::get(node->Op()->GetAttr( + framework::OpProtoAndCheckerMaker::OpRoleAttrName())); + if ((op_role == static_cast(framework::OpRole::kForward)) || + (op_role & static_cast(framework::OpRole::kBackward)) || + (op_role & static_cast(framework::OpRole::kLoss))) { + forward_backward_ops.push_back(node); + } else if ((op_role & static_cast(framework::OpRole::kOptimize)) || + (op_role & static_cast(framework::OpRole::kDist)) || + (op_role & static_cast(framework::OpRole::kRPC))) { + optimize_ops.push_back(node); + auto op_role_var = node->Op()->GetNullableAttr( + OpProtoAndCheckerMaker::OpRoleVarAttrName()); + auto op_role_vars = boost::get>(op_role_var); + for (size_t i = 0; i < op_role_vars.size(); i += 2) { + grad_names.insert(op_role_vars[i + 1]); + } + } else if (op_role & static_cast(framework::OpRole::kLRSched)) { + lr_ops.push_back(node); + } else { // NOLINT + PADDLE_THROW("Invalid op_role: %d", static_cast(op_role)); + } + } + + // 2. copy forward backward + ir::Node* prev_repeat_last_op_node = nullptr; + // record origin_grad -> repeated grad list map. + std::map> grad_repeated_map; + std::map> created; + std::unordered_set bn_vars_need_rename; + for (int i = 0; i < num_repeats; ++i) { + std::unordered_set copied; + for (size_t node_idx = 0; node_idx < forward_backward_ops.size(); + ++node_idx) { + auto node = forward_backward_ops[node_idx]; + OpDesc repeated_op(*(node->Op()), node->Op()->Block()); + // 3. rename grad outputs to current repeat. + for (auto outname : repeated_op.OutputArgumentNames()) { + if (grad_names.find(outname) != grad_names.end()) { + std::string new_gname = string::Sprintf("%s.repeat.%d", outname, i); + repeated_op.RenameOutput(outname, new_gname); + } + } + // 3.5 let batch_norm ops use independent vars, note batch_norm_grad do + // not need this update + if (node->Name() == "batch_norm") { + // NOTE: assume bn op created by layers use save var as output mean and + // variance + std::string new_mean_name = + string::Sprintf("%s.repeat.%d", repeated_op.Input("Mean")[0], i); + std::string new_var_name = string::Sprintf( + "%s.repeat.%d", repeated_op.Input("Variance")[0], i); + bn_vars_need_rename.insert(repeated_op.Input("Mean")[0]); + bn_vars_need_rename.insert(repeated_op.Input("Variance")[0]); + VLOG(30) << "renaming " << repeated_op.Input("Mean")[0] << " to " + << new_mean_name; + repeated_op.RenameInput(repeated_op.Input("Mean")[0], new_mean_name); + repeated_op.RenameInput(repeated_op.Input("Variance")[0], new_var_name); + repeated_op.RenameOutput(repeated_op.Output("MeanOut")[0], + new_mean_name); + repeated_op.RenameOutput(repeated_op.Output("VarianceOut")[0], + new_var_name); + } + + // 3.9 do copy + auto repeated_node = result.CreateOpNode(&repeated_op); + copied.insert(node); + + // 4. add deps between repeats + if (node_idx == forward_backward_ops.size() - 1) { + prev_repeat_last_op_node = repeated_node; + } + if (node_idx == 0 && prev_repeat_last_op_node) { + auto* depvar = result.CreateControlDepVar(); + prev_repeat_last_op_node->outputs.push_back(depvar); + depvar->inputs.push_back(prev_repeat_last_op_node); + repeated_node->inputs.push_back(depvar); + depvar->outputs.push_back(repeated_node); + } + + for (auto in_node : node->inputs) { + if (in_node->IsCtrlVar()) { + continue; + } + ir::Node* var = nullptr; + auto updated_var = UpdateGradVarDesc(in_node->Var(), i, grad_names, + bn_vars_need_rename); + // should be initialized by startup, how to initilize tensor in the + // scope? + if (node->Name() == "batch_norm" && + bn_vars_need_rename.find(in_node->Name()) != + bn_vars_need_rename.end()) { + // Create bn mean/variance for each repeat + var = result.CreateVarNode(&updated_var); + created[updated_var.Name()].push_back(var); + copied.insert(in_node); + repeated_node->inputs.push_back(var); + var->outputs.push_back(repeated_node); + continue; + } + + // for other ops + if (in_node->inputs.empty() && i > 0) { + // do not copy head vars (inputs, params) in repeats > 0 + var = created.at(in_node->Name()).back(); + } else { + if (copied.find(in_node) == copied.end()) { + var = result.CreateVarNode(&updated_var); + if (grad_names.find(in_node->Var()->Name()) != grad_names.end()) { + grad_repeated_map[in_node].push_back(var); + } + copied.insert(in_node); + created[updated_var.Name()].push_back(var); + } else { + var = created.at(updated_var.Name()).back(); + } + } + repeated_node->inputs.push_back(var); + var->outputs.push_back(repeated_node); + } + for (auto out_node : node->outputs) { + if (out_node->IsCtrlVar()) { + continue; + } + ir::Node* var = nullptr; + auto updated_var = UpdateGradVarDesc(out_node->Var(), i, grad_names, + bn_vars_need_rename); + if (copied.find(out_node) == copied.end()) { + var = result.CreateVarNode(&updated_var); + if (grad_names.find(out_node->Var()->Name()) != grad_names.end()) { + grad_repeated_map[out_node].push_back(var); + } + copied.insert(out_node); + created[updated_var.Name()].push_back(var); + } else { + var = created.at(updated_var.Name()).back(); + } + repeated_node->outputs.push_back(var); + var->inputs.push_back(repeated_node); + } + } + } + + // 5. create GRAD merge op node + for (auto kv : grad_repeated_map) { + OpDesc sum_op; + sum_op.SetType("sum"); + std::vector repeated_grad_names; + for (auto r : kv.second) { + repeated_grad_names.push_back(r->Var()->Name()); + } + sum_op.SetInput("X", repeated_grad_names); + sum_op.SetOutput("Out", {kv.first->Var()->Name()}); + sum_op.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kBackward)); + auto sum_op_node = result.CreateOpNode(&sum_op); + for (auto r : kv.second) { + sum_op_node->inputs.push_back(r); + r->outputs.push_back(sum_op_node); + } + auto sum_out_var_node = result.CreateVarNode(kv.first->Var()); + sum_op_node->outputs.push_back(sum_out_var_node); + sum_out_var_node->inputs.push_back(sum_op_node); + created[sum_out_var_node->Name()].push_back(sum_out_var_node); + + OpDesc scale_op; + scale_op.SetType("scale"); + scale_op.SetInput("X", {sum_out_var_node->Var()->Name()}); + // NOTE: inplace scale. + scale_op.SetOutput("Out", {sum_out_var_node->Var()->Name()}); + scale_op.SetAttr("scale", static_cast(1.0f / num_repeats)); + scale_op.SetAttr(OpProtoAndCheckerMaker::OpRoleAttrName(), + static_cast(OpRole::kBackward)); + auto scale_op_node = result.CreateOpNode(&scale_op); + scale_op_node->inputs.push_back(sum_out_var_node); + sum_out_var_node->outputs.push_back(scale_op_node); + auto scale_out_var_node = result.CreateVarNode(sum_out_var_node->Var()); + scale_op_node->outputs.push_back(scale_out_var_node); + scale_out_var_node->inputs.push_back(scale_op_node); + created[scale_out_var_node->Name()].push_back(scale_out_var_node); + } + // 6. add optimize ops + { + auto copy_node = [&result, &created](ir::Node* node) { + auto op_node = result.CreateOpNode(node->Op()); + // copy op ins/outs + // NOTE: for send/recv ops, the OpDesc uses ctrldepvar to describe + // dependencies, so create those depvars if OpDesc have in/outs. + for (auto in_node : node->inputs) { + if (in_node->IsCtrlVar() && !in_node->Var()) { + continue; + } + ir::Node* var = nullptr; + if (created.find(in_node->Name()) == created.end()) { + var = result.CreateVarNode(in_node->Var()); + created[in_node->Name()].push_back(var); + } else { + var = created.at(in_node->Name()).back(); + } + op_node->inputs.push_back(var); + var->outputs.push_back(op_node); + } + for (auto out_node : node->outputs) { + if (out_node->IsCtrlVar() && !out_node->Var()) { + continue; + } + auto var = result.CreateVarNode(out_node->Var()); + created[out_node->Name()].push_back(var); + op_node->outputs.push_back(var); + var->inputs.push_back(op_node); + } + }; + for (auto node : lr_ops) { + copy_node(node); + } + for (auto node : optimize_ops) { + copy_node(node); + } + } + + result.ResolveHazard(created); + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(multi_batch_merge_pass, paddle::framework::ir::BatchMergePass) + .RequirePassAttr(paddle::framework::ir::kNumRepeats); diff --git a/paddle/fluid/framework/ir/multi_batch_merge_pass.h b/paddle/fluid/framework/ir/multi_batch_merge_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..c1e5aef20dbc60c18ed03038818bfd8ab217bf28 --- /dev/null +++ b/paddle/fluid/framework/ir/multi_batch_merge_pass.h @@ -0,0 +1,44 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +// BatchMergePass is used to copy forward and backward ops for several +// times to run several batches to simulate large batch size training +// as if we have more than 1 GPUs. +// User can define how many batches to run, gradients will be merged +// through those repeats, and then do optimization using merged gradients. +// This pass is extremely useful when doing large batch-size distributed +// sync training, we can simulate even large batch size as if we have more +// GPUs. + +class BatchMergePass : public Pass { + public: + virtual ~BatchMergePass() {} + + protected: + std::unique_ptr ApplyImpl(std::unique_ptr graph) const override; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/node.cc b/paddle/fluid/framework/ir/node.cc index 9277abe8c1b79c5f76f4610d0554bf337f329518..50d9113088903aa7681d6c6af5cc65f846d32787 100644 --- a/paddle/fluid/framework/ir/node.cc +++ b/paddle/fluid/framework/ir/node.cc @@ -17,8 +17,12 @@ limitations under the License. */ namespace paddle { namespace framework { namespace ir { +// msvc15 don't support constexpr in correct way. +#if !defined(_WIN32) constexpr char Node::kControlDepVarName[]; -int Node::count_ = 0; +#else +const char Node::kControlDepVarName[] = "__control_var"; +#endif std::unique_ptr CreateNodeForTest(const std::string& name, Node::Type type) { diff --git a/paddle/fluid/framework/ir/node.h b/paddle/fluid/framework/ir/node.h index 5d6da9f1d76a3c0fc64b7ff35264e385cf19a14b..d2a393b3f19e9aab79098757dae663d030b0fa2b 100644 --- a/paddle/fluid/framework/ir/node.h +++ b/paddle/fluid/framework/ir/node.h @@ -15,7 +15,10 @@ limitations under the License. */ #pragma once #include +#include +#include #include + #include "paddle/fluid/framework/op_desc.h" #include "paddle/fluid/framework/var_desc.h" #include "paddle/fluid/platform/macros.h" @@ -24,11 +27,39 @@ namespace paddle { namespace framework { namespace ir { -// Node should normally created by Graph::CreateXXXNode(). +// Node should only created by Graph::CreateXXXNode(). +// 1. Every Node should be part of a graph. No dangling Node exists. +// 2. Node only contains members necessary for building graph structure. +// It doesn't contain other unrelated members, such as device, etc. +// +// Sometimes, for specific usages, Node needs to have additional members, +// such as device_placement, version in order to be executed. It is suggested +// to use composition pattern. +// +// class RunnableOp { +// RunnableOp(ir::Node* n) : n_(n) { n_.WrappedBy(this); } +// +// int any_thing_; +// } +// +// RunnableOp is owned by the ir::Node that composes it. In other words. +// ir::Node will be responsible for deleting RunnableOp, say, when ir::Node +// is deleted from the graph. class Node { public: + virtual ~Node() { + if (!wrapper_.empty()) { + VLOG(4) << "ir::Node deleting a wrapper node " << Name(); + wrapper_deleter_(); + } + } + enum class Type { kOperation, kVariable }; +#if !defined(_WIN32) // msvc not support constexpr correctly. static constexpr char kControlDepVarName[] = "__control_var"; +#else + static const char kControlDepVarName[]; +#endif Type NodeType() const { return type_; } @@ -44,6 +75,30 @@ class Node { return op_desc_.get(); } + // Set the `wrapper` that wraps the Node. `wrapper` is owned by Node. + template + void WrappedBy(T* wrapper) { + if (!wrapper_.empty()) { + wrapper_deleter_(); + } + wrapper_ = wrapper; + wrapper_deleter_ = [wrapper]() { delete wrapper; }; + wrapper_type_ = std::type_index(typeid(T)); + } + + // Return a reference to the `wrapper`. + template + T& Wrapper() { + return *boost::any_cast(wrapper_); + } + + // Test if the Node is wrapped by type T. + template + bool IsWrappedBy() { + return std::type_index(typeid(T)) == wrapper_type_; + } + + // Please don't use this API! int id() const { return id_; } bool IsOp() const { return type_ == Type::kOperation; } @@ -64,35 +119,34 @@ class Node { int id_; private: + // ID can only set by a Graph. + void SetId(int id) { id_ = id; } + friend class Graph; friend std::unique_ptr CreateNodeForTest(const std::string& name, Node::Type type); explicit Node(const std::string& name, Type type) - : name_(name), - var_desc_(nullptr), - op_desc_(nullptr), - type_(type), - id_(count_++) {} + : name_(name), var_desc_(nullptr), op_desc_(nullptr), type_(type) {} explicit Node(VarDesc* var_desc) : name_(var_desc->Name()), var_desc_(new VarDesc(*var_desc)), op_desc_(nullptr), - type_(Type::kVariable), - id_(count_++) {} + type_(Type::kVariable) {} explicit Node(OpDesc* op_desc) : name_(op_desc->Type()), var_desc_(nullptr), op_desc_(new OpDesc(*op_desc, op_desc->Block())), - type_(Type::kOperation), - id_(count_++) {} + type_(Type::kOperation) {} Node() = delete; - static int count_; - static void ResetId() { count_ = 0; } + boost::any wrapper_; + std::function wrapper_deleter_; + std::type_index wrapper_type_ = std::type_index(typeid(void)); + DISABLE_COPY_AND_ASSIGN(Node); }; diff --git a/paddle/fluid/framework/ir/node_test.cc b/paddle/fluid/framework/ir/node_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..694efadda078169c993457181c00f7b357a09e87 --- /dev/null +++ b/paddle/fluid/framework/ir/node_test.cc @@ -0,0 +1,80 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "gtest/gtest.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/pass.h" + +namespace paddle { +namespace framework { +namespace ir { + +class RunnableOp { + public: + RunnableOp(Node* node, bool* alive) : node_(node), alive_(alive) { + node_->WrappedBy(this); + } + + virtual ~RunnableOp() { *alive_ = false; } + + private: + Node* node_; + bool* alive_; +}; + +class RunnableOp2 { + public: + RunnableOp2(Node* node, bool* alive) : node_(node), alive_(alive) { + node_->WrappedBy(this); + } + + virtual ~RunnableOp2() { *alive_ = false; } + + private: + Node* node_; + bool* alive_; +}; + +TEST(NodeTest, Basic) { + bool alive1 = true; + bool alive2 = true; + std::unique_ptr n1(CreateNodeForTest("n1", Node::Type::kVariable)); + std::unique_ptr n2(CreateNodeForTest("n2", Node::Type::kVariable)); + + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + + new RunnableOp(n1.get(), &alive1); + new RunnableOp2(n2.get(), &alive2); + + EXPECT_TRUE(n1->IsWrappedBy()); + EXPECT_FALSE(n1->IsWrappedBy()); + EXPECT_FALSE(n2->IsWrappedBy()); + EXPECT_TRUE(n2->IsWrappedBy()); + + EXPECT_TRUE(alive1); + EXPECT_TRUE(alive2); + + n1.reset(nullptr); + n2.reset(nullptr); + EXPECT_FALSE(alive1); + EXPECT_FALSE(alive2); +} + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/ir/pass.h b/paddle/fluid/framework/ir/pass.h index 9570c59cff2a6afeb1c607f7219b7b455974d6ce..615b539695de8c3f9a256d17d4d49e61902da394 100644 --- a/paddle/fluid/framework/ir/pass.h +++ b/paddle/fluid/framework/ir/pass.h @@ -76,7 +76,7 @@ class Pass { attr_name); attrs_[attr_name] = attr; attr_dels_[attr_name] = [attr, attr_name]() { - VLOG(3) << "deleting " << attr_name; + VLOG(30) << "deleting " << attr_name; delete attr; }; } @@ -93,6 +93,7 @@ class Pass { protected: virtual std::unique_ptr ApplyImpl(std::unique_ptr graph) const { LOG(FATAL) << "Calling virtual Pass not implemented."; + return graph; } private: @@ -196,26 +197,26 @@ struct PassRegistrar : public Registrar { msg) // Register a new pass that can be applied on the IR. -#define REGISTER_PASS(pass_type, pass_class) \ - STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ - __reg_pass__##pass_type, \ - "REGISTER_PASS must be called in global namespace"); \ - static ::paddle::framework::ir::PassRegistrar \ - __pass_registrar_##pass_type##__(#pass_type); \ - int TouchPassRegistrar_##pass_type() { \ - __pass_registrar_##pass_type##__.Touch(); \ - return 0; \ - } \ - static ::paddle::framework::ir::PassRegistrar \ - &__pass_tmp_registrar_##pass_type##__ __attribute__((unused)) = \ +#define REGISTER_PASS(pass_type, pass_class) \ + STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ + __reg_pass__##pass_type, \ + "REGISTER_PASS must be called in global namespace"); \ + static ::paddle::framework::ir::PassRegistrar \ + __pass_registrar_##pass_type##__(#pass_type); \ + int TouchPassRegistrar_##pass_type() { \ + __pass_registrar_##pass_type##__.Touch(); \ + return 0; \ + } \ + static ::paddle::framework::ir::PassRegistrar \ + &__pass_tmp_registrar_##pass_type##__ UNUSED = \ __pass_registrar_##pass_type##__ -#define USE_PASS(pass_type) \ - STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ - __use_pass_itself_##pass_type, \ - "USE_PASS must be called in global namespace"); \ - extern int TouchPassRegistrar_##pass_type(); \ - static int use_pass_itself_##pass_type##_ __attribute__((unused)) = \ +#define USE_PASS(pass_type) \ + STATIC_ASSERT_PASS_GLOBAL_NAMESPACE( \ + __use_pass_itself_##pass_type, \ + "USE_PASS must be called in global namespace"); \ + extern int TouchPassRegistrar_##pass_type(); \ + static int use_pass_itself_##pass_type##_ UNUSED = \ TouchPassRegistrar_##pass_type() } // namespace ir diff --git a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc index a7d5161c35db804703415066990f34da8109fbd9..b7687d61de3eacd47ff1208ba14c3f482215c1d4 100644 --- a/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc +++ b/paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.cc @@ -12,10 +12,13 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h" +#include +#include + #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph_pattern_detector.h" #include "paddle/fluid/framework/ir/graph_viz_pass.h" +#include "paddle/fluid/framework/ir/seq_concat_fc_fuse_pass.h" #include "paddle/fluid/framework/lod_tensor.h" namespace paddle { @@ -159,10 +162,7 @@ PDNode* BuildFCPattern(PDPattern* pattern, PDNode* fc_x) { std::set acts({"sigmoid", "tanh", "relu", "identity"}); PDNode* act = pattern->NewNode( - [=](Node* x) { - return x && x->IsOp() && acts.count(x->Op()->Type()); - - }, + [=](Node* x) { return x && x->IsOp() && acts.count(x->Op()->Type()); }, "act"); PDNode* fc_out = pattern->NewNode( @@ -196,7 +196,7 @@ std::unique_ptr SeqConcatFcFusePass::ApplyImpl( detector(graph.get(), [&](const GraphPatternDetector::subgraph_t& subgraph, Graph* graph) { - VLOG(4) << "get one concat pattern"; + VLOG(40) << "get one concat pattern"; // fc GET_NODE(fc_w, detector.pattern()); GET_NODE(fc_bias, detector.pattern()); diff --git a/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..015b5e3c6363cc96e31e21095fbbb007543c99af --- /dev/null +++ b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.cc @@ -0,0 +1,101 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h" +#include +#include "paddle/fluid/framework/lod_tensor.h" + +namespace paddle { +namespace framework { +namespace ir { + +int BuildFusion(Graph* graph, const std::string& name_scope, Scope* scope) { + GraphPatternDetector gpd; + auto* pattern = gpd.mutable_pattern(); + + PDNode* x = pattern->NewNode(patterns::PDNodeName(name_scope, "X")) + ->assert_is_op_input("sequence_conv") + ->assert_var_not_persistable(); + patterns::SeqConvEltAddRelu fuse_pattern(pattern, name_scope); + fuse_pattern(x); + + // Create New OpDesc + auto fuse_creator = [&](Node* seqconv, Node* input, Node* seqconv_weight, + Node* eltadd_bias, Node* relu_out) { + OpDesc op_desc; + op_desc.SetType("fusion_seqconv_eltadd_relu"); + op_desc.SetInput("X", {input->Name()}); + op_desc.SetInput("Filter", {seqconv_weight->Name()}); + op_desc.SetInput("Bias", {eltadd_bias->Name()}); + op_desc.SetAttr("contextLength", seqconv->Op()->GetAttr("contextLength")); + op_desc.SetAttr("contextStart", seqconv->Op()->GetAttr("contextStart")); + op_desc.SetAttr("contextStride", seqconv->Op()->GetAttr("contextStride")); + PADDLE_ENFORCE(graph->Has(kParamScopeAttr)); + auto* scope = graph->Get(kParamScopeAttr); + const std::string ColMat = patterns::UniqueKey("SeqConvColMat"); + op_desc.SetOutput("ColMat", {ColMat}); + op_desc.SetOutput("Out", {relu_out->Name()}); + scope->Var(ColMat)->GetMutable(); + + auto* op = graph->CreateOpNode(&op_desc); + IR_NODE_LINK_TO(input, op); + IR_NODE_LINK_TO(seqconv_weight, op); + IR_NODE_LINK_TO(eltadd_bias, op); + IR_NODE_LINK_TO(op, relu_out); + return op; + }; + + int fusion_count{0}; + + auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph, + Graph* g) { + VLOG(40) << "handle SeqConv EltAdd Relu fuse"; + GET_IR_NODE_FROM_SUBGRAPH(seqconv, seqconv, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(seqconv_weight, seqconv_weight, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(seqconv_out, seqconv_out, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd, eltadd, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd_bias, eltadd_bias, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(eltadd_out, eltadd_out, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(relu, relu, fuse_pattern); + GET_IR_NODE_FROM_SUBGRAPH(relu_out, relu_out, fuse_pattern); + + fuse_creator(seqconv, subgraph.at(x), seqconv_weight, eltadd_bias, + relu_out); + std::unordered_set marked_nodes( + {seqconv, seqconv_out, eltadd, eltadd_out, relu}); + GraphSafeRemoveNodes(graph, marked_nodes); + ++fusion_count; + }; + + gpd(graph, handler); + + return fusion_count; +} + +std::unique_ptr SeqConvEltAddReluFusePass::ApplyImpl( + std::unique_ptr graph) const { + FusePassBase::Init(name_scope_, graph.get()); + + int fusion_count = BuildFusion(graph.get(), name_scope_, param_scope()); + AddStatis(fusion_count); + + return graph; +} + +} // namespace ir +} // namespace framework +} // namespace paddle + +REGISTER_PASS(seqconv_eltadd_relu_fuse_pass, + paddle::framework::ir::SeqConvEltAddReluFusePass); diff --git a/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..dac9de71930c1768bdf416520caae6468449cd3d --- /dev/null +++ b/paddle/fluid/framework/ir/seqconv_eltadd_relu_fuse_pass.h @@ -0,0 +1,38 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" + +namespace paddle { +namespace framework { +namespace ir { + +class SeqConvEltAddReluFusePass : public FusePassBase { + public: + virtual ~SeqConvEltAddReluFusePass() {} + + protected: + std::unique_ptr ApplyImpl(std::unique_ptr graph) const; + + const std::string name_scope_{"seqconv_eltadd_relu_fuse"}; +}; + +} // namespace ir +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/lod_rank_table.cc b/paddle/fluid/framework/lod_rank_table.cc index 6bc795b642bf79b7556869c5ebe9b0323d3cc5fc..660ce2ec85131bafae27e8b7800fbfa3c238b59a 100644 --- a/paddle/fluid/framework/lod_rank_table.cc +++ b/paddle/fluid/framework/lod_rank_table.cc @@ -31,7 +31,7 @@ void LoDRankTable::Reset(const LoD& lod, size_t level) { TableItem item; item.index = i; item.length = vec[i + 1] - vec[i]; - VLOG(10) << "Add item to rank table " << item.index << " " << item.length; + VLOG(100) << "Add item to rank table " << item.index << " " << item.length; items_.emplace_back(item); } // NOTE(yuyang18): diff --git a/paddle/fluid/framework/lod_tensor.cc b/paddle/fluid/framework/lod_tensor.cc index 1e7da9a69c7cbf8c13306656599a759515802b76..9b2eeaf59a536b5e8577f513b392648ed4270aab 100644 --- a/paddle/fluid/framework/lod_tensor.cc +++ b/paddle/fluid/framework/lod_tensor.cc @@ -26,10 +26,8 @@ limitations under the License. */ #include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/memory/memory.h" -#if !defined(_WIN32) #include "paddle/fluid/recordio/scanner.h" #include "paddle/fluid/recordio/writer.h" -#endif // _WIN32 namespace paddle { namespace framework { @@ -305,7 +303,6 @@ void DeserializeFromStream(std::istream &is, LoDTensor *tensor, TensorFromStream(is, static_cast(tensor), dev_ctx); } -#if !defined(_WIN32) void WriteToRecordIO(recordio::Writer *writer, const std::vector &tensor, const platform::DeviceContext &dev_ctx) { @@ -335,19 +332,7 @@ bool ReadFromRecordIO(recordio::Scanner *scanner, return true; } -#else -class Writer {}; -class Scanner {}; -void WriteToRecordIO(recordio::Writer *writer, - const std::vector &tensor, - const platform::DeviceContext &dev_ctx) {} -bool ReadFromRecordIO(recordio::Scanner *scanner, - const platform::DeviceContext &dev_ctx, - std::vector *result_ptr) { - PADDLE_ENFORCE("windows didn't supported recordio!."); - return true; -} -#endif // _WIN32 + std::vector LoDTensor::SplitLoDTensor( const std::vector places) const { check_memory_size(); @@ -418,7 +403,7 @@ void LoDTensor::MergeLoDTensor( PADDLE_ENFORCE_EQ(new_lod.size(), lod.size()); for (size_t j = 0; j < lod.size(); ++j) { auto &sub_lod = new_lod[j]; - auto &offset = sub_lod.back(); + size_t offset = sub_lod.back(); for (size_t k = 1; k < lod[j].size(); ++k) { sub_lod.push_back(lod[j][k] + offset); } diff --git a/paddle/fluid/framework/lod_tensor.h b/paddle/fluid/framework/lod_tensor.h index e9b473d547252e80ed26ec61e1a33fbe1742dbe0..fb6e781fd07b9033bea547118b8338ad8b705c5e 100644 --- a/paddle/fluid/framework/lod_tensor.h +++ b/paddle/fluid/framework/lod_tensor.h @@ -111,9 +111,6 @@ class LoDTensor : public Tensor { public: LoDTensor() : Tensor() {} - /* Constructor with place should only be used in pybind */ - explicit LoDTensor(const platform::Place& place) : Tensor(place) {} - explicit LoDTensor(const LoD& lod) : lod_(lod) {} void set_lod(const LoD& lod) { lod_ = lod; } diff --git a/paddle/fluid/framework/lod_tensor_array.h b/paddle/fluid/framework/lod_tensor_array.h index 6d7b6a4ada8729e3698dab5d2b1861aac632be79..36a5c3c5d601390beedaf37ceb98ee2c63ecf5a6 100644 --- a/paddle/fluid/framework/lod_tensor_array.h +++ b/paddle/fluid/framework/lod_tensor_array.h @@ -18,6 +18,8 @@ limitations under the License. */ namespace paddle { namespace framework { + using LoDTensorArray = std::vector; -} + +} // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/lod_tensor_test.cc b/paddle/fluid/framework/lod_tensor_test.cc index cbf5fd04d73007d303d0fd96064e3a2d7f21cfb8..cd50aaa26054b78f1b1e8f0d470b397892155a2b 100644 --- a/paddle/fluid/framework/lod_tensor_test.cc +++ b/paddle/fluid/framework/lod_tensor_test.cc @@ -274,7 +274,6 @@ TEST(LoD, ConvertToOffsetBasedLoD) { EXPECT_EQ(offset_lod, expected); } -#if !defined(_WIN32) template static void TestRecordIO() { LoDTensor tensor; @@ -321,7 +320,6 @@ TEST(LoDTensor, RecordIO) { TestRecordIO(); TestRecordIO(); } -#endif // !defined(_WIN32) } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/mixed_vector.h b/paddle/fluid/framework/mixed_vector.h index 77386f4f069489b6ff7b927a281bdc286ff816e0..6940250c3f9663bbb734d5a6eb78135aecbc3a3b 100644 --- a/paddle/fluid/framework/mixed_vector.h +++ b/paddle/fluid/framework/mixed_vector.h @@ -23,6 +23,7 @@ #include "paddle/fluid/framework/details/cow_ptr.h" #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/memory/malloc.h" #include "paddle/fluid/memory/memcpy.h" #include "glog/logging.h" @@ -31,46 +32,6 @@ namespace paddle { namespace framework { #if defined(PADDLE_WITH_CUDA) -namespace details { -struct CUDABuffer { - void *data_{nullptr}; - size_t size_{0}; - platform::CUDAPlace place_; - - CUDABuffer() {} - CUDABuffer(platform::Place place, size_t size) - : size_(size), place_(boost::get(place)) { - data_ = memory::Alloc(place_, size); - } - - ~CUDABuffer() { ClearMemory(); } - - CUDABuffer(const CUDABuffer &o) = delete; - CUDABuffer &operator=(const CUDABuffer &o) = delete; - - void Resize(platform::Place place, size_t size) { - ClearMemory(); - place_ = boost::get(place); - data_ = memory::Alloc(place_, size); - PADDLE_ENFORCE_NOT_NULL(data_); - size_ = size; - } - - void Swap(CUDABuffer &o) { - std::swap(data_, o.data_); - std::swap(place_, o.place_); - std::swap(size_, o.size_); - } - - private: - void ClearMemory() const { - if (data_ != nullptr) { - memory::Free(place_, data_); - } - } -}; -} // namespace details - // Vector implements the std::vector interface, and can get Data or // MutableData from any place. The data will be synced implicitly inside. template @@ -103,8 +64,6 @@ class Vector { o.ImmutableCPU(); cpu_ = o.cpu_; flag_ = kDataInCPU; - details::CUDABuffer null; - gpu_.Swap(null); return *this; } @@ -199,7 +158,7 @@ class Vector { PADDLE_ENFORCE(platform::is_gpu_place(place), "CUDA Data must on CUDA place"); ImmutableCUDA(place); - return reinterpret_cast(gpu_.data_); + return reinterpret_cast(gpu_->ptr()); } // get cuda ptr. mutable @@ -234,13 +193,11 @@ class Vector { std::mutex &Mutex() const { return mtx_; } - std::unique_ptr CUDAPlace() const { - if (gpu_.data_ == nullptr) { - return nullptr; - } else { - return std::unique_ptr( - new platform::CUDAPlace(gpu_.place_)); - } + boost::optional CUDAPlace() const { + return gpu_ == nullptr + ? boost::none + : boost::optional( + boost::get(gpu_->place())); } private: @@ -254,13 +211,12 @@ class Vector { void CopyToCPU() const { // COPY GPU Data To CPU auto *dev_ctx = static_cast( - platform::DeviceContextPool::Instance().Get( - platform::Place(gpu_.place_))); + platform::DeviceContextPool::Instance().Get(gpu_->place())); auto stream = dev_ctx->stream(); - void *src = gpu_.data_; + void *src = gpu_->ptr(); void *dst = cpu_.data(); - memory::Copy(platform::CPUPlace(), dst, gpu_.place_, src, gpu_.size_, - stream); + memory::Copy(platform::CPUPlace(), dst, CUDAPlace().get(), src, + gpu_->size(), stream); dev_ctx->Wait(); } @@ -277,8 +233,7 @@ class Vector { CopyCPUDataToCUDA(place); UnsetFlag(kDirty); SetFlag(kDataInCUDA); - } else if (IsInCUDA() && - !(boost::get(place) == gpu_.place_)) { + } else if (IsInCUDA() && !(place == gpu_->place())) { PADDLE_THROW("This situation should not happen"); // Still dirty } else { @@ -290,7 +245,7 @@ class Vector { // Even data is not dirty. However, data is not in CUDA. Copy data. CopyCPUDataToCUDA(place); SetFlag(kDataInCUDA); - } else if (!(boost::get(place) == gpu_.place_)) { + } else if (!(place == gpu_->place())) { PADDLE_THROW("This situation should not happen."); } else { // Not Dirty && DataInCUDA && Device is same @@ -301,13 +256,13 @@ class Vector { void CopyCPUDataToCUDA(const platform::Place &place) const { void *src = cpu_.data(); - gpu_.Resize(place, cpu_.size() * sizeof(T)); - void *dst = gpu_.data_; + gpu_ = memory::Alloc(place, cpu_.size() * sizeof(T)); + void *dst = gpu_->ptr(); auto *dev_ctx = static_cast( platform::DeviceContextPool::Instance().Get(place)); auto stream = dev_ctx->stream(); - memory::Copy(gpu_.place_, dst, platform::CPUPlace(), src, gpu_.size_, - stream); + memory::Copy(CUDAPlace().get(), dst, platform::CPUPlace(), src, + gpu_->size(), stream); } void ImmutableCPU() const { @@ -329,7 +284,7 @@ class Vector { bool IsInCPU() const { return flag_ & kDataInCPU; } mutable std::vector cpu_; - mutable details::CUDABuffer gpu_; + mutable memory::AllocationPtr gpu_; mutable int flag_; mutable std::mutex mtx_; @@ -428,8 +383,8 @@ class Vector { auto &mtx = m_.Data().Mutex(); std::lock_guard guard(mtx); auto cuda_place = m_.Data().CUDAPlace(); - if (cuda_place == nullptr || - *cuda_place == boost::get(place)) { + if (cuda_place == boost::none || + cuda_place == boost::get(place)) { return m_.Data().CUDAData(place); } } @@ -444,8 +399,8 @@ class Vector { auto &mtx = m_.Data().Mutex(); std::lock_guard guard(mtx); auto cuda_place = m_.Data().CUDAPlace(); - if (cuda_place == nullptr || - *cuda_place == boost::get(place)) { + if (cuda_place == boost::none || + cuda_place == boost::get(place)) { return m_.MutableData()->CUDAMutableData(place); } } @@ -542,6 +497,33 @@ class CPUVector : public std::vector> { this->reserve(this->size() + size_t(end - begin)); this->insert(this->end(), begin, end); } + + const T *CUDAData(platform::Place place) const { + PADDLE_THROW( + "Vector::CUDAData() method is not supported in CPU-only version"); + } + + T *CUDAMutableData(platform::Place place) { + PADDLE_THROW( + "Vector::CUDAMutableData() method is not supported in CPU-only " + "version"); + } + + const T *Data(platform::Place place) const { + PADDLE_ENFORCE( + platform::is_cpu_place(place), + "Vector::Data() method is not supported when not in CPUPlace"); + return this->data(); + } + + T *MutableData(platform::Place place) { + PADDLE_ENFORCE( + platform::is_cpu_place(place), + "Vector::MutableData() method is not supported when not in CPUPlace"); + return this->data(); + } + + const void *Handle() const { return static_cast(this); } }; template diff --git a/paddle/fluid/framework/mixed_vector_test.cc b/paddle/fluid/framework/mixed_vector_test.cc index 0599c8d384641606b0a5ebb5ba1781b56f539e63..0330cae377c32b2d49d409eff42b968d81356d49 100644 --- a/paddle/fluid/framework/mixed_vector_test.cc +++ b/paddle/fluid/framework/mixed_vector_test.cc @@ -51,7 +51,7 @@ TEST(mixed_vector, InitWithCount) { TEST(mixed_vector, ForEach) { vec tmp; for (auto& v : tmp) { - VLOG(3) << v; + VLOG(30) << v; } } diff --git a/paddle/fluid/framework/naive_executor.cc b/paddle/fluid/framework/naive_executor.cc index 2840d503f1454271afb309efdd435225ab077dc0..e8295639520b5838dce3c9c9e443cc846bd9c1ec 100644 --- a/paddle/fluid/framework/naive_executor.cc +++ b/paddle/fluid/framework/naive_executor.cc @@ -57,60 +57,69 @@ static void InitializeVariable(Variable *var, proto::VarType::Type var_type) { } } -void NaiveExecutor::Prepare(Scope *parent_scope, - const ProgramDesc &program_desc, int block_id, - bool with_feed_fetch_ops) { - if (!parent_scope) { +void NaiveExecutor::Prepare(Scope *scope, const ProgramDesc &program_desc, + int block_id, bool with_feed_fetch_ops) { + if (!scope) { scope_ = new framework::Scope; } else { - scope_ = &parent_scope->NewScope(); + scope_ = scope; } - CreateVariables(program_desc, scope_, block_id); + + VLOG(3) << "NaiveExecutor init with scope " << scope; CreateOps(program_desc, block_id, with_feed_fetch_ops); } void NaiveExecutor::Run() { +#ifndef PADDLE_ON_INFERENCE + LOG_FIRST_N(WARNING, 15) << "The NaiveExecutor can not work properly if the " + "cmake flag ON_INFER is not set."; + LOG_FIRST_N(WARNING, 15) << "Unlike the training phase, all the scopes and " + "variables will be reused to save the allocation " + "overhead."; + LOG_FIRST_N(WARNING, 15) << "Please re-compile the inference library by " + "setting the cmake flag ON_INFER=ON if you are " + "running Paddle Inference"; +#endif // PADDLE_ON_INFERENCE for (auto &op : ops_) { - VLOG(4) << "run " << op->Type(); + VLOG(3) << std::this_thread::get_id() << " run " << op->Type() + << " on scope " << scope_; + op->SetIsCalledByExecutor(false); op->Run(*scope_, place_); } } -void NaiveExecutor::CreateVariables(const ProgramDesc &desc, Scope *scope, - int block_id) { - PADDLE_ENFORCE(scope); +void NaiveExecutor::CreateVariables(const ProgramDesc &desc, int block_id, + bool persistable, Scope *scope) { + PADDLE_ENFORCE_NOT_NULL(scope); + auto &global_block = desc.Block(block_id); - const Scope *ancestor_scope = scope; - while (ancestor_scope->parent()) { - ancestor_scope = ancestor_scope->parent(); + const auto *anc = scope; + PADDLE_ENFORCE(anc->parent() != anc); + while (anc->parent()) { + anc = anc->parent(); } - if (ancestor_scope != scope) { - for (auto &var : global_block.AllVars()) { - if (var->Name() == framework::kEmptyVarName) { - continue; - } - // Create persistable vars in ancestor scope. - if (var->Persistable()) { - auto *ptr = const_cast(ancestor_scope)->Var(var->Name()); - InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " global, which pointer is " << ptr; - } else { // Create temporary variables in local scope. - auto *ptr = scope->Var(var->Name()); + for (auto &var : global_block.AllVars()) { + if (var->Name() == framework::kEmptyVarName) { + continue; + } + + if (persistable == var->Persistable()) { + if (persistable) { + if (!anc->FindVar(var->Name())) { + auto *ptr = const_cast(anc)->Var(var->Name()); + VLOG(3) << scope << " Create persistable variable " << var->Name() + << ", which pointer is " << ptr; + InitializeVariable(ptr, var->GetType()); + } + } else { + auto *ptr = const_cast(scope)->Var(var->Name()); + VLOG(3) << scope << " Create variable " << var->Name() + << ", which pointer is " << ptr; InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create Variable " << var->Name() - << " locally, which pointer is " << ptr; } } - } else { - for (auto &var : global_block.AllVars()) { - auto *ptr = scope->Var(var->Name()); - InitializeVariable(ptr, var->GetType()); - VLOG(3) << "Create variable " << var->Name() << ", which pointer is " - << ptr; - } } } @@ -146,22 +155,5 @@ void NaiveExecutor::CleanFeedFetchOps() { ops_.swap(ops); } -void NaiveExecutor::EnableMKLDNN(const ProgramDesc &program) { -#ifdef PADDLE_WITH_MKLDNN - VLOG(3) << "use_mkldnn=True"; - for (size_t block_id = 0; block_id < program.Size(); ++block_id) { - auto *block = const_cast(program).MutableBlock(block_id); - for (auto *op : block->AllOps()) { - if (op->HasAttr("use_mkldnn")) { - op->SetAttr("use_mkldnn", true); - } - } - } -#else - LOG(WARNING) - << "'MKLDNN' is not supported, Please re-compile with WITH_MKLDNN option"; -#endif -} - } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/naive_executor.h b/paddle/fluid/framework/naive_executor.h index 9374f3f4a35cc0f90e5b2d6e8b397784b8eae123..5e673f68574c4ddaa4c9260367d09e9f62f6b751 100644 --- a/paddle/fluid/framework/naive_executor.h +++ b/paddle/fluid/framework/naive_executor.h @@ -35,8 +35,14 @@ class NaiveExecutor { // Create child scope. // Create variables. // @with_feed_fetch_ops: whether to work with the feed and fetch operators. - void Prepare(Scope* parent_scope, const ProgramDesc& program_desc, - int block_id, bool with_feed_fetch_ops); + void Prepare(Scope* scope, const ProgramDesc& program_desc, int block_id, + bool with_feed_fetch_ops); + + // Create variables before head. + // Create parameters if persistable is ture, or create the temporary variables + // instead. + void CreateVariables(const ProgramDesc& desc, int block_id, bool persistable, + Scope* scope); // Run all the operators. void Run(); @@ -48,11 +54,7 @@ class NaiveExecutor { void CleanFeedFetchOps(); - void EnableMKLDNN(const ProgramDesc& program); - protected: - void CreateVariables(const ProgramDesc& desc, Scope* scope, int block_id); - void CreateOps(const ProgramDesc& desc, int block_id, bool with_feed_fetch_ops); diff --git a/paddle/fluid/framework/naive_executor_test.cc b/paddle/fluid/framework/naive_executor_test.cc index 6b9f79b9d398bc5a0ee6ba66587924daad0dbbc5..c917630666b082ab7148550707f9f1f720aa25d3 100644 --- a/paddle/fluid/framework/naive_executor_test.cc +++ b/paddle/fluid/framework/naive_executor_test.cc @@ -39,7 +39,7 @@ TEST(NaiveExecutor, Basic) { auto place = platform::CPUPlace(); NaiveExecutor exe(place); - exe.Prepare(nullptr, program, 0, false /*with feed fetch ops*/); + exe.Prepare(nullptr, program, 0, false); auto* a_tensor = exe.FindTensor("a"); auto* b_tensor = exe.FindTensor("b"); auto* c_tensor = exe.FindTensor("c"); diff --git a/paddle/fluid/framework/ngraph_bridge.cc b/paddle/fluid/framework/ngraph_bridge.cc new file mode 100644 index 0000000000000000000000000000000000000000..8177436d0bd90c3bcf8f91d5c55b66be188b19f9 --- /dev/null +++ b/paddle/fluid/framework/ngraph_bridge.cc @@ -0,0 +1,39 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifdef PADDLE_WITH_NGRAPH +#include +#include + +#include "paddle/fluid/framework/ngraph_bridge.h" + +#include "ngraph/ngraph.hpp" + +namespace paddle { +namespace framework { + +std::map&, + std::shared_ptr>>)>> + NgraphBridge::NG_NODE_MAP = {}; + +void NgraphBridge::build_graph(const std::shared_ptr& op) { + auto& op_type = op->Type(); + NG_NODE_MAP[op_type](op, ngb_node_map); +} + +} // namespace framework +} // namespace paddle +#endif diff --git a/paddle/fluid/framework/ngraph_bridge.h b/paddle/fluid/framework/ngraph_bridge.h new file mode 100644 index 0000000000000000000000000000000000000000..55bf0d21f3471013b1fb780e852d813313345f03 --- /dev/null +++ b/paddle/fluid/framework/ngraph_bridge.h @@ -0,0 +1,58 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#ifdef PADDLE_WITH_NGRAPH + +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/platform/enforce.h" + +#include "ngraph/ngraph.hpp" + +namespace paddle { +namespace framework { + +class NgraphBridge { + public: + static std::map< + std::string, + std::function&, + std::shared_ptr>>)>> + NG_NODE_MAP; + + explicit NgraphBridge( + std::shared_ptr< + std::unordered_map>> + var_node_map) + : ngb_node_map(var_node_map) {} + + void build_graph(const std::shared_ptr& op); + + private: + std::shared_ptr< + std::unordered_map>> + ngb_node_map; +}; + +} // namespace framework +} // namespace paddle +#endif diff --git a/paddle/fluid/framework/ngraph_operator.cc b/paddle/fluid/framework/ngraph_operator.cc new file mode 100644 index 0000000000000000000000000000000000000000..d967b2780c21713a2f9a73a3402964103f44269e --- /dev/null +++ b/paddle/fluid/framework/ngraph_operator.cc @@ -0,0 +1,220 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifdef PADDLE_WITH_NGRAPH +#include + +#include +#include + +#include "paddle/fluid/framework/feed_fetch_type.h" +#include "paddle/fluid/framework/ngraph_operator.h" +#include "paddle/fluid/framework/shape_inference.h" +#include "paddle/fluid/framework/var_desc.h" +#include "paddle/fluid/framework/var_type.h" + +namespace paddle { +namespace framework { + +static std::map pd2ng_type_map = { + {proto::VarType::FP32, ngraph::element::f32}, + {proto::VarType::FP64, ngraph::element::f64}, + {proto::VarType::INT32, ngraph::element::i32}, + {proto::VarType::INT64, ngraph::element::i64}, + {proto::VarType::BOOL, ngraph::element::boolean}, +}; + +typedef enum { /* nGraph support state on ops */ + FULL_TRAIN, /* Support full ops for train */ + PARTIAL_TRAIN, /* Support partial ops for train */ + FULL_TEST, /* Support full list of ops for test */ + PARTIAL_TEST /* Support partial list of ops for test */ +} op_state; + +class NgraphOperator { + public: + explicit NgraphOperator(const Scope& scope, const platform::Place& place, + const std::vector>& ops, + const std::unordered_map< + std::string, ngraph::element::Type>& var_type_map, + const std::unordered_set& persist, + const std::unordered_set& fetches, + const std::unordered_set& post_op_inputs, + op_state ng_op_state) + : scope_(scope), + place_(place), + fused_ops_(ops), + var_type_map_(var_type_map), + persistables_(persist), + fetches_(fetches), + post_op_inputs_(post_op_inputs), + ng_op_state_(ng_op_state) {} + + void Run(const Scope& scope, const platform::Place& place) const; + + private: + static std::unordered_map> + func_cache; + const Scope& scope_; + const platform::Place& place_; + std::vector> fused_ops_; + std::unordered_map var_type_map_; + std::unordered_set persistables_; + std::unordered_set fetches_; + std::unordered_set post_op_inputs_; + op_state ng_op_state_; +}; + +std::vector>::iterator>> +FusedOperator::FusedOpIntervals( + std::vector>* ops) { + std::vector>::iterator>> + intervals; + if (ops->empty()) { + return intervals; + } + size_t size = ops->size(); + size_t left = 0; + while (left < size && ops.at(left)->Type() != kFeedOpType) { + ++left; + } + if (left == size) { + return intervals; + } + while (left < size && ops->at(left)->Type() == kFeedOpType) { + ++left; + } + + size_t right = left; + while (right < size && ops->at(right)->Type() != kFetchOpType) { + ++right; + } + if (right == size) { + return intervals; + } + if (left >= right) return intervals; + + // (left, right - 1) represents indices between feed and fetch + size_t pivot = left; + while (pivot < right) { + auto op_type = ops->at(pivot)->Type(); + if (paddle::framework::NgraphBridge::NG_NODE_MAP.find(op_type) == + paddle::framework::NgraphBridge::NG_NODE_MAP.end()) { + ++pivot; + } else { + size_t start = pivot, end = start; + while (pivot < right && + (paddle::framework::NgraphBridge::NG_NODE_MAP.find( + ops.at(pivot)->Type()) != + paddle::framework::NgraphBridge::NG_NODE_MAP.end())) { + ++pivot; + ++end; + } + std::vector>::iterator> + interval = {ops->begin() + start, ops->begin() + end}; + intervals.push_back(interval); + } + } // end while + + return intervals; +} + +FusedOperator::FusedOperator( + const ProgramDesc& prog, size_t block_id, + std::vector>::iterator start, + std::vector>::iterator end, + const std::string& type, const VariableNameMap& inputs, + const VariableNameMap& outputs, const AttributeMap& attrs) + : OperatorBase(type, inputs, outputs, attrs), pdesc(prog), block(block_id) { + for (std::vector>::iterator it = start; + it != end; ++it) { + fused_ops_.push_back(std::move(*it)); + } + + for (std::vector>::iterator it = end; + (*it)->Type() != kFetchOpType; ++it) { + for (auto& var_name_item : (*it)->Inputs()) { + for (auto& var_name : var_name_item.second) { + post_op_inputs_.insert(var_name); + } + } + } + + if ((*(start - 1))->Type() == kFeedOpType && (*end)->Type() == kFetchOpType) { + is_complete = true; + } + + Process(); +} + +void FusedOperator::Process() { + auto& bdesc = pdesc_.Block(block_); + for (auto& var : bdesc.AllVars()) { + if (!(var->GetType() == proto::VarType::SELECTED_ROWS || + var->GetType() == proto::VarType::LOD_TENSOR || + var->GetType() == proto::VarType::LOD_TENSOR_ARRAY)) { + continue; + } + + auto var_name = var->Name(); + if (var->Name() == framework::kEmptyVarName) { + continue; + } + + if (var_name != "fetch" && var_name != "feed") { + auto pd_type = var->GetDataType(); + if (pd2ng_type_map.find(pd_type) == pd2ng_type_map.end()) { + PADDLE_THROW("Data type of var %s not found in pd2ng_type_map", + var_name); + } + var_type_map_[var_name] = pd2ng_type_map[pd_type]; + } + + if (var->Persistable()) { + persistables_.insert(var->Name()); + } + } + + for (auto* op : bdesc.AllOps()) { + if (op->Type() == kFetchOpType) { + std::string fetch_target_name = op->Input("X")[0]; + fetches_.insert(fetch_target_name); + } + } +} + +void FusedOperator::RunImpl(const Scope& scope, + const platform::Place& place) const { + op_state ng_op_state = PARTIAL_TEST; + auto& bdesc = pdesc_.Block(block_); + for (auto* op : bdesc.AllOps()) { + if (op->Type().find("_grad") != std::string::npos) { + ng_op_state = PARTIAL_TRAIN; + break; + } + } + + if (is_full) { + ng_op_state = ng_op_state == PARTIAL_TEST ? FULL_TEST : FULL_TRAIN; + } + + NgraphOperator ngraph_op(scope, place, fused_ops_, var_type_map_, + persistables_, fetches_, post_op_inputs_, + ng_op_state); + ngraph_op.Run(scope, place); +} + +} // namespace framework +} // namespace paddle +#endif diff --git a/paddle/fluid/framework/ngraph_operator.h b/paddle/fluid/framework/ngraph_operator.h new file mode 100644 index 0000000000000000000000000000000000000000..0f655cef1dde624bcf4944b5c096279097e1c8ae --- /dev/null +++ b/paddle/fluid/framework/ngraph_operator.h @@ -0,0 +1,72 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#ifdef PADDLE_WITH_NGRAPH + +#include +#include +#include +#include +#include + +#include "paddle/fluid/framework/attribute.h" +#include "paddle/fluid/framework/framework.pb.h" +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/ngraph_bridge.h" +#include "paddle/fluid/framework/op_info.h" +#include "paddle/fluid/framework/op_kernel_type.h" +#include "paddle/fluid/framework/operator.h" +#include "paddle/fluid/framework/program_desc.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/platform/variant.h" + +#include "ngraph/ngraph.hpp" + +namespace paddle { +namespace framework { + +class FusedOperator : public OperatorBase { + public: + static std::vector< + std::vector>::iterator>> + FusedOpIntervals( + std::vector>* ops); + + explicit FusedOperator( + const ProgramDesc& prog, size_t block_id, + std::vector>::iterator start, + std::vector>::iterator end, + const std::string& type = "fused_op", const VariableNameMap& inputs = {}, + const VariableNameMap& outputs = {}, const AttributeMap& attrs = {}); + + void RunImpl(const Scope& scope, const platform::Place& place) const final; + + private: + const ProgramDesc pdesc_; + size_t block_; + std::vector> fused_ops_; + std::unordered_map var_type_map_; + std::unordered_set persistables_; + std::unordered_set fetches_; + std::unordered_set post_op_inputs_; + bool is_full_ = false; + + void Process(); +}; +} // namespace framework +} // namespace paddle +#endif diff --git a/paddle/fluid/framework/op_desc.cc b/paddle/fluid/framework/op_desc.cc index b29ac44699463312a1fdcea55e003daa75997302..362cda3f2329bef1abaa93b4529e506d41f07606 100644 --- a/paddle/fluid/framework/op_desc.cc +++ b/paddle/fluid/framework/op_desc.cc @@ -82,13 +82,9 @@ class CompileTimeInferShapeContext : public InferShapeContext { auto *in_var = block_.FindVarRecursive(Inputs(in)[i]); auto *out_var = block_.FindVarRecursive(Outputs(out)[j]); if (in_var->GetType() != proto::VarType::LOD_TENSOR) { - VLOG(3) << "input " << in << " is not LodTensor"; + VLOG(30) << "input " << in << " is not LodTensor"; return; } - PADDLE_ENFORCE_EQ(in_var->GetType(), proto::VarType::LOD_TENSOR, - "The %d-th output of Output(%s) must be LoDTensor.", j, - out); - out_var->SetLoDLevel(in_var->GetLoDLevel()); } @@ -245,32 +241,38 @@ void OpDesc::SetAttr(const std::string &name, const Attribute &v) { const proto::OpProto::Attr &attr = GetProtoAttr(name); switch (attr.type()) { case proto::AttrType::BOOLEANS: { - VLOG(11) << "SetAttr: " << Type() << ", " << name - << " from INTS to BOOLEANS"; + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from INTS to BOOLEANS"; this->attrs_[name] = std::vector(); break; } case proto::AttrType::INTS: { - VLOG(11) << "SetAttr: " << Type() << ", " << name - << " from INTS to INTS"; + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from INTS to INTS"; this->attrs_[name] = std::vector(); break; } + case proto::AttrType::LONGS: { + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from LONGS to LONGS"; + this->attrs_[name] = std::vector(); + break; + } case proto::AttrType::FLOATS: { - VLOG(11) << "SetAttr: " << Type() << ", " << name - << " from INTS to FLOATS"; + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from INTS to FLOATS"; this->attrs_[name] = std::vector(); break; } case proto::AttrType::STRINGS: { - VLOG(11) << "SetAttr: " << Type() << ", " << name - << " from INTS to STRINGS"; + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from INTS to STRINGS"; this->attrs_[name] = std::vector(); break; } case proto::AttrType::BLOCKS: { - VLOG(11) << "SetAttr: " << Type() << ", " << name - << " from INTS to BLOCKS"; + VLOG(110) << "SetAttr: " << Type() << ", " << name + << " from INTS to BLOCKS"; this->SetBlocksAttr(name, std::vector()); return; } @@ -423,8 +425,15 @@ struct SetAttrDescVisitor : public boost::static_visitor { } VectorToRepeated(blocks_idx, attr_->mutable_blocks_idx()); } + void operator()(BlockDesc *desc) const { attr_->set_block_idx(desc->ID()); } + void operator()(int64_t v) const { attr_->set_l(v); } + + void operator()(const std::vector &v) const { + VectorToRepeated(v, attr_->mutable_longs()); + } + void operator()(boost::blank) const { PADDLE_THROW("Unexpected branch"); } }; @@ -496,13 +505,13 @@ void OpDesc::CheckAttrs() { } void OpDesc::InferShape(const BlockDesc &block) const { - VLOG(3) << "CompileTime infer shape on " << Type(); + VLOG(30) << "CompileTime infer shape on " << Type(); InitInferShapeFuncs(); auto &infer_shape = OpInfoMap::Instance().Get(this->Type()).infer_shape_; PADDLE_ENFORCE(static_cast(infer_shape), "%s's infer_shape has not been registered", this->Type()); CompileTimeInferShapeContext ctx(*this, block); - if (VLOG_IS_ON(10)) { + if (VLOG_IS_ON(100)) { std::ostringstream sout; auto inames = this->InputArgumentNames(); sout << " From ["; @@ -513,26 +522,20 @@ void OpDesc::InferShape(const BlockDesc &block) const { std::copy(onames.begin(), onames.end(), std::ostream_iterator(sout, ", ")); sout << "]"; - VLOG(10) << sout.str(); + VLOG(100) << sout.str(); } infer_shape(&ctx); } void OpDesc::InferVarType(BlockDesc *block) const { + // There are a few places that var type can be set. + // When VarDesc is created, default set to LOD_TENSOR. + // When output variable is created, default is defaut set to LOD_TENSOR. + // We limit here to be the only place that operator defines its customized + // var type inference. Hence, we don't do any "default" setting here. auto &info = OpInfoMap::Instance().Get(this->Type()); if (info.infer_var_type_) { info.infer_var_type_(*this, block); - } else { - // all output type is LoDTensor by default - VLOG(10) << this->Type() - << " has not registered InferVarType. Set output variables to " - "LOD_TENSOR"; - for (auto &out_pair : this->outputs_) { - for (auto &out_var_name : out_pair.second) { - block->FindRecursiveOrCreateVar(out_var_name) - .SetType(proto::VarType::LOD_TENSOR); - } - } } } @@ -610,7 +613,7 @@ DDim CompileTimeInferShapeContext::GetDim(const std::string &name) const { auto shape = var->GetShape(); res = shape.empty() ? make_ddim({0UL}) : make_ddim(shape); } catch (...) { - VLOG(5) << "GetDim of variable " << name << " error"; + VLOG(50) << "GetDim of variable " << name << " error"; std::rethrow_exception(std::current_exception()); } return res; @@ -627,7 +630,7 @@ std::vector CompileTimeInferShapeContext::GetRepeatedDims( res.push_back(s.empty() ? make_ddim({0UL}) : make_ddim(s)); } } catch (...) { - VLOG(5) << "GetRepeatedDim of variable " << name << " error."; + VLOG(50) << "GetRepeatedDim of variable " << name << " error."; std::rethrow_exception(std::current_exception()); } return res; diff --git a/paddle/fluid/framework/op_desc.h b/paddle/fluid/framework/op_desc.h index b4205aba83e774fb9c08193124adb93935c00157..30c8a26c3d2f0068674aa70b4ff875a2f73c1dca 100644 --- a/paddle/fluid/framework/op_desc.h +++ b/paddle/fluid/framework/op_desc.h @@ -100,16 +100,6 @@ class OpDesc { std::vector InputNames() const { return MapKeys(inputs_); } std::vector OutputNames() const { return MapKeys(outputs_); } - void SetInputMap(const VariableNameMap &input) { - this->inputs_ = input; - this->need_update_ = true; - } - - void SetOutputMap(const VariableNameMap &output) { - this->outputs_ = output; - this->need_update_ = true; - } - const VariableNameMap &Inputs() const { return inputs_; } const VariableNameMap &Outputs() const { return outputs_; } @@ -131,10 +121,6 @@ class OpDesc { BlockDesc *Block() { return this->block_; } - const BlockDesc &BlockRef() const { return *this->block_; } - - void SetBlock(BlockDesc *block) { this->block_ = block; } - private: template static std::vector MapKeys(const MapType &map) { diff --git a/paddle/fluid/framework/op_kernel_type.h b/paddle/fluid/framework/op_kernel_type.h index c59b232191c49ccb47bb9f51dcaf2fd9280fae19..ac0330218973123771367ed5ba9477c90143a043 100644 --- a/paddle/fluid/framework/op_kernel_type.h +++ b/paddle/fluid/framework/op_kernel_type.h @@ -63,6 +63,8 @@ struct OpKernelType { place_(dev_ctx.GetPlace()), library_type_(library_type) {} + size_t hash_key() const { return Hash()(*this); } + bool operator==(const OpKernelType& o) const { return platform::places_are_same_class(place_, o.place_) && data_type_ == o.data_type_ && data_layout_ == o.data_layout_ && diff --git a/paddle/fluid/framework/op_proto_maker.cc b/paddle/fluid/framework/op_proto_maker.cc index df2a7a27ca4a6011b214202ac9bf4f30dc482ece..ca31303f77c4a30eb64c43404e214779ea78aeaf 100644 --- a/paddle/fluid/framework/op_proto_maker.cc +++ b/paddle/fluid/framework/op_proto_maker.cc @@ -21,7 +21,6 @@ namespace framework { void OpProtoAndCheckerMaker::Validate() { validated_ = true; CheckNoDuplicatedInOutAttrs(); - CheckReuseVars(); } OpProtoAndCheckerMaker::VariableBuilder OpProtoAndCheckerMaker::AddInput( @@ -40,40 +39,6 @@ OpProtoAndCheckerMaker::VariableBuilder OpProtoAndCheckerMaker::AddOutput( return OpProtoAndCheckerMaker::VariableBuilder{output}; } -void OpProtoAndCheckerMaker::Reuse(const std::string& name, - const std::string& reused_name) { - bool found = false; - proto::OpProto::Var* var; - - for (auto& var : proto_->inputs()) { - if (var.name() == reused_name) { - found = true; - break; - } - } - PADDLE_ENFORCE(found == true, - "Input/Output name: %s reused_name: %s, one of them is not " - "exists or not matched.", - name, reused_name); - - found = false; - for (int i = 0; i < proto_->outputs().size(); ++i) { - var = proto_->mutable_outputs()->Mutable(i); - if (var->name() == name) { - PADDLE_ENFORCE(!var->has_reuse(), - "Output(%s) has been set reused var of %s", name, - var->reuse()); - found = true; - var->set_reuse(reused_name); - break; - } - } - PADDLE_ENFORCE(found == true, - "Input/Output name: %s reused_name: %s, one of them is not " - "exists or not matched.", - name, reused_name); -} - void OpProtoAndCheckerMaker::CheckNoDuplicatedInOutAttrs() { std::unordered_set names; auto checker = [&](const std::string& name) { @@ -91,24 +56,6 @@ void OpProtoAndCheckerMaker::CheckNoDuplicatedInOutAttrs() { } } -void OpProtoAndCheckerMaker::CheckReuseVars() { - std::unordered_set names; - for (auto& input : proto_->inputs()) { - names.insert(input.name()); - } - auto checker = [&](const std::string& name, const std::string& reused) { - PADDLE_ENFORCE( - names.count(reused), - "Output [%s] reuse Input [%s], but the input is not registered.", name, - reused); - }; - for (auto& output : proto_->outputs()) { - if (output.has_reuse()) { - checker(output.name(), output.reuse()); - } - } -} - void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto, OpAttrChecker* attr_checker) { proto_ = proto; @@ -124,6 +71,8 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto, static_cast(OpRole::kLoss) | static_cast(OpRole::kForward), static_cast(OpRole::kLoss) | static_cast(OpRole::kBackward), + static_cast(OpRole::kOptimize) | + static_cast(OpRole::kLRSched), static_cast(OpRole::kNotSpecified)}) .SetDefault(static_cast(OpRole::kNotSpecified)); AddAttr>(OpRoleVarAttrName(), diff --git a/paddle/fluid/framework/op_proto_maker.h b/paddle/fluid/framework/op_proto_maker.h index 4ed3cc45d66849267ef4945a03da1db76b53e4ea..4c59c73d8779eceb267ad532aabccabbd54b0df2 100644 --- a/paddle/fluid/framework/op_proto_maker.h +++ b/paddle/fluid/framework/op_proto_maker.h @@ -14,25 +14,26 @@ limitations under the License. */ #pragma once #include -#include - #include "glog/logging.h" #include "paddle/fluid/framework/attribute.h" #include "paddle/fluid/framework/framework.pb.h" namespace paddle { namespace framework { +////////////////////////// +// Don't add more roles to make this too complicated! +////////////////////////// enum class OpRole { kForward = 0x0000, kBackward = 0x0001, kOptimize = 0x0002, // RPC role is for send/recv releated op - kRPC = 0x0003, + kRPC = 0x0004, // Dist role is for split_byref/split_selected_rows/concat // used for distributed training. - kDist = 0x0004, + kDist = 0x0008, // Tag all learning rate scheduler operators. - kLRSched = 0x0005, + kLRSched = 0x0010, kLoss = 0x0100, // The default value of op's role. This should be only used for unittests and @@ -73,11 +74,6 @@ class OpProtoAndCheckerMaker { var_->set_dispensable(true); return *this; } - - VariableBuilder &Reuse(const std::string &name) { - var_->set_reuse(name); - return *this; - } }; VariableBuilder AddInput(const std::string &name, const std::string &comment); @@ -85,8 +81,6 @@ class OpProtoAndCheckerMaker { VariableBuilder AddOutput(const std::string &name, const std::string &comment); - void Reuse(const std::string &name, const std::string &reused_name); - template TypedAttrChecker &AddAttr(const std::string &name, const std::string &comment, @@ -105,8 +99,6 @@ class OpProtoAndCheckerMaker { void CheckNoDuplicatedInOutAttrs(); void Validate(); - void CheckReuseVars(); - proto::OpProto *proto_; OpAttrChecker *op_checker_; bool validated_{false}; diff --git a/paddle/fluid/framework/op_proto_maker_test.cc b/paddle/fluid/framework/op_proto_maker_test.cc index b71c7b646857e11f291748c4c7c2af92b6d53231..a8030d377fdb4d4aef74b315e21792dad10fac96 100644 --- a/paddle/fluid/framework/op_proto_maker_test.cc +++ b/paddle/fluid/framework/op_proto_maker_test.cc @@ -47,120 +47,3 @@ TEST(ProtoMaker, DuplicatedInOut) { ASSERT_THROW(proto_maker(&op_proto, &op_checker), paddle::platform::EnforceNotMet); } - -class TestInplaceProtoMaker : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddOutput("XOut", "output of test op").Reuse("X"); - } -}; - -class TestInplaceProtoMaker2 - : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddOutput("XOut", "output of test op").Reuse("X"); - AddOutput("NoOut", "output of test op").Reuse("NotExists"); - } -}; - -TEST(ProtoMaker, InplaceOutput) { - paddle::framework::proto::OpProto op_proto, op_proto2; - paddle::framework::OpAttrChecker op_checker; - TestInplaceProtoMaker proto_maker; - TestInplaceProtoMaker2 proto_maker2; - - proto_maker(&op_proto, &op_checker); - - ASSERT_THROW(proto_maker2(&op_proto2, &op_checker), - paddle::platform::EnforceNotMet); -} - -// normal reuse -class TestReuseProtoMaker : public paddle::framework::OpProtoAndCheckerMaker { - public: - void Make() { - AddInput("X", "input of test op"); - AddInput("Y", "input of test op"); - AddOutput("Out", "output of test op"); - AddOutput("XOut", "output of test op"); - // avoid destructor exception. - // Validate(); - TestReuse(); - } - - virtual void TestReuse() {} -}; - -// test duplicate reuse error -class TestReuseProtoMaker2 : public TestReuseProtoMaker { - public: - void TestReuse() { - Reuse("Out", "X"); - Reuse("Out", "Y"); - } -}; - -// NotExists Input -class TestReuseProtoMaker3 : public TestReuseProtoMaker { - public: - void TestReuse() { - Reuse("Out", "NotExists"); - Reuse("XOut", "X"); - } -}; - -// NotExists Output -class TestReuseProtoMaker4 : public TestReuseProtoMaker { - public: - void TestReuse() { Reuse("NotExists", "X"); } -}; - -TEST(ProtoMaker, Reuse) { - paddle::framework::proto::OpProto op_proto; - paddle::framework::OpAttrChecker op_checker; - TestReuseProtoMaker proto_maker; - proto_maker(&op_proto, &op_checker); -} - -// NOTE(dzhwinter): -// There is a Fatal CHECK on base class destructor, which will call abort inside -// instead of -// throw an exception. If we throw an exception in Make(), we will trigger the -// CHECK and terminate the tests. -// -// I had tried to replace the default CHECK with a exception, however, it's -// still not supported by glog. -// the details: -// https://github.com/google/glog/issues/249 -// https://github.com/facebookresearch/TensorComprehensions/issues/351 -/* -TEST(ProtoMaker, ReuseWithException) { - paddle::framework::proto::OpProto op_proto2, op_proto3, op_proto4; - paddle::framework::OpAttrChecker op_checker; - TestReuseProtoMaker2 proto_maker2; - TestReuseProtoMaker3 proto_maker3; - TestReuseProtoMaker4 proto_maker4; - EXPECT_THROW(proto_maker2(&op_proto2, &op_checker), - paddle::platform::EnforceNotMet); - - EXPECT_THROW(proto_maker3(&op_proto3, &op_checker), - paddle::platform::EnforceNotMet); - - EXPECT_THROW(proto_maker4(&op_proto4, &op_checker), - paddle::platform::EnforceNotMet); -} - -void FailureFunction() { - throw std::runtime_error("Check failed in destructor."); - // return 0; -} - -int main(int argc, char** argv) { - testing::InitGoogleTest(&argc, argv); - google::InstallFailureFunction(&FailureFunction); - return RUN_ALL_TESTS(); -} -*/ diff --git a/paddle/fluid/framework/op_registry.cc b/paddle/fluid/framework/op_registry.cc index bfc411ca2c4a483e344b368da089392d8e4a87c1..4a841bae8323f5733ba413a2c623a8147ec32f67 100644 --- a/paddle/fluid/framework/op_registry.cc +++ b/paddle/fluid/framework/op_registry.cc @@ -46,9 +46,9 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap( std::unique_ptr OpRegistry::CreateOp( const proto::OpDesc& op_desc) { - VLOG(1) << "CreateOp directly from OpDesc is deprecated. It should only be" - "used in unit tests. Use CreateOp(const OpDesc& op_desc) " - "instead."; + VLOG(10) << "CreateOp directly from OpDesc is deprecated. It should only be" + "used in unit tests. Use CreateOp(const OpDesc& op_desc) " + "instead."; VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs()); VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs()); AttributeMap attrs; diff --git a/paddle/fluid/framework/op_registry.h b/paddle/fluid/framework/op_registry.h index ef2eb334a4e7f3f482ba6d62d3f325f109c69302..0e6e74293c30d5f8caa58fe6bfa63657d2669b46 100644 --- a/paddle/fluid/framework/op_registry.h +++ b/paddle/fluid/framework/op_registry.h @@ -23,11 +23,6 @@ limitations under the License. */ #include #include -#if defined(_WIN32) -#define GLOG_NO_ABBREVIATED_SEVERITIES // msvc conflict logging with windows.h -#define GOOGLE_GLOG_DLL_DECL -#endif - #include "glog/logging.h" // For VLOG() #include "paddle/fluid/framework/attribute.h" #include "paddle/fluid/framework/details/op_registry.h" diff --git a/paddle/fluid/framework/operator.cc b/paddle/fluid/framework/operator.cc index 14fcde2fe3b1c3acfc0994e9cd37a784c57826d7..1d386a608ae1d98fd47b95fb489662f7edd9d1fb 100644 --- a/paddle/fluid/framework/operator.cc +++ b/paddle/fluid/framework/operator.cc @@ -11,8 +11,6 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#define GLOG_NO_ABBREVIATED_SEVERITIES -#define GOOGLE_GLOG_DLL_DECL #include #include @@ -24,6 +22,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/framework/shape_inference.h" +#include "paddle/fluid/framework/transfer_scope_cache.h" #include "paddle/fluid/framework/var_type.h" #include "paddle/fluid/platform/profiler.h" @@ -140,7 +139,7 @@ static LoD GetLoD(const Scope& scope, const std::string& name) { } void OperatorBase::Run(const Scope& scope, const platform::Place& place) { - VLOG(4) << place << " " << DebugStringEx(&scope); + VLOG(40) << place << " " << DebugStringEx(&scope); if (platform::is_gpu_place(place)) { #ifndef PADDLE_WITH_CUDA PADDLE_THROW("Cannot run operator on place %s", place); @@ -160,7 +159,7 @@ void OperatorBase::Run(const Scope& scope, const platform::Place& place) { } else { RunImpl(scope, place); } - VLOG(3) << place << " " << DebugStringEx(&scope); + VLOG(30) << place << " " << DebugStringEx(&scope); } bool OperatorBase::HasInputs(const std::string& name) const { @@ -259,6 +258,8 @@ std::string OperatorBase::DebugStringEx(const Scope* scope) const { if (row_size >= 0) { ss << "[row_size=" << row_size << "]"; } + std::string dtype = GetDtype(*scope, output.second[i]); + ss << ":" << dtype; ss << "[" << GetDims(*scope, var_name, true) << "]"; ss << "(" << GetLoD(*scope, var_name) << ")"; } @@ -354,22 +355,22 @@ void OperatorBase::GenerateTemporaryNames() { } } -static bool VarIsTensor(const Variable* var) { - return var->IsType() || var->IsType(); +static bool VarIsTensor(const Variable& var) { + return var.IsType() || var.IsType(); } -static const Tensor* GetTensorFromVar(Variable* var) { - if (var->IsType()) { - return var->GetMutable(); - } else if (var->IsType()) { - return var->GetMutable()->mutable_value(); +const Tensor* GetLoDTensorOrSelectedRowsValueFromVar(const Variable& var) { + if (var.IsType()) { + return static_cast(&(var.Get())); + } else if (var.IsType()) { + return &(var.Get().value()); } else { PADDLE_THROW("Variable type_id %s, expect LoDTensor/SelectedRows.", - var->Type().name()); + var.Type().name()); } } -static Tensor* GetMutableTensorFromVar(Variable* var) { +Tensor* GetMutableLoDTensorOrSelectedRowsValueFromVar(Variable* var) { if (var->IsType()) { return var->GetMutable(); } else if (var->IsType()) { @@ -414,9 +415,7 @@ bool ExecutionContext::HasOutput(const std::string& name) const { template <> const Tensor* ExecutionContext::Input(const std::string& name) const { - auto* var = InputVar(name); - return var == nullptr ? nullptr - : GetTensorFromVar(const_cast(var)); + return Input(name); } template <> @@ -426,17 +425,21 @@ const std::vector ExecutionContext::MultiInput( std::vector res; res.reserve(names.size()); std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { + [&](const std::string& sub_name) -> const Tensor* { auto var = scope_.FindVar(sub_name); - return var == nullptr ? nullptr : GetTensorFromVar(var); + if (var == nullptr) return nullptr; + PADDLE_ENFORCE( + var->IsType(), + "%s should be LoDTensor, but the received type is %s", + sub_name, var->Type().name()); + return &(var->Get()); }); return res; } template <> Tensor* ExecutionContext::Output(const std::string& name) const { - auto var = OutputVar(name); - return var == nullptr ? nullptr : GetMutableTensorFromVar(var); + return Output(name); } template <> @@ -446,10 +449,14 @@ std::vector ExecutionContext::MultiOutput( std::vector res; res.reserve(names.size()); std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { + [&](const std::string& sub_name) -> Tensor* { auto var = scope_.FindVar(sub_name); - return var == nullptr ? nullptr - : GetMutableTensorFromVar(var); + if (var == nullptr) return nullptr; + PADDLE_ENFORCE( + var->IsType(), + "%s should be LoDTensor, but the received type is %s", + sub_name, var->Type().name()); + return var->GetMutable(); }); return res; } @@ -709,14 +716,14 @@ void OperatorWithKernel::RunImpl(const Scope& scope, auto expected_kernel_key = this->GetExpectedKernelType(ExecutionContext(*this, scope, *dev_ctx)); - VLOG(3) << "expected_kernel_key:" << expected_kernel_key; + VLOG(30) << "expected_kernel_key:" << expected_kernel_key; auto kernel_iter = kernels.find(expected_kernel_key); #ifdef PADDLE_WITH_MKLDNN // workaround for missing MKLDNN kernel when FLAGS_use_mkldnn env var is set if (kernel_iter == kernels.end() && expected_kernel_key.library_type_ == LibraryType::kMKLDNN) { - VLOG(3) << "missing MKLDNN kernel: fallbacking to PLAIN one"; + VLOG(30) << "missing MKLDNN kernel: fallbacking to PLAIN one"; expected_kernel_key.library_type_ = LibraryType::kPlain; expected_kernel_key.data_layout_ = DataLayout::kAnyLayout; kernel_iter = kernels.find(expected_kernel_key); @@ -768,10 +775,14 @@ void OperatorWithKernel::TransferInplaceVarsBack( const Scope& scope, const std::vector& inplace_vars, const Scope& transfer_scope) const { for (auto& var_name : inplace_vars) { - VLOG(3) << "share inplace var " + var_name + " back to it's original scope"; - auto* original_tensor = GetMutableTensorFromVar(scope.FindVar(var_name)); - auto* transformed_tensor = - GetTensorFromVar(transfer_scope.FindVar(var_name)); + VLOG(30) << "share inplace var " + var_name + + " back to it's original scope"; + auto* original_tensor = + GetMutableLoDTensorOrSelectedRowsValueFromVar(scope.FindVar(var_name)); + auto* var = transfer_scope.FindVar(var_name); + PADDLE_ENFORCE(var != nullptr, "The var[%s] should not be nullptr", + var_name); + auto* transformed_tensor = GetLoDTensorOrSelectedRowsValueFromVar(*var); original_tensor->ShareDataWith(*transformed_tensor); } } @@ -784,11 +795,11 @@ Scope* OperatorWithKernel::TryTransferData( for (auto& var_name : var_name_item.second) { auto* var = scope.FindVar(var_name); // Only tensor can be tranfer to another device. - if (var == nullptr || !VarIsTensor(var)) { + if (var == nullptr || !VarIsTensor(*var)) { continue; } - auto* tensor_in = GetTensorFromVar(var); + auto* tensor_in = GetLoDTensorOrSelectedRowsValueFromVar(*var); if (!tensor_in->IsInitialized()) { continue; } @@ -806,14 +817,31 @@ Scope* OperatorWithKernel::TryTransferData( transfered_inplace_vars->emplace_back(var_name); } - VLOG(3) << "Transform Variable " << var_name << " from " - << kernel_type_for_var << " to " << expected_kernel_key; - - if (new_scope == nullptr) { + VLOG(30) << "Transform Variable " << var_name << " from " + << kernel_type_for_var << " to " << expected_kernel_key; + + // In the inference scenerio, the scopes will be reused across the + // batches, so the `new_scope` here will result in GPU memroy explosion + // over the running of operators. + // We use a thread_local cache to fix that issue, the key in the cache is + // the combination of the `scope` argument, from_kernel_type, + // target_kernel_type. + // Have a discussion with @Superjomn or the inference developers if some + // changes on this logic for this macro might not tested on the other + // scenerios. + // If this op is not called by an Executor or ParallelExecutor, it should + // called by a NaiveExecutor, the NaiveExecutor will cache the scopes and + // variables, that behavior a lot different. + if (!run_by_executor_) { + new_scope = TryCreateTransferScope(kernel_type_for_var, + expected_kernel_key, &scope); + } + if (!new_scope) { new_scope = &scope.NewScope(); } auto* trans_var = new_scope->Var(var_name); + Tensor out; TransformData(expected_kernel_key, kernel_type_for_var, *tensor_in, &out); SetTensorToVariable(*var, out, trans_var); diff --git a/paddle/fluid/framework/operator.h b/paddle/fluid/framework/operator.h index 626b50edfd39424473be33e9f8baec5970471477..bfdfdc56b34098dd170f8ca98d27b41759c2f57b 100644 --- a/paddle/fluid/framework/operator.h +++ b/paddle/fluid/framework/operator.h @@ -20,8 +20,6 @@ limitations under the License. */ #include #include #include -#define GLOG_NO_ABBREVIATED_SEVERITIES -#define GOOGLE_GLOG_DLL_DECL #include "glog/logging.h" // For VLOG #include "paddle/fluid/framework/attribute.h" @@ -54,6 +52,9 @@ constexpr char kGradVarSuffix[] = "@GRAD"; /// Variables with this suffix are supposed to be filled up with zeros. constexpr char kZeroVarSuffix[] = "@ZERO"; +/// Variables with this suffix are the new Gradient. +constexpr char kNewGradSuffix[] = "@NEWGRAD@"; + // define some kernel priority /* Define multiple kernel type fallback order*/ extern std::vector> kKernelPriority; @@ -63,6 +64,8 @@ inline std::string GradVarName(const std::string& var_name) { } proto::VarType::Type GetDataTypeOfVar(const Variable* var); +const Tensor* GetLoDTensorOrSelectedRowsValueFromVar(const Variable& var); +Tensor* GetMutableLoDTensorOrSelectedRowsValueFromVar(Variable* var); class OperatorBase; class ExecutionContext; @@ -95,6 +98,7 @@ class OperatorBase { const std::string& Type() const { return type_; } + bool HasAttr(const std::string& name) const { return attrs_.count(name); } template inline const T& Attr(const std::string& name) const { PADDLE_ENFORCE(attrs_.count(name) != 0, "%s should be in AttributeMap", @@ -123,6 +127,8 @@ class OperatorBase { //! Get all outputs variable names virtual std::vector OutputVars(bool has_intermediate) const; + void SetIsCalledByExecutor(bool x) { run_by_executor_ = x; } + protected: std::string type_; // NOTE: in case of OpGrad, inputs_ contains: @@ -135,6 +141,8 @@ class OperatorBase { // IG (Inputs Gradients) VariableNameMap outputs_; AttributeMap attrs_; + // Whether this operator executes in an Executor. + bool run_by_executor_{true}; private: void GenerateTemporaryNames(); @@ -223,7 +231,7 @@ class ExecutionContext { std::vector res; res.reserve(names.size()); std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { + [&](const std::string& sub_name) -> const T* { auto var = scope_.FindVar(sub_name); return var == nullptr ? nullptr : &var->Get(); }); @@ -236,7 +244,7 @@ class ExecutionContext { std::vector res; res.reserve(names.size()); std::transform(names.begin(), names.end(), std::back_inserter(res), - [&](const std::string& sub_name) { + [&](const std::string& sub_name) -> T* { auto var = scope_.FindVar(sub_name); return var == nullptr ? nullptr : var->GetMutable(); }); diff --git a/paddle/fluid/framework/parallel_executor.cc b/paddle/fluid/framework/parallel_executor.cc index e8adabd26540754d5b9206294eeeed79757220bf..2c6e337568306502fbaa362015e51f81efc0a5ff 100644 --- a/paddle/fluid/framework/parallel_executor.cc +++ b/paddle/fluid/framework/parallel_executor.cc @@ -38,9 +38,20 @@ class ParallelExecutorPrivate { explicit ParallelExecutorPrivate(const std::vector &places) : places_(places) {} + ~ParallelExecutorPrivate() { + if (own_local_scope_) { + for (size_t i = 1; i < local_scopes_.size(); ++i) { + // Skip the first scope, since it is the global scope. + Scope *local_scope = local_scopes_[i]; + if (global_scope_->HasKid(local_scope)) { + global_scope_->DeleteScope(local_scope); + } + } + } + } std::vector places_; std::vector local_scopes_; - Scope *global_scope_; + Scope *global_scope_; // not owned std::unique_ptr executor_; #ifdef PADDLE_WITH_CUDA @@ -109,18 +120,9 @@ ParallelExecutor::ParallelExecutor( if (member_->local_scopes_.size() != 1 && local_scopes.empty()) { BCastParamsToDevices(bcast_vars); } - // Startup Program has been run. All local scopes has correct parameters. +// Startup Program has been run. All local scopes has correct parameters. - // Step 2. Create vars in each scope; - std::vector var_infos; - for (auto *var : main_program.Block(0).AllVars()) { - var_infos.emplace_back(); - var_infos.back().name_ = var->Name(); - var_infos.back().type_ = var->GetType(); - var_infos.back().persistable_ = var->Persistable(); - } - -// Step 3. Convert main_program to SSA form and dependency graph. Also, insert +// Step 2. Convert main_program to SSA form and dependency graph. Also, insert // ncclOp #ifdef PADDLE_WITH_CUDA std::unique_ptr graph = build_strategy.Apply( @@ -156,11 +158,29 @@ ParallelExecutor::ParallelExecutor( params, member_->local_scopes_, member_->use_cuda_); #endif - if (VLOG_IS_ON(5)) { - // If the loss_var_name is given, the number of graph should be only one. - if (loss_var_name.size()) { - PADDLE_ENFORCE_EQ(ir::GraphNum(*graph), 1, - "The number of graph should be only one"); + // Step 3. Create vars in each scope. Passes may also create new vars. + // skip control vars and empty vars + std::vector var_infos; + for (auto &node : graph->Nodes()) { + if (node->IsVar() && !node->IsCtrlVar() && node->Var()) { + var_infos.emplace_back(); + var_infos.back().name_ = node->Var()->Name(); + var_infos.back().type_ = node->Var()->GetType(); + var_infos.back().persistable_ = node->Var()->Persistable(); + } + } + // If the loss_var_name is given, the number of graph should be only one. + if (loss_var_name.size()) { + size_t graph_num = ir::GraphNum(*graph); + if (graph_num > 1) { + LOG(WARNING) + << "The number of graph should be only one, " + "but the current graph has " + << ir::GraphNum(*graph) + << " sub_graphs. If you want to see the nodes of the " + "sub_graphs, you should use 'FLAGS_print_sub_graph_dir' " + "to specify the output dir. NOTES: if you not do training, " + "please don't pass loss_var_name."; } } @@ -187,6 +207,10 @@ void ParallelExecutor::BCastParamsToDevices( } auto &main_tensor = main_var->Get(); + if (!main_tensor.IsInitialized()) { + VLOG(30) << "one in var not inited, return!"; + continue; + } auto &dims = main_tensor.dims(); if (paddle::platform::is_gpu_place(main_tensor.place())) { #ifdef PADDLE_WITH_CUDA @@ -299,15 +323,9 @@ void ParallelExecutor::FeedAndSplitTensorIntoLocalScopes( } ParallelExecutor::~ParallelExecutor() { - if (member_->own_local_scope_) { - for (size_t i = 1; i < member_->local_scopes_.size(); ++i) { - Scope *local_scope = member_->local_scopes_[i]; - if (member_->global_scope_->HasKid(local_scope)) { - member_->global_scope_->DeleteScope(local_scope); - } - } + for (auto &p : member_->places_) { + platform::DeviceContextPool::Instance().Get(p)->Wait(); } - // member_ must be destructed before gcs_ since the destructor of // ReferenceCountOpHandle use raw pointers of gcs_ inside. member_.reset(); diff --git a/paddle/fluid/framework/program_desc.cc b/paddle/fluid/framework/program_desc.cc index 589905828f7793c614c0fe12259e9ba5ab11ceac..4b9667113bc7918c1323f0213213a6ffdb7eed8e 100644 --- a/paddle/fluid/framework/program_desc.cc +++ b/paddle/fluid/framework/program_desc.cc @@ -126,7 +126,7 @@ const std::vector ProgramDesc::GetFeedTargetNames() { std::vector feed_target_names; for (auto *op : global_block.AllOps()) { if (op->Type() == kFeedOpType) { - int col = boost::get(op->GetAttr("col")); + size_t col = boost::get(op->GetAttr("col")); if (col >= feed_target_names.size()) { feed_target_names.resize(col + 1); } @@ -143,7 +143,7 @@ const std::vector ProgramDesc::GetFetchTargetNames() { std::vector fetch_target_names; for (auto *op : global_block.AllOps()) { if (op->Type() == kFetchOpType) { - int col = boost::get(op->GetAttr("col")); + size_t col = boost::get(op->GetAttr("col")); if (col >= fetch_target_names.size()) { fetch_target_names.resize(col + 1); } diff --git a/paddle/fluid/framework/program_desc_test.cc b/paddle/fluid/framework/program_desc_test.cc index 7e689a37da8a16bd9b1ac6650b9322d2eb5a2c85..48bde2785e6a51afc0d2905ac31fe20a3c3019b6 100644 --- a/paddle/fluid/framework/program_desc_test.cc +++ b/paddle/fluid/framework/program_desc_test.cc @@ -103,7 +103,7 @@ TEST(ProgramDesc, copy_ctor) { ASSERT_EQ(1, op->GetBlockAttrId("sub_block")); found_sub_block = true; - ASSERT_EQ(2, op->GetBlocksAttrIds("sub_blocks").size()); + ASSERT_EQ(2UL, op->GetBlocksAttrIds("sub_blocks").size()); found_sub_blocks = true; } } diff --git a/paddle/fluid/framework/reader_test.cc b/paddle/fluid/framework/reader_test.cc index f0d07cb7c1367576084b9494e7758103bb45d1e5..d812417a38200bcfdbdeac78800190647510a144 100644 --- a/paddle/fluid/framework/reader_test.cc +++ b/paddle/fluid/framework/reader_test.cc @@ -39,8 +39,8 @@ TEST(READER, decorate_chain) { { auto endpoints = root->GetEndPoints(); ASSERT_EQ(endpoints.size(), 2U); - ASSERT_NE(endpoints.count(end_point1.get()), 0); - ASSERT_NE(endpoints.count(end_point2.get()), 0); + ASSERT_NE(endpoints.count(end_point1.get()), 0UL); + ASSERT_NE(endpoints.count(end_point2.get()), 0UL); } { diff --git a/paddle/fluid/framework/scope.cc b/paddle/fluid/framework/scope.cc index a4abd1b1283f08fb8431fbeea0cea17c8439fdd7..26cb7d51a88afac15322eecad965912097d19a45 100644 --- a/paddle/fluid/framework/scope.cc +++ b/paddle/fluid/framework/scope.cc @@ -15,7 +15,9 @@ limitations under the License. */ #include "paddle/fluid/framework/scope.h" #include // for unique_ptr +#include #include +#include #include "glog/logging.h" #include "paddle/fluid/framework/threadpool.h" #include "paddle/fluid/string/printf.h" @@ -36,6 +38,16 @@ DEFINE_double( "Memory size threshold (GB) when the garbage collector clear tensors." "Disabled when this value is less than 0"); +// When in inference scenario, the scopes will not be written by two threads in +// a mean time, but a scope may be read by multiple threads concurrently, and +// the mutex will cause serious performance issue. +// So the mutex is disabled when `ON_INFER`. +#ifdef PADDLE_ON_INFERENCE +#define SCOPE_LOCK_GUARD +#else +#define SCOPE_LOCK_GUARD std::lock_guard lock(mutex_); +#endif + namespace paddle { namespace framework { @@ -49,18 +61,18 @@ int64_t GetEagerDeletionThreshold() { Scope::~Scope() { DropKids(); } Scope& Scope::NewScope() const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD kids_.push_back(new Scope(this)); return *kids_.back(); } Variable* Scope::Var(const std::string& name) { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD return VarInternal(name); } Variable* Scope::Var(std::string* name) { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD auto new_name = string::Sprintf("%p.%d", this, vars_.size()); if (name != nullptr) { *name = new_name; @@ -69,34 +81,34 @@ Variable* Scope::Var(std::string* name) { } Variable* Scope::FindVar(const std::string& name) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD return FindVarInternal(name); } Variable* Scope::FindLocalVar(const std::string& name) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD return FindVarLocally(name); } const Scope* Scope::FindScope(const Variable* var) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD return FindScopeInternal(var); } void Scope::DropKids() { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD for (Scope* s : kids_) delete s; kids_.clear(); } bool Scope::HasKid(const Scope* scope) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD auto it = std::find(this->kids_.begin(), this->kids_.end(), scope); return it != this->kids_.end(); } std::vector Scope::LocalVarNames() const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD std::vector known_vars; known_vars.reserve(this->vars_.size()); for (auto& p : vars_) { @@ -106,9 +118,10 @@ std::vector Scope::LocalVarNames() const { } void Scope::DeleteScope(Scope* scope) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD auto it = std::find(this->kids_.begin(), this->kids_.end(), scope); - PADDLE_ENFORCE(it != this->kids_.end(), "Cannot find %p as kid scope", scope); + PADDLE_ENFORCE(it != this->kids_.end(), "%p Cannot find %p as kid scope", + this, scope); this->kids_.erase(it); // When making memory benchmark on Fluid, we have to delete scope sync. if (FLAGS_benchmark || FLAGS_eager_delete_scope) { @@ -119,7 +132,7 @@ void Scope::DeleteScope(Scope* scope) const { } void Scope::EraseVars(const std::vector& var_names) { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD std::set var_set(var_names.begin(), var_names.end()); for (auto it = vars_.begin(); it != vars_.end();) { if (var_set.find(it->first) != var_set.end()) { @@ -132,12 +145,12 @@ void Scope::EraseVars(const std::vector& var_names) { void Scope::Rename(const std::string& origin_name, const std::string& new_name) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD RenameInternal(origin_name, new_name); } std::string Scope::Rename(const std::string& origin_name) const { - std::lock_guard lock(mutex_); + SCOPE_LOCK_GUARD auto new_name = string::Sprintf("%p.%d", this, vars_.size()); RenameInternal(origin_name, new_name); return new_name; @@ -149,7 +162,7 @@ Variable* Scope::VarInternal(const std::string& name) { v = new Variable(); vars_[name].reset(v); - VLOG(3) << "Create variable " << name; + VLOG(30) << "Create variable " << name; v->name_ = &(vars_.find(name)->first); return v; } @@ -189,5 +202,46 @@ Variable* Scope::FindVarLocally(const std::string& name) const { return nullptr; } +std::string GenScopeTreeDebugInfo(Scope* root) { + std::stringstream os; + + if (!root) return ""; + + // level traversal + std::queue queue; + queue.push(root); + + std::vector scopes; + + while (!queue.empty()) { + auto* end = queue.back(); + Scope* q = nullptr; + while (q != end) { + q = queue.front(); + queue.pop(); + os << q << " "; + scopes.push_back(q); + + for (auto* c : q->kids()) { + queue.push(c); + } + } + // end of a level + os << "\n------------------------------------------\n"; + } + + os << "\nDetails:\n\n"; + + for (Scope* q : scopes) { + os << "====\n"; + os << q << ":\n"; + for (auto& var : q->LocalVarNames()) { + os << " - " << var << "\n"; + } + } + + return os.str(); +} + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/scope.h b/paddle/fluid/framework/scope.h index 14f9f36812d690fc4a7440f2e7e6a85e9993a535..1901ffbe57e0d85193c3a218f06eba06a0f287a5 100644 --- a/paddle/fluid/framework/scope.h +++ b/paddle/fluid/framework/scope.h @@ -81,6 +81,8 @@ class Scope { /// Find if a scope exists in the kid scopes bool HasKid(const Scope* scope) const; + const std::list& kids() const { return kids_; } + // enumerate all the variables current contains. std::vector LocalVarNames() const; @@ -116,12 +118,17 @@ class Scope { // Scope in `kids_` are owned by this class. mutable std::list kids_; - Scope const* parent_{nullptr}; + const Scope* parent_{nullptr}; DISABLE_COPY_AND_ASSIGN(Scope); private: mutable std::mutex mutex_; }; + +// Generate some debug string about the inherience structure of scope, quite +// naive. +std::string GenScopeTreeDebugInfo(Scope*); + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/framework/selected_rows.cc b/paddle/fluid/framework/selected_rows.cc index 8c290bb095d554a973e66a3a19606a06759fd668..f4f2b769d5e47d8fba8d08476df4cd8e54133551 100644 --- a/paddle/fluid/framework/selected_rows.cc +++ b/paddle/fluid/framework/selected_rows.cc @@ -63,6 +63,26 @@ struct TensorCopyVisitor { int64_t size_; }; +struct TensorFillVisitor { + TensorFillVisitor(framework::Tensor* dst, int64_t dst_offset, int64_t size, + float value) + : dst_(dst), dst_offset_(dst_offset), size_(size) {} + + template + void apply() const { + // TODO(qiao): support other place + platform::CPUPlace cpu; + auto* tensor_data = dst_->mutable_data(cpu); + auto* start = tensor_data + dst_offset_; + auto* end = start + size_; + std::fill(start, end, static_cast(0.0)); + } + + framework::Tensor* dst_; + int64_t dst_offset_; + int64_t size_; +}; + void SerializeToStream(std::ostream& os, const SelectedRows& selected_rows, const platform::DeviceContext& dev_ctx) { { // the 1st field, uint32_t version @@ -120,7 +140,17 @@ bool SelectedRows::HasKey(int64_t key) const { : true; } -int64_t SelectedRows::AutoGrownIndex(int64_t key, bool auto_grown) { +int64_t SelectedRows::AutoGrownIndex(int64_t key, bool auto_grown, + bool is_test) { + if (is_test) { + auto iter = id_to_index_.find(key); + if (iter == id_to_index_.end()) { + return -1; + } else { + return iter->second; + } + } + rwlock_->RDLock(); auto iter = id_to_index_.find(key); if (iter == id_to_index_.end()) { @@ -172,22 +202,30 @@ void SelectedRows::SyncIndex() { } void SelectedRows::Get(const framework::Tensor& ids, framework::Tensor* value, - bool auto_grown) { + bool auto_grown, bool is_test) { PADDLE_ENFORCE(value->IsInitialized(), "The value tensor should be initialized."); if (ids.numel() == 0) { - VLOG(3) << "keys is empty, please check data!"; + VLOG(30) << "keys is empty, please check data!"; } else { int64_t value_width = value_->numel() / value_->dims()[0]; PADDLE_ENFORCE_EQ(value_width, value->numel() / value->dims()[0], "output tensor should have the same shape with table " "except the dims[0]."); for (int i = 0; i < ids.numel(); ++i) { - int64_t index = AutoGrownIndex(ids.data()[i], auto_grown); - framework::VisitDataType( - framework::ToDataType(value_->type()), - TensorCopyVisitor(value, i * value_width, *value_.get(), - index * value_width, value_width)); + auto id = ids.data()[i]; + int64_t index = AutoGrownIndex(id, auto_grown, is_test); + if (index < 0) { + VLOG(5) << "id " << id << " not in the table, return 0"; + framework::VisitDataType( + framework::ToDataType(value_->type()), + TensorFillVisitor(value, i * value_width, value_width, 0.0)); + } else { + framework::VisitDataType( + framework::ToDataType(value_->type()), + TensorCopyVisitor(value, i * value_width, *value_.get(), + index * value_width, value_width)); + } } } } diff --git a/paddle/fluid/framework/selected_rows.h b/paddle/fluid/framework/selected_rows.h index daf5e95304fb84eaba26a30c45414d5021e7ffcb..55ca02038e083da4f8984f70fecf4ca2d878088e 100644 --- a/paddle/fluid/framework/selected_rows.h +++ b/paddle/fluid/framework/selected_rows.h @@ -105,7 +105,7 @@ class SelectedRows { * the value */ void Get(const framework::Tensor& ids, framework::Tensor* value, - bool auto_grown = false); + bool auto_grown = false, bool is_test = false); /* * @brief Get the index of the key from id_to_index_ map. If the key not @@ -118,7 +118,7 @@ class SelectedRows { * * @return index of the key. */ - int64_t AutoGrownIndex(int64_t key, bool auto_grown); + int64_t AutoGrownIndex(int64_t key, bool auto_grown, bool is_test = false); void SyncIndex(); diff --git a/paddle/fluid/framework/selected_rows_test.cc b/paddle/fluid/framework/selected_rows_test.cc index 928e1ad8b9168e61ddc5782066a4aa29a4296a94..3b0509e0344efedf08ab21cac0a075049617ca97 100644 --- a/paddle/fluid/framework/selected_rows_test.cc +++ b/paddle/fluid/framework/selected_rows_test.cc @@ -84,14 +84,18 @@ TEST(SelectedRows, SparseTable) { data[i * embedding_width + j] = static_cast(i); } } - ASSERT_EQ(table.AutoGrownIndex(10, true), 0); - ASSERT_EQ(table.AutoGrownIndex(8, true), 1); - ASSERT_EQ(table.AutoGrownIndex(8, true), 1); - ASSERT_EQ(table.AutoGrownIndex(6, true), 2); + ASSERT_EQ(table.AutoGrownIndex(10, true, false), 0); + ASSERT_EQ(table.AutoGrownIndex(8, true, false), 1); + ASSERT_EQ(table.AutoGrownIndex(8, true, false), 1); + ASSERT_EQ(table.AutoGrownIndex(6, true, false), 2); + for (int64_t i = 11; i < 20; i++) { + ASSERT_EQ(table.AutoGrownIndex(i, true, true), -1); + ASSERT_TRUE(!table.HasKey(i)); + } ASSERT_TRUE(table.HasKey(10)); ASSERT_TRUE(table.HasKey(8)); ASSERT_TRUE(table.HasKey(6)); - ASSERT_EQ(table.rows().size(), 3); + ASSERT_EQ(table.rows().size(), 3UL); framework::Tensor ids; ids.Resize(framework::make_ddim({4})); diff --git a/paddle/fluid/framework/tensor.cc b/paddle/fluid/framework/tensor.cc index b6ba0df033af12d48e88eb57a3b97b559077250d..41566800e5781d576120ccf5dfbb3024bf4bea24 100644 --- a/paddle/fluid/framework/tensor.cc +++ b/paddle/fluid/framework/tensor.cc @@ -32,10 +32,9 @@ size_t Tensor::memory_size() const { } void* Tensor::mutable_data(platform::Place place, std::type_index type, + memory::Allocator::Attr attr, size_t requested_size) { - if (holder_ != nullptr) { - holder_->set_type(type); - } + type_ = type; PADDLE_ENFORCE_GE(numel(), 0, "When calling this method, the Tensor's numel must be " "equal or larger than zero. " @@ -48,35 +47,18 @@ void* Tensor::mutable_data(platform::Place place, std::type_index type, /* some versions of boost::variant don't have operator!= */ if (holder_ == nullptr || !(holder_->place() == place) || holder_->size() < size + offset_) { - if (platform::is_cpu_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } else if (platform::is_gpu_place(place) || - platform::is_cuda_pinned_place(place)) { -#ifndef PADDLE_WITH_CUDA - PADDLE_THROW( - "CUDAPlace or CUDAPinnedPlace is not supported in CPU-only mode."); - } -#else - if (platform::is_gpu_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } else if (platform::is_cuda_pinned_place(place)) { - holder_.reset(new PlaceholderImpl( - boost::get(place), size, type)); - } - } -#endif + holder_ = memory::AllocShared(place, size, attr); offset_ = 0; } return reinterpret_cast(reinterpret_cast(holder_->ptr()) + offset_); } -void* Tensor::mutable_data(platform::Place place, size_t requested_size) { +void* Tensor::mutable_data(platform::Place place, memory::Allocator::Attr attr, + size_t requested_size) { PADDLE_ENFORCE(this->holder_ != nullptr, "Cannot invoke mutable data if current hold nothing."); - return mutable_data(place, holder_->type(), requested_size); + return mutable_data(place, type_, attr, requested_size); } Tensor& Tensor::ShareDataWith(const Tensor& src) { @@ -101,6 +83,7 @@ Tensor Tensor::Slice(int begin_idx, int end_idx) const { Tensor dst; dst.holder_ = holder_; dst.set_layout(layout_); + dst.type_ = type_; DDim dst_dims = dims_; dst_dims[0] = end_idx - begin_idx; dst.Resize(dst_dims); diff --git a/paddle/fluid/framework/tensor.h b/paddle/fluid/framework/tensor.h index f1d268548578fea12082e2edb213a3749eccbfaf..71e8badd4b6b08e7d380fd45d93a33176172081d 100644 --- a/paddle/fluid/framework/tensor.h +++ b/paddle/fluid/framework/tensor.h @@ -67,12 +67,7 @@ class Tensor { friend struct EigenVector; public: - Tensor() : offset_(0) {} - - /*! Constructor with place should only be used in pybind. */ - explicit Tensor(const platform::Place& place) : offset_(0) { - holder_->set_place(place); - } + Tensor() : type_(typeid(float)), offset_(0) {} /*! Return a pointer to mutable memory block. */ template @@ -89,12 +84,17 @@ class Tensor { * @note If not exist, then allocation. */ template - T* mutable_data(platform::Place place, size_t requested_size = 0); + T* mutable_data(platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); void* mutable_data(platform::Place place, std::type_index type, + memory::Allocator::Attr attr = memory::Allocator::kDefault, size_t requested_size = 0); - void* mutable_data(platform::Place place, size_t requested_size = 0); + void* mutable_data(platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); /** * @brief Return a pointer to mutable memory block. @@ -106,7 +106,9 @@ class Tensor { * @note If not exist, then allocation. */ template - T* mutable_data(DDim dims, platform::Place place, size_t requested_size = 0); + T* mutable_data(DDim dims, platform::Place place, + memory::Allocator::Attr attr = memory::Allocator::kDefault, + size_t requested_size = 0); /*! Return the dimensions of the memory block. */ const DDim& dims() const; @@ -139,7 +141,7 @@ class Tensor { std::type_index type() const { PADDLE_ENFORCE_NOT_NULL( holder_, "Tensor not initialized yet when Tensor::type() is called."); - return holder_->type(); + return type_; } // memory size returns the holding memory size in byte. @@ -153,56 +155,13 @@ class Tensor { void clear() { holder_ = nullptr; } - private: - /** - * @note Placeholder hides type T, so it doesn't appear as a template - * parameter of Variable. - */ - struct Placeholder { - virtual ~Placeholder() = default; - virtual void* ptr() const = 0; - virtual size_t size() const = 0; - virtual std::type_index type() const = 0; - virtual platform::Place place() const = 0; - virtual void set_type(std::type_index type) = 0; - virtual void set_place(platform::Place place) = 0; - }; - - template - struct PlaceholderImpl : public Placeholder { - PlaceholderImpl(Place place, size_t size, std::type_index type) - : ptr_(static_cast(memory::Alloc(place, size)), - memory::PODDeleter(place)), - place_(place), - size_(size), - type_(type) { - PADDLE_ENFORCE_NOT_NULL(ptr_, "Insufficient %s memory to allocation.", - (is_cpu_place(place_) ? "CPU" : "GPU")); - } - - virtual size_t size() const { return size_; } - virtual platform::Place place() const { return place_; } - virtual void* ptr() const { return static_cast(ptr_.get()); } - virtual std::type_index type() const { return type_; } - virtual void set_type(std::type_index type) { type_ = type; } - virtual void set_place(platform::Place place) { place_ = place; } - - /*! the pointer of memory block. */ - std::unique_ptr> ptr_; - - /*! the place of memory block. */ - platform::Place place_; - - /*! the size of memory block. */ - size_t size_; - - /* the current type of memory */ - std::type_index type_; - }; + const std::shared_ptr& Holder() const { return holder_; } + size_t offset() const { return offset_; } + private: /*! holds the memory block if allocated. */ - std::shared_ptr holder_; - + std::shared_ptr holder_; + std::type_index type_; /** * @brief points to elements dimensions. * diff --git a/paddle/fluid/framework/tensor_impl.h b/paddle/fluid/framework/tensor_impl.h index 6d3047c95d6cf30c2a5308d4f69ded367066d78c..0c9c0d782fc73bd8278b82bebf7fd84a4f297b94 100644 --- a/paddle/fluid/framework/tensor_impl.h +++ b/paddle/fluid/framework/tensor_impl.h @@ -23,10 +23,10 @@ namespace framework { template inline const T* Tensor::data() const { check_memory_size(); - bool valid = std::is_same::value || - holder_->type() == std::type_index(typeid(T)); + bool valid = + std::is_same::value || type_ == std::type_index(typeid(T)); PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %s", - this->holder_->type().name()); + type_.name()); return reinterpret_cast( reinterpret_cast(holder_->ptr()) + offset_); @@ -37,26 +37,30 @@ inline bool Tensor::IsInitialized() const { return holder_ != nullptr; } template inline T* Tensor::data() { check_memory_size(); - bool valid = std::is_same::value || - holder_->type() == std::type_index(typeid(T)); + bool valid = + std::is_same::value || type_ == std::type_index(typeid(T)); PADDLE_ENFORCE(valid, "Tensor holds the wrong type, it holds %s", - this->holder_->type().name()); + type_.name()); return reinterpret_cast(reinterpret_cast(holder_->ptr()) + offset_); } template inline T* Tensor::mutable_data(DDim dims, platform::Place place, + memory::Allocator::Attr attr, size_t requested_size) { static_assert(std::is_pod::value, "T must be POD"); Resize(dims); - return mutable_data(place, requested_size); + return mutable_data(place, attr, requested_size); } template -inline T* Tensor::mutable_data(platform::Place place, size_t requested_size) { +inline T* Tensor::mutable_data(platform::Place place, + memory::Allocator::Attr attr, + size_t requested_size) { static_assert(std::is_pod::value, "T must be POD"); - return reinterpret_cast(mutable_data(place, typeid(T), requested_size)); + return reinterpret_cast( + mutable_data(place, typeid(T), attr, requested_size)); } inline Tensor ReshapeToMatrix(const Tensor& src, int num_col_dims) { diff --git a/paddle/fluid/framework/tensor_test.cc b/paddle/fluid/framework/tensor_test.cc index cb2061c06a429d8e8116001a4aa4e8c46ea13428..a0a9a573603ceb6b577529101cb331adbc81337a 100644 --- a/paddle/fluid/framework/tensor_test.cc +++ b/paddle/fluid/framework/tensor_test.cc @@ -75,6 +75,19 @@ TEST(Tensor, MutableData) { platform::CPUPlace()); EXPECT_EQ(p1, p2); } + // Not sure if it's desired, but currently, Tensor type can be changed. + { + framework::Tensor src_tensor; + int8_t* p1 = src_tensor.mutable_data(framework::make_ddim({1}), + platform::CPUPlace()); + EXPECT_NE(p1, nullptr); + *p1 = 1; + + uint8_t* p2 = src_tensor.mutable_data(framework::make_ddim({1}), + platform::CPUPlace()); + EXPECT_NE(p2, nullptr); + EXPECT_EQ(static_cast(p2[0]), 1); + } #ifdef PADDLE_WITH_CUDA { diff --git a/paddle/fluid/framework/tensor_util.cc b/paddle/fluid/framework/tensor_util.cc index 69bcbc0e5891f95af4de8dfd49a25648ca920ab1..8d8f07a1f52b3062498b59a4dbc20219d42e4735 100644 --- a/paddle/fluid/framework/tensor_util.cc +++ b/paddle/fluid/framework/tensor_util.cc @@ -22,8 +22,8 @@ namespace framework { void TensorCopy(const Tensor& src, const platform::Place& dst_place, const platform::DeviceContext& ctx, Tensor* dst) { - VLOG(3) << "TensorCopy " << src.dims() << " from " << src.place() << " to " - << dst_place; + VLOG(30) << "TensorCopy " << src.dims() << " from " << src.place() << " to " + << dst_place; src.check_memory_size(); dst->Resize(src.dims()); @@ -37,8 +37,8 @@ void TensorCopy(const Tensor& src, const platform::Place& dst_place, if (platform::is_cpu_place(src_place) && platform::is_cpu_place(dst_place)) { if (src_ptr == dst_ptr) { - VLOG(3) << "Skip copy the same data async from " << src_place << " to " - << dst_place; + VLOG(30) << "Skip copy the same data async from " << src_place << " to " + << dst_place; return; } memory::Copy(boost::get(dst_place), dst_ptr, @@ -77,8 +77,8 @@ void TensorCopy(const Tensor& src, const platform::Place& dst_place, reinterpret_cast(ctx).stream(); if (platform::is_same_place(src_place, dst_place)) { if (src_ptr == dst_ptr) { - VLOG(3) << "Skip copy the same data async from " << src_place << " to " - << dst_place; + VLOG(30) << "Skip copy the same data async from " << src_place << " to " + << dst_place; return; } memory::Copy(dst_gpu_place, dst_ptr, src_gpu_place, src_ptr, size, @@ -114,8 +114,8 @@ void TensorCopy(const Tensor& src, const platform::Place& dst_place, void TensorCopySync(const Tensor& src, const platform::Place& dst_place, Tensor* dst) { - VLOG(3) << "TensorCopySync " << src.dims() << " from " << src.place() - << " to " << dst_place; + VLOG(30) << "TensorCopySync " << src.dims() << " from " << src.place() + << " to " << dst_place; src.check_memory_size(); dst->Resize(src.dims()); dst->set_layout(src.layout()); @@ -125,8 +125,8 @@ void TensorCopySync(const Tensor& src, const platform::Place& dst_place, auto size = src.numel() * SizeOfType(src.type()); if (platform::is_cpu_place(src_place) && platform::is_cpu_place(dst_place)) { if (src_ptr == dst_ptr) { - VLOG(3) << "Skip copy the same data from " << src_place << " to " - << dst_place; + VLOG(30) << "Skip copy the same data from " << src_place << " to " + << dst_place; return; } memory::Copy(boost::get(dst_place), dst_ptr, @@ -146,13 +146,19 @@ void TensorCopySync(const Tensor& src, const platform::Place& dst_place, } else if (platform::is_gpu_place(src_place) && platform::is_gpu_place(dst_place)) { if (src_ptr == dst_ptr && platform::is_same_place(src_place, dst_place)) { - VLOG(3) << "Skip copy the same data from " << src_place << " to " - << dst_place; + VLOG(30) << "Skip copy the same data from " << src_place << " to " + << dst_place; return; } auto src_gpu_place = boost::get(src_place); auto dst_gpu_place = boost::get(dst_place); memory::Copy(dst_gpu_place, dst_ptr, src_gpu_place, src_ptr, size, nullptr); + } else if (platform::is_cuda_pinned_place(src_place) && + platform::is_gpu_place(dst_place)) { + auto src_pinned_place = boost::get(src_place); + auto dst_gpu_place = boost::get(dst_place); + memory::Copy(dst_gpu_place, dst_ptr, src_pinned_place, src_ptr, size, + nullptr); } #endif } diff --git a/paddle/fluid/framework/tensor_util_test.cc b/paddle/fluid/framework/tensor_util_test.cc index 793ccfc79fe56707f226477b9d50b1d972ab6a59..17c55378178325b40e394f4b422c22c1c10bd130 100644 --- a/paddle/fluid/framework/tensor_util_test.cc +++ b/paddle/fluid/framework/tensor_util_test.cc @@ -379,7 +379,9 @@ TEST(Tensor, FromAndToStream) { TensorToStream(oss, gpu_tensor, gpu_ctx); std::istringstream iss(oss.str()); - TensorFromStream(iss, &dst_tensor, gpu_ctx); + TensorFromStream( + iss, &dst_tensor, + *platform::DeviceContextPool::Instance().Get(platform::CPUPlace())); int* dst_ptr = dst_tensor.mutable_data(platform::CPUPlace()); for (int i = 0; i < 6; ++i) { diff --git a/paddle/fluid/framework/threadpool.cc b/paddle/fluid/framework/threadpool.cc index 18cdca3a658a6a89e6ab637a7f38825756acfea8..2dab4e793eeacd65239786976948b8043aeeb215 100644 --- a/paddle/fluid/framework/threadpool.cc +++ b/paddle/fluid/framework/threadpool.cc @@ -25,7 +25,6 @@ DEFINE_int32(dist_threadpool_size, 0, namespace paddle { namespace framework { - std::unique_ptr ThreadPool::threadpool_(nullptr); std::once_flag ThreadPool::init_flag_; @@ -40,15 +39,14 @@ void ThreadPool::Init() { int num_threads = std::thread::hardware_concurrency(); if (FLAGS_dist_threadpool_size > 0) { num_threads = FLAGS_dist_threadpool_size; - VLOG(1) << "set dist_threadpool_size to " << num_threads; + VLOG(10) << "set dist_threadpool_size to " << num_threads; } PADDLE_ENFORCE_GT(num_threads, 0); threadpool_.reset(new ThreadPool(num_threads)); } } -ThreadPool::ThreadPool(int num_threads) - : total_threads_(num_threads), idle_threads_(num_threads), running_(true) { +ThreadPool::ThreadPool(int num_threads) : running_(true) { threads_.resize(num_threads); for (auto& thread : threads_) { // TODO(Yancey1989): binding the thread on the specify CPU number @@ -59,9 +57,10 @@ ThreadPool::ThreadPool(int num_threads) ThreadPool::~ThreadPool() { { // notify all threads to stop running + std::unique_lock l(mutex_); running_ = false; - scheduled_.notify_all(); } + scheduled_.notify_all(); for (auto& t : threads_) { t->join(); @@ -69,36 +68,30 @@ ThreadPool::~ThreadPool() { } } -void ThreadPool::Wait() { - std::unique_lock lock(mutex_); - completed_.wait(lock, [=] { return Done() == true; }); -} - void ThreadPool::TaskLoop() { - while (running_) { - std::unique_lock lock(mutex_); - scheduled_.wait(lock, [=] { return !tasks_.empty() || !running_; }); - - if (!running_) { - break; - } - // pop a task from the task queue - auto task = std::move(tasks_.front()); - tasks_.pop(); - - --idle_threads_; - lock.unlock(); - - // run the task - task(); + while (true) { + Task task; { std::unique_lock lock(mutex_); - ++idle_threads_; - if (Done()) { - completed_.notify_all(); + scheduled_.wait( + lock, [this] { return !this->tasks_.empty() || !this->running_; }); + + if (!running_ && tasks_.empty()) { + return; + } + + if (tasks_.empty()) { + PADDLE_THROW("This thread has no task to Run"); } + + // pop a task from the task queue + task = std::move(tasks_.front()); + tasks_.pop(); } + + // run the task + task(); } } diff --git a/paddle/fluid/framework/threadpool.h b/paddle/fluid/framework/threadpool.h index 94111ee335b1a5df327b3e46d62069b4735c54f6..7a51d18fbbf65f68725aa86a6a0ce4d15dff5673 100644 --- a/paddle/fluid/framework/threadpool.h +++ b/paddle/fluid/framework/threadpool.h @@ -57,17 +57,8 @@ class ThreadPool { ~ThreadPool(); - // Returns the number of threads created by the constructor. - size_t Threads() const { return total_threads_; } - - // Returns the number of currently idle threads. - size_t IdleThreads() { - std::unique_lock lock(mutex_); - return idle_threads_; - } - // Run pushes a function to the task queue and returns a std::future - // object. To wait for the completion of the task, call + // object. To wait for the completion of the task, call // std::future::wait(). template std::future Run(Callback fn) { @@ -78,7 +69,6 @@ class ThreadPool { template std::future> RunAndGetException( Callback fn) { - std::unique_lock lock(mutex_); Task task([fn]() -> std::unique_ptr { try { fn(); @@ -93,26 +83,20 @@ class ThreadPool { return nullptr; }); std::future> f = task.get_future(); - tasks_.push(std::move(task)); - lock.unlock(); + { + std::unique_lock lock(mutex_); + if (!running_) { + PADDLE_THROW("enqueue on stopped ThreadPool"); + } + tasks_.push(std::move(task)); + } scheduled_.notify_one(); return f; } - // Wait until all the tasks are completed. - void Wait(); - private: DISABLE_COPY_AND_ASSIGN(ThreadPool); - // If the task queue is empty and avaialbe is equal to the number of - // threads, means that all tasks are completed. Note: this function - // is not thread-safe. Returns true if all tasks are completed. - // Note: don't delete the data member total_threads_ and use - // threads_.size() instead; because you'd need to lock the mutex - // before accessing threads_. - bool Done() { return tasks_.empty() && idle_threads_ == total_threads_; } - // The constructor starts threads to run TaskLoop, which retrieves // and runs tasks from the queue. void TaskLoop(); @@ -125,14 +109,11 @@ class ThreadPool { static std::once_flag init_flag_; std::vector> threads_; - const size_t total_threads_; - size_t idle_threads_; std::queue tasks_; std::mutex mutex_; bool running_; std::condition_variable scheduled_; - std::condition_variable completed_; }; class ThreadPoolIO : ThreadPool { diff --git a/paddle/fluid/framework/threadpool_test.cc b/paddle/fluid/framework/threadpool_test.cc index 27a4ffd4fcbf293a3dea1744b29384d0bee0c137..884d61e23428a0ad758946295ca9c470767e93ef 100644 --- a/paddle/fluid/framework/threadpool_test.cc +++ b/paddle/fluid/framework/threadpool_test.cc @@ -19,10 +19,11 @@ limitations under the License. */ namespace framework = paddle::framework; -void do_sum(framework::ThreadPool* pool, std::atomic* sum, int cnt) { - std::vector> fs; +void do_sum(std::vector>* fs, std::mutex* mu, + std::atomic* sum, int cnt) { for (int i = 0; i < cnt; ++i) { - fs.push_back(framework::Async([sum]() { sum->fetch_add(1); })); + std::lock_guard l(*mu); + fs->push_back(framework::Async([sum]() { sum->fetch_add(1); })); } } @@ -40,18 +41,21 @@ TEST(ThreadPool, ConcurrentInit) { } TEST(ThreadPool, ConcurrentRun) { - framework::ThreadPool* pool = framework::ThreadPool::GetInstance(); std::atomic sum(0); std::vector threads; + std::vector> fs; + std::mutex fs_mu; int n = 50; // sum = (n * (n + 1)) / 2 for (int i = 1; i <= n; ++i) { - std::thread t(do_sum, pool, &sum, i); + std::thread t(do_sum, &fs, &fs_mu, &sum, i); threads.push_back(std::move(t)); } for (auto& t : threads) { t.join(); } - pool->Wait(); + for (auto& t : fs) { + t.wait(); + } EXPECT_EQ(sum, ((n + 1) * n) / 2); } diff --git a/paddle/fluid/framework/transfer_scope_cache.cc b/paddle/fluid/framework/transfer_scope_cache.cc new file mode 100644 index 0000000000000000000000000000000000000000..f6219a14173094d15e9c60a2e26f98da1b04ec2e --- /dev/null +++ b/paddle/fluid/framework/transfer_scope_cache.cc @@ -0,0 +1,62 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/transfer_scope_cache.h" + +namespace paddle { +namespace framework { + +// Holds all the transfer scope across the process. +std::unordered_map& global_transfer_data_cache() { + typedef std::unordered_map map_t; + thread_local std::unique_ptr x(new map_t); + return *x; +} + +// Holds all the transfer scope for this thread. +std::unordered_set& global_transfer_scope_cache() { + typedef std::unordered_set set_t; + thread_local std::unique_ptr x(new set_t); + return *x; +} + +// Try to create a transfer scope. If one cached scope has match the +// requirement, just return that one. +// Inputs: +// @type0: the source kernel type. +// @type1: the target kernel type. +// @scope: the execution scope of this op. +// Returns: A scope used to hold the transfer data across the different kernel +// type. +Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1, + const Scope* scope) { + Scope* new_scope{nullptr}; + size_t infer_cache_key = + CombineHash(OpKernelType::Hash()(type0), OpKernelType::Hash()(type1)); + infer_cache_key = + CombineHash(infer_cache_key, std::hash()(scope)); + + auto it = global_transfer_data_cache().find(infer_cache_key); + if (it != global_transfer_data_cache().end()) { + new_scope = global_transfer_data_cache()[infer_cache_key]; + } else { + new_scope = &scope->NewScope(); + global_transfer_data_cache()[infer_cache_key] = new_scope; + } + global_transfer_scope_cache().insert(new_scope); + return new_scope; +} + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/transfer_scope_cache.h b/paddle/fluid/framework/transfer_scope_cache.h new file mode 100644 index 0000000000000000000000000000000000000000..86fc0bf52972ae22a9bf269d71291f7fcd593880 --- /dev/null +++ b/paddle/fluid/framework/transfer_scope_cache.h @@ -0,0 +1,41 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include +#include "paddle/fluid/framework/op_kernel_type.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace framework { + +std::unordered_map& global_transfer_data_cache(); + +std::unordered_set& global_transfer_scope_cache(); + +// Combine two hash values to a single hash. +static size_t CombineHash(size_t seed, size_t a) { + return (seed ^ a) + 0x9e3779b9 + (seed << 6) + (seed >> 2); +} + +Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1, + const Scope* scope); + +void RemoveKidsFromTransferScopeCache(Scope* scope); + +} // namespace framework +} // namespace paddle diff --git a/paddle/fluid/framework/type_defs.h b/paddle/fluid/framework/type_defs.h index e099e40f121ff13657e563eb608feecbca0551be..2de6233a9e0d320ec9a06d547db3575eb61925c0 100644 --- a/paddle/fluid/framework/type_defs.h +++ b/paddle/fluid/framework/type_defs.h @@ -36,7 +36,7 @@ using Attribute = boost::variant, std::vector, std::vector, bool, std::vector, BlockDesc*, int64_t, - std::vector>; + std::vector, std::vector>; using AttributeMap = std::unordered_map; diff --git a/paddle/fluid/framework/var_desc.cc b/paddle/fluid/framework/var_desc.cc index 7e3f002b53351ba5892aaa50482b21a83db94069..29ef459b454075a30c3a4d0ff0f9ef1212292b4b 100644 --- a/paddle/fluid/framework/var_desc.cc +++ b/paddle/fluid/framework/var_desc.cc @@ -61,10 +61,10 @@ size_t VarDesc::GetTensorDescNum() const { void VarDesc::SetShapes( const std::vector> &multiple_dims) { if (multiple_dims.size() != GetTensorDescNum()) { - VLOG(3) << "WARNING: The number of given shapes(" << multiple_dims.size() - << ") doesn't match the existing tensor number(" - << GetTensorDescNum() - << "). The Reader is going to be reinitialized."; + VLOG(30) << "WARNING: The number of given shapes(" << multiple_dims.size() + << ") doesn't match the existing tensor number(" + << GetTensorDescNum() + << "). The Reader is going to be reinitialized."; SetTensorDescNum(multiple_dims.size()); } std::vector tensors = mutable_tensor_descs(); @@ -94,11 +94,11 @@ void VarDesc::SetDataType(proto::VarType::Type data_type) { void VarDesc::SetDataTypes( const std::vector &multiple_data_type) { if (multiple_data_type.size() != GetTensorDescNum()) { - VLOG(3) << "WARNING: The number of given data types(" - << multiple_data_type.size() - << ") doesn't match the existing tensor number(" - << GetTensorDescNum() - << "). The Reader is going to be reinitialized."; + VLOG(30) << "WARNING: The number of given data types(" + << multiple_data_type.size() + << ") doesn't match the existing tensor number(" + << GetTensorDescNum() + << "). The Reader is going to be reinitialized."; SetTensorDescNum(multiple_data_type.size()); } std::vector tensor_descs = @@ -139,11 +139,11 @@ void VarDesc::SetLoDLevel(int32_t lod_level) { void VarDesc::SetLoDLevels(const std::vector &multiple_lod_level) { if (multiple_lod_level.size() != GetTensorDescNum()) { - VLOG(3) << "WARNING: The number of given lod_levels(" - << multiple_lod_level.size() - << ") doesn't match the existing tensor number(" - << GetTensorDescNum() - << "). The Reader is going to be reinitialized."; + VLOG(30) << "WARNING: The number of given lod_levels(" + << multiple_lod_level.size() + << ") doesn't match the existing tensor number(" + << GetTensorDescNum() + << "). The Reader is going to be reinitialized."; SetTensorDescNum(multiple_lod_level.size()); } switch (desc_.type().type()) { diff --git a/paddle/fluid/framework/var_type_inference.h b/paddle/fluid/framework/var_type_inference.h index f3035cd712bdea517068b4c172bb2794d5fccddb..64236b78d2e390ea5f6c43c76a4b33b62c67629f 100644 --- a/paddle/fluid/framework/var_type_inference.h +++ b/paddle/fluid/framework/var_type_inference.h @@ -13,6 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include +#include "paddle/fluid/framework/block_desc.h" +#include "paddle/fluid/framework/op_desc.h" #include "paddle/fluid/framework/type_defs.h" namespace paddle { @@ -24,5 +27,27 @@ class VarTypeInference { virtual void operator()(const OpDesc& op_desc, BlockDesc* block) const = 0; }; +class PassInDtypeAndVarTypeToOutput : public framework::VarTypeInference { + public: + void operator()(const framework::OpDesc& op_desc, + framework::BlockDesc* block) const final { + auto in_out_var_names = this->GetInputOutputWithSameType(); + + for (auto& i_o_n : in_out_var_names) { + auto& x_name = op_desc.Input(i_o_n.first).at(0); + auto& out_name = op_desc.Output(i_o_n.second).at(0); + + auto& x = block->FindRecursiveOrCreateVar(x_name); + auto& out = block->FindRecursiveOrCreateVar(out_name); + out.SetType(x.GetType()); + out.SetDataType(x.GetDataType()); + } + } + + protected: + virtual std::unordered_map + GetInputOutputWithSameType() const = 0; +}; + } // namespace framework } // namespace paddle diff --git a/paddle/fluid/inference/CMakeLists.txt b/paddle/fluid/inference/CMakeLists.txt index 9794a193bcfaae19552b1f6fbdf2dab2898033d5..058a5b5f460d2bd3c4c0248929dd0c87f7506930 100644 --- a/paddle/fluid/inference/CMakeLists.txt +++ b/paddle/fluid/inference/CMakeLists.txt @@ -1,6 +1,10 @@ +if(WITH_TESTING) + include(tests/test.cmake) # some generic cmake funtion for inference +endif() # analysis and tensorrt must be added before creating static library, # otherwise, there would be undefined reference to them in static library. add_subdirectory(analysis) +add_subdirectory(utils) if (TENSORRT_FOUND) add_subdirectory(tensorrt) endif() @@ -10,12 +14,24 @@ set(FLUID_CORE_MODULES proto_desc memory lod_tensor executor) # TODO(panyx0718): Should this be called paddle_fluid_inference_api_internal? cc_library(paddle_fluid_api SRCS io.cc - DEPS ${FLUID_CORE_MODULES} ${GLOB_OP_LIB}) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS}) get_property(fluid_modules GLOBAL PROPERTY FLUID_MODULES) +get_property(cuda_modules GLOBAL PROPERTY CUDA_MODULES) +get_property(fluid_third_partys GLOBAL PROPERTY FLUID_THRID_PARTYS) +if (WIN32) +list(APPEND fluid_third_partys gflags glog protobuf cblas) +endif(WIN32) # paddle_fluid_origin exclude inference api interface -cc_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) +if(WIN32) + sep_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) + if(WITH_GPU AND NOT WITH_DSO) + target_link_libraries(paddle_fluid_origin ${cuda_modules}) + endif(WITH_GPU AND NOT WITH_DSO) +else(WIN32) + cc_library(paddle_fluid_origin DEPS ${fluid_modules} paddle_fluid_api) +endif(WIN32) add_subdirectory(api) @@ -24,13 +40,17 @@ set(SHARED_INFERENCE_SRCS io.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/api.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/api_impl.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/analysis_predictor.cc ${CMAKE_CURRENT_SOURCE_DIR}/api/details/zero_copy_tensor.cc) -if (WITH_GPU AND TENSORRT_FOUND) - set(STATIC_INFERENCE_APIS ${STATIC_INFERENCE_APIS} paddle_inference_tensorrt_subgraph_engine) - set(SHARED_INFERENCE_SRCS ${SHARED_INFERENCE_SRCS} ${CMAKE_CURRENT_SOURCE_DIR}/api/api_tensorrt_subgraph_engine.cc) -endif() -# Create static library -cc_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor) +if(WIN32) + sep_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor reset_tensor_array + analysis_config paddle_pass_builder) + if(WITH_GPU AND NOT WITH_DSO) + target_link_libraries(paddle_fluid ${cuda_modules}) + endif(WITH_GPU AND NOT WITH_DSO) +else(WIN32) + cc_library(paddle_fluid DEPS ${fluid_modules} ${STATIC_INFERENCE_APIS} zero_copy_tensor reset_tensor_array + analysis_config paddle_pass_builder) +endif(WIN32) if(NOT APPLE) # TODO(liuyiqu: Temporarily disable the link flag because it is not support on Mac. @@ -39,11 +59,20 @@ if(NOT APPLE) endif() # Create shared library -cc_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} - DEPS ${fluid_modules} paddle_fluid_api) +if(WIN32) + sep_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} + DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array analysis_config paddle_pass_builder) + target_link_libraries(paddle_fluid_shared shlwapi) + if(WITH_GPU AND NOT WITH_DSO) + target_link_libraries(paddle_fluid_origin ${cuda_modules}) + endif(WITH_GPU AND NOT WITH_DSO) +else(WIN32) + cc_library(paddle_fluid_shared SHARED SRCS ${SHARED_INFERENCE_SRCS} + DEPS ${fluid_modules} paddle_fluid_api reset_tensor_array analysis_config paddle_pass_builder) +endif() set_target_properties(paddle_fluid_shared PROPERTIES OUTPUT_NAME paddle_fluid) -if(NOT APPLE) +if(NOT APPLE AND NOT WIN32) # TODO(liuyiqun): Temporarily disable the link flag because it is not support on Mac. set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_fluid.map") set_target_properties(paddle_fluid_shared PROPERTIES LINK_FLAGS "${LINK_FLAGS}") diff --git a/paddle/fluid/inference/analysis/CMakeLists.txt b/paddle/fluid/inference/analysis/CMakeLists.txt index d4d2fd4634f9e11f3f002e11e177c332ced49885..4bd3f93ef75ada545751fef5af77a78e4872b690 100644 --- a/paddle/fluid/inference/analysis/CMakeLists.txt +++ b/paddle/fluid/inference/analysis/CMakeLists.txt @@ -1,51 +1,38 @@ -cc_library(ir_pass_manager SRCS ir_pass_manager.cc DEPS graph pass) -set(analysis_deps - framework_proto proto_desc ir_pass_manager graph pass paddle_fluid_api executor pretty_log) +unset(analysis_deps CACHE) +set(analysis_deps # analysis_deps can be extended accross the project + framework_proto proto_desc graph pass paddle_fluid_api executor pretty_log + ir_pass_manager + CACHE INTERNAL "") -cc_library(analysis SRCS pass_manager.cc node.cc data_flow_graph.cc graph_traits.cc subgraph_splitter.cc +add_subdirectory(ir_passes) +add_subdirectory(passes) + +cc_library(analysis_helper SRCS helper.cc DEPS framework_proto proto_desc graph paddle_fluid_api) + +cc_library(ir_pass_manager SRCS ir_pass_manager.cc DEPS graph pass ${INFER_IR_PASSES} analysis_helper) + +cc_library(argument SRCS argument.cc DEPS scope proto_desc) +cc_library(analysis_pass SRCS analysis_pass.cc DEPS proto_desc) + +cc_library(analysis SRCS analyzer.cc - helper.cc - # passes - analysis_pass.cc - fluid_to_data_flow_graph_pass.cc - data_flow_graph_to_fluid_pass.cc - dfg_graphviz_draw_pass.cc - tensorrt_subgraph_pass.cc - tensorrt_subgraph_node_mark_pass.cc - fluid_to_ir_pass.cc - model_store_pass.cc - DEPS ${analysis_deps}) - -cc_test(test_node SRCS node_tester.cc DEPS analysis) + analysis_pass + DEPS ${analysis_deps} analysis_helper + ) + cc_test(test_dot SRCS dot_tester.cc DEPS analysis) -cc_binary(inference_analyzer SRCS analyzer_main.cc DEPS analysis paddle_fluid) - -function (inference_analysis_test TARGET) - if(WITH_TESTING) - set(options "") - set(oneValueArgs "") - set(multiValueArgs SRCS ARGS EXTRA_DEPS) - cmake_parse_arguments(analysis_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) - set(mem_opt "") - if(WITH_GPU) - set(mem_opt "--fraction_of_gpu_memory_to_use=0.5") - endif() - cc_test(${TARGET} - SRCS "${analysis_test_SRCS}" - DEPS analysis pass ${GLOB_PASS_LIB} ${analysis_test_EXTRA_DEPS} - ARGS --inference_model_dir=${PYTHON_TESTS_DIR}/book/word2vec.inference.model ${mem_opt} ${analysis_test_ARGS}) - set_tests_properties(${TARGET} PROPERTIES DEPENDS test_word2vec) - endif(WITH_TESTING) + +function(inference_analysis_test TARGET) + if(WITH_TESTING) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS ARGS EXTRA_DEPS) + cmake_parse_arguments(analysis_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + inference_base_test(${TARGET} + SRCS ${analysis_test_SRCS} + DEPS analysis pass ${GLOB_PASS_LIB} ${analysis_test_EXTRA_DEPS} + ARGS --inference_model_dir=${WORD2VEC_MODEL_DIR} ${analysis_test_ARGS}) + endif() endfunction(inference_analysis_test) -inference_analysis_test(test_analyzer SRCS analyzer_tester.cc EXTRA_DEPS paddle_inference_api) -inference_analysis_test(test_data_flow_graph SRCS data_flow_graph_tester.cc) -inference_analysis_test(test_data_flow_graph_to_fluid_pass SRCS data_flow_graph_to_fluid_pass_tester.cc) -inference_analysis_test(test_fluid_to_ir_pass SRCS fluid_to_ir_pass_tester.cc) -inference_analysis_test(test_fluid_to_data_flow_graph_pass SRCS fluid_to_data_flow_graph_pass_tester.cc) -inference_analysis_test(test_subgraph_splitter SRCS subgraph_splitter_tester.cc) -inference_analysis_test(test_dfg_graphviz_draw_pass SRCS dfg_graphviz_draw_pass_tester.cc) -inference_analysis_test(test_tensorrt_subgraph_pass SRCS tensorrt_subgraph_pass_tester.cc) -inference_analysis_test(test_pass_manager SRCS pass_manager_tester.cc) -inference_analysis_test(test_tensorrt_subgraph_node_mark_pass SRCS tensorrt_subgraph_node_mark_pass_tester.cc) -inference_analysis_test(test_model_store_pass SRCS model_store_pass_tester.cc) +inference_analysis_test(test_analyzer SRCS analyzer_tester.cc EXTRA_DEPS reset_tensor_array paddle_inference_api) diff --git a/paddle/fluid/inference/analysis/analysis_pass.h b/paddle/fluid/inference/analysis/analysis_pass.h index 13805ea4acf936b242bcd86b2faf89813753a9fe..299f235a74ae0ffb663be61079607d8ac1105a97 100644 --- a/paddle/fluid/inference/analysis/analysis_pass.h +++ b/paddle/fluid/inference/analysis/analysis_pass.h @@ -19,42 +19,36 @@ limitations under the License. */ #include #include "paddle/fluid/framework/framework.pb.h" +#include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" #include "paddle/fluid/inference/analysis/helper.h" -#include "paddle/fluid/inference/analysis/node.h" namespace paddle { namespace inference { namespace analysis { +/* + * AnalysisPass is a pass used to control the IR passes. + */ class AnalysisPass { public: AnalysisPass() = default; virtual ~AnalysisPass() = default; - // Mutable Pass. - virtual bool Initialize(Argument *argument) { return false; } - // Readonly Pass. - virtual bool Initialize(const Argument &argument) { return false; } - // Virtual method overriden by subclasses to do any necessary clean up after - // all passes have run. - virtual bool Finalize() { return false; } - - // Create a debugger Pass that draw the DFG by graphviz toolkit. - virtual AnalysisPass *CreateGraphvizDebugerPass() const { return nullptr; } - - // Run on a single DataFlowGraph. - virtual void Run(DataFlowGraph *x) = 0; + // Run on a single Graph. + void Run(Argument* argument) { RunImpl(argument); } // Human-readable short representation. virtual std::string repr() const = 0; // Human-readable long description. virtual std::string description() const { return "No DOC"; } -}; -// GraphPass processes on any GraphType. -class DataFlowGraphPass : public AnalysisPass {}; + protected: + // User should implement these. + virtual void RunImpl(Argument* argument) = 0; + + Argument* argument_{nullptr}; +}; } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/analyzer.cc b/paddle/fluid/inference/analysis/analyzer.cc index d780592eb9f79e39e34fcd3bd6b086992eaa931f..c8ed373ee7c32552608d501aa642677f940cd520 100644 --- a/paddle/fluid/inference/analysis/analyzer.cc +++ b/paddle/fluid/inference/analysis/analyzer.cc @@ -15,111 +15,23 @@ #include "paddle/fluid/inference/analysis/analyzer.h" #include #include - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" -#include "paddle/fluid/inference/analysis/model_store_pass.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" - -DEFINE_bool(IA_enable_tensorrt_subgraph_engine, false, - "Enable subgraph to TensorRT engine for acceleration"); - -DEFINE_bool(IA_enable_ir, false, "Turn on IR support"); - -DEFINE_string(IA_graphviz_log_root, "./", - "Graphviz debuger for data flow graphs."); - -DEFINE_string(IA_output_storage_path, "", "optimized model output path"); +#include "paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.h" +#include "paddle/fluid/inference/analysis/passes/passes.h" namespace paddle { namespace inference { namespace analysis { -class DfgPassManagerImpl final : public DfgPassManager { - public: - DfgPassManagerImpl() { - // TODO(Superjomn) set the key with pass reprs. - if (!FLAGS_IA_enable_ir) { - AddPass("fluid-to-data-flow-graph", new FluidToDataFlowGraphPass); - } else { - AddPass("fluid-to-ir-pass", new FluidToIrPass); - } - TryAddTensorRtPass(); - AddPass("data-flow-graph-to-fluid", new DataFlowGraphToFluidPass); - if (!FLAGS_IA_output_storage_path.empty()) { - AddPass("model-store-pass", new ModelStorePass); - } - } - - std::string repr() const override { return "dfg-pass-manager"; } - std::string description() const override { return "DFG pass manager."; } - - private: - void AddPass(const std::string& name, AnalysisPass* pass) { - VLOG(3) << "Adding pass " << name; - Register(name, pass); - AddGraphvizDebugerPass(pass); - } - - void TryAddTensorRtPass() { - if (FLAGS_IA_enable_tensorrt_subgraph_engine) { - auto trt_teller = [&](const Node* node) { - std::unordered_set teller_set( - {"mul", "conv2d", "pool2d", "relu", "softmax", "sigmoid", - "depthwise_conv2d", "batch_norm", "concat", "tanh", "pad", - "elementwise_add", "dropout"}); - if (!node->IsFunction()) return false; - - const auto* func = static_cast(node); - if (teller_set.count(func->func_type())) { - return true; - } else { - return false; - } - }; +Analyzer::Analyzer() {} - AddPass("tensorrt-subgraph-marker", - new TensorRTSubgraphNodeMarkPass(trt_teller)); - AddPass("tensorrt-subgraph", new TensorRTSubGraphPass(trt_teller)); - } - } - - // Add the graphviz debuger pass if the parent pass has one. - void AddGraphvizDebugerPass(AnalysisPass* pass) { - auto* debuger_pass = pass->CreateGraphvizDebugerPass(); - if (debuger_pass) { - Register(debuger_pass->repr(), debuger_pass); - } - } -}; +void Analyzer::Run(Argument *argument) { RunIrAnalysis(argument); } -Analyzer::Analyzer() { Register("manager1", new DfgPassManagerImpl); } +void Analyzer::RunIrAnalysis(Argument *argument) { + std::vector passes({"ir_analysis_compose_pass"}); -void Analyzer::Run(Argument* argument) { - std::vector passes; - for (auto& pass : all_ir_passes_) { - if (!disabled_ir_passes_.count(pass)) { - passes.push_back(pass); - passes.push_back("graph_viz_pass"); // add graphviz for debug. - } + for (auto &pass : passes) { + PassRegistry::Global().Retreive(pass)->Run(argument); } - passes.push_back("graph_viz_pass"); - argument->Set(kFluidToIrPassesAttr, new std::vector(passes)); - - for (auto& x : data_) { - PADDLE_ENFORCE(x->Initialize(argument)); - x->RunAll(); - PADDLE_ENFORCE(x->Finalize()); - } -} - -Analyzer& Analyzer::DisableIrPasses(const std::vector& passes) { - disabled_ir_passes_.insert(passes.begin(), passes.end()); - return *this; } } // namespace analysis diff --git a/paddle/fluid/inference/analysis/analyzer.h b/paddle/fluid/inference/analysis/analyzer.h index 765145cb7da44ca13c5394ad1dc2e879e69d69d1..b43e67f20f493cd8151871ca3a36eb6fdadcf9ff 100644 --- a/paddle/fluid/inference/analysis/analyzer.h +++ b/paddle/fluid/inference/analysis/analyzer.h @@ -40,47 +40,21 @@ limitations under the License. */ #include #include "paddle/fluid/inference/analysis/analysis_pass.h" #include "paddle/fluid/inference/analysis/flags.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" namespace paddle { namespace inference { namespace analysis { -class Analyzer : public OrderedRegistry { +class Analyzer final { public: - // Register all the pass-managers. Analyzer(); void Run(Argument* argument); - Analyzer& DisableIrPasses(const std::vector& passes); - DISABLE_COPY_AND_ASSIGN(Analyzer); - private: - // All avaiable IR passes. - // The bigger fuse comes first, so that the small operators prefer to be - // merged in a larger fuse op. The small fusion will not break the pattern of - // larger fusion. - const std::vector all_ir_passes_{{ - // Manual update the passes here. - "infer_clean_graph_pass", // - "attention_lstm_fuse_pass", // - "embedding_fc_lstm_fuse_pass", // - "fc_lstm_fuse_pass", // - "mul_lstm_fuse_pass", // - "fc_gru_fuse_pass", // - "mul_gru_fuse_pass", // - "seq_concat_fc_fuse_pass", // - "fc_fuse_pass", // - "conv_bn_fuse_pass", // - "conv_eltwiseadd_bn_fuse_pass", // -#ifdef PADDLE_WITH_MKLDNN - "conv_relu_mkldnn_fuse_pass", // -#endif - }}; - - std::unordered_set disabled_ir_passes_; + protected: + void RunIrAnalysis(Argument* argument); }; } // namespace analysis diff --git a/paddle/fluid/inference/analysis/analyzer_tester.cc b/paddle/fluid/inference/analysis/analyzer_tester.cc index f90910ac0d0a897ef01d4ca2bd0bca575baf4c40..7710ed7b613a180a4e24237a2cf6b5f212f677a3 100644 --- a/paddle/fluid/inference/analysis/analyzer_tester.cc +++ b/paddle/fluid/inference/analysis/analyzer_tester.cc @@ -19,6 +19,7 @@ #include "paddle/fluid/inference/analysis/ut_helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" +#include "paddle/fluid/platform/port.h" namespace paddle { namespace inference { @@ -27,21 +28,23 @@ namespace analysis { using namespace framework; // NOLINT TEST(Analyzer, analysis_without_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = false; Argument argument; - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); + argument.SetModelDir(FLAGS_inference_model_dir); + argument.SetIrAnalysisPasses({"infer_clean_graph_pass"}); + argument.SetUseGPU(false); + Analyzer analyser; analyser.Run(&argument); } TEST(Analyzer, analysis_with_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = true; Argument argument; - argument.Set("minimum_subgraph_size", new int(0)); - argument.Set("max_batch_size", new int(3)); - argument.Set("workspace_size", new int(1 << 20)); - argument.Set("precision_mode", new std::string("FP32")); - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); + argument.SetTensorRtMaxBatchSize(3); + argument.SetTensorRtWorkspaceSize(1 << 20); + argument.SetModelDir(FLAGS_inference_model_dir); + argument.SetIrAnalysisPasses({"infer_clean_graph_pass"}); + argument.SetUseGPU(false); + Analyzer analyser; analyser.Run(&argument); } @@ -51,9 +54,7 @@ void TestWord2vecPrediction(const std::string& model_path) { config.model_dir = model_path; config.use_gpu = false; config.device = 0; - auto predictor = - ::paddle::CreatePaddlePredictor( - config); + auto predictor = ::paddle::CreatePaddlePredictor(config); // One single batch @@ -75,7 +76,7 @@ void TestWord2vecPrediction(const std::string& model_path) { 0.000932706}; const size_t num_elements = outputs.front().data.length() / sizeof(float); // The outputs' buffers are in CPU memory. - for (size_t i = 0; i < std::min(5UL, num_elements); i++) { + for (size_t i = 0; i < std::min((size_t)5UL, num_elements); i++) { LOG(INFO) << "data: " << static_cast(outputs.front().data.data())[i]; PADDLE_ENFORCE(static_cast(outputs.front().data.data())[i], diff --git a/paddle/fluid/inference/analysis/argument.h b/paddle/fluid/inference/analysis/argument.h index e8fb0775b45761f64fd6fd28306c35b76d1e40c4..21203e2d9f4e4cd22ea49ea7b6808aff07e70eff 100644 --- a/paddle/fluid/inference/analysis/argument.h +++ b/paddle/fluid/inference/analysis/argument.h @@ -24,13 +24,16 @@ #pragma once #include +#include +#include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" +#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/platform/variant.h" namespace paddle { namespace inference { namespace analysis { +using framework::ir::Graph; /* * The argument definition of both Pass and PassManagers. @@ -39,75 +42,100 @@ namespace analysis { */ struct Argument { Argument() = default; - explicit Argument(const std::string& fluid_model_dir) - : fluid_model_dir(new std::string(fluid_model_dir)) {} - // The directory of the trained model. - std::unique_ptr fluid_model_dir; - // The path of `__model__` and `param`, this is used when the file name of - // model and param is changed. - std::unique_ptr fluid_model_program_path; - std::unique_ptr fluid_model_param_path; - - // The graph that process by the Passes or PassManagers. - std::unique_ptr main_dfg; - - // The original program desc. - std::unique_ptr origin_program_desc; - - // The processed program desc. - std::unique_ptr transformed_program_desc; - - // The output storage path of ModelStorePass. - std::unique_ptr model_output_store_path; - - // Support for any other attributes. - template - void Set(const std::string& key, T* data) { - PADDLE_ENFORCE_NOT_NULL(data); - PADDLE_ENFORCE(!attrs_.count(key), "Duplicate set Argument's attr [%s]", - key); - attrs_[key] = data; - attr_deleters_[key] = [data, key]() { - VLOG(3) << "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; - VLOG(3) << "argument delete attr: " << key; - delete data; - }; - } - - bool Has(const std::string& name) const { return attrs_.count(name); } - - template - T* Release(const std::string& key) { - PADDLE_ENFORCE(attrs_.count(key)); - auto* res = boost::any_cast(attrs_.at(key)); - attrs_.erase(key); - attr_deleters_.erase(key); - return res; - } - - template - T& Get(const std::string& key) { - PADDLE_ENFORCE(Has(key)); - return *boost::any_cast(attrs_.at(key)); - } - - ~Argument() { - for (auto& item : attr_deleters_) { - item.second(); - } - } + explicit Argument(const std::string& model_dir) { SetModelDir(model_dir); } + + using unique_ptr_t = std::unique_ptr>; + using fusion_statis_t = std::unordered_map; + + bool Has(const std::string& key) const { return valid_fields_.count(key); } + +#define DECL_ARGUMENT_FIELD(field__, Field, type__) \ + public: \ + type__& field__() { \ + PADDLE_ENFORCE(Has(#field__)); \ + return field__##_; \ + } \ + void Set##Field(const type__& x) { \ + field__##_ = x; \ + valid_fields_.insert(#field__); \ + } \ + DECL_ARGUMENT_FIELD_VALID(field__); \ + type__* field__##_ptr() { return &field__##_; } \ + \ + private: \ + type__ field__##_; + +#define DECL_ARGUMENT_FIELD_VALID(field__) \ + bool field__##_valid() { return Has(#field__); } + +#define DECL_ARGUMENT_UNIQUE_FIELD(field__, Field, type__) \ + public: \ + type__& field__() { \ + PADDLE_ENFORCE_NOT_NULL(field__##_); \ + PADDLE_ENFORCE(Has(#field__)); \ + return *static_cast(field__##_.get()); \ + } \ + void Set##Field(type__* x) { \ + field__##_ = \ + unique_ptr_t(x, [](void* x) { delete static_cast(x); }); \ + valid_fields_.insert(#field__); \ + } \ + void Set##Field##NotOwned(type__* x) { \ + valid_fields_.insert(#field__); \ + field__##_ = unique_ptr_t(x, [](void* x) {}); \ + } \ + DECL_ARGUMENT_FIELD_VALID(field__); \ + type__* field__##_ptr() { \ + PADDLE_ENFORCE(Has(#field__)); \ + return static_cast(field__##_.get()); \ + } \ + type__* Release##Field() { \ + PADDLE_ENFORCE(Has(#field__)); \ + valid_fields_.erase(#field__); \ + return static_cast(field__##_.release()); \ + } \ + \ + private: \ + unique_ptr_t field__##_; + + // Model path + DECL_ARGUMENT_FIELD(model_dir, ModelDir, std::string); + // Model specified with program and parameters files. + DECL_ARGUMENT_FIELD(model_program_path, ModelProgramPath, std::string); + DECL_ARGUMENT_FIELD(model_params_path, ModelParamsPath, std::string); + + // The overall graph to work on. + DECL_ARGUMENT_UNIQUE_FIELD(main_graph, MainGraph, framework::ir::Graph); + // The overall Scope to work on. + DECL_ARGUMENT_UNIQUE_FIELD(scope, Scope, framework::Scope); + + DECL_ARGUMENT_UNIQUE_FIELD(main_program, MainProgram, framework::ProgramDesc); + + // The ir passes to perform in analysis phase. + DECL_ARGUMENT_FIELD(ir_analysis_passes, IrAnalysisPasses, + std::vector); + + DECL_ARGUMENT_FIELD(use_gpu, UseGPU, bool); + DECL_ARGUMENT_FIELD(gpu_device_id, GPUDeviceId, int); + DECL_ARGUMENT_FIELD(use_tensorrt, UseTensorRT, bool); + DECL_ARGUMENT_FIELD(tensorrt_node_teller, TensorRtNodeTeller, + std::function); + DECL_ARGUMENT_FIELD(tensorrt_max_batch_size, TensorRtMaxBatchSize, int); + DECL_ARGUMENT_FIELD(tensorrt_workspace_size, TensorRtWorkspaceSize, int); + + // The program transformed by IR analysis phase. + DECL_ARGUMENT_UNIQUE_FIELD(ir_analyzed_program, IrAnalyzedProgram, + framework::proto::ProgramDesc); + + DECL_ARGUMENT_FIELD(fusion_statis, FusionStatis, fusion_statis_t); private: - std::unordered_map attrs_; - std::unordered_map> attr_deleters_; + std::unordered_set valid_fields_; }; -#define UNLIKELY(condition) __builtin_expect(static_cast(condition), 0) -#define ANALYSIS_ARGUMENT_CHECK_FIELD(field__) \ - if (UNLIKELY(!(field__))) { \ - LOG(ERROR) << "field " << #field__ << " should be set."; \ - return false; \ - } +#define ARGUMENT_CHECK_FIELD(argument__, fieldname__) \ + PADDLE_ENFORCE(argument__->Has(#fieldname__), \ + "the argument field [%s] should be set", #fieldname__); } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/data_flow_graph.cc b/paddle/fluid/inference/analysis/data_flow_graph.cc deleted file mode 100644 index 8c7d58678fd29cb25d13d64a08e6c6f26f242d8b..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph.cc +++ /dev/null @@ -1,496 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/dot.h" -#include "paddle/fluid/inference/analysis/node.h" - -namespace paddle { -namespace inference { -namespace analysis { -using ir_node_t = framework::ir::Node; -using ir_graph_t = framework::ir::Graph; - -// It is a better idea that the inputs and outputs of this graph is set manually -// before, but there must be a Pass that helps to prune the unnecessary ops that -// do not contribute to the given targets, so in this pass, analysis and get the -// inputs and outputs is OK. -void DataFlowGraph::Build() { - inputs_.clear(); - outputs_.clear(); - std::unordered_set ins; - std::unordered_set outs; - for (auto &node : nodes.nodes()) { - for (auto *in : node->inlinks) { - ins.insert(in); - } - for (auto *out : node->outlinks) { - outs.insert(out); - } - } - - // The nodes that in ins but not in outs is the graph's inputs - // similarly, the nodes that in outs but not in ins is the graphs' outputs - for (auto *in : ins) { - if (!outs.count(in)) { - inputs_.push_back(in); - } - } - for (auto *out : outs) { - if (!ins.count(out)) { - outputs_.push_back(out); - } - } - - Clean(); -} - -void DataFlowGraph::Build(const framework::proto::ProgramDesc &prog) { - // insert vars - // The `var2id` keeps a map from a variable's name to its Node-id, the Node-id - // will keep updating to its latest alias during the graph-building. - std::unordered_map var2id; - auto &main_block = prog.blocks(framework::kRootBlockIndex); - for (int i = 0; i < main_block.vars_size(); i++) { - const auto &var = main_block.vars(i); - auto *v = nodes.Create(Node::Type::kValue); - v->SetName(var.name()); - v->SetPbDesc(const_cast(static_cast(&var))); - v->SetPbMsg(var.SerializeAsString()); - var2id[var.name()] = v->id(); - } - - // The variables in a SSA can only write once, so if a variable is written - // multiple times(quite common in our ProgramDesc design), multiple alias - // Nodes of this variable will be created, and each will just write once. - - // An set that keep all the names of the variables(the original, not alias) - // that have been written(as outputs). Once an Op's output variable hit the - // set, it should create a new alias and update the global alias for this - // variable. And that make a Data Flow Graph a SSA. - std::unordered_set unique_written_vars; - for (int i = 0; i < main_block.ops_size(); i++) { - const auto &op = main_block.ops(i); - auto *o = nodes.Create(Node::Type::kFunction); - o->SetName(op.type()); - static_cast(o)->SetFuncType(op.type()); - // Link to the original protobuf message's memory, make it easier to - // generate from a data flow graph to fluid ProgramDesc. - o->SetPbDesc(const_cast(static_cast(&op))); - o->SetPbMsg(op.SerializeAsString()); - - // set inputs and outputs - for (int j = 0; j < op.inputs_size(); j++) { - auto &in_var = op.inputs(j); - for (int k = 0; k < in_var.arguments_size(); k++) { - auto *in = nodes.GetMutable(var2id.at(in_var.arguments(k))); - in->outlinks.push_back(o); - o->inlinks.push_back(in); - unique_written_vars.insert(in); - } - } - for (int j = 0; j < op.outputs_size(); j++) { - auto &out_var = op.outputs(j); - for (int k = 0; k < out_var.arguments_size(); k++) { - auto *out = nodes.GetMutable(var2id[out_var.arguments(k)]); - if (unique_written_vars.count(out)) { - // Loop found, for example, a = op(a), use SSA, change to a1 = op(a). - auto *out_alias = nodes.Create(Node::Type::kValue); - out_alias->SetName(out->name()); - out_alias->SetPbDesc(out->pb_desc()); - out_alias->SetPbMsg(out->pb_msg()); - var2id[out_alias->name()] = - out_alias->id(); // update variable's alias Node - LOG(INFO) << "loop found in graph, create SSA alias node [" - << out_alias->repr() << "] for [" << out->repr() << "]"; - out = out_alias; - } - out->inlinks.push_back(o); - o->outlinks.push_back(out); - } - } - } - // Analysis and extract the inputs and outputs of this graph. - Build(); -} - -void DataFlowGraph::Build(const framework::ir::Graph &graph) { - // Create nodes - std::unordered_map ir_node_map; - for (auto *ir_node : graph.Nodes()) { - Node *x{nullptr}; - if (ir_node->IsOp()) { - PADDLE_ENFORCE(ir_node->Op()); - VLOG(4) << "get op " << ir_node << " " << ir_node->Name(); - x = nodes.Create(Node::Type::kFunction); - x->attr("ir_node").Pointer() = ir_node; - PADDLE_ENFORCE(ir_node->Op()->Proto()); - x->SetName(ir_node->Op()->Proto()->type()); - x->SetPbMsg(ir_node->Op()->Proto()->SerializeAsString()); - } else if (ir_node->IsVar()) { - // Not create a Node for IR ControlDepVar, considering Inference currently - // just used in single thread scenerio. - VLOG(4) << "get var " << ir_node->Name(); - x = nodes.Create(Node::Type::kValue); - x->attr("ir_node").Pointer() = ir_node; - x->SetName(ir_node->Name()); - // x->SetPbMsg(ir_node->Var()->Proto()->SerializeAsString()); - } else { - PADDLE_THROW("Failed to create an Node from IR, unknown type"); - } - ir_node_map.emplace(ir_node, x); - } - VLOG(4) << "finish creating Nodes"; - - VLOG(4) << "to create edge"; - // Create links - for (auto *ir_node : graph.Nodes()) { - auto it = ir_node_map.find(ir_node); - // Skip ControlDepVar. - if (it == ir_node_map.end()) continue; - auto *node = it->second; - for (auto *x : ir_node->inputs) { - if (!ir_node_map.count(x)) continue; - node->inlinks.push_back(ir_node_map.at(x)); - } - for (auto *x : ir_node->outputs) { - if (!ir_node_map.count(x)) continue; - node->outlinks.push_back(ir_node_map.at(x)); - } - } - - Build(); - PADDLE_ENFORCE(!inputs_.empty(), - "Can't deduce any inputs from the graph, Is the graph empty?"); - - ir_graph = &graph; - VLOG(3) << "finished build from IR"; -} - -void DataFlowGraph::Clean() { - for (auto &node : nodes.nodes()) { - std::unordered_set inlinks_set(node->inlinks.begin(), - node->inlinks.end()); - std::unordered_set outlinks_set(node->outlinks.begin(), - node->outlinks.end()); - if (inlinks_set.size() < node->inlinks.size()) { - node->inlinks.assign(inlinks_set.begin(), inlinks_set.end()); - } - if (outlinks_set.size() < node->outlinks.size()) { - node->outlinks.assign(outlinks_set.begin(), outlinks_set.end()); - } - } -} - -std::string DataFlowGraph::DotString() const { - Dot dot; - - // Add nodes - for (size_t i = 0; i < nodes.size(); i++) { - const Node &node = nodes.Get(i); - dot.AddNode(node.repr(), node.dot_attrs()); - } - - // Add edges - for (size_t i = 0; i < nodes.size(); i++) { - const Node &node = nodes.Get(i); - for (auto &in : node.inlinks) { - dot.AddEdge(in->repr(), node.repr(), {}); - } - } - return dot.Build(); -} - -std::string DataFlowGraph::HumanReadableInfo(bool show_values, - bool show_functions) const { - std::stringstream values, functions; - for (auto &n : nodes.nodes()) { - if (show_values && n->IsValue()) { - values << n->repr() << "\n"; - } - if (show_functions && n->IsFunction()) { - functions << n->repr() << "\n"; - } - } - return "Values:\n" + values.str() + "\n\n" + "Functions:\n" + functions.str(); -} - -// -// NodesBFSIterator -// - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - const std::vector &source) - : queue_(source.begin(), source.end()) {} - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - GraphTraits::NodesBFSIterator &&other) noexcept - : queue_(std::move(other.queue_)), - visited_(std::move(other.visited_)) {} - -GraphTraits::NodesBFSIterator::NodesBFSIterator( - const GraphTraits::NodesBFSIterator &other) - : queue_(other.queue_), visited_(other.visited_) {} - -Node &GraphTraits::NodesBFSIterator::operator*() { - PADDLE_ENFORCE(!queue_.empty()); - return *queue_.front(); -} - -Node *GraphTraits::NodesBFSIterator::operator->() { - PADDLE_ENFORCE(!queue_.empty()); - return queue_.front(); -} - -GraphTraits::NodesBFSIterator & -GraphTraits::NodesBFSIterator::operator=( - const GraphTraits::NodesBFSIterator &other) { - queue_ = other.queue_; - visited_ = other.visited_; - return *this; -} - -GraphTraits::NodesBFSIterator - &GraphTraits::NodesBFSIterator::operator++() { - PADDLE_ENFORCE(!queue_.empty()); - auto *cur = queue_.front(); - visited_.insert(cur); - queue_.pop_front(); - for (auto *output : cur->outlinks) { - if (!visited_.count(output)) { - queue_.push_back(output); - visited_.insert(output); - } - } - return *this; -} - -bool GraphTraits::NodesBFSIterator::operator==( - const GraphTraits::NodesBFSIterator &other) { - if (queue_.empty()) return other.queue_.empty(); - if ((!queue_.empty()) && (!other.queue_.empty())) { - return queue_.front() == other.queue_.front() && - visited_.size() == other.visited_.size(); - // equality of queue and - // visited. Just a light but week implementation. - } - return false; -} - -// -// NodesDFSIterator -// -GraphTraits::NodesDFSIterator::NodesDFSIterator( - const std::vector &source) { - for (auto *x : source) stack_.push(x); -} - -GraphTraits::NodesDFSIterator::NodesDFSIterator( - GraphTraits::NodesDFSIterator &&other) noexcept - : stack_(std::move(other.stack_)), - visited_(std::move(other.visited_)) {} - -GraphTraits::NodesDFSIterator::NodesDFSIterator( - const GraphTraits::NodesDFSIterator &other) - : stack_(other.stack_), visited_(other.visited_) {} - -Node &GraphTraits::NodesDFSIterator::operator*() { - PADDLE_ENFORCE(!stack_.empty()); - return *stack_.top(); -} - -GraphTraits::NodesDFSIterator - &GraphTraits::NodesDFSIterator::operator++() { - if (stack_.empty()) return *this; - visited_.insert(stack_.top()); - auto *cur = stack_.top(); - stack_.pop(); - for (auto *x : cur->outlinks) { - if (!visited_.count(x)) { - stack_.push(x); - visited_.insert(x); - } - } - return *this; -} -bool GraphTraits::NodesDFSIterator::operator==( - const GraphTraits::NodesDFSIterator &other) { - if (stack_.empty()) return other.stack_.empty(); - if ((!stack_.empty()) && (!other.stack_.empty())) { - return stack_.top() == other.stack_.top(); - } - return false; -} - -GraphTraits::NodesDFSIterator & -GraphTraits::NodesDFSIterator::operator=( - const GraphTraits::NodesDFSIterator &other) { - stack_ = other.stack_; - visited_ = other.visited_; - return *this; -} -Node *GraphTraits::NodesDFSIterator::operator->() { - return stack_.top(); -} - -inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { - return node.inlinks.size() == n; -} - -GraphTraits::NodesTSIterator::NodesTSIterator( - const std::vector &source) { - PADDLE_ENFORCE(!source.empty(), - "Start points of topological sorting should not be empty!"); - // CHECK all the inputs' in-degree is 0 - for (auto *node : source) { - PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0)); - } - - std::unordered_set visited; - std::unordered_set to_visit{source.begin(), source.end()}; - - std::vector inlink_visited; - while (!to_visit.empty()) { - std::vector queue(to_visit.begin(), to_visit.end()); - for (auto *p : queue) { - if (p->deleted()) { - visited.insert(p); - to_visit.erase(p); - continue; - } - inlink_visited.clear(); - - std::copy_if(p->inlinks.begin(), p->inlinks.end(), - std::back_inserter(inlink_visited), - [&](Node *x) { return visited.count(x); }); - - if (inlink_visited.size() == p->inlinks.size()) { - sorted_.push_back(p); - for (auto *_ : p->outlinks) { - if (!visited.count(_)) { - to_visit.insert(_); - } - } - - to_visit.erase(p); - visited.insert(p); - } - } - } -} - -GraphTraits::NodesTSIterator::NodesTSIterator( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) - : sorted_(other.sorted_), cursor_(other.cursor_) {} - -Node &GraphTraits::NodesTSIterator::operator*() { - PADDLE_ENFORCE_LT(cursor_, sorted_.size()); - return *sorted_[cursor_]; -} - -paddle::inference::analysis::GraphTraits::NodesTSIterator - &GraphTraits::NodesTSIterator::operator++() { - if (++cursor_ >= sorted_.size()) { - sorted_.clear(); - cursor_ = 0; - } - return *this; -} -paddle::inference::analysis::GraphTraits::NodesTSIterator & -GraphTraits::NodesTSIterator::operator=( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) { - cursor_ = other.cursor_; - sorted_ = other.sorted_; - return *this; -} - -bool GraphTraits::NodesTSIterator::operator==( - const paddle::inference::analysis::GraphTraits< - DataFlowGraph>::NodesTSIterator &other) { - return sorted_ == other.sorted_ && cursor_ == other.cursor_; -} - -Node *GraphTraits::NodesTSIterator::operator->() { - PADDLE_ENFORCE_LT(cursor_, sorted_.size()); - return sorted_[cursor_]; -} - -std::pair, std::vector> -ExtractInputAndOutputOfSubGraph(std::vector &graph) { // NOLINT - std::unordered_set nodes(graph.begin(), graph.end()); - std::unordered_set inputs; - std::unordered_set outputs; - // Input a Value, check whether its inlink is in the subgraph. - auto inlink_in_subgraph = [&](Node *n) { - for (auto *in : n->inlinks) { - if (nodes.count(in)) return true; - } - return false; - }; - - for (auto &node : graph) { - for (auto *in : node->inlinks) { - // The Value that is written by nodes inside a sub-graph shouldn't be the - // input of the sub-graph. - if (!nodes.count(in) && in->type() == Node::Type::kValue && - !inlink_in_subgraph(in)) { - inputs.insert(in); - } - } - for (auto *out : node->outlinks) { - if (!nodes.count(out) && out->type() == Node::Type::kValue) { - outputs.insert(out); - } - } - } - return std::make_pair(std::vector(inputs.begin(), inputs.end()), - std::vector(outputs.begin(), outputs.end())); -} - -// Filter the Intermediate results of the subgraph node. -void FilterRedundantOutputOfSubGraph(DataFlowGraph *graph) { - std::vector op_nodes; - for (auto &node : GraphTraits(*graph).nodes_in_TS()) { - if (node.type() == Node::Type::kValue || node.deleted()) { - continue; - } - op_nodes.push_back(&node); - } - size_t op_num = op_nodes.size(); - for (size_t i = 0; i < op_num; i++) { - if (op_nodes[i]->type() == Node::Type::kFunction) continue; - std::unordered_set follow_up_input_names; - for (size_t j = i + 1; j < op_num; j++) { - for (auto *in : op_nodes[j]->inlinks) { - follow_up_input_names.insert(in->name()); - } - } - std::vector filtered_subgraph_outlinks; - for (auto *out : op_nodes[i]->outlinks) { - if (follow_up_input_names.count(out->name())) { - filtered_subgraph_outlinks.push_back(out); - } else { - out->SetDeleted(); - } - } - // The filtered_subgraph_outlinks may be empty. - op_nodes[i]->outlinks = filtered_subgraph_outlinks; - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph.h b/paddle/fluid/inference/analysis/data_flow_graph.h deleted file mode 100644 index 437e097acd24aad384df6712ce0de6106b3b5c65..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph.h +++ /dev/null @@ -1,209 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * Data flow graph is an pass that build the basic graph. It contains a graph - * and the iterators that enable the iteration over the graph. - */ - -#pragma once - -#include -#include -#include -#include -#include -#include - -#include "paddle/fluid/framework/ir/graph.h" -#include "paddle/fluid/inference/analysis/graph_traits.h" -#include "paddle/fluid/inference/analysis/node.h" -#include "paddle/fluid/platform/enforce.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * DataFlowGraph - A container of Value and Function Nodes. - * - * This is the base graph for any other type of graphs, such as SSA or CFG. - */ -struct DataFlowGraph { - NodeMap nodes; - // inputs and outputs are deduced from the graph. - // Used to interact with IR. - const framework::ir::Graph *ir_graph{nullptr}; - - // Extract inputs and outputs of the graph. - void Build(); - - void Build(const framework::proto::ProgramDesc &prog); - - // Build a graph from ir::Graph. - void Build(const framework::ir::Graph &graph); - - // Get an attribute. - AnyAttr &Attr(const std::string &key) { return attrs_[key]; } - - // Output a DOT graph file for debug. - std::string DotString() const; - - std::string HumanReadableInfo(bool show_values = true, - bool show_functions = true) const; - - const std::vector &inputs() const { - PADDLE_ENFORCE(!inputs_.empty(), - "No inputs are deduced, need to Build() first."); - return inputs_; - } - const std::vector &outputs() const { - PADDLE_ENFORCE(!outputs_.empty(), - "No outputs are deduced, need to Build() first."); - return outputs_; - } - - private: - mutable std::vector inputs_; - mutable std::vector outputs_; - std::unordered_map attrs_; - - // Remove duplicate edges and so on. - void Clean(); -}; - -/* - * An graph trait help to traverse the graph using BFS. - * The BFS start from a graph's inputs, the graph should be fully-connected, so - * that the iterator can reach the end. - */ -template <> -struct GraphTraits { - // BFS iterator on nodes. - struct NodesBFSIterator - : public std::iterator { - NodesBFSIterator() = default; - explicit NodesBFSIterator(const std::vector &source); - NodesBFSIterator(NodesBFSIterator &&other) noexcept; - // NOTE Heavy to use. - NodesBFSIterator(const NodesBFSIterator &other); - - Node &operator*(); - NodesBFSIterator &operator++(); - Node *operator->(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesBFSIterator &operator=(const NodesBFSIterator &other); - bool operator==(const NodesBFSIterator &other); - bool operator!=(const NodesBFSIterator &other) { return !(*this == other); } - - private: - std::deque queue_; - std::unordered_set visited_; - }; - - // DFS iterator on nodes. - struct NodesDFSIterator - : public std::iterator { - NodesDFSIterator() = default; - NodesDFSIterator(const std::vector &source); - NodesDFSIterator(NodesDFSIterator &&other) noexcept; - NodesDFSIterator(const NodesDFSIterator &other); - - Node &operator*(); - NodesDFSIterator &operator++(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesDFSIterator &operator=(const NodesDFSIterator &other); - bool operator==(const NodesDFSIterator &other); - bool operator!=(const NodesDFSIterator &other) { return !(*this == other); } - Node *operator->(); - - private: - std::stack stack_; - std::unordered_set visited_; - }; - - // Topological sorting iterator on nodes. - struct NodesTSIterator - : public std::iterator { - NodesTSIterator() = default; - NodesTSIterator(const std::vector &source); - NodesTSIterator(NodesTSIterator &&other) - : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) { - other.cursor_ = 0; - } - NodesTSIterator(const NodesTSIterator &other); - - Node &operator*(); - NodesTSIterator &operator++(); - // TODO(Superjomn) current implementation just compare the first - // element, need to compare the graph and all the elements in the queue and - // set. - NodesTSIterator &operator=(const NodesTSIterator &other); - bool operator==(const NodesTSIterator &other); - bool operator!=(const NodesTSIterator &other) { return !(*this == other); } - Node *operator->(); - - private: - std::vector sorted_; - size_t cursor_{0}; - }; - - explicit GraphTraits(const DataFlowGraph &graph) : graph_(graph) {} - - // default use BFS to visit the nodes. - iterator_range nodes() { - return iterator_range(nodes_bfs_begin(), nodes_bfs_end()); - } - iterator_range nodes_in_BFS() { - return iterator_range(nodes_bfs_begin(), nodes_bfs_end()); - } - iterator_range nodes_in_DFS() { - return iterator_range(nodes_dfs_begin(), nodes_dfs_end()); - } - iterator_range nodes_in_TS() { - return iterator_range(nodes_ts_begin(), nodes_ts_end()); - } - - private: - NodesBFSIterator nodes_bfs_begin() { - return NodesBFSIterator(graph_.inputs()); - } - NodesBFSIterator nodes_bfs_end() { return NodesBFSIterator(); } - - NodesDFSIterator nodes_dfs_begin() { - return NodesDFSIterator(graph_.inputs()); - } - NodesDFSIterator nodes_dfs_end() { return NodesDFSIterator(); } - - NodesTSIterator nodes_ts_begin() { return NodesTSIterator(graph_.inputs()); } - NodesTSIterator nodes_ts_end() { return NodesTSIterator(); } - - private: - const DataFlowGraph &graph_; -}; - -// Extract the inputs and outputs of a graph. The inputs and outputs of a -// sub-graph is the inputs nodes and output nodes that doesn't inside the -// sub-graph. -std::pair, std::vector> -ExtractInputAndOutputOfSubGraph(std::vector &graph); // NOLINT - -void FilterRedundantOutputOfSubGraph(DataFlowGraph *graph); -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_tester.cc b/paddle/fluid/inference/analysis/data_flow_graph_tester.cc deleted file mode 100644 index 1682011c3d8cc9927a4b026b370671798cace625..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_tester.cc +++ /dev/null @@ -1,165 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DataFlowGraph, BFS) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - dfg.Build(); - - for (auto* in : dfg.inputs()) { - LOG(INFO) << "inputs: " << in->name() << " " - << static_cast(in->type()); - } - for (auto* out : dfg.outputs()) { - LOG(INFO) << "outputs: " << out->name() << " " - << static_cast(out->type()); - } - - size_t count = 0; - for (auto& node : GraphTraits(dfg).nodes()) { - LOG(INFO) << "visiting " << node.name(); - ++count; - } - ASSERT_EQ(count, dfg.nodes.size()); -} - -TEST(DataFlowGraph, DFS) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - DataFlowGraph dfg; - dfg.Build(desc); - size_t count = 0; - for (auto& node : GraphTraits(dfg).nodes_in_DFS()) { - LOG(INFO) << "visiting " << node.name(); - ++count; - } - ASSERT_EQ(count, dfg.nodes.size()); -} - -// Topological sorting. -/* - * Graph topology - * inputs: 0, 1, 2 - * 0 -> 4 - * 0 -> 5 - * 1 -> 6 - * 2 -> 7 - * 4 -> 5 - * 4 -> 7 - * 4 -> 3 - * 7 -> 3 - */ -TEST(DataFlowGraph, TS) { - DataFlowGraph graph; - - for (int i = 0; i < 8; i++) { - auto* node = graph.nodes.Create(Node::Type::kValue); - node->SetName("node-" + std::to_string(i)); - } - - auto add_link = [&](int i, int j) { - Node* source = graph.nodes.GetMutable(i); - Node* target = graph.nodes.GetMutable(j); - target->inlinks.push_back(source); - source->outlinks.push_back(target); - }; - - add_link(0, 4); - add_link(0, 5); - add_link(1, 6); - add_link(2, 7); - add_link(4, 5); - add_link(4, 7); - add_link(4, 3); - add_link(7, 3); - graph.Build(); - - auto its = GraphTraits(graph).nodes_in_TS(); - std::vector sorted_ids; - for (auto it = its.begin(); it != its.end(); ++it) { - LOG(INFO) << it->name(); - sorted_ids.push_back(it->id()); - } - - // Assert a occurs prior to b in the sorted_ids. - auto assert_positive_sequence_pair = [&](int a, int b) { - auto a_offset = std::find(sorted_ids.begin(), sorted_ids.end(), a); - auto b_offset = std::find(sorted_ids.begin(), sorted_ids.end(), b); - ASSERT_LT(a_offset, b_offset); - }; - - assert_positive_sequence_pair(2, 7); - assert_positive_sequence_pair(7, 3); - assert_positive_sequence_pair(4, 3); - assert_positive_sequence_pair(0, 4); - assert_positive_sequence_pair(0, 5); - assert_positive_sequence_pair(1, 6); - assert_positive_sequence_pair(4, 5); - assert_positive_sequence_pair(4, 7); -} - -TEST(DataFlowGraph, Build_ProgramDesc) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - DataFlowGraph graph; - graph.Build(desc); - ASSERT_EQ(graph.nodes.size(), 38UL); -} - -void SetOp(framework::ProgramDesc* prog, const std::string& type, - const std::vector& inputs, - const std::vector& outputs) { - auto* op = prog->MutableBlock(0)->AppendOp(); - op->SetType(type); - op->SetInput("Xs", inputs); - op->SetOutput("Xs", outputs); -} - -TEST(DataFlowGraph, Build_IR_Graph) { - framework::ProgramDesc prog; - for (auto& v : std::vector({"a", "b", "c", "d", "e", "f"})) { - auto* var = prog.MutableBlock(0)->Var(v); - var->SetType(framework::proto::VarType::SELECTED_ROWS); - if (v == "c") { - var->SetPersistable(true); - } - } - - SetOp(&prog, "OP0", std::vector({"a"}), - std::vector({"b"})); - SetOp(&prog, "OP1", std::vector({"a"}), - std::vector({"c"})); - SetOp(&prog, "mul", std::vector({"b", "c"}), - std::vector({"d"})); - SetOp(&prog, "elementwise_add", std::vector({"d", "e"}), - std::vector({"f"})); - - DataFlowGraph graph; - - framework::ir::Graph ir_graph(prog); - - graph.Build(ir_graph); - - ASSERT_EQ(graph.nodes.size(), ir_graph.Nodes().size()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc deleted file mode 100644 index cb549f4b50cf56154a951d16b58b022dbad3e990..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.cc +++ /dev/null @@ -1,284 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include -#include "paddle/fluid/framework/block_desc.h" -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/framework/op_desc.h" -#include "paddle/fluid/framework/proto_desc.h" -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/io.h" - -namespace paddle { -namespace inference { - -namespace analysis { - -using framework::proto::ProgramDesc; - -std::vector ExtractParameters( - const std::vector> &nodes); - -bool DataFlowGraphToFluidPass::Initialize(Argument *argument) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument) - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->origin_program_desc) - // The transformed_program_desc should inherit all the VarDesc and BlockDesc - // from the original program desc. The operators of the main block(the first - // block) should rewritten by data flow graph. - argument->transformed_program_desc.reset( - new ProgramDesc(*argument->origin_program_desc)); - argument->transformed_program_desc->mutable_blocks(framework::kRootBlockIndex) - ->clear_ops(); - desc_ = argument->transformed_program_desc.get(); - argument_ = argument; - return true; -} - -bool DataFlowGraphToFluidPass::Finalize() { return true; } - -void DataFlowGraphToFluidPass::Run(DataFlowGraph *graph) { - // FilterRedundantOutputOfSubGraph(graph); - for (auto &node : GraphTraits(*graph).nodes_in_TS()) { - if (node.deleted()) continue; - - switch (node.type()) { - case Node::Type::kFunction: { - AddFluidOp(&node); - } break; - case Node::Type::kFunctionBlock: { - AddEngineOp(&node); - } break; - default: - continue; - } - } - - if (argument_->Has(framework::ir::kParamScopeAttr)) { - LOG(WARNING) << "parameter changes in the scope takes effect"; - } - - PADDLE_ENFORCE(argument_->transformed_program_desc.get()); -} - -void DataFlowGraphToFluidPass::AddFluidOp(Node *node) { - PADDLE_ENFORCE(node); - PADDLE_ENFORCE(node->IsFunction()); - PADDLE_ENFORCE(node->pb_desc() || !node->pb_msg().empty(), - "node has invalid protobuf repr."); - - // currently only the main block is analyzed. - PADDLE_ENFORCE(desc_); - auto *main_block = desc_->mutable_blocks(framework::kRootBlockIndex); - auto *op = main_block->add_ops(); - - if (node->pb_desc()) { - auto *ori_op = static_cast(node->pb_desc()); - *op = - *ori_op; // copy the attributes, by default, these will not be changed - // by analysis phrase. - // The inputs and outputs of the existing ops are not changed by tensorrt - // subgraph pass. - // NOTE It might be changed by other passes in the long run. - } else { - op->ParseFromString(node->pb_msg()); - } -} - -void CreateTrtEngineOp(Node *node, Argument *argument, - framework::proto::BlockDesc *block) { - PADDLE_ENFORCE(argument->main_dfg.get()); - const DataFlowGraph &graph = *(argument->main_dfg); - static int counter{0}; - PADDLE_ENFORCE(node->IsFunctionBlock()); - framework::OpDesc desc; - auto *func = static_cast(node); - - // collect inputs - std::unordered_set input_names; - std::unordered_set input_names_with_id; - for (auto *x : func->inlinks) { - input_names.insert(x->name()); - input_names_with_id.insert(x->name() + std::to_string(x->id())); - } - desc.SetInput( - "Xs", std::vector(input_names.begin(), input_names.end())); - - std::unordered_set output_names; - std::unordered_set output_names_with_id; - for (auto *x : func->outlinks) { - output_names.insert(x->name()); - output_names_with_id.insert(x->name() + std::to_string(x->id())); - } - - desc.SetOutput( - "Ys", std::vector(output_names.begin(), output_names.end())); - desc.SetType("tensorrt_engine"); - - std::unordered_map output_name_map; - - // The following procedure is used to rename all the intermediate - // variables and the output variables of the subgraph. - // Why we do this? - // During the transition from fluid OP to tensorrt OP, we map - // the input and output Tensor(fluid data structure) of fluid OP - // to the correspondin ITensor (trt data structure) through the - // Tensor name. When we set up ITensor for an variable, we must - // ensure that it has not been set before. - // If there is variable in the fluid graph, which is not only the - // input of a OP, but also the output of a Op, there will be problems. - // So we have to rename the variable in the subgraph to make sure - // it is either an OP's input or an OP's output. - - auto subgraph_nodes = func->subgraph; - for (int index = 0; index < block->ops_size(); index++) { - framework::proto::OpDesc *op = block->mutable_ops(index); - auto correspond_node = subgraph_nodes[index]; - PADDLE_ENFORCE_EQ(correspond_node->name(), op->type()); - - std::unordered_map var2id; - for (auto *in_var : correspond_node->inlinks) { - var2id[in_var->name()] = in_var->id(); - } - // rename for the input variables of op inside subgraph - for (int i = 0; i < op->inputs_size(); i++) { - framework::proto::OpDesc_Var *in_var = op->mutable_inputs(i); - std::vector replaced_names; - for (int k = 0; k < in_var->arguments_size(); k++) { - std::string arg_value = in_var->arguments(k); - std::string arg_value_with_id = - arg_value + std::to_string(var2id[arg_value]); - if (input_names_with_id.count(arg_value_with_id)) { - replaced_names.push_back(arg_value); - } else { - replaced_names.push_back(arg_value_with_id); - } - } - in_var->clear_arguments(); - for (size_t k = 0; k < replaced_names.size(); k++) { - in_var->add_arguments(replaced_names[k]); - } - } - var2id.clear(); - for (auto out_var : correspond_node->outlinks) { - var2id[out_var->name()] = out_var->id(); - } - - // rename for the output variables of op inside subgraph - for (int i = 0; i < op->outputs_size(); i++) { - framework::proto::OpDesc_Var *out_var = op->mutable_outputs(i); - std::vector replaced_names; - for (int k = 0; k < out_var->arguments_size(); k++) { - std::string arg_value = out_var->arguments(k); - std::string arg_value_with_id = - arg_value + std::to_string(var2id[arg_value]); - if (output_names_with_id.count(arg_value_with_id)) { - output_name_map[arg_value] = arg_value_with_id; - } - replaced_names.push_back(arg_value_with_id); - } - out_var->clear_arguments(); - for (size_t k = 0; k < replaced_names.size(); k++) { - out_var->add_arguments(replaced_names[k]); - } - } - } - // When tensorrt engine runs at the end of the operation, - // output_mapping help us copy the data from the renamed ITensor - // to Tensor. - std::vector output_mapping; - for (auto name : output_names) { - PADDLE_ENFORCE(output_name_map.count(name) != 0); - output_mapping.push_back(output_name_map[name]); - } - - PADDLE_ENFORCE(!block->vars().empty(), "the block has no var-desc"); - // Set attrs - - SetAttr(desc.Proto(), "subgraph", block->SerializeAsString()); - SetAttr(desc.Proto(), "max_batch_size", argument->Get("max_batch_size")); - SetAttr(desc.Proto(), "workspace_size", argument->Get("workspace_size")); - SetAttr(desc.Proto(), "engine_uniq_key", "trt-" + std::to_string(counter++)); - SetAttr(desc.Proto(), "parameters", ExtractParameters(graph.nodes.nodes())); - SetAttr(desc.Proto(), "output_name_mapping", output_mapping); - node->SetPbMsg(desc.Proto()->SerializeAsString()); -} - -std::vector ExtractParameters( - const std::vector> &nodes) { - std::vector parameters; - for (const auto &node : nodes) { - if (!node->IsValue()) continue; - PADDLE_ENFORCE(!node->pb_msg().empty(), "pb_msg should be set first"); - framework::proto::VarDesc var; - var.ParseFromString(node->pb_msg()); - if (var.persistable()) { - parameters.push_back(var.name()); - } - } - return parameters; -} - -void DataFlowGraphToFluidPass::AddEngineOp(Node *node) { - // TODO(Superjomn) Here need to expose some arguments for default setting. - PADDLE_ENFORCE(node->IsFunctionBlock()); - auto *block_node = static_cast(node); - framework::proto::BlockDesc proto; - framework::BlockDesc block_desc(nullptr, &proto); - block_desc.Proto()->set_parent_idx(-1); - block_desc.Proto()->set_idx(0); - VLOG(4) << "origin variable size: " - << argument_->origin_program_desc->blocks(0).vars().size(); - VLOG(4) << "transformed variable size: " << block_desc.Proto()->vars().size(); - // copy ops. - - for (auto *node : block_node->subgraph) { - auto *op = block_desc.AppendOp(); - PADDLE_ENFORCE(!node->pb_msg().empty()); - op->Proto()->ParseFromString(node->pb_msg()); - } - - *block_desc.Proto()->mutable_vars() = - argument_->origin_program_desc->blocks(0).vars(); - PADDLE_ENFORCE(!block_desc.Proto()->vars().empty()); - CreateTrtEngineOp(node, argument_, block_desc.Proto()); - auto *main_block = desc_->mutable_blocks(framework::kRootBlockIndex); - auto *op = main_block->add_ops(); - PADDLE_ENFORCE(!node->pb_msg().empty(), "failed to set desc for block"); - op->ParseFromString(node->pb_msg()); -} - -namespace { -class DFG_DebuggerPass : public DFG_GraphvizDrawPass { - public: - using Config = DFG_GraphvizDrawPass::Config; - explicit DFG_DebuggerPass(const Config &config) - : DFG_GraphvizDrawPass(config) {} - - std::string repr() const override { return "dfg-to-fluid-debuger-pass"; } - - bool Finalize() override { return true; } -}; -} // namespace - -AnalysisPass *DataFlowGraphToFluidPass::CreateGraphvizDebugerPass() const { - return new DFG_DebuggerPass(DFG_GraphvizDrawPass::Config( - FLAGS_IA_graphviz_log_root, - "data_flow_graph_to_fluid_graphviz_debugger")); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h deleted file mode 100644 index 891c7226e245fa3b92892785362c186185a61f62..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h +++ /dev/null @@ -1,59 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -/* - * This file implements the transformation from fluid ProgramDesc to data flow - * graph. - */ - -#pragma once - -#include -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" - -namespace paddle { -namespace inference { - -namespace analysis { -class DataFlowGraphToFluidPass final : public DataFlowGraphPass { - public: - DataFlowGraphToFluidPass() = default; - - bool Initialize(Argument *argument) override; - bool Finalize() override; - - void Run(DataFlowGraph *graph) override; - - std::string repr() const override { return "DFG to fluid"; } - std::string description() const override { - return "Transform a DFG to a Fluid ProgramDesc"; - } - - AnalysisPass *CreateGraphvizDebugerPass() const override; - - protected: - // Add a Fluid Op into the ProgramDesc. - void AddFluidOp(Node *node); - // Add a EngineOp into the ProgramDesc. - void AddEngineOp(Node *node); - - private: - framework::proto::ProgramDesc *desc_; - Argument *argument_; -}; -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc b/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc deleted file mode 100644 index 4ef381db295b986b91173a728b6d98640f6f4f51..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass_tester.cc +++ /dev/null @@ -1,48 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" - -#include -#include -#include -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" -#include "paddle/fluid/inference/io.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DataFlowGraph, Test) { - Argument argument(FLAGS_inference_model_dir); - - FluidToDataFlowGraphPass pass0; - DataFlowGraphToFluidPass pass1; - ASSERT_TRUE(pass0.Initialize(&argument)); - ASSERT_TRUE(pass1.Initialize(&argument)); - - pass0.Run(argument.main_dfg.get()); - pass1.Run(argument.main_dfg.get()); - - pass0.Finalize(); - pass1.Finalize(); - - LOG(INFO) << argument.main_dfg->nodes.size(); -} - -}; // namespace analysis -}; // namespace inference -}; // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc deleted file mode 100644 index 648b8f7d6a6ec4bafbad2838c5631e776c8699b1..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.cc +++ /dev/null @@ -1,59 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -int DFG_GraphvizDrawPass::counter_{0}; - -void DFG_GraphvizDrawPass::Run(DataFlowGraph *graph) { - auto content = Draw(graph); - auto dot_path = GenDotPath(); - std::ofstream file(dot_path); - file.write(content.c_str(), content.size()); - file.close(); - - auto png_path = dot_path.substr(0, dot_path.size() - 4) + ".png"; - std::string message; - VLOG(3) << "draw to " << png_path; - ExecShellCommand("dot -Tpng " + dot_path + " -o " + png_path, &message); -} - -std::string DFG_GraphvizDrawPass::Draw(DataFlowGraph *graph) { - Dot dot; - // Add nodes - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (config_.display_deleted_node || !node.deleted()) { - dot.AddNode(node.repr(), node.dot_attrs()); - } - } - // Add edges - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (!config_.display_deleted_node && node.deleted()) continue; - for (auto &out : node.outlinks) { - if (!config_.display_deleted_node && out->deleted()) continue; - dot.AddEdge(node.repr(), out->repr(), {}); - } - } - return dot.Build(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h deleted file mode 100644 index e537bfc0e64d4ff46b3d61499a1a0298ed83533f..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h +++ /dev/null @@ -1,78 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file create an DFG_GraphvizDrawPass which helps to draw a data flow - * graph's structure using graphviz. - */ - -#pragma once - -#include -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/dot.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Output a dot file and write to some place. - */ -class DFG_GraphvizDrawPass : public DataFlowGraphPass { - public: - struct Config { - Config(const std::string &dir, const std::string &id, - bool display_deleted_node = false) - : dir(dir), id(id), display_deleted_node(display_deleted_node) {} - - // The directory to store the .dot or .png files. - const std::string dir; - // The identifier for this dot file. - const std::string id; - // Whether to display deleted nodes, default false. - const bool display_deleted_node; - }; - - explicit DFG_GraphvizDrawPass(const Config &config) : config_(config) {} - - bool Initialize(Argument *argument) override { return true; } - void Run(DataFlowGraph *graph) override; - bool Finalize() override { return true; } - - std::string repr() const override { return "DFG graphviz drawer"; } - std::string description() const override { - return "Debug a DFG by draw with graphviz"; - } - - protected: - // A counter to add a number prefix to the debugger image output so that they - // will sort in the triggered order. - static int counter_; - - // Path of the dot file to output. - std::string GenDotPath() const { - return config_.dir + "/" + std::to_string(counter_++) + "-graph_" + - config_.id + ".dot"; - } - - virtual std::string Draw(DataFlowGraph *graph); - - Config config_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc b/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc deleted file mode 100644 index 928be7917047382d9b86294f6039b26b0ebf6f49..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/dfg_graphviz_draw_pass_tester.cc +++ /dev/null @@ -1,54 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" - -#include -#include -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(DFG_GraphvizDrawPass, dfg_graphviz_draw_pass_tester) { - Argument argument(FLAGS_inference_model_dir); - FluidToDataFlowGraphPass pass0; - ASSERT_TRUE(pass0.Initialize(&argument)); - pass0.Run(argument.main_dfg.get()); - - // auto dfg = ProgramDescToDFG(*argument.origin_program_desc); - - DFG_GraphvizDrawPass::Config config("./", "test"); - DFG_GraphvizDrawPass pass(config); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); - - // test content - std::ifstream file("./0-graph_test.dot"); - ASSERT_TRUE(file.is_open()); - - std::string line; - int no{0}; - while (std::getline(file, line)) { - no++; - } - // DFG is sensitive to ProgramDesc, be careful to change the existing models. - ASSERT_EQ(no, 83); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/dot_tester.cc b/paddle/fluid/inference/analysis/dot_tester.cc index 56ceb9bd5d6f41a601d66f6124fb7b4099c9337e..c785a312bf96c3586ea990fd9028cfd3b930d577 100644 --- a/paddle/fluid/inference/analysis/dot_tester.cc +++ b/paddle/fluid/inference/analysis/dot_tester.cc @@ -16,7 +16,6 @@ #include #include -#include "paddle/fluid/inference/analysis/data_flow_graph.h" namespace paddle { namespace inference { diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc deleted file mode 100644 index 2b7d632c839e735ca03c6e17b94307b40cc13374..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.cc +++ /dev/null @@ -1,76 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -bool FluidToDataFlowGraphPass::Initialize(Argument *argument) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument); - if (argument->origin_program_desc) { - LOG(WARNING) << "argument's origin_program_desc is already set, might " - "duplicate called"; - } - if (!argument->fluid_model_program_path) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_dir); - argument->fluid_model_program_path.reset( - new std::string(*argument->fluid_model_dir + "/__model__")); - } - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_program_path); - auto program = LoadProgramDesc(*argument->fluid_model_program_path); - argument->origin_program_desc.reset( - new framework::proto::ProgramDesc(program)); - - if (!argument->main_dfg) { - argument->main_dfg.reset(new DataFlowGraph); - } - desc_ = argument->origin_program_desc.get(); - return true; -} - -bool FluidToDataFlowGraphPass::Finalize() { return true; } - -void FluidToDataFlowGraphPass::Run(DataFlowGraph *graph) { - PADDLE_ENFORCE(graph); - PADDLE_ENFORCE(desc_); - graph->Build(*desc_); -} - -namespace { -class DFG_DebuggerPass : public DFG_GraphvizDrawPass { - public: - using Config = DFG_GraphvizDrawPass::Config; - explicit DFG_DebuggerPass(const Config &config) - : DFG_GraphvizDrawPass(config) {} - std::string repr() const override { return "fluid-to-dfg-debuger-pass"; } - bool Finalize() override { return true; } -}; -} - -AnalysisPass *FluidToDataFlowGraphPass::CreateGraphvizDebugerPass() const { - return new DFG_DebuggerPass(DFG_GraphvizDrawPass::Config( - FLAGS_IA_graphviz_log_root, "fluid-to-dfg-debuger")); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h deleted file mode 100644 index b9e262020e9522e167b998d57e2be2ac19b48447..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h +++ /dev/null @@ -1,57 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -/* - * This file implements the transformation from data flow graph to fluid - * ProgramDesc. - */ - -#pragma once - -#include - -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Transform a FluidDesc to a SSA. - */ -class FluidToDataFlowGraphPass final : public DataFlowGraphPass { - public: - FluidToDataFlowGraphPass() = default; - - bool Initialize(Argument *argument) override; - bool Finalize() override; - - void Run(DataFlowGraph *graph) override; - - std::string repr() const override { return "fluid-to-data-flow-graph"; } - std::string description() const override { - return "transform a fluid ProgramDesc to a data flow graph."; - } - - AnalysisPass *CreateGraphvizDebugerPass() const override; - - private: - framework::proto::ProgramDesc const *desc_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc b/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc deleted file mode 100644 index 267a0a84ebf75615e0b390f4a1b3bf3b51793fc7..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass_tester.cc +++ /dev/null @@ -1,38 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" - -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(FluidToDataFlowGraphPass, Test) { - FluidToDataFlowGraphPass pass; - Argument argument(FLAGS_inference_model_dir); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); - // Analysis is sensitive to ProgramDesc, careful to change the original model. - ASSERT_EQ(argument.main_dfg->nodes.size(), 38UL); - pass.Finalize(); - ASSERT_FALSE(argument.main_dfg->DotString().empty()); - EXPECT_FALSE(argument.main_dfg->inputs().empty()); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc b/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc deleted file mode 100644 index fc60ca3bd0bf706407defb2655a093d999aef7c2..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.cc +++ /dev/null @@ -1,60 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" -#include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/inference/io.h" -#include "paddle/fluid/platform/device_context.h" -#include "paddle/fluid/platform/place.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void FluidToIrPass::EnableParamModify(const std::string &model_dir, - const std::string &prog_file, - const std::string ¶m_file) { - PADDLE_ENFORCE(argument_); - argument_->Set(framework::ir::kParamScopeAttr, new framework::Scope); - // Load parameters. - VLOG(3) << "Loading parameters from " << model_dir; - LoadParams(&argument_->Get(framework::ir::kParamScopeAttr), - model_dir, prog_file, param_file); -} - -bool FluidToIrPass::LoadParams(framework::Scope *scope, const std::string &dir, - const std::string &prog_file, - const std::string ¶m_file) { - platform::CPUPlace place; - platform::CPUDeviceContext ctx(place); - framework::Executor executor(place); - PADDLE_ENFORCE(argument_->origin_program_desc.get()); - framework::ProgramDesc program(*argument_->origin_program_desc); - if ((!prog_file.empty()) && (!param_file.empty())) { - LOG(INFO) << "load single model file from " << prog_file; - Load(&executor, scope, prog_file, param_file); - } else if (!dir.empty()) { - LOG(INFO) << "load from dir " << dir; - Load(&executor, scope, dir); - } else { - LOG(ERROR) << "failed to load parameters"; - return false; - } - return true; -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass.h b/paddle/fluid/inference/analysis/fluid_to_ir_pass.h deleted file mode 100644 index c2599e218a2306f9353b843b7ea3f18aeacb008e..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass.h +++ /dev/null @@ -1,128 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#pragma once - -#include -#include - -#include "paddle/fluid/framework/ir/fuse_pass_base.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/flags.h" -#include "paddle/fluid/inference/analysis/ir_pass_manager.h" - -namespace paddle { -namespace inference { -namespace analysis { - -static const char kFluidToIrPassesAttr[] = "__fluid_to_ir_passes__"; - -class FluidToIrPass final : public DataFlowGraphPass { - public: - FluidToIrPass() = default; - - bool Initialize(Argument *argument) override { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument); - PADDLE_ENFORCE(argument->Has(kFluidToIrPassesAttr), - "argument need the attr %s", kFluidToIrPassesAttr); - argument_ = argument; - if (argument->origin_program_desc) { - LOG(WARNING) << "argument's origin_program_desc is already set, might " - "duplicate called"; - } - // set fluid model program path - if (!argument->fluid_model_program_path) { - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_dir); - argument->fluid_model_program_path.reset( - new std::string(*argument->fluid_model_dir + "/__model__")); - } - ANALYSIS_ARGUMENT_CHECK_FIELD(argument->fluid_model_program_path); - // Load program. - auto program = LoadProgramDesc(*argument->fluid_model_program_path); - argument->origin_program_desc.reset( - new framework::proto::ProgramDesc(program)); - // Create main data flow graph. - if (!argument->main_dfg) { - argument->main_dfg.reset(new DataFlowGraph); - } - argument->Set("ir_program_desc", new ProgramDesc(program)); - - LOG(INFO) << "Loading parameters"; - // Load parameters to argument if needed. - if (argument->fluid_model_dir || (argument->fluid_model_program_path && - argument->fluid_model_param_path)) { -#define SAFE_GET(ATTR) std::string ATTR = argument->ATTR ? *argument->ATTR : ""; - SAFE_GET(fluid_model_dir); - SAFE_GET(fluid_model_program_path); - SAFE_GET(fluid_model_param_path); -#undef SAFE_GET - EnableParamModify(fluid_model_dir, fluid_model_program_path, - fluid_model_param_path); - } - - return true; - } - - bool Finalize() override { return true; } - - void Run(DataFlowGraph *graph) override { - // Call all the IR Passes - IRPassManager ir_passes(argument_->Get("ir_program_desc"), - nullptr); - // Pass the scope from analysis to IR if needed. - if (argument_->Has(framework::ir::kParamScopeAttr)) { - // Here the address is passed, attention that IR doesn't own the scope, so - // the real scope in analysis should live during the IR phase. - ir_passes.graph().Set( - framework::ir::kParamScopeAttr, - new framework::Scope *(&argument_->Get( - framework::ir::kParamScopeAttr))); - } - - if (FLAGS_IA_enable_ir) { - const auto &ir_passes_to_apply = - argument_->Get>(kFluidToIrPassesAttr); - ir_passes.Apply(ir_passes_to_apply); - } - - PADDLE_ENFORCE(argument_->main_dfg.get()); - argument_->main_dfg->Build(ir_passes.graph()); - // inherit the arguments from ir. - if (ir_passes.graph().Has(framework::ir::kFuseStatisAttr)) { - argument_->Set( - framework::ir::kFuseStatisAttr, - new std::unordered_map( - ir_passes.graph().Get>( - framework::ir::kFuseStatisAttr))); - } - } - - void EnableParamModify(const std::string &model_dir, - const std::string &prog_file, - const std::string ¶m_file); - - std::string repr() const override { return "fluid-to-ir-pass"; } - - private: - // Load parameters from a single file or from a directory. - bool LoadParams(framework::Scope *scope, const std::string &dir, - const std::string &prog_file, const std::string ¶m_file); - - private: - Argument *argument_{nullptr}; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/graph_traits.h b/paddle/fluid/inference/analysis/graph_traits.h deleted file mode 100644 index aed2b1e8e27d94b430201d70ecf09d4acc33c8fa..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/graph_traits.h +++ /dev/null @@ -1,63 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the GraphTraits template class that should be specified - * by classes that want to be iteratable by generic graph iterators. - * - * This file also defines the marker class Inverse that is used to iterate over - * graphs in a graph defined, inverse ordering... - */ - -#pragma once - -#include "paddle/fluid/inference/analysis/helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * This class should be specialized by different graph types... - * That's why the base class is empty. - */ -template -struct GraphTraits { - // using NodesBFSIterator = xxx - - // NodesBFSIterator nodes_begin(); - // NodesBFSIterator nodes_end(); -}; - -/* - * Inverse - This class is used as a marker class to tell the graph iterator to - * iterate in a graph defined Inverse order. - */ -template -struct Inverse { - const GraphType &graph; - - explicit Inverse(const GraphType &graph) : graph(graph) {} -}; - -/* - * Provide a partial specialization of GraphTraits so that the inverse of an - * inverse turns into the original graph. - */ -template -struct GraphTraits>> : GraphTraits {}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/helper.h b/paddle/fluid/inference/analysis/helper.h index 5151e2b69ac199dea136535ba445e890596f6227..269a0da9f9378601373e42d741f519843b111ec6 100644 --- a/paddle/fluid/inference/analysis/helper.h +++ b/paddle/fluid/inference/analysis/helper.h @@ -26,6 +26,7 @@ limitations under the License. */ #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/variable.h" #include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/port.h" namespace paddle { namespace inference { @@ -101,20 +102,20 @@ class OrderedRegistry { public: T *Register(const std::string &name, T *x) { PADDLE_ENFORCE(!dic_.count(name), "duplicate key [%s]", name); - dic_[name] = data_.size(); - data_.emplace_back(std::unique_ptr(x)); - return data_.back().get(); + dic_[name] = elements_.size(); + elements_.emplace_back(std::unique_ptr(x)); + return elements_.back().get(); } T *Lookup(const std::string &name) { auto it = dic_.find(name); if (it == dic_.end()) return nullptr; - return data_[it->second].get(); + return elements_[it->second].get(); } protected: std::unordered_map dic_; - std::vector> data_; + std::vector> elements_; }; template @@ -124,20 +125,6 @@ T &GetFromScope(const framework::Scope &scope, const std::string &name) { return *var->GetMutable(); } -static void ExecShellCommand(const std::string &cmd, std::string *message) { - char buffer[128]; - std::shared_ptr pipe(popen(cmd.c_str(), "r"), pclose); - if (!pipe) { - LOG(ERROR) << "error running command: " << cmd; - return; - } - while (!feof(pipe.get())) { - if (fgets(buffer, 128, pipe.get()) != nullptr) { - *message += buffer; - } - } -} - static framework::proto::ProgramDesc LoadProgramDesc( const std::string &model_path) { std::ifstream fin(model_path, std::ios::in | std::ios::binary); diff --git a/paddle/fluid/inference/analysis/ir_pass_manager.cc b/paddle/fluid/inference/analysis/ir_pass_manager.cc index e76708baf4b39afb0febbcf3ff71281dfbfc8627..fce5e1cac92064a320179243380ea02b2c5d7838 100644 --- a/paddle/fluid/inference/analysis/ir_pass_manager.cc +++ b/paddle/fluid/inference/analysis/ir_pass_manager.cc @@ -18,6 +18,8 @@ #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/argument.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" #include "paddle/fluid/string/pretty_log.h" namespace paddle { @@ -27,21 +29,33 @@ using string::PrettyLogEndl; using string::PrettyLog; using string::Style; -IRPassManager::IRPassManager(const ProgramDesc &program, - framework::Scope *scope) - : program_(program) { - graph_.reset(new framework::ir::Graph(program)); - if (scope) - graph_->Set(framework::ir::kParamScopeAttr, new framework::Scope *(scope)); +IRPassManager::IRPassManager(Argument *argument) { + ARGUMENT_CHECK_FIELD(argument, main_program); + graph_ = std::unique_ptr(new Graph(argument->main_program())); + if (argument->Has("scope")) { + graph_->Set(framework::ir::kParamScopeAttr, + new framework::Scope *( + const_cast(&argument->scope()))); + } + + ARGUMENT_CHECK_FIELD(argument, ir_analysis_passes); + CreatePasses(argument, argument->ir_analysis_passes()); } -void IRPassManager::Apply(const std::vector &passes) { - // Apply all the passes +void IRPassManager::CreatePasses(Argument *argument, + const std::vector &passes) { std::string pre_pass; int pass_num = 0; for (const std::string &pass_name : passes) { - PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass_name); auto pass = framework::ir::PassRegistry::Instance().Get(pass_name); + + // Set some pass attributes. + if (pass_name == "ir_analysis_pass") { + pass->Set("tensorrt_node_teller", + new SubgraphDetector::NodeInsideSubgraphTeller( + argument->tensorrt_node_teller())); + } + if (pass_name == "graph_viz_pass") { std::string dot_file_path = std::to_string(pass_num) + "_ir_" + (pre_pass.empty() ? "origin" : pre_pass) + @@ -49,11 +63,47 @@ void IRPassManager::Apply(const std::vector &passes) { pass->Set("graph_viz_path", new std::string(std::move(dot_file_path))); pass_num++; } - graph_ = pass->Apply(std::move(graph_)); + + if (pass_name == "tensorrt_subgraph_pass") { + PADDLE_ENFORCE(argument->tensorrt_node_teller_valid()); + pass->SetNotOwned("tensorrt_node_teller", + argument->tensorrt_node_teller_ptr()); + pass->Set("workspace_size", new int(argument->tensorrt_workspace_size())); + pass->Set("max_batch_size", new int(argument->tensorrt_max_batch_size())); + } + + // graph_ = pass->Apply(std::move(graph_)); pre_pass = pass_name; + + passes_.emplace_back(std::move(pass)); } } +std::unique_ptr IRPassManager::Apply(std::unique_ptr graph) { + if (passes_.empty()) { + return graph; + } + PADDLE_ENFORCE(graph.get()); + // Apply all the passes + for (const auto &pass : passes_) { + PrettyLogEndl(Style::H2(), "--- Running IR pass [%s]", pass->Type()); + graph = pass->Apply(std::move(graph)); + } + return std::move(graph); +} + +framework::proto::ProgramDesc IRPassManager::AcquireProgram( + std::unique_ptr *graph, const ProgramDesc &program) const { + auto pass = + framework::ir::PassRegistry::Instance().Get("graph_to_program_pass"); + + ProgramDesc desc(program); + pass->SetNotOwned("program", &desc); + auto *the_graph = graph->release(); + *graph = pass->Apply(std::unique_ptr(the_graph)); + return *desc.Proto(); +} + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_pass_manager.h b/paddle/fluid/inference/analysis/ir_pass_manager.h index bb230283b7c2cc783d0b68ea0aa3cca1cabc75e6..983a582649706fa6eedb5aa459b5ac53b98f658b 100644 --- a/paddle/fluid/inference/analysis/ir_pass_manager.h +++ b/paddle/fluid/inference/analysis/ir_pass_manager.h @@ -20,27 +20,38 @@ * for inference. */ +#pragma once + +#include +#include #include "paddle/fluid/framework/ir/graph.h" #include "paddle/fluid/framework/ir/pass.h" #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/argument.h" namespace paddle { namespace inference { namespace analysis { using framework::ProgramDesc; +using framework::ir::Graph; class IRPassManager final { public: - IRPassManager(const ProgramDesc &program, framework::Scope *scope); + explicit IRPassManager(Argument *argument); + + std::unique_ptr Apply(std::unique_ptr graph); - void Apply(const std::vector &passes); + framework::proto::ProgramDesc AcquireProgram( + std::unique_ptr *graph, const ProgramDesc &program) const; framework::ir::Graph &graph() const { return *graph_; } private: - std::unique_ptr graph_; - ProgramDesc program_; + void CreatePasses(Argument *argument, const std::vector &passes); + + std::unique_ptr graph_; + std::vector> passes_; }; } // namespace analysis diff --git a/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt b/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..822c7799bb3ae6d79da6cf2a7b3c8c9b20353ed7 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/CMakeLists.txt @@ -0,0 +1,9 @@ +cc_library(subgraph_detector SRCS subgraph_detector.cc DEPS proto_desc) +cc_library(tensorrt_subgraph_pass SRCS tensorrt_subgraph_pass.cc DEPS subgraph_detector) +set(analysis_deps ${analysis_deps} + subgraph_detector tensorrt_subgraph_pass + CACHE INTERNAL "") + +set(pass_file ${PADDLE_BINARY_DIR}/paddle/fluid/inference/api/paddle_inference_pass.h) +file(APPEND ${pass_file} "USE_PASS(tensorrt_subgraph_pass);\n") +set(INFER_IR_PASSES ${INFER_IR_PASSES} tensorrt_subgraph_pass CACHE INTERNAL "") diff --git a/paddle/fluid/inference/analysis/subgraph_splitter.cc b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc similarity index 54% rename from paddle/fluid/inference/analysis/subgraph_splitter.cc rename to paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc index 526bbbadfe90c3064d7c620cc22e30f7fef99088..b6a5dfd087c95d0ccb0f5adfa4f754cfa5a44f14 100644 --- a/paddle/fluid/inference/analysis/subgraph_splitter.cc +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc @@ -12,46 +12,110 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" +#include +#include +#include "paddle/fluid/framework/ir/graph_helper.h" +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/framework/ir/node.h" namespace paddle { namespace inference { namespace analysis { -const char *SubGraphSplitter::kMarkerAttrName = - "_sub_graph_splitter_inside_sub_graph"; +using framework::ir::Node; + +std::pair, std::vector> +ExtractInputAndOutputOfSubGraph(std::vector &graph) { // NOLINT + std::unordered_set nodes(graph.begin(), graph.end()); + std::unordered_set inputs; + std::unordered_set outputs; + // Input a Value, check whether its inlink is in the subgraph. + auto inlink_in_subgraph = [&](Node *n) { + for (auto *in : n->inputs) { + if (nodes.count(in)) return true; + } + return false; + }; + + for (auto &node : graph) { + for (auto *in : node->inputs) { + // The Value that is written by nodes inside a sub-graph shouldn't be the + // input of the sub-graph. + if (!nodes.count(in) && in->IsVar() && !inlink_in_subgraph(in)) { + inputs.insert(in); + } + } + for (auto *out : node->outputs) { + if (!nodes.count(out) && out->IsVar()) { + outputs.insert(out); + } + } + } + return std::make_pair(std::vector(inputs.begin(), inputs.end()), + std::vector(outputs.begin(), outputs.end())); +} + +// Filter the Intermediate results of the subgraph node. +void FilterRedundantOutputOfSubGraph(Graph *graph) { + std::vector op_nodes; + for (auto &node : TopologicalSort(*graph)) { + if (node.IsVar() || Agent(&node).deleted()) { + continue; + } + op_nodes.push_back(&node); + } + size_t op_num = op_nodes.size(); + for (size_t i = 0; i < op_num; i++) { + if (op_nodes[i]->IsOp()) continue; + std::unordered_set follow_up_input_names; + for (size_t j = i + 1; j < op_num; j++) { + for (auto *in : op_nodes[j]->inputs) { + follow_up_input_names.insert(in->Name()); + } + } + std::vector filtered_subgraph_outlinks; + for (auto *out : op_nodes[i]->outputs) { + if (follow_up_input_names.count(out->Name())) { + filtered_subgraph_outlinks.push_back(out); + } else { + Agent(out).set_deleted(true); + } + } + // The filtered_subgraph_outlinks may be empty. + op_nodes[i]->outputs = filtered_subgraph_outlinks; + } +} -std::vector> SubGraphSplitter::operator()() { +std::vector> SubgraphDetector::operator()() { MarkNodesInsideSubGraph(); return ExtractSubGraphs(); } // Mark the output variables inside a subgraph with the func. -inline void MarkOutLinksInSubGraph(const Function *func) { - for (auto *var : func->outlinks) { - var->attr(SubGraphSplitter::kMarkerAttrName).Bool() = true; +inline void MarkOutLinksInSubGraph(const Node *func) { + for (auto *var : func->outputs) { + Agent(var).set_marked(true); } } -void SubGraphSplitter::MarkNodesInsideSubGraph() { - for (auto &node : GraphTraits(*graph_).nodes()) { +void SubgraphDetector::MarkNodesInsideSubGraph() { + for (auto &node : framework::ir::GraphTraits::DFS(*graph_)) { if (node_inside_subgraph_teller_(&node)) { - node.attr(kMarkerAttrName).Bool() = true; - if (node.type() == Node::Type::kFunction) { + Agent(&node).set_marked(true); + if (node.IsOp()) { // If a function is inside the sub-graph, mark all the output variables // to be inside too, so that two marked functions will be inside a same // sub-graph, lets take a example: A_function->var->B_function, if // A_function is marked, var should also be marked, so that B_function // will be in the same sub-graph with A_function if B_function is // marked. - MarkOutLinksInSubGraph(static_cast(&node)); + MarkOutLinksInSubGraph(&node); } } } } -const char *kUnionFindParent = "_sub_graph_splitter_union_find_parent_"; - // Use the Union Find(UF) algorithm to find fully connected sub-graphs, if node // a's output is node b, that is a and b is in the same sub-graph. The UF // algorithm will group them to the same cluster. @@ -60,8 +124,8 @@ using node_map_t = std::unordered_map; int UnionFindGetAncestor(const node_map_t &node_map, size_t id) { int tmp = id; do { - tmp = node_map.at(tmp)->attr(kUnionFindParent).Int32(); - } while (node_map.at(tmp)->attr(kUnionFindParent).Int32() != tmp); + tmp = Agent(node_map.at(tmp)).union_find_parent(); + } while (Agent(node_map.at(tmp)).union_find_parent() != tmp); return tmp; } // Make this two node share the same ancestor. @@ -69,9 +133,9 @@ int UnionFindGetAncestor(const node_map_t &node_map, size_t id) { void UnionFindCombine(const node_map_t &node_map, size_t a, size_t b) { int a_ancestor = UnionFindGetAncestor(node_map, a); int b_ancestor = UnionFindGetAncestor(node_map, b); - node_map.at(b_ancestor)->attr(kUnionFindParent).Int32() = a_ancestor; - node_map.at(a)->attr(kUnionFindParent).Int32() = a_ancestor; - node_map.at(b)->attr(kUnionFindParent).Int32() = a_ancestor; + Agent(node_map.at(b_ancestor)).set_union_find_parent(a_ancestor); + Agent(node_map.at(a)).set_union_find_parent(a_ancestor); + Agent(node_map.at(b)).set_union_find_parent(a_ancestor); } // This is a simple representation of a graph. @@ -195,16 +259,21 @@ void FlexibleDFS(const std::vector &source, bool reverse, } } -std::vector> SubGraphSplitter::ExtractSubGraphs() { +std::vector> SubgraphDetector::ExtractSubGraphs() { // Run the Extract algorithm to find all subgraphs. std::vector marked_nodes; // We use brief_node_map to represent the original graph in order to avoid // changing the original graph. std::unordered_map brief_node_map; - for (auto &node : GraphTraits(*graph_).nodes_in_TS()) { + std::unordered_set valid_node_ids; + for (auto *node : graph_->Nodes()) { + valid_node_ids.insert(node->id()); + } + + for (auto &node : framework::ir::GraphTraits::TS(*graph_)) { brief_node_map[node.id()] = new BriefNode(&node); - if (node.attr(kMarkerAttrName).Bool()) { + if (Agent(&node).marked()) { marked_nodes.push_back(&node); } } @@ -213,26 +282,34 @@ std::vector> SubGraphSplitter::ExtractSubGraphs() { node_map_t node_map; // id to ptr for (auto *n : marked_nodes) { // n's parent == n.id means it is the ancestor - n->attr(kUnionFindParent).Int32() = n->id(); + Agent(n).set_union_find_parent(n->id()); node_map[n->id()] = n; } // create breif node map for (auto &itr : brief_node_map) { - for (Node *node : itr.second->node->inlinks) { - itr.second->inlinks.push_back(brief_node_map[node->id()]); + for (Node *node : itr.second->node->inputs) { + if (!valid_node_ids.count(node->id())) { + LOG(INFO) << "invalid node id " << node->id(); + continue; + } + itr.second->inlinks.push_back(brief_node_map.at(node->id())); } - for (Node *node : itr.second->node->outlinks) { - itr.second->outlinks.push_back(brief_node_map[node->id()]); + for (Node *node : itr.second->node->outputs) { + if (!valid_node_ids.count(node->id())) { + LOG(INFO) << "invalid node id " << node->id(); + continue; + } + itr.second->outlinks.push_back(brief_node_map.at(node->id())); } } for (auto &itr : brief_node_map) { BriefNode *brief_node = itr.second; - if (!brief_node->node->attr(kMarkerAttrName).Bool()) { - VLOG(4) << brief_node->node->id() << " node not a trt candicate."; + if (!Agent(brief_node->node).marked()) { + VLOG(4) << brief_node->node->id() << " node not a trt candidate."; continue; } @@ -254,7 +331,7 @@ std::vector> SubGraphSplitter::ExtractSubGraphs() { std::unordered_set contract_nodes; for (auto *out : brief_node->outlinks) { // must be an trt candidate - if (!out->node->attr(kMarkerAttrName).Bool()) continue; + if (!Agent(out->node).marked()) continue; // get all dst input nodes except src. std::vector source_nodes; for (auto *n : out->inlinks) { @@ -289,9 +366,8 @@ std::vector> SubGraphSplitter::ExtractSubGraphs() { std::unordered_map> clusters; for (auto *n : marked_nodes) { - if (n->type() == Node::Type::kFunction) { - clusters[UnionFindGetAncestor(node_map, - n->attr(kUnionFindParent).Int32())] + if (n->IsOp()) { + clusters[UnionFindGetAncestor(node_map, Agent(n).union_find_parent())] .push_back(n); } } @@ -304,28 +380,59 @@ std::vector> SubGraphSplitter::ExtractSubGraphs() { return result; } -void SubGraphFuse::operator()() { ReplaceNodesWithSubGraphs(); } +void SubGraphFuser::operator()() { ReplaceNodesWithSubGraphs(); } + +void RemoveIntermediateOutputInSubgraph(const std::vector &subgraph, + Graph *graph, + std::vector *outputs) { + std::unordered_set subgraph_set(subgraph.begin(), subgraph.end()); + std::unordered_set valid_output; + + for (auto *output : *outputs) { + int num_used = 0; + for (auto *node : output->outputs) { + if (!subgraph_set.count(node)) ++num_used; + if (num_used > 0) valid_output.insert(output); + } + } + + outputs->assign(valid_output.begin(), valid_output.end()); +} + +void DetachDeletedNodes(framework::ir::Graph *graph) { + std::unordered_set nodes; + for (auto *node : graph->Nodes()) { + if (Agent(node).deleted()) { + node->inputs.clear(); + node->outputs.clear(); + } + } +} -void SubGraphFuse::ReplaceNodesWithSubGraphs() { - auto subgraphs = SubGraphSplitter(graph_, node_inside_subgraph_teller_)(); +void SubGraphFuser::ReplaceNodesWithSubGraphs() { + auto subgraphs = SubgraphDetector(graph_, node_inside_subgraph_teller_)(); for (auto &subgraph : subgraphs) { - if (subgraph.size() <= argument_->Get("minimum_subgraph_size")) - continue; + if (subgraph.size() <= (size_t)min_subgraph_size_) continue; + LOG(INFO) << "detect a subgraph size " << subgraph.size(); std::unordered_set subgraph_uniq(subgraph.begin(), subgraph.end()); // replace this sub-graph with the first node. Two steps: 1. Create a Block // Node that contains this subgraph 2. Mark the nodes inside the sub-graph // as deleted. 3. Replace the deleted node with the new Block Node. - auto *block_node = static_cast( - graph_->nodes.Create(Node::Type::kFunctionBlock)); + framework::OpDesc empty_desc; + empty_desc.SetType("tensorrt_engine"); + auto *block_node = graph_->CreateOpNode(&empty_desc); + Agent(block_node).set_subgraph({}); auto io = ExtractInputAndOutputOfSubGraph(subgraph); - block_node->inlinks = std::move(io.first); - block_node->outlinks = std::move(io.second); + block_node->inputs = std::move(io.first); + block_node->outputs = std::move(io.second); + + RemoveIntermediateOutputInSubgraph(subgraph, graph_, &block_node->outputs); for (auto *node : subgraph) { // TODO(Superjomn) need a unified mechanism to treat deleted node in each // pass. - node->SetDeleted(); - block_node->subgraph.push_back(node); + Agent(node).set_deleted(true); + Agent(block_node).subgraph()->push_back(node); } // Change all the sub-graph's inputs and outputs corresponding inlink and @@ -339,16 +446,92 @@ void SubGraphFuse::ReplaceNodesWithSubGraphs() { std::unordered_set uniq(nodes.begin(), nodes.end()); nodes.assign(uniq.begin(), uniq.end()); }; - for (auto *i : block_node->inlinks) { - inlink_or_outlink_cleaner(i->outlinks); + for (auto *i : block_node->inputs) { + inlink_or_outlink_cleaner(i->outputs); } - for (auto *&o : block_node->outlinks) { - inlink_or_outlink_cleaner(o->inlinks); + for (auto *&o : block_node->outputs) { + inlink_or_outlink_cleaner(o->inputs); } } + // DetachDeletedNodes(graph_); FilterRedundantOutputOfSubGraph(graph_); } +inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) { + return node.inputs.size() == n; +} + +NodesTSIterator::NodesTSIterator(const std::vector &source) { + PADDLE_ENFORCE(!source.empty(), + "Start points of topological sorting should not be empty!"); + // CHECK all the inputs' in-degree is 0 + for (auto *node : source) { + PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0)); + } + + std::unordered_set visited; + std::unordered_set to_visit{source.begin(), source.end()}; + + std::vector inlink_visited; + while (!to_visit.empty()) { + std::vector queue(to_visit.begin(), to_visit.end()); + for (auto *p : queue) { + if (Agent(p).deleted()) { + visited.insert(p); + to_visit.erase(p); + } + + inlink_visited.clear(); + + std::copy_if(p->inputs.begin(), p->inputs.end(), + std::back_inserter(inlink_visited), + [&](Node *x) -> bool { return visited.count(x) != 0; }); + + if (inlink_visited.size() == p->inputs.size()) { + sorted_.push_back(p); + for (auto *_ : p->outputs) { + if (!visited.count(_)) { + to_visit.insert(_); + } + } + + to_visit.erase(p); + visited.insert(p); + } + } + } +} + +NodesTSIterator::NodesTSIterator(const NodesTSIterator &other) + : sorted_(other.sorted_), cursor_(other.cursor_) {} + +Node &NodesTSIterator::operator*() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return *sorted_[cursor_]; +} + +NodesTSIterator &NodesTSIterator::operator++() { + if (++cursor_ >= sorted_.size()) { + sorted_.clear(); + cursor_ = 0; + } + return *this; +} +NodesTSIterator &NodesTSIterator::operator=(const NodesTSIterator &other) { + cursor_ = other.cursor_; + sorted_ = other.sorted_; + return *this; +} + +bool NodesTSIterator::operator==(const NodesTSIterator &other) { + return sorted_ == other.sorted_ && cursor_ == other.cursor_; +} + +Node *NodesTSIterator::operator->() { + PADDLE_ENFORCE_LT(cursor_, sorted_.size()); + return sorted_[cursor_]; +} + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h new file mode 100644 index 0000000000000000000000000000000000000000..ea88edd042aa9d46f66af1aa92f2cb273696c118 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h @@ -0,0 +1,182 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +/* + * This file defines the the class to partition a graph. + */ + +#pragma once + +#include +#include "paddle/fluid/framework/ir/graph.h" +#include "paddle/fluid/framework/ir/graph_traits.h" +#include "paddle/fluid/framework/ir/node.h" +#include "paddle/fluid/inference/analysis/argument.h" +#include "paddle/fluid/inference/analysis/helper.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Graph; + +const char kIsFunctionNode[] = "__is_function_node__"; +const char kFunctionNodeSubGraph[] = "__function_node_sub_graph__"; +const char kSubgraphSplitterMarkerAttrName[] = + "_sub_graph_splitter_inside_sub_graph"; + +/* + * Detect the nodes in a sub-graph that meet some conditions. This class doesn't + * modify the graph. + */ +class SubgraphDetector { + public: + // Tell whether a node is inside a sub-graph. + using NodeInsideSubgraphTeller = + std::function; + + SubgraphDetector(Graph *graph, const NodeInsideSubgraphTeller &teller) + : graph_(graph), node_inside_subgraph_teller_(teller) {} + + std::vector> operator()(); + + protected: + // Mark the nodes inside the accepted sub-graph using + // node_inside_subgraph_teller. + void MarkNodesInsideSubGraph(); + + // Merge the marked nodes into sub-graphs and return the sub-graphs. + std::vector> ExtractSubGraphs(); + + private: + Graph *graph_; + NodeInsideSubgraphTeller node_inside_subgraph_teller_; +}; + +/* + * SubGraphFuser - Replace some nodes with the sub-graph node they are inside. + * To some extent, the TensorRT engine is just a fusion op for a model. + */ +class SubGraphFuser { + public: + using NodeInsideSubgraphTeller = SubgraphDetector::NodeInsideSubgraphTeller; + + SubGraphFuser(Graph *graph, const NodeInsideSubgraphTeller &teller, + int min_subgraph_size) + : graph_(graph), + node_inside_subgraph_teller_(teller), + min_subgraph_size_{min_subgraph_size} {} + + // The main method which run all the logic. + void operator()(); + + protected: + // Remove the nodes inside sub-graphs and replace with the SubGraphNode. + void ReplaceNodesWithSubGraphs(); + + private: + Graph *graph_; + NodeInsideSubgraphTeller node_inside_subgraph_teller_; + int min_subgraph_size_; +}; + +struct NodeWrapper { + bool deleted{false}; + bool marked{false}; + int union_find_parent{-1}; + std::vector subgraph; +}; + +/* + * ir::Node agent for subgraph detector. + */ +struct Agent { + explicit Agent(framework::ir::Node *x) : x_(x) {} + + NodeWrapper &wrapper() { + if (!x_->IsWrappedBy()) { + x_->WrappedBy(new NodeWrapper); + } + return x_->template Wrapper(); + } + + bool deleted() { return wrapper().deleted; } + void set_deleted(bool x) { wrapper().deleted = x; } + + bool marked() { return wrapper().marked; } + void set_marked(bool x) { wrapper().marked = x; } + + void set_subgraph(const std::vector &x) { + wrapper().subgraph = x; + } + + int union_find_parent() { return wrapper().union_find_parent; } + void set_union_find_parent(int v) { wrapper().union_find_parent = v; } + + std::vector *subgraph() { return &wrapper().subgraph; } + std::vector &inputs() { return x_->inputs; } + std::vector &outputs() { return x_->outputs; } + + private: + framework::ir::Node *x_; +}; + +// Topological sorting iterator on nodes. +struct NodesTSIterator + : public std::iterator { + NodesTSIterator() = default; + explicit NodesTSIterator(const std::vector &source); + NodesTSIterator(NodesTSIterator &&other) + : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) { + other.cursor_ = 0; + } + NodesTSIterator(const NodesTSIterator &other); + + framework::ir::Node &operator*(); + NodesTSIterator &operator++(); + // TODO(Superjomn) current implementation just compare the first + // element, need to compare the graph and all the elements in the queue and + // set. + NodesTSIterator &operator=(const NodesTSIterator &other); + bool operator==(const NodesTSIterator &other); + bool operator!=(const NodesTSIterator &other) { return !(*this == other); } + framework::ir::Node *operator->(); + + private: + std::vector sorted_; + size_t cursor_{0}; +}; + +// The nodes those have no input will be treated as start points. +static std::vector ExtractStartPoints(const Graph &g) { + std::vector result; + for (auto *node : g.Nodes()) { + if (node->inputs.empty()) { + result.push_back(node); + } + } + return result; +} + +static iterator_range TopologicalSort(const Graph &g) { + auto start_points = ExtractStartPoints(g); + PADDLE_ENFORCE(!start_points.empty()); + NodesTSIterator x(start_points); + return iterator_range(NodesTSIterator(start_points), + NodesTSIterator()); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..c6b7c05f784b7c44fe30dd69529fe48405538ab6 --- /dev/null +++ b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.cc @@ -0,0 +1,220 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h" +#include +#include +#include "paddle/fluid/framework/ir/graph_pattern_detector.h" +#include "paddle/fluid/inference/analysis/helper.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" + +namespace paddle { +namespace inference { +namespace analysis { + +using framework::ir::Node; + +std::vector ExtractParameters( + const std::unordered_set &nodes); + +std::unique_ptr analysis::TensorRtSubgraphPass::ApplyImpl( + + std::unique_ptr graph) const { + framework::ir::FusePassBase::Init("tensorrt_subgraph_pass", graph.get()); + + auto teller = + Get("tensorrt_node_teller"); + + SubGraphFuser fuser(graph.get(), teller, 2 /*min subgraph size*/); + fuser(); + + for (auto *node : graph->Nodes()) { + if (node->IsOp() && !Agent(node).subgraph()->empty()) { + CreateTensorRTOp(node, graph.get()); + + std::unordered_set nodes2remove( + Agent(node).subgraph()->begin(), Agent(node).subgraph()->end()); + framework::ir::GraphSafeRemoveNodes(graph.get(), nodes2remove); + } + } + + std::unordered_set nodes2remove; + for (auto *node : graph->Nodes()) { + if (node->IsOp() && Agent(node).deleted()) { + nodes2remove.insert(node); + } + } + framework::ir::GraphSafeRemoveNodes(graph.get(), nodes2remove); + + return graph; +} + +void TensorRtSubgraphPass::CreateTensorRTOp(framework::ir::Node *node, + Graph *graph) const { + auto *op_desc = node->Op(); + static int counter{0}; + auto &subgraph = *Agent(node).subgraph(); + PADDLE_ENFORCE(!subgraph.empty()); + + // An fake block desc. + framework::proto::BlockDesc block_proto; + framework::BlockDesc block_desc(nullptr, &block_proto); + block_desc.Proto()->set_parent_idx(-1); + block_desc.Proto()->set_idx(0); + for (auto *node : subgraph) { + auto *op = block_desc.AppendOp(); + *op->Proto() = *node->Op()->Proto(); + } + + // collect inputs + std::unordered_set input_names; + std::unordered_set input_names_with_id; + for (auto *x : node->inputs) { + input_names.insert(x->Name()); + input_names_with_id.insert(x->Name() + std::to_string(x->id())); + } + op_desc->SetInput( + "Xs", std::vector(input_names.begin(), input_names.end())); + + std::unordered_set output_names; + std::unordered_set output_names_with_id; + for (auto *x : node->outputs) { + output_names.insert(x->Name()); + output_names_with_id.insert(x->Name() + std::to_string(x->id())); + } + + op_desc->SetOutput( + "Ys", std::vector(output_names.begin(), output_names.end())); + op_desc->SetType("tensorrt_engine"); + + std::unordered_map output_name_map; + + // The following procedure is used to rename all the intermediate + // variables and the output variables of the subgraph. + // Why we do this? + // During the transition from fluid OP to tensorrt OP, we map + // the input and output Tensor(fluid data structure) of fluid OP + // to the corresponding ITensor (trt data structure) through the + // Tensor name. When we set up ITensor for an variable, we must + // ensure that it has not been set before. + // If there is variable in the fluid graph, which is not only the + // input of a OP, but also the output of a Op, there will be problems. + // So we have to rename the variable in the subgraph to make sure + // it is either an OP's input or an OP's output. + + auto &subgraph_nodes = *Agent(node).subgraph(); + for (size_t index = 0; index < block_desc.OpSize(); ++index) { + framework::proto::OpDesc *op = block_desc.Op(index)->Proto(); + auto correspond_node = subgraph_nodes[index]; + PADDLE_ENFORCE_EQ(correspond_node->Name(), op->type()); + + std::unordered_map var2id; + for (auto *in_var : correspond_node->inputs) { + var2id[in_var->Name()] = in_var->id(); + } + // rename for the input variables of op inside subgraph + for (int i = 0; i < op->inputs_size(); i++) { + // one input + auto *in_var = op->mutable_inputs(i); + std::vector replaced_names; + for (int k = 0; k < in_var->arguments_size(); k++) { // all the arguments + std::string arg_value = in_var->arguments(k); + std::string arg_value_with_id = + arg_value + std::to_string(var2id[arg_value]); + if (input_names_with_id.count(arg_value_with_id)) { + replaced_names.push_back(arg_value); + } else { + replaced_names.push_back(arg_value_with_id); + } + } + in_var->clear_arguments(); + for (size_t k = 0; k < replaced_names.size(); k++) { + in_var->add_arguments(replaced_names[k]); + } + } + var2id.clear(); + for (auto out_var : correspond_node->outputs) { + var2id[out_var->Name()] = out_var->id(); + } + + // rename for the output variables of op inside subgraph + for (int i = 0; i < op->outputs_size(); i++) { + framework::proto::OpDesc_Var *out_var = op->mutable_outputs(i); + std::vector replaced_names; + for (int k = 0; k < out_var->arguments_size(); k++) { + std::string arg_value = out_var->arguments(k); + std::string arg_value_with_id = + arg_value + std::to_string(var2id[arg_value]); + if (output_names_with_id.count(arg_value_with_id)) { + output_name_map[arg_value] = arg_value_with_id; + } + replaced_names.push_back(arg_value_with_id); + } + out_var->clear_arguments(); + for (size_t k = 0; k < replaced_names.size(); k++) { + out_var->add_arguments(replaced_names[k]); + } + } + } + + // When tensorrt engine runs at the end of the operation, + // output_mapping help us copy the data from the renamed ITensor + // to Tensor. + std::vector output_mapping; + for (auto name : output_names) { + // LOG(INFO) << name << " " << output_name_map.size(); + PADDLE_ENFORCE(output_name_map.count(name) != 0); + output_mapping.push_back(output_name_map[name]); + } + + *block_desc.Proto()->mutable_vars() = + const_cast(&graph->program()) + ->Proto() + ->blocks(0) + .vars(); + PADDLE_ENFORCE(!block_desc.Proto()->vars().empty(), + "the block has no var-desc"); + PADDLE_ENFORCE(!output_mapping.empty()); + // Set attrs + SetAttr(op_desc->Proto(), "subgraph", + block_desc.Proto()->SerializeAsString()); + SetAttr(op_desc->Proto(), "max_batch_size", Get("max_batch_size")); + SetAttr(op_desc->Proto(), "workspace_size", Get("workspace_size")); + SetAttr(op_desc->Proto(), "engine_uniq_key", + "trt-" + std::to_string(counter++)); + SetAttr(op_desc->Proto(), "parameters", ExtractParameters(graph->Nodes())); + SetAttr(op_desc->Proto(), "output_name_mapping", output_mapping); +} + +std::vector ExtractParameters( + const std::unordered_set &nodes) { + std::vector parameters; + for (const auto &node : nodes) { + if (!node->IsVar()) continue; + if (node->Var()->Persistable()) { + parameters.push_back(node->Name()); + } + } + return parameters; +} + +} // namespace analysis +} // namespace inference +} // namespace paddle + +REGISTER_PASS(tensorrt_subgraph_pass, + paddle::inference::analysis::TensorRtSubgraphPass) + .RequirePassAttr("tensorrt_node_teller") + .RequirePassAttr("max_batch_size") + .RequirePassAttr("workspace_size"); diff --git a/paddle/fluid/inference/analysis/model_store_pass_tester.cc b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h similarity index 55% rename from paddle/fluid/inference/analysis/model_store_pass_tester.cc rename to paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h index d6493fc25edf25003504542f1b01c4105754c8df..502353b95fc15e763900a0caf1649257508f0880 100644 --- a/paddle/fluid/inference/analysis/model_store_pass_tester.cc +++ b/paddle/fluid/inference/analysis/ir_passes/tensorrt_subgraph_pass.h @@ -12,31 +12,24 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/inference/analysis/model_store_pass.h" - -#include -#include -#include "paddle/fluid/inference/analysis/analyzer.h" +#pragma once +#include +#include "paddle/fluid/framework/ir/pass.h" namespace paddle { namespace inference { namespace analysis { -DEFINE_string(inference_model_dir, "", "Model path"); - -TEST(DFG_StorePass, test) { - Analyzer analyzer; - Argument argument(FLAGS_inference_model_dir); - argument.model_output_store_path.reset( - new std::string("./_dfg_store_pass_tmp")); - // disable storage in alalyzer - FLAGS_IA_output_storage_path = ""; - analyzer.Run(&argument); +class TensorRtSubgraphPass : public framework::ir::FusePassBase { + public: + std::unique_ptr ApplyImpl( + std::unique_ptr graph) const override; - ModelStorePass pass; - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); -} + private: + void CreateTensorRTOp(framework::ir::Node *x, + framework::ir::Graph *graph) const; + void CleanIntermediateOutputs(framework::ir::Node *node); +}; } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/model_store_pass.cc b/paddle/fluid/inference/analysis/model_store_pass.cc deleted file mode 100644 index c313db08875669010ddcca13aa66b383ee6d26f8..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/model_store_pass.cc +++ /dev/null @@ -1,67 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include -#include -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/model_store_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void ModelStorePass::Run(DataFlowGraph *x) { - if (!argument_->fluid_model_param_path) { - PADDLE_ENFORCE_NOT_NULL(argument_->fluid_model_dir); - argument_->fluid_model_param_path.reset( - new std::string(*argument_->fluid_model_dir + "param")); - } - PADDLE_ENFORCE_NOT_NULL(argument_->model_output_store_path); - // Directly copy param file to destination. - std::stringstream ss; - // NOTE these commands only works on linux. - ss << "mkdir -p " << *argument_->model_output_store_path; - VLOG(3) << "run command: " << ss.str(); - PADDLE_ENFORCE_EQ(system(ss.str().c_str()), 0); - ss.str(""); - - ss << "cp " << *argument_->fluid_model_dir << "/*" - << " " << *argument_->model_output_store_path; - VLOG(3) << "run command: " << ss.str(); - PADDLE_ENFORCE_EQ(system(ss.str().c_str()), 0); - - // Store program - PADDLE_ENFORCE_NOT_NULL(argument_->transformed_program_desc, - "program desc is not transformed, should call " - "DataFlowGraphToFluidPass first."); - VLOG(3) << "store analyzed program to " - << *argument_->model_output_store_path; - const std::string program_output_path = - *argument_->model_output_store_path + "/__model__"; - std::ofstream file(program_output_path, std::ios::binary); - PADDLE_ENFORCE(file.is_open(), "failed to open %s to write.", - program_output_path); - const std::string serialized_message = - argument_->transformed_program_desc->SerializeAsString(); - file.write(serialized_message.c_str(), serialized_message.size()); -} - -bool ModelStorePass::Finalize() { return true; } - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node.cc b/paddle/fluid/inference/analysis/node.cc deleted file mode 100644 index 3339b5044df0cf91d00aa9ddad310d4bf263bc3c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node.cc +++ /dev/null @@ -1,70 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/node.h" -#include "glog/logging.h" -#include "paddle/fluid/platform/enforce.h" - -namespace paddle { -namespace inference { -namespace analysis { - -std::vector Value::dot_attrs() const { - return std::vector({Dot::Attr("style", "filled,rounded"), - Dot::Attr("shape", "box"), - Dot::Attr("fillcolor", "red")}); -} - -std::vector Function::dot_attrs() const { - return std::vector({Dot::Attr("style", "filled,rounded"), - Dot::Attr("shape", "diamond"), - Dot::Attr("fillcolor", "yellow")}); -} - -Node *NodeMap::Create(Node::Type type) { - switch (type) { - case Node::Type::kFunction: - nodes_.emplace_back(new Function); - break; - case Node::Type::kValue: - nodes_.emplace_back(new Value); - break; - case Node::Type::kFunctionBlock: - nodes_.emplace_back(new FunctionBlock); - break; - default: - PADDLE_THROW("Not supported node type."); - } - nodes_.back()->id_ = size() - 1; - return nodes_.back().get(); -} - -Node *NodeMap::GetMutable(size_t id) { - PADDLE_ENFORCE_GT(size(), id); - return nodes_[id].get(); -} - -const Node &NodeMap::Get(size_t id) const { - PADDLE_ENFORCE_GT(size(), id); - return *nodes_[id].get(); -} - -void NodeMap::Delete(size_t id) { - PADDLE_ENFORCE_LT(id, size()); - nodes_[id]->SetDeleted(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node.h b/paddle/fluid/inference/analysis/node.h deleted file mode 100644 index af34156bc2f101465d87cb10e2155745022eb521..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node.h +++ /dev/null @@ -1,244 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the Node class and its subclasses. A Node is the basis - * analysis element in a computation graph. - * There are basically two kinds of nodes, the function node and value node. - */ -#pragma once - -#include -#include -#include -#include -#include - -#include "paddle/fluid/framework/var_type.h" -#include "paddle/fluid/inference/analysis/device.h" -#include "paddle/fluid/inference/analysis/dot.h" -#include "paddle/fluid/inference/analysis/helper.h" -#include "paddle/fluid/platform/variant.h" - -namespace paddle { -namespace inference { -namespace analysis { - -class NodeMap; - -// A helper class to maintain the status from Pass. -struct AnyAttr { - using any_t = - boost::variant; - // NOTE T should be a primary type or a struct combined by several primary - // types. - // NOTE the STL containers should not use here. - // Some usages - // Attr attr; - // attr.Bool() = true; - bool &Bool() { return As(); } - float &Float() { return As(); } - int32_t &Int32() { return As(); } - int64_t &Int64() { return As(); } - void *&Pointer() { return As(); } - std::string &String() { return As(); } - - template - T &As() { - if (type_index_ == typeid(AnyAttr)) { - type_index_ = typeid(T); - any_data_ = T(); - } else { - PADDLE_ENFORCE(type_index_ == typeid(T), "fetch error type"); - } - return boost::get(any_data_); - } - - private: - any_t any_data_; - std::type_index type_index_{typeid(AnyAttr)}; -}; - -/* - * Node Representation. - * - * This is a very important class for analysis. It is the base class of all - * nodes computed by a program that may be used as operands to other nodes. - * Node is the super class of other important classes such as Function and - * Value, some nodes can have a name. - */ -class Node { - public: - // Node type. NOTE the new node types should add here. - enum class Type { kNone = -1, kFunction, kValue, kFunctionBlock }; - - Node() = default; - - // Cast to a subclass type, Function for example. - template - Subclass &As() { - return *dynamic_cast(this); - } - - // Formatted representation of this Node. - virtual std::string repr() const { - return name() + "(" + std::to_string(id()) + ")"; - } - - // DOT node representation. One Node type can customize its own node - // representation. - virtual std::vector dot_attrs() const { - return std::vector({Dot::Attr("style", "filled")}); - } - - // Get an additional attribute and convert it to T data type. NOTE this will - // silently create a new attribute if not exists. - AnyAttr &attr(const std::string &name) const { return attrs_[name]; } - - int id() const { return id_; } - - // The Protobuf description is set/get with a void* to decouple Node interface - // from a specific kind of Protobuf message. - void SetPbDesc(void *pb) { attr("pb_desc").Pointer() = pb; } - void *pb_desc() const { return attr("pb_desc").Pointer(); } - - void SetPbMsg(const std::string &s) { attr("pb_msg").String() = s; } - const std::string &pb_msg() const { return attr("pb_msg").String(); } - - void SetDeleted() { deleted_ = true; } - bool deleted() const { return deleted_; } - - void SetName(const std::string &name) { name_ = name; } - const std::string &name() const { return name_; } - - void SetType(Type type) { type_ = type; } - Type type() const { return type_; } - - // Input links. - std::vector inlinks; - // Output links. - std::vector outlinks; - - // Type checks. - bool IsFunction() const { return type_ == Node::Type::kFunction; } - bool IsValue() const { return type_ == Node::Type::kValue; } - bool IsFunctionBlock() const { return type_ == Node::Type::kFunctionBlock; } - - virtual ~Node() {} - - friend class NodeMap; - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Node); - - protected: - // The id number not the name is a node's unique identifier in the computation - // graph. - int id_{-1}; - std::string name_; - Type type_{Type::kNone}; - // Mark this node is deleted by some pass. - bool deleted_{false}; - mutable std::unordered_map attrs_; -}; - -class Function; -/* - * Value represents a value node, it has some attributes including dims, data - * type and so on. - */ -class Value : public Node { - public: - enum class DataType { kInt32, kInt64, kFloat32, kFloat64 }; - using Dims = std::vector; - - void SetDataType(DataType data_type) { data_type_ = data_type; } - DataType data_type() const { return data_type_; } - - void SetDims(const Dims &dims) { dims_ = dims; } - const Dims &dims() const { return dims_; } - - Device device() const { return device_; } - void SetDevice(Device device) { device_ = device; } - - std::vector dot_attrs() const override; - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Value); - - protected: - Value() { SetType(Node::Type::kValue); } - friend class NodeMap; - - private: - DataType data_type_; - Dims dims_; - Device device_; -}; - -/* - * Function represents any kind of executable concepts that takes several Values - * as input, and outputs several Values. - */ -class Function : public Node { - public: - std::vector dot_attrs() const override; - - // Get the operator's type from Desc. - const std::string &func_type() const { return func_type_; } - // Set the operator's type. - void SetFuncType(const std::string &func_type) { func_type_ = func_type; } - - PADDLE_DISALLOW_COPY_AND_ASSIGN(Function); - - protected: - std::string func_type_; - Function() { SetType(Node::Type::kFunction); } - friend class NodeMap; -}; - -/* - * FunctionBlock is a Node that contains a sub-graph multiple Node. - */ -struct FunctionBlock : public Node { - std::string repr() const override { return "block-" + std::to_string(id()); } - std::vector subgraph; - - protected: - FunctionBlock() { SetType(Node::Type::kFunctionBlock); } - friend class NodeMap; -}; - -class NodeMap { - public: - // Create a new node with type. - Node *Create(Node::Type type); - - // Get a node by its id. - Node *GetMutable(size_t id); - - const Node &Get(size_t id) const; - - void Delete(size_t id); - - const std::vector> &nodes() const { return nodes_; } - - size_t size() const { return nodes_.size(); } - - private: - std::vector> nodes_; - std::unordered_map map_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node_tester.cc b/paddle/fluid/inference/analysis/node_tester.cc deleted file mode 100644 index 9207c15373fb4264ff0e738e93ae88e1c08b554c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/node_tester.cc +++ /dev/null @@ -1,55 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/node.h" - -#include - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(NodeAttr, bool) { - AnyAttr x; - x.Bool() = true; - ASSERT_EQ(x.Bool(), true); -} - -TEST(NodeAttr, int32) { - AnyAttr x; - x.Int32() = 32; - ASSERT_EQ(x.Int32(), 32); -} - -TEST(NodeAttr, string) { - AnyAttr x; - x.String() = "Hello"; - ASSERT_EQ(x.String(), "Hello"); -} - -TEST(Node, Attr) { - // Node is an abstract class, use Value instead for they share the same Attr - // logic. - NodeMap nodes; - auto* node = nodes.Create(Node::Type::kValue); - node->attr("v0").Int32() = 2008; - ASSERT_EQ(node->attr("v0").Int32(), 2008); - - node->attr("str").String() = "hello world"; - ASSERT_EQ(node->attr("str").String(), "hello world"); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager.cc b/paddle/fluid/inference/analysis/pass_manager.cc deleted file mode 100644 index a6ac0ee49f8f408faa7a17bf5ef5d2799a9a6238..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager.cc +++ /dev/null @@ -1,47 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/string/pretty_log.h" - -namespace paddle { -namespace inference { -namespace analysis { - -bool PassManager::Initialize(Argument* argument) { - argument_ = argument; - for (auto& pass : data_) { - VLOG(3) << "Initializing pass [" << pass->repr() << "]"; - if (!pass->Initialize(argument)) { - LOG(ERROR) << "Failed to initialize pass [" << pass->repr() << "]"; - return false; - } - } - return true; -} - -void DfgPassManager::RunAll() { - PADDLE_ENFORCE(argument_); - VLOG(3) << "Total " << data_.size() << " Analysys passes"; - for (auto& pass : data_) { - string::PrettyLogEndl(string::Style::H1(), "* Running Analysis pass [%s]", - pass->repr()); - pass->Run(argument_->main_dfg.get()); - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager.h b/paddle/fluid/inference/analysis/pass_manager.h deleted file mode 100644 index 412747c4fcce73303703f586f7a04edf4cc5ee76..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager.h +++ /dev/null @@ -1,94 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the logic of pass management. The analysis for inference is - * a pipeline of Passes, a PassManager is a agency that helps to manage the - * executation of the Passes. - * - * There are two modes of Passes, the first one is called NodePass and takes - * an Node as input and output; the second one is called DFGPass and takes a - * DFG(Data Flow Graph) as input and output. It is hard to put all the passes in - * the same pipeline, there are two kinds of PassManagers, both takes a DFG as - * input and output a DFG, but the Passes inside are different: - * - * 1. NodePassManager: the passes inside are all NodePasses, it can have - * different graph trivial algorithm, for example, DFS_NodePassManager will - * trigger the passes in depth first order; - * 2. DfgPassManager: the passes inside are all DfgPasses. - */ - -#pragma once - -#include -#include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/inference/analysis/analysis_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * PassManager is the base class for all pass managers, a pass manager has - * several Pass-es registered, and execute them in the linear order. - */ -class PassManager : public OrderedRegistry { - public: - PassManager() = default; - // Call all the passes' Initialize methods. The desc and data_flow_graph are - // globally shared, so pass them as the arguemnts for all the pass managers. - virtual bool Initialize(const Argument& argument) { return false; } - - virtual bool Initialize(Argument* argument); - - // Call all the passes' Finalize methods. - virtual bool Finalize() { - for (auto& pass : data_) { - if (!pass->Finalize()) { - LOG(ERROR) << "Failed to finalize pass [" << pass->repr() << "]"; - return false; - } - } - return true; - } - - // Run all the passes. - virtual void RunAll() = 0; - - // Short identifier. - virtual std::string repr() const = 0; - // Long description. - virtual std::string description() const = 0; - - virtual ~PassManager() = default; - - protected: - Argument* argument_{nullptr}; -}; - -/* - * A pass manager that process a DFG. - */ -class DfgPassManager : public PassManager { - public: - DfgPassManager() = default; - - void RunAll() override; - - virtual ~DfgPassManager() = default; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/pass_manager_tester.cc b/paddle/fluid/inference/analysis/pass_manager_tester.cc deleted file mode 100644 index 72b0fbf7e571ec97a0ea093d01449c1d5ddb9b91..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/pass_manager_tester.cc +++ /dev/null @@ -1,54 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include - -#include "paddle/fluid/inference/analysis/data_flow_graph_to_fluid_pass.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" -#include "paddle/fluid/inference/analysis/pass_manager.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -class TestDfgPassManager final : public DfgPassManager { - public: - TestDfgPassManager() = default; - virtual ~TestDfgPassManager() = default; - // Short identifier. - std::string repr() const override { return "test-pass-manager"; } - // Long description. - std::string description() const override { return "test doc"; } -}; - -TEST(PassManager, DFG_pass_manager) { - TestDfgPassManager manager; - DFG_GraphvizDrawPass::Config config("./", "dfg.dot"); - - manager.Register("fluid-to-flow-graph", new FluidToDataFlowGraphPass); - manager.Register("graphviz", new DFG_GraphvizDrawPass(config)); - manager.Register("dfg-to-fluid", new DataFlowGraphToFluidPass); - - Argument argument(FLAGS_inference_model_dir); - - ASSERT_TRUE(&argument); - ASSERT_TRUE(manager.Initialize(&argument)); - manager.RunAll(); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/CMakeLists.txt b/paddle/fluid/inference/analysis/passes/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..a30c27b1183a75de8c0bb50ef3617d747b239fae --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/CMakeLists.txt @@ -0,0 +1,9 @@ +cc_library(ir_graph_build_pass SRCS ir_graph_build_pass.cc DEPS analysis_pass argument ir_pass_manager) +cc_library(ir_analysis_pass SRCS ir_analysis_pass.cc DEPS analysis_pass argument ir_pass_manager) +cc_library(analysis_passes SRCS passes.cc DEPS ir_graph_build_pass ir_analysis_pass) + +set(analysis_deps ${analysis_deps} + ir_graph_build_pass + ir_analysis_pass + analysis_passes + CACHE INTERNAL "") diff --git a/paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.cc b/paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..108cb6f74b1208395a4faabdf6184152c300d244 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.cc @@ -0,0 +1,84 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.h" +#include +#include +#include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/ir/pass.h" +#include "paddle/fluid/inference/analysis/ir_pass_manager.h" +#include "paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h" +#include "paddle/fluid/string/pretty_log.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void IrAnalysisComposePass::RunImpl(Argument *argument) { + ARGUMENT_CHECK_FIELD(argument, ir_analysis_passes); + if (argument->use_tensorrt_valid() && argument->use_tensorrt()) { + InitTensorRTAttrs(argument); + } + ApplyIrPasses(argument); + CollectFusionStatis(argument); +} + +std::string IrAnalysisComposePass::repr() const { + return "ir-analysis-compose-pass"; +} + +void IrAnalysisComposePass::InitTensorRTAttrs(Argument *argument) { + if (argument->use_tensorrt_valid() && argument->use_tensorrt()) { + LOG(INFO) << "Initing TensorRT pass"; + argument->SetTensorRtNodeTeller([](const framework::ir::Node *node) { + std::unordered_set teller_set( + {"mul", "conv2d", "pool2d", "relu", "softmax", "sigmoid", + "depthwise_conv2d", "batch_norm", "concat", "tanh", "pad", + "elementwise_add", "elementwise_mul", "dropout", "split", "prelu", + "conv2d_transpose", "leaky_relu"}); + if (!node->IsOp()) return false; + + if (teller_set.count(node->Op()->Type())) { + return true; + } else { + return false; + } + }); + } +} + +void IrAnalysisComposePass::ApplyIrPasses(Argument *argument) { + std::vector passes({ + "ir_graph_build_pass", "ir_analysis_pass", + }); + for (const auto &pass : passes) { + VLOG(2) << "Run pass " << pass; + auto *the_pass = PassRegistry::Global().Retreive(pass); + the_pass->Run(argument); + } +} + +void IrAnalysisComposePass::CollectFusionStatis(Argument *argument) { + if (!argument->main_graph().Has(framework::ir::kFuseStatisAttr)) { + LOG(INFO) << "argument has no fuse statis"; + return; + } + argument->SetFusionStatis( + argument->main_graph().Get( + framework::ir::kFuseStatisAttr)); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/model_store_pass.h b/paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.h similarity index 53% rename from paddle/fluid/inference/analysis/model_store_pass.h rename to paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.h index f14b49e09c2f8e79c6fc4accdbf17f4f7a9bb1a3..53e2ebb0038a5c105f68a0146b3da90a6ae34af8 100644 --- a/paddle/fluid/inference/analysis/model_store_pass.h +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.h @@ -12,42 +12,35 @@ // See the License for the specific language governing permissions and // limitations under the License. -/* - * This file defines ModelStorePass, which store the runtime DFG to a Paddle - * model in the disk, and that model can be reloaded for prediction. - */ - #pragma once + #include +#include #include "paddle/fluid/inference/analysis/analysis_pass.h" +#include "paddle/fluid/inference/analysis/passes/passes.h" namespace paddle { namespace inference { namespace analysis { -class ModelStorePass : public DataFlowGraphPass { +/* + * The analysis pass to run a list of IR passes (like a function call). + * Currently, it should be the first pass of analysis phase. + */ +class IrAnalysisComposePass : public AnalysisPass { public: - bool Initialize(Argument* argument) override { - if (!argument) { - LOG(ERROR) << "invalid argument"; - return false; - } - argument_ = argument; - return true; - } + void RunImpl(Argument* argument) override; + std::string repr() const override; - void Run(DataFlowGraph* x) override; + private: + void InitTensorRTAttrs(Argument* argument); - std::string repr() const override { return "DFG-store-pass"; } - std::string description() const override { - return R"DD(This file defines ModelStorePass, which store the runtime DFG to a Paddle - model in the disk, and that model can be reloaded for prediction again.)DD"; - } + void ApplyIrPasses(Argument* argument); - bool Finalize() override; + void CollectFusionStatis(Argument* argument); - private: - Argument* argument_{nullptr}; + // Assign a Scope for IR passes to modify the weights. + void AssignScopeToModify(Argument* argument); }; } // namespace analysis diff --git a/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..e327bd39f0ae0b8fbe3b189e4bb26a23c44d910c --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.cc @@ -0,0 +1,43 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_analysis_pass.h" +#include "paddle/fluid/inference/analysis/ir_pass_manager.h" + +namespace paddle { +namespace inference { +namespace analysis { + +void IrAnalysisPass::RunImpl(Argument* argument) { + ARGUMENT_CHECK_FIELD(argument, ir_analysis_passes); + ARGUMENT_CHECK_FIELD(argument, main_program); + ARGUMENT_CHECK_FIELD(argument, scope); + + auto* the_graph = argument->ReleaseMainGraph(); + auto graph = std::unique_ptr(the_graph); + + // Apply passes. + IRPassManager the_ir_manager(argument); + graph = the_ir_manager.Apply(std::move(graph)); + PADDLE_ENFORCE_GT(graph->Nodes().size(), 0); + argument->SetIrAnalyzedProgram(new framework::proto::ProgramDesc( + the_ir_manager.AcquireProgram(&graph, argument->main_program()))); + argument->SetMainGraph(graph.release()); +} + +std::string IrAnalysisPass::repr() const { return "ir-analysis-pass"; } + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/node_attr_flags.h b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.h similarity index 70% rename from paddle/fluid/inference/analysis/node_attr_flags.h rename to paddle/fluid/inference/analysis/passes/ir_analysis_pass.h index a3f70e5419a66969e8fb20152a8a8ace39316f57..d8a7449807585257c153d3c8958555ea2306afa3 100644 --- a/paddle/fluid/inference/analysis/node_attr_flags.h +++ b/paddle/fluid/inference/analysis/passes/ir_analysis_pass.h @@ -12,20 +12,25 @@ // See the License for the specific language governing permissions and // limitations under the License. -/* - * This file contains all the flags that declared in Node::Attr. - * - * The Node::Attr is designed to share information between different passes, one - * can get other's attributes in a Node by the flags in this file. - */ #pragma once + +#include +#include "paddle/fluid/inference/analysis/analysis_pass.h" + namespace paddle { namespace inference { namespace analysis { -#define DECLARE_NODE_ATTR(flag__) const char ATTR_##flag__[] = #flag__; - -DECLARE_NODE_ATTR(supported_by_tensorrt) // bool +/* + * Perform IR analysis passes. + * + * It is used to fuse some + */ +class IrAnalysisPass : public AnalysisPass { + public: + void RunImpl(Argument* argument) override; + std::string repr() const override; +}; } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc new file mode 100644 index 0000000000000000000000000000000000000000..d5e0d90de1da8e54e2411c266f7a8c09c33b0336 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.cc @@ -0,0 +1,85 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h" +#include +#include +#include "paddle/fluid/framework/executor.h" +#include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { + +extern void ReadBinaryFile(const std::string &filename, std::string *contents); + +namespace analysis { + +void IrGraphBuildPass::RunImpl(Argument *argument) { + if (!argument->scope_valid()) { + argument->SetScope(new framework::Scope); + } + PADDLE_ENFORCE(argument->use_gpu_valid()); + + // The load program should run on the same device with the inference program, + // so that the parameters will on the same device, or they will keep copying + // between difference devices. + platform::Place place; + if (argument->use_gpu()) { + PADDLE_ENFORCE(argument->gpu_device_id_valid()); + place = platform::CUDAPlace(argument->gpu_device_id()); + } else { + place = platform::CPUPlace(); + } + + if (argument->model_dir_valid()) { + auto program = + LoadModel(argument->model_dir(), argument->scope_ptr(), place); + argument->SetMainProgram(program.release()); + } else if (argument->model_program_path_valid() && + argument->model_params_path_valid()) { + auto program = + LoadModel(argument->model_program_path(), argument->model_params_path(), + argument->scope_ptr(), place); + argument->SetMainProgram(program.release()); + } else { + PADDLE_THROW( + "either model_dir or (program path and parameter path) should be set."); + } + + auto graph = std::unique_ptr(new Graph(argument->main_program())); + argument->SetMainGraph(graph.release()); + argument->main_graph().Set(framework::ir::kParamScopeAttr, + new framework::Scope *(argument->scope_ptr())); +} + +std::unique_ptr IrGraphBuildPass::LoadModel( + const std::string &path, framework::Scope *scope, + const platform::Place &place) { + framework::Executor exe(place); + return Load(&exe, scope, path); +} + +std::unique_ptr IrGraphBuildPass::LoadModel( + const std::string &program_path, const std::string ¶ms_path, + framework::Scope *scope, const platform::Place &place) { + framework::Executor exe(place); + return Load(&exe, scope, program_path, params_path); +} + +std::string IrGraphBuildPass::repr() const { return "ir-graph-build-pass"; } + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h new file mode 100644 index 0000000000000000000000000000000000000000..271e64fce579bc9001b1dd632576571cec949752 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/analysis/analysis_pass.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace inference { +namespace analysis { + +/* + * Load program and parameter to memory from the disk. + */ +class IrGraphBuildPass : public AnalysisPass { + public: + void RunImpl(Argument *argument) override; + + std::string repr() const override; + + private: + std::unique_ptr LoadModel( + const std::string &path, framework::Scope *scope, + const platform::Place &place); + std::unique_ptr LoadModel( + const std::string &program_path, const std::string ¶ms_path, + framework::Scope *scope, const platform::Place &place); + + std::string model_binary_str_; +}; + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/passes/passes.cc b/paddle/fluid/inference/analysis/passes/passes.cc new file mode 100644 index 0000000000000000000000000000000000000000..2ef515f45f2483df8d1238b4758d6729d0299ce9 --- /dev/null +++ b/paddle/fluid/inference/analysis/passes/passes.cc @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/analysis/passes/passes.h" +#include "paddle/fluid/inference/analysis/passes/ir_analysis_compose_pass.cc" +#include "paddle/fluid/inference/analysis/passes/ir_analysis_pass.h" +#include "paddle/fluid/inference/analysis/passes/ir_graph_build_pass.h" + +namespace paddle { +namespace inference { +namespace analysis { +PassRegistry::PassRegistry() { + passes_.emplace("ir_analysis_pass", + std::unique_ptr(new IrAnalysisPass)); + passes_.emplace("ir_graph_build_pass", + std::unique_ptr(new IrGraphBuildPass)); + passes_.emplace("ir_analysis_compose_pass", + std::unique_ptr(new IrAnalysisComposePass)); +} + +} // namespace analysis +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc b/paddle/fluid/inference/analysis/passes/passes.h similarity index 61% rename from paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc rename to paddle/fluid/inference/analysis/passes/passes.h index 367c25805d05f8d10fb8341158760ac6356a5c48..ea07e0dcbd992c9d10c6662909798ef79a01e3a7 100644 --- a/paddle/fluid/inference/analysis/fluid_to_ir_pass_tester.cc +++ b/paddle/fluid/inference/analysis/passes/passes.h @@ -12,24 +12,30 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/inference/analysis/fluid_to_ir_pass.h" +#pragma once -#include -#include "paddle/fluid/inference/analysis/ut_helper.h" -#include "paddle/fluid/inference/api/paddle_inference_pass.h" +#include +#include "paddle/fluid/inference/analysis/analysis_pass.h" namespace paddle { namespace inference { namespace analysis { -TEST(FluidToIrPass, Test) { - FluidToIrPass pass; - Argument argument(FLAGS_inference_model_dir); - argument.Set(kFluidToIrPassesAttr, - new std::vector({"infer_clean_graph_pass"})); - pass.Initialize(&argument); - pass.Run(argument.main_dfg.get()); -} +struct PassRegistry { + PassRegistry(); + + AnalysisPass* Retreive(const std::string& pass_type) { + return passes_[pass_type].get(); + } + + static PassRegistry& Global() { + static auto* x = new PassRegistry; + return *x; + } + + private: + std::unordered_map> passes_; +}; } // namespace analysis } // namespace inference diff --git a/paddle/fluid/inference/analysis/subgraph_splitter.h b/paddle/fluid/inference/analysis/subgraph_splitter.h deleted file mode 100644 index 76e4fda0249e03c617d1b37c079dcd97f21387c1..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/subgraph_splitter.h +++ /dev/null @@ -1,88 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -/* - * This file defines the the class to partition a graph. - */ - -#pragma once - -#include - -#include "paddle/fluid/inference/analysis/argument.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/node.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Detect the nodes in a sub-graph that meet some conditions. This class doesn't - * modify the graph. - */ -class SubGraphSplitter { - public: - static const char *kMarkerAttrName; - // Tell whether a node is inside a sub-graph. - using NodeInsideSubgraphTeller = std::function; - - SubGraphSplitter(DataFlowGraph *graph, const NodeInsideSubgraphTeller &teller) - : graph_(graph), node_inside_subgraph_teller_(teller) {} - - std::vector> operator()(); - - protected: - // Mark the nodes inside the accepted sub-graph using - // node_inside_subgraph_teller. - void MarkNodesInsideSubGraph(); - - // Merge the marked nodes into sub-graphs and return the sub-graphs. - std::vector> ExtractSubGraphs(); - - private: - DataFlowGraph *graph_; - NodeInsideSubgraphTeller node_inside_subgraph_teller_; -}; - -/* - * SubGraphFuse - Replace some nodes with the sub-graph node they are inside. To - * some extent, the TensorRT engine is just a fusion op for a model. - */ -class SubGraphFuse { - public: - using NodeInsideSubgraphTeller = SubGraphSplitter::NodeInsideSubgraphTeller; - - SubGraphFuse(DataFlowGraph *graph, const NodeInsideSubgraphTeller &teller, - Argument *argument) - : graph_(graph), - node_inside_subgraph_teller_(teller), - argument_(argument) {} - - // The main method which run all the logic. - void operator()(); - - protected: - // Remove the nodes inside sub-graphs and replace with the SubGraphNode. - void ReplaceNodesWithSubGraphs(); - - private: - DataFlowGraph *graph_; - NodeInsideSubgraphTeller node_inside_subgraph_teller_; - Argument *argument_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc b/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc deleted file mode 100644 index e1dc89fab5fb76d456b07c316ab1cabe6de23b26..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/subgraph_splitter_tester.cc +++ /dev/null @@ -1,92 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -SubGraphSplitter::NodeInsideSubgraphTeller teller = [](const Node* node) { - if (node->type() != Node::Type::kFunction) return false; - const auto* func = static_cast(node); - if (func->func_type() == "elementwise_add" || func->func_type() == "relu" || - func->func_type() == "conv2d" || func->func_type() == "mul" || - func->func_type() == "sigmoid" || func->func_type() == "softmax") { - LOG(INFO) << "sub-graph marked " << node->repr(); - return true; - } - return false; -}; - -TEST(SubGraphSplitter, Split) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - LOG(INFO) << "spliter\n" << dfg.DotString(); - - ASSERT_GT(dfg.nodes.size(), 5UL); - - auto subgraphs = SubGraphSplitter(&dfg, teller)(); - - // Check the number of the marked nodes. - int marked_nodes = 0; - for (auto& node : dfg.nodes.nodes()) { - if (node->IsFunction() && - node->attr(SubGraphSplitter::kMarkerAttrName).Bool()) { - ++marked_nodes; - } - } - EXPECT_EQ(marked_nodes, 6); - - // For human debug. - for (auto& subgraph : subgraphs) { - LOG(INFO) << "subgraph size " << subgraph.size(); - for (auto* node : subgraph) { - LOG(INFO) << "node " << node->repr(); - } - } - - ASSERT_EQ(subgraphs.size(), 1UL); - // The last sub-graph has 5 Functions. - ASSERT_EQ(subgraphs.back().size(), 6UL); -} - -TEST(SubGraphSplitter, Fuse) { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - auto dfg = ProgramDescToDFG(desc); - Argument argument; - argument.Set("minimum_subgraph_size", new int(3)); - - size_t count0 = dfg.nodes.size(); - - SubGraphFuse fuse(&dfg, teller, &argument); - fuse(); - - int count1 = 0; - for (auto& node : dfg.nodes.nodes()) { - if (node->deleted()) { - LOG(INFO) << "deleted " << node->repr(); - } - count1 += node->deleted(); - } - - // At least one nodes should be deleted. - ASSERT_EQ(dfg.nodes.size(), count0 + 1); // added a new FunctionBlock - ASSERT_EQ(11, count1); -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc deleted file mode 100644 index 174c8513f92cf869419f04cab5a54af65e9673b8..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.cc +++ /dev/null @@ -1,80 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/node_attr_flags.h" -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" - -namespace paddle { -namespace inference { -namespace analysis { - -void TensorRTSubgraphNodeMarkPass::Run(DataFlowGraph *graph) { - for (auto &node : graph->nodes.nodes()) { - node->attr(ATTR_supported_by_tensorrt).Bool() = teller_(node.get()); - } -} - -class DfgDebuggerPass : public DFG_GraphvizDrawPass { - public: - explicit DfgDebuggerPass(const DFG_GraphvizDrawPass::Config &config) - : DFG_GraphvizDrawPass(config) {} - - std::string repr() const override { - return "tensorrt-subgraph-node-mark-debugger"; - } - - bool Finalize() override { return true; } - - protected: - std::string Draw(DataFlowGraph *graph) override { - Dot dot; - // Add nodes - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (config_.display_deleted_node || !node.deleted()) { - auto dot_attr = node.dot_attrs(); - if (node.attr(ATTR_supported_by_tensorrt).Bool()) { - dot_attr.assign( - {Dot::Attr{"color", "green"}, Dot::Attr{"style", "filled"}}); - } - dot.AddNode(node.repr(), dot_attr); - } - } - // Add edges - for (size_t i = 0; i < graph->nodes.size(); i++) { - const Node &node = graph->nodes.Get(i); - if (!config_.display_deleted_node && node.deleted()) continue; - for (auto &in : node.inlinks) { - if (!config_.display_deleted_node && in->deleted()) continue; - dot.AddEdge(in->repr(), node.repr(), {}); - } - } - return dot.Build(); - } -}; - -AnalysisPass *TensorRTSubgraphNodeMarkPass::CreateGraphvizDebugerPass() const { - DFG_GraphvizDrawPass::Config config(FLAGS_IA_graphviz_log_root, - "tensorrt_marked_node"); - return new DfgDebuggerPass(config); -} -bool TensorRTSubgraphNodeMarkPass::Finalize() { return true; } - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h deleted file mode 100644 index c881a54c240538b68abdcb9060db69de3bf2b8bb..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h +++ /dev/null @@ -1,60 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -/* - * This file defines TensorRTSubgraphNodeMarkPass which helps to mark the ops - * that supported by TensorRT engine. - */ - -#pragma once - -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Mark the operators that TensorRT engine supports. - */ -class TensorRTSubgraphNodeMarkPass : public DataFlowGraphPass { - public: - using teller_t = SubGraphSplitter::NodeInsideSubgraphTeller; - - explicit TensorRTSubgraphNodeMarkPass(const teller_t& teller) - : teller_(teller) {} - - bool Initialize(Argument* argument) override { return true; } - - // This class get a sub-graph as input and determine whether to transform this - // sub-graph into TensorRT. - void Run(DataFlowGraph* graph) override; - - std::string repr() const override { return "tensorrt-sub-subgraph-mark"; } - std::string description() const override { - return "tensorrt sub-graph mark pass"; - } - - AnalysisPass* CreateGraphvizDebugerPass() const override; - bool Finalize() override; - - private: - teller_t teller_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc deleted file mode 100644 index c1d932878e559180af987594535959afdf475587..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass_tester.cc +++ /dev/null @@ -1,50 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_node_mark_pass.h" - -#include -#include "paddle/fluid/inference/analysis/node_attr_flags.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TEST(TensorRTSubgraphNodeMarkPass, test) { - // init - FluidToDataFlowGraphPass pass; - Argument argument(FLAGS_inference_model_dir); - ASSERT_TRUE(pass.Initialize(&argument)); - pass.Run(argument.main_dfg.get()); - - TensorRTSubgraphNodeMarkPass::teller_t teller = [](const Node* node) { - return node->IsFunction() && - static_cast(node)->func_type() == "mul"; - }; - TensorRTSubgraphNodeMarkPass pass1(teller); - ASSERT_TRUE(pass1.Initialize(&argument)); - pass1.Run(argument.main_dfg.get()); - - int counter{0}; - for (auto& node : argument.main_dfg->nodes.nodes()) { - counter += node->attr(ATTR_supported_by_tensorrt).Bool(); - } - ASSERT_EQ(counter, 2); - LOG(INFO) << counter << " nodes marked"; -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc deleted file mode 100644 index cc1746ecb34c983d219693bcec17c8789c38fa9f..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.cc +++ /dev/null @@ -1,36 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -TensorRTSubGraphPass::TensorRTSubGraphPass( - const TensorRTSubGraphPass::NodeInsideSubgraphTeller &teller) - : node_inside_subgraph_teller_(teller) {} - -void TensorRTSubGraphPass::Run(DataFlowGraph *graph) { - SubGraphFuse(graph, node_inside_subgraph_teller_, argument_)(); - VLOG(4) << "debug info " - << graph->HumanReadableInfo(false /*show_values*/, - true /*show_functions*/); -} - -} // namespace analysis -} // namespace inference - -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h deleted file mode 100644 index 3545da9109d79964f36c3d7e738620cc2e0f9a6c..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h +++ /dev/null @@ -1,57 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include "paddle/fluid/inference/analysis/analysis_pass.h" -#include "paddle/fluid/inference/analysis/node.h" -#include "paddle/fluid/inference/analysis/subgraph_splitter.h" - -namespace paddle { -namespace inference { -namespace analysis { - -/* - * Parse the graph and replace TensorRT supported nodes with SubGraphNode - */ -class TensorRTSubGraphPass : public DataFlowGraphPass { - public: - // Tell whether to transform a sub-graph into TensorRT. - using NodeInsideSubgraphTeller = SubGraphFuse::NodeInsideSubgraphTeller; - - explicit TensorRTSubGraphPass(const NodeInsideSubgraphTeller& teller); - - bool Initialize(Argument* argument) override { - argument_ = argument; - return true; - } - - // This class get a sub-graph as input and determine whether to transform this - // sub-graph into TensorRT. - void Run(DataFlowGraph* graph) override; - - bool Finalize() override { return true; } - - std::string repr() const override { return "tensorrt-sub-graph"; } - std::string description() const override { return "tensorrt sub graph pass"; } - - private: - NodeInsideSubgraphTeller node_inside_subgraph_teller_; - Argument* argument_; -}; - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc b/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc deleted file mode 100644 index 9748e24b06295a4e7c2995429e6588cd0f225fe6..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/analysis/tensorrt_subgraph_pass_tester.cc +++ /dev/null @@ -1,73 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/inference/analysis/tensorrt_subgraph_pass.h" - -#include -#include -#include "paddle/fluid/inference/analysis/dfg_graphviz_draw_pass.h" -#include "paddle/fluid/inference/analysis/ut_helper.h" - -namespace paddle { -namespace inference { -namespace analysis { - -DEFINE_string(dot_dir, "./", ""); - -TEST(TensorRTSubGraphPass, main) { - std::unordered_set teller_set( - {"elementwise_add", "mul", "sigmoid"}); - SubGraphSplitter::NodeInsideSubgraphTeller teller = [&](const Node* node) { - if (node->type() != Node::Type::kFunction) return false; - const auto* func = static_cast(node); - if (teller_set.count(func->func_type())) return true; - return false; - }; - - Argument argument(FLAGS_inference_model_dir); - argument.Set("minimum_subgraph_size", new int(0)); - argument.Set("max_batch_size", new int(3)); - argument.Set("workspace_size", new int(1 << 20)); - argument.Set("precision_mode", new std::string("FP32")); - - DFG_GraphvizDrawPass::Config config{FLAGS_dot_dir, "origin"}; - DFG_GraphvizDrawPass::Config config1{FLAGS_dot_dir, "fusion"}; - - DFG_GraphvizDrawPass dfg_pass(config); - DFG_GraphvizDrawPass dfg_pass1(config1); - FluidToDataFlowGraphPass pass0; - TensorRTSubGraphPass trt_pass(std::move(teller)); - - dfg_pass.Initialize(&argument); - dfg_pass1.Initialize(&argument); - pass0.Initialize(&argument); - trt_pass.Initialize(&argument); - - argument.main_dfg.reset(new DataFlowGraph); - pass0.Run(argument.main_dfg.get()); - dfg_pass.Run(argument.main_dfg.get()); - trt_pass.Run(argument.main_dfg.get()); - dfg_pass1.Run(argument.main_dfg.get()); - - // Check the TRT op's block desc - for (auto& node : argument.main_dfg->nodes.nodes()) { - if (node->IsFunctionBlock()) { - LOG(INFO) << "get function block"; - } - } -} - -} // namespace analysis -} // namespace inference -} // namespace paddle diff --git a/paddle/fluid/inference/analysis/ut_helper.h b/paddle/fluid/inference/analysis/ut_helper.h index 1073a6f686eaeeaaae2d93ab044149b7df518085..d599099a8050eaeabb8e0544b1bfe3b6b46b17ec 100644 --- a/paddle/fluid/inference/analysis/ut_helper.h +++ b/paddle/fluid/inference/analysis/ut_helper.h @@ -18,8 +18,6 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/executor.h" -#include "paddle/fluid/inference/analysis/data_flow_graph.h" -#include "paddle/fluid/inference/analysis/fluid_to_data_flow_graph_pass.h" #include "paddle/fluid/inference/analysis/helper.h" namespace paddle { @@ -32,29 +30,6 @@ namespace analysis { DEFINE_string(inference_model_dir, "", "inference test model dir"); -static DataFlowGraph ProgramDescToDFG( - const framework::proto::ProgramDesc& desc) { - DataFlowGraph graph; - FluidToDataFlowGraphPass pass; - Argument argument; - argument.fluid_model_dir.reset(new std::string(FLAGS_inference_model_dir)); - argument.origin_program_desc.reset(new framework::proto::ProgramDesc(desc)); - pass.Initialize(&argument); - pass.Run(&graph); - pass.Finalize(); - return graph; -} - -class DFG_Tester : public ::testing::Test { - protected: - void SetUp() override { - auto desc = LoadProgramDesc(FLAGS_inference_model_dir + "/__model__"); - argument.origin_program_desc.reset(new framework::proto::ProgramDesc(desc)); - } - - Argument argument; -}; - } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/api/CMakeLists.txt b/paddle/fluid/inference/api/CMakeLists.txt index 0ddd5d53f836131fe37d412fc867cb38f11ee2b5..eda251c5346a6d970ecd0956f976cbef41e6c1c1 100644 --- a/paddle/fluid/inference/api/CMakeLists.txt +++ b/paddle/fluid/inference/api/CMakeLists.txt @@ -18,57 +18,38 @@ if(APPLE) endif(APPLE) -set(inference_deps paddle_inference_api paddle_fluid_api analysis pass ir_pass_manager naive_executor ${GLOB_PASS_LIB}) +set(inference_deps paddle_inference_api paddle_fluid_api analysis pass ir_pass_manager naive_executor analysis_predictor ${GLOB_PASS_LIB}) if(WITH_GPU AND TENSORRT_FOUND) - set(inference_deps ${inference_deps} paddle_inference_tensorrt_subgraph_engine analysis_predictor) + set(inference_deps ${inference_deps} tensorrt_engine tensorrt_converter) endif() -function(inference_api_test TARGET_NAME) - if (WITH_TESTING) - set(options "") - set(oneValueArgs SRC) - set(multiValueArgs ARGS) - cmake_parse_arguments(inference_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) +cc_library(reset_tensor_array SRCS details/reset_tensor_array.cc DEPS lod_tensor scope) +cc_library(analysis_config SRCS analysis_config.cc DEPS lod_tensor paddle_pass_builder) +cc_library(paddle_pass_builder SRCS paddle_pass_builder.cc) +cc_library(analysis_predictor SRCS analysis_predictor.cc DEPS paddle_inference_api analysis naive_executor zero_copy_tensor reset_tensor_array analysis_config paddle_pass_builder ir_pass_manager) +cc_library(zero_copy_tensor SRCS details/zero_copy_tensor.cc DEPS scope lod_tensor enforce) +cc_library(zero_copy_tensor_dummy SRCS details/zero_copy_tensor_dummy.cc) +cc_library(paddle_inference_api SRCS api.cc api_impl.cc helper.cc DEPS + lod_tensor scope paddle_pass_builder reset_tensor_array analysis_config + analysis_config paddle_pass_builder zero_copy_tensor reset_tensor_array) - cc_test(${TARGET_NAME} - SRCS ${inference_test_SRC} - DEPS "${inference_deps}" - ARGS --dirname=${PYTHON_TESTS_DIR}/book/) - if(inference_test_ARGS) - set_tests_properties(${TARGET_NAME} - PROPERTIES DEPENDS "${inference_test_ARGS}") - endif() - endif(WITH_TESTING) -endfunction(inference_api_test) - -cc_library(paddle_inference_api SRCS api.cc api_impl.cc helper.cc DEPS lod_tensor scope) -cc_library(analysis_predictor SRCS analysis_predictor.cc DEPS paddle_inference_api analysis naive_executor zero_copy_tensor) -cc_library(zero_copy_tensor SRCS details/zero_copy_tensor.cc DEPS paddle_inference_api) -cc_library(zero_copy_tensor_dummy SRCS details/zero_copy_tensor_dummy.cc DEPS paddle_inference_api) cc_test(test_paddle_inference_api SRCS api_tester.cc DEPS paddle_inference_api) -inference_api_test(test_api_impl SRC api_impl_tester.cc - ARGS test_word2vec test_image_classification) - -set(PYTHON_TESTS_DIR ${PADDLE_BINARY_DIR}/python/paddle/fluid/tests) -cc_test(test_analysis_predictor SRCS analysis_predictor_tester.cc DEPS analysis_predictor ${inference_deps} paddle_inference_api - ARGS --dirname=${PYTHON_TESTS_DIR}/book) - -if(WITH_GPU AND TENSORRT_FOUND) -cc_library(paddle_inference_tensorrt_subgraph_engine - SRCS api_tensorrt_subgraph_engine.cc - DEPS paddle_inference_api analysis tensorrt_engine paddle_inference_api paddle_fluid_api tensorrt_converter zero_copy_tensor_dummy) - -inference_api_test(test_api_tensorrt_subgraph_engine SRC api_tensorrt_subgraph_engine_tester.cc ARGS test_word2vec) +if(WITH_TESTING) + inference_base_test(test_api_impl SRCS api_impl_tester.cc DEPS ${inference_deps} + ARGS --word2vec_dirname=${WORD2VEC_MODEL_DIR} --book_dirname=${PYTHON_TESTS_DIR}/book) + set_tests_properties(test_api_impl PROPERTIES DEPENDS test_image_classification) endif() +cc_test(test_analysis_predictor SRCS analysis_predictor_tester.cc DEPS analysis_predictor ${inference_deps} + ARGS --dirname=${WORD2VEC_MODEL_DIR}) if (WITH_ANAKIN AND WITH_MKL) # only needed in CI # compile the libinference_anakin_api.a and anakin.so. - cc_library(inference_anakin_api SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber mklml scope zero_copy_tensor_dummy) - cc_library(inference_anakin_api_shared SHARED SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber scope) + cc_library(inference_anakin_api SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber mklml zero_copy_tensor_dummy) + cc_library(inference_anakin_api_shared SHARED SRCS api.cc api_anakin_engine.cc DEPS anakin_shared anakin_saber zero_copy_tensor_dummy) function(anakin_target target_name) target_compile_options(${target_name} BEFORE PUBLIC ${ANAKIN_COMPILE_EXTRA_FLAGS}) endfunction() diff --git a/paddle/fluid/inference/api/README.md b/paddle/fluid/inference/api/README.md index 20969fac6c8f894ffb4a02b48f795e2a0dcbd096..a2d685d723bd9ab2b84969adb86e177a8754328d 100644 --- a/paddle/fluid/inference/api/README.md +++ b/paddle/fluid/inference/api/README.md @@ -2,25 +2,15 @@ Paddle inference offers the APIs in `C` and `C++` languages. -One can easily deploy a model trained by Paddle following the steps as below: +You can easily deploy a model trained by Paddle following the steps as below: 1. Optimize the native model; 2. Write some codes for deployment. +## The APIs -Let's explain the steps in detail. - -## Optimize the native Fluid Model - -The native model that get from the training phase needs to be optimized for that. - -- Clean the noise such as the cost operators that do not need inference; -- Prune unnecessary computation fork that has nothing to do with the output; -- Remove extraneous variables; -- Memory reuse for native Fluid executor; -- Translate the model storage format to some third-party engine's, so that the inference API can utilize the engine for acceleration; - -We have an official tool to do the optimization, call `paddle_inference_optimize --help` for more information. +All the released APIs are located in the `paddle_inference_api.h` header file. +The stable APIs are wrapped by `namespace paddle`, the unstable APIs are protected by `namespace paddle::contrib`. ## Write some codes diff --git a/paddle/fluid/inference/api/analysis_config.cc b/paddle/fluid/inference/api/analysis_config.cc new file mode 100644 index 0000000000000000000000000000000000000000..dd75f0d9a65404908667d873786160ddaa73fa57 --- /dev/null +++ b/paddle/fluid/inference/api/analysis_config.cc @@ -0,0 +1,105 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/scope.h" +#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle_pass_builder.h" // NOLINT + +namespace paddle { + +PassStrategy *contrib::AnalysisConfig::pass_builder() const { + PADDLE_ENFORCE( + pass_builder_.get(), + "Should call constructor first, that will init the pass_builder_."); + return pass_builder_.get(); +} + +contrib::AnalysisConfig::AnalysisConfig(bool use_gpu) { + this->use_gpu = use_gpu; + if (use_gpu) { + pass_builder_.reset(new GpuPassStrategy); + } else { + pass_builder_.reset(new CpuPassStrategy); + } +} + +contrib::AnalysisConfig::AnalysisConfig(const contrib::AnalysisConfig &other) { + // fields from Config + model_dir = other.model_dir; + // fields from NativeConfig + use_gpu = other.use_gpu; + device = other.device; + fraction_of_gpu_memory = other.fraction_of_gpu_memory; + prog_file = other.prog_file; + param_file = other.param_file; + specify_input_name = other.specify_input_name; + cpu_math_library_num_threads_ = other.cpu_math_library_num_threads_; + // fields from this. + enable_ir_optim = other.enable_ir_optim; + use_feed_fetch_ops = other.use_feed_fetch_ops; + use_tensorrt_ = other.use_tensorrt_; + tensorrt_max_batchsize_ = other.tensorrt_max_batchsize_; + tensorrt_workspace_size_ = other.tensorrt_workspace_size_; + + if (use_gpu) { + pass_builder_.reset(new GpuPassStrategy( + *static_cast(other.pass_builder()))); + } else { + pass_builder_.reset(new CpuPassStrategy( + *static_cast(other.pass_builder()))); + } +} + +contrib::AnalysisConfig::AnalysisConfig(contrib::AnalysisConfig &&other) { + // fields from Config + model_dir = other.model_dir; + // fields from NativeConfig + use_gpu = other.use_gpu; + device = other.device; + fraction_of_gpu_memory = other.fraction_of_gpu_memory; + prog_file = other.prog_file; + param_file = other.param_file; + specify_input_name = other.specify_input_name; + cpu_math_library_num_threads_ = other.cpu_math_library_num_threads_; + // fields from this. + enable_ir_optim = other.enable_ir_optim; + use_feed_fetch_ops = other.use_feed_fetch_ops; + use_tensorrt_ = other.use_tensorrt_; + tensorrt_max_batchsize_ = other.tensorrt_max_batchsize_; + tensorrt_workspace_size_ = other.tensorrt_workspace_size_; + pass_builder_ = std::move(other.pass_builder_); +} + +void contrib::AnalysisConfig::EnableMKLDNN() { +#ifdef PADDLE_WITH_MKLDNN + pass_builder()->EnableMKLDNN(); + use_mkldnn_ = true; +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MKLDNN"; + use_mkldnn_ = false; +#endif +} + +void contrib::AnalysisConfig::EnableTensorRtEngine(int workspace_size, + int max_batch_size) { + use_tensorrt_ = true; + tensorrt_workspace_size_ = workspace_size; + tensorrt_max_batchsize_ = max_batch_size; + // Append after the infer_clean pass. + pass_builder()->InsertPass(1, "tensorrt_subgraph_pass"); +} + +} // namespace paddle diff --git a/paddle/fluid/inference/api/analysis_predictor.cc b/paddle/fluid/inference/api/analysis_predictor.cc index 3095dee0f0106b2408663cd32bb4fb310111eda4..ebe56734c6d6a9137e877f8a5ea188f59a1b4f98 100644 --- a/paddle/fluid/inference/api/analysis_predictor.cc +++ b/paddle/fluid/inference/api/analysis_predictor.cc @@ -13,10 +13,13 @@ // limitations under the License. #include "paddle/fluid/inference/api/analysis_predictor.h" +#include +#include #include #include #include #include "paddle/fluid/framework/feed_fetch_method.h" +#include "paddle/fluid/framework/feed_fetch_type.h" #include "paddle/fluid/framework/ir/fuse_pass_base.h" #include "paddle/fluid/framework/ir/pass.h" #include "paddle/fluid/framework/naive_executor.h" @@ -24,22 +27,35 @@ #include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" +#if PADDLE_WITH_TENSORRT +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#endif #include "paddle/fluid/inference/utils/singleton.h" +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/platform/cpu_helper.h" #include "paddle/fluid/platform/profiler.h" DECLARE_bool(profile); -DECLARE_int32(paddle_num_threads); namespace paddle { using contrib::AnalysisConfig; +namespace { +bool IsPersistable(const framework::VarDesc *var) { + if (var->Persistable() && + var->GetType() != framework::proto::VarType::FEED_MINIBATCH && + var->GetType() != framework::proto::VarType::FETCH_LIST) { + return true; + } + return false; +} +} // namespace + bool AnalysisPredictor::Init( const std::shared_ptr &parent_scope, const std::shared_ptr &program) { - VLOG(3) << "Predictor::init()"; -#if !defined(_WIN32) + VLOG(30) << "Predictor::init()"; if (FLAGS_profile) { LOG(WARNING) << "Profiler is actived, might affect the performance"; LOG(INFO) << "You can turn off by set gflags '-profile false'"; @@ -47,56 +63,116 @@ bool AnalysisPredictor::Init( : platform::ProfilerState::kCPU; platform::EnableProfiler(tracking_device); } -#endif // no matter with or without MKLDNN - paddle::platform::SetNumThreads(FLAGS_paddle_num_threads); + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); - if (config_.use_gpu) { - place_ = paddle::platform::CUDAPlace(config_.device); - LOG(WARNING) << "ir optimize only supports CPU currently, enable_ir_optim " - "is turned false."; - config_.enable_ir_optim = false; - } else { - place_ = paddle::platform::CPUPlace(); + if (!PrepareScope(parent_scope)) { + return false; } + if (!CreateExecutor()) { + return false; + } + if (!PrepareProgram(program)) { + return false; + } + + // Prepare executor, create local variables. + if (!PrepareExecutor()) { + return true; + } + + // Get the feed_target_names and fetch_target_names + PrepareFeedFetch(); + + return true; +} + +bool AnalysisPredictor::PrepareScope( + const std::shared_ptr &parent_scope) { if (parent_scope) { + PADDLE_ENFORCE_NOT_NULL( + parent_scope, + "Both program and parent_scope should be set in Clone mode."); scope_ = parent_scope; - sub_scope_ = &(parent_scope->NewScope()); + status_is_cloned_ = true; } else { paddle::framework::InitDevices(false); scope_.reset(new paddle::framework::Scope()); + status_is_cloned_ = false; } - - executor_.reset(new paddle::framework::NaiveExecutor(place_)); - + sub_scope_ = &scope_->NewScope(); + return true; +} +bool AnalysisPredictor::PrepareProgram( + const std::shared_ptr &program) { if (!program) { if (!LoadProgramDesc()) return false; - OptimizeInferenceProgram(); + + // Optimize the program, and load parameters and modify them in the + // scope_. + // This will change the scope_ address. + if (config_.enable_ir_optim) { + status_ir_optim_enabled_ = true; + OptimizeInferenceProgram(); + } else { + // If the parent_scope is passed, we assert that the persistable variables + // are already created, so just create the no persistable variables. + + // If not cloned, the parameters should be loaded + // OptimizeInferenceProgram. + // So in both cases, just the local variables are needed to load, not the + // parematers. + executor_->CreateVariables(*inference_program_, 0, true, sub_scope_); + + // Load parameters + LOG(INFO) << "load parameters "; + LoadParameters(); + } } else { + // If the program is passed from external, no need to optimize it, this + // logic is used in the clone scenario. inference_program_ = program; } - if (config_._use_mkldnn) { - executor_->EnableMKLDNN(*inference_program_); - } + executor_->CreateVariables(*inference_program_, 0, false, sub_scope_); - executor_->Prepare(scope_.get(), *inference_program_, 0, + return true; +} +bool AnalysisPredictor::CreateExecutor() { + if (config_.use_gpu) { + status_use_gpu_ = true; + place_ = paddle::platform::CUDAPlace(config_.device); + } else { + place_ = paddle::platform::CPUPlace(); + } + executor_.reset(new paddle::framework::NaiveExecutor(place_)); + return true; +} +bool AnalysisPredictor::PrepareExecutor() { + executor_->Prepare(sub_scope_, *inference_program_, 0, config_.use_feed_fetch_ops); - // Get the feed_target_names and fetch_target_names - PrepareFeedFetch(); + PADDLE_ENFORCE_NOT_NULL(sub_scope_); + return true; } +void AnalysisPredictor::SetMkldnnThreadID(int tid) { +#ifdef PADDLE_WITH_MKLDNN + platform::set_cur_thread_id(tid); +#else + LOG(ERROR) << "Please compile with MKLDNN first to use MKLDNN"; +#endif +} + bool AnalysisPredictor::Run(const std::vector &inputs, std::vector *output_data, int batch_size) { - VLOG(3) << "Predictor::predict"; + VLOG(30) << "Predictor::predict"; inference::Timer timer; timer.tic(); // set feed variable - std::vector feeds; framework::Scope *scope = sub_scope_ ? sub_scope_ : scope_.get(); if (!SetFeed(inputs, scope)) { LOG(ERROR) << "fail to set feed"; @@ -112,13 +188,17 @@ bool AnalysisPredictor::Run(const std::vector &inputs, LOG(ERROR) << "fail to get fetches"; return false; } - VLOG(3) << "predict cost: " << timer.toc() << "ms"; + VLOG(30) << "predict cost: " << timer.toc() << "ms"; + + // Fix TensorArray reuse not cleaned bug. + tensor_array_batch_cleaner_.CollectTensorArrays(scope_.get()); + tensor_array_batch_cleaner_.ResetTensorArray(); return true; } bool AnalysisPredictor::SetFeed(const std::vector &inputs, framework::Scope *scope) { - VLOG(3) << "Predictor::set_feed"; + VLOG(30) << "Predictor::set_feed"; if (inputs.size() != feeds_.size()) { LOG(ERROR) << "wrong feed input size, need " << feeds_.size() << " but get " << inputs.size(); @@ -133,17 +213,29 @@ bool AnalysisPredictor::SetFeed(const std::vector &inputs, framework::DDim ddim = framework::make_ddim(inputs[i].shape); void *input_ptr; if (inputs[i].dtype == PaddleDType::INT64) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else if (inputs[i].dtype == PaddleDType::FLOAT32) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else { LOG(ERROR) << "unsupported feed type " << inputs[i].dtype; return false; } - // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. - std::memcpy(static_cast(input_ptr), inputs[i].data.data(), - inputs[i].data.length()); + if (platform::is_cpu_place(place_)) { + // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. + std::memcpy(static_cast(input_ptr), inputs[i].data.data(), + inputs[i].data.length()); + } else { +#ifdef PADDLE_WITH_CUDA + auto dst_gpu_place = boost::get(place_); + memory::Copy(dst_gpu_place, static_cast(input_ptr), + platform::CPUPlace(), inputs[i].data.data(), + inputs[i].data.length(), + 0); // stream 0 for sync copy +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } // TODO(Superjomn) Low performance, need optimization for heavy LoD copy. framework::LoD lod; for (auto &level : inputs[i].lod) { @@ -183,7 +275,7 @@ void AnalysisPredictor::GetFetchOne(const framework::LoDTensor &fetch, bool AnalysisPredictor::GetFetch(std::vector *outputs, framework::Scope *scope) { - VLOG(3) << "Predictor::get_fetch"; + VLOG(30) << "Predictor::get_fetch"; outputs->resize(fetchs_.size()); for (size_t i = 0; i < fetchs_.size(); ++i) { int idx = boost::get(fetchs_[i]->GetAttr("col")); @@ -205,47 +297,48 @@ bool AnalysisPredictor::GetFetch(std::vector *outputs, return true; } +// NOTE All the members in AnalysisConfig should be copied to Argument. void AnalysisPredictor::OptimizeInferenceProgram() { - LOG(INFO) << "optimize begin"; - FLAGS_IA_enable_ir = config_.enable_ir_optim; - FLAGS_IA_enable_tensorrt_subgraph_engine = false; - FLAGS_IA_output_storage_path = ""; // Don't output the model. + status_program_optimized_ = true; + + argument_.SetUseGPU(config_.use_gpu); + argument_.SetGPUDeviceId(config_.device); // Analyze inference_program if (!config_.model_dir.empty()) { - argument_.fluid_model_dir.reset(new std::string(config_.model_dir)); + argument_.SetModelDir(config_.model_dir); } else { PADDLE_ENFORCE( !config_.param_file.empty(), "Either model_dir or (param_file, prog_file) should be set."); PADDLE_ENFORCE(!config_.prog_file.empty()); - argument_.fluid_model_program_path.reset( - new std::string(config_.prog_file)); - argument_.fluid_model_param_path.reset(new std::string(config_.param_file)); + argument_.SetModelProgramPath(config_.prog_file); + argument_.SetModelParamsPath(config_.param_file); + } + + if (config_.use_gpu && config_.use_tensorrt_) { + argument_.SetUseTensorRT(true); + argument_.SetTensorRtWorkspaceSize(config_.tensorrt_workspace_size_); + argument_.SetTensorRtMaxBatchSize(config_.tensorrt_max_batchsize_); } - argument_.origin_program_desc.reset( - new ProgramDesc(*inference_program_->Proto())); - PADDLE_ENFORCE( - config_.ir_mode == contrib::AnalysisConfig::IrPassMode::kExclude, - "Only kExclude is supported yet."); - Analyzer().DisableIrPasses(config_.ir_passes).Run(&argument_); + auto passes = config_.pass_builder()->AllPasses(); + if (!config_.enable_ir_optim) passes.clear(); + argument_.SetIrAnalysisPasses(passes); + argument_.SetScopeNotOwned(const_cast(scope_.get())); + Analyzer().Run(&argument_); - CHECK(argument_.transformed_program_desc); + PADDLE_ENFORCE(argument_.scope_valid()); VLOG(5) << "to prepare executor"; + ARGUMENT_CHECK_FIELD((&argument_), ir_analyzed_program); inference_program_.reset( - new framework::ProgramDesc(*argument_.transformed_program_desc)); - if (argument_.Has(framework::ir::kParamScopeAttr)) { - // Update scope. - scope_.reset( - argument_.Release(framework::ir::kParamScopeAttr)); - } + new framework::ProgramDesc(argument_.ir_analyzed_program())); LOG(INFO) << "== optimize end =="; } template <> std::unique_ptr CreatePaddlePredictor< AnalysisConfig, PaddleEngineKind::kAnalysis>(const AnalysisConfig &config) { - VLOG(3) << "create AnalysisConfig"; + VLOG(30) << "create AnalysisConfig"; if (config.use_gpu) { // 1. GPU memeroy PADDLE_ENFORCE_GT( @@ -259,7 +352,7 @@ std::unique_ptr CreatePaddlePredictor< std::string flag = "--fraction_of_gpu_memory_to_use=" + std::to_string(config.fraction_of_gpu_memory); flags.push_back(flag); - VLOG(3) << "set flag: " << flag; + VLOG(30) << "set flag: " << flag; framework::InitGflags(flags); } } @@ -268,10 +361,12 @@ std::unique_ptr CreatePaddlePredictor< if (!dynamic_cast(predictor.get())->Init(nullptr)) { return nullptr; } - return predictor; + return std::move(predictor); } void AnalysisPredictor::PrepareFeedFetch() { + PADDLE_ENFORCE_NOT_NULL(sub_scope_); + CreateFeedFetchVar(sub_scope_); for (auto *op : inference_program_->Block(0).AllOps()) { if (op->Type() == "feed") { int idx = boost::get(op->GetAttr("col")); @@ -290,6 +385,14 @@ void AnalysisPredictor::PrepareFeedFetch() { } } +void AnalysisPredictor::CreateFeedFetchVar(framework::Scope *scope) { + PADDLE_ENFORCE_NOT_NULL(scope); + auto *var = scope->Var("feed"); + var->GetMutable(); + var = scope->Var("fetch"); + var->GetMutable(); +} + std::unique_ptr AnalysisPredictor::GetInputTensor( const std::string &name) { PADDLE_ENFORCE(executor_->scope()->FindVar(name), "no name called %s", name); @@ -312,42 +415,113 @@ std::unique_ptr AnalysisPredictor::GetOutputTensor( bool AnalysisPredictor::ZeroCopyRun() { executor_->Run(); + // Fix TensorArray reuse not cleaned bug. + tensor_array_batch_cleaner_.CollectTensorArrays(scope_.get()); + tensor_array_batch_cleaner_.ResetTensorArray(); return true; } bool AnalysisPredictor::LoadProgramDesc() { // Initialize the inference program - std::unique_ptr tmp_exe( - new framework::Executor(platform::CPUPlace())); + std::string filename; if (!config_.model_dir.empty()) { - // Parameters are saved in separate files sited in - // the specified `dirname`. - inference_program_ = paddle::inference::Load( - static_cast(tmp_exe.get()), scope_.get(), - config_.model_dir); + filename = config_.model_dir + "/__model__"; } else if (!config_.prog_file.empty() && !config_.param_file.empty()) { // All parameters are saved in a single file. // The file names should be consistent with that used // in Python API `fluid.io.save_inference_model`. - inference_program_ = paddle::inference::Load( - static_cast(tmp_exe.get()), scope_.get(), - config_.prog_file, config_.param_file); + filename = config_.prog_file; } else { + if (config_.model_dir.empty() && config_.prog_file.empty()) { + LOG(ERROR) + << "Either model_dir or (prog_file, param_file) should be set."; + return false; + } LOG(ERROR) << string::Sprintf( "not valid model path '%s' or program path '%s'.", config_.model_dir, config_.param_file); return false; } + + std::string pb_content; + // Read binary + std::ifstream fin(filename, std::ios::in | std::ios::binary); + PADDLE_ENFORCE(static_cast(fin), "Cannot open file %s", filename); + fin.seekg(0, std::ios::end); + + pb_content.resize(fin.tellg()); + fin.seekg(0, std::ios::beg); + fin.read(&(pb_content.at(0)), pb_content.size()); + fin.close(); + + // Create ProgramDesc + framework::proto::ProgramDesc proto; + proto.ParseFromString(pb_content); + inference_program_.reset(new framework::ProgramDesc(proto)); + return true; +} + +bool AnalysisPredictor::LoadParameters() { + PADDLE_ENFORCE_NOT_NULL(inference_program_.get(), + "The inference program should be loaded first."); + const auto &global_block = inference_program_->MutableBlock(0); + + // create a temporary program to load parameters. + + std::unique_ptr load_program( + new framework::ProgramDesc()); + framework::BlockDesc *load_block = load_program->MutableBlock(0); + std::vector params; + + for (auto *var : global_block->AllVars()) { + if (IsPersistable(var)) { + VLOG(3) << "persistable variable's name: " << var->Name(); + + framework::VarDesc *new_var = load_block->Var(var->Name()); + new_var->SetShape(var->GetShape()); + new_var->SetDataType(var->GetDataType()); + new_var->SetType(var->GetType()); + new_var->SetLoDLevel(var->GetLoDLevel()); + new_var->SetPersistable(true); + + if (!config_.param_file.empty()) { + params.push_back(new_var->Name()); + } else { + // append_op + framework::OpDesc *op = load_block->AppendOp(); + op->SetType("load"); + op->SetOutput("Out", {new_var->Name()}); + op->SetAttr("file_path", {config_.model_dir + "/" + new_var->Name()}); + op->CheckAttrs(); + } + } + } + + if (!config_.param_file.empty()) { + // sort paramlist to have consistent ordering + std::sort(params.begin(), params.end()); + // append just the load_combine op + framework::OpDesc *op = load_block->AppendOp(); + op->SetType("load_combine"); + op->SetOutput("Out", params); + op->SetAttr("file_path", {config_.param_file}); + op->CheckAttrs(); + } + + // Use NaiveExecutor to Load parameters. + framework::NaiveExecutor e(place_); + e.Prepare(scope_.get(), *load_program, 0, false); + e.Run(); + VLOG(3) << "get " << scope_->LocalVarNames().size() << " vars after load"; + return true; } AnalysisPredictor::~AnalysisPredictor() { -#if !defined(_WIN32) if (FLAGS_profile) { platform::DisableProfiler(platform::EventSortingKey::kTotal, "./profile.log"); } -#endif if (sub_scope_) { scope_->DeleteScope(sub_scope_); } @@ -367,3 +541,30 @@ std::unique_ptr CreatePaddlePredictor( } } // namespace paddle + +#if PADDLE_WITH_TENSORRT +USE_TRT_CONVERTER(elementwise_add_weight); +USE_TRT_CONVERTER(elementwise_add_tensor); +USE_TRT_CONVERTER(elementwise_sub_tensor); +USE_TRT_CONVERTER(elementwise_div_tensor); +USE_TRT_CONVERTER(elementwise_mul_tensor); +USE_TRT_CONVERTER(elementwise_max_tensor); +USE_TRT_CONVERTER(elementwise_min_tensor); +USE_TRT_CONVERTER(elementwise_pow_tensor); +USE_TRT_CONVERTER(mul); +USE_TRT_CONVERTER(conv2d); +USE_TRT_CONVERTER(relu); +USE_TRT_CONVERTER(sigmoid); +USE_TRT_CONVERTER(tanh); +USE_TRT_CONVERTER(fc); +USE_TRT_CONVERTER(pool2d); +USE_TRT_CONVERTER(softmax); +USE_TRT_CONVERTER(batch_norm); +USE_TRT_CONVERTER(concat); +USE_TRT_CONVERTER(dropout); +USE_TRT_CONVERTER(pad); +USE_TRT_CONVERTER(split); +USE_TRT_CONVERTER(prelu); +USE_TRT_CONVERTER(conv2d_transpose); +USE_TRT_CONVERTER(leaky_relu); +#endif diff --git a/paddle/fluid/inference/api/analysis_predictor.h b/paddle/fluid/inference/api/analysis_predictor.h index 5a9f4d36959d4ee7ca16dec769d9d1283b8787cb..db57812bc3ba8e6d578e665524cb5749e6bfecd6 100644 --- a/paddle/fluid/inference/api/analysis_predictor.h +++ b/paddle/fluid/inference/api/analysis_predictor.h @@ -13,14 +13,20 @@ // limitations under the License. #pragma once +#include +#include #include #include #include "paddle/fluid/framework/naive_executor.h" #include "paddle/fluid/inference/analysis/analyzer.h" #include "paddle/fluid/inference/api/api_impl.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/string/printf.h" - +#ifdef PADDLE_WITH_TESTING +#include +#include +#endif namespace paddle { using inference::analysis::Argument; @@ -51,6 +57,7 @@ class AnalysisPredictor : public PaddlePredictor { bool ZeroCopyRun() override; + void CreateFeedFetchVar(framework::Scope *scope); void PrepareFeedFetch(); void OptimizeInferenceProgram(); @@ -59,11 +66,19 @@ class AnalysisPredictor : public PaddlePredictor { std::unique_ptr Clone() override; - framework::Scope *scope() { return executor_->scope(); } + framework::Scope *scope() { return scope_.get(); } framework::ProgramDesc &program() { return *inference_program_; } + void SetMkldnnThreadID(int tid); + protected: + bool PrepareProgram(const std::shared_ptr &program); + bool PrepareScope(const std::shared_ptr &parent_scope); + bool CreateExecutor(); + bool PrepareExecutor(); + bool LoadProgramDesc(); + bool LoadParameters(); bool SetFeed(const std::vector &input_datas, framework::Scope *scope); @@ -74,6 +89,14 @@ class AnalysisPredictor : public PaddlePredictor { PaddleTensor *output_data); ~AnalysisPredictor(); +// Some more detailed tests, they are made the friends of the predictor, so that +// the all the details can be tested. +#if PADDLE_WITH_TESTING + FRIEND_TEST(AnalysisPredictor, analysis_off); + FRIEND_TEST(AnalysisPredictor, analysis_on); + FRIEND_TEST(AnalysisPredictor, with_gpu); +#endif + private: contrib::AnalysisConfig config_; Argument argument_; @@ -88,6 +111,14 @@ class AnalysisPredictor : public PaddlePredictor { // Memory buffer for feed inputs. The temporary LoDTensor will cause serious // concurrency problems, so cache them. std::vector feed_tensors_; + details::TensorArrayBatchCleaner tensor_array_batch_cleaner_; + + private: + // Some status here that help to determine the status inside the predictor. + bool status_program_optimized_{false}; + bool status_is_cloned_{false}; + bool status_use_gpu_{false}; + bool status_ir_optim_enabled_{false}; }; } // namespace paddle diff --git a/paddle/fluid/inference/api/analysis_predictor_tester.cc b/paddle/fluid/inference/api/analysis_predictor_tester.cc index 1d25f55b3188a684fe38df1417d114348cfa2e8a..d67305670c91bb0814b8771332641e96974ade9d 100644 --- a/paddle/fluid/inference/api/analysis_predictor_tester.cc +++ b/paddle/fluid/inference/api/analysis_predictor_tester.cc @@ -12,24 +12,91 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include "paddle/fluid/inference/api/analysis_predictor.h" #include #include +#include // NOLINT +#include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" DEFINE_string(dirname, "", "dirname to tests."); namespace paddle { -namespace inference { using contrib::AnalysisConfig; +TEST(AnalysisPredictor, analysis_off) { + AnalysisConfig config(false); + config.model_dir = FLAGS_dirname; + config.enable_ir_optim = false; + + auto _predictor = CreatePaddlePredictor(config); + auto* predictor = static_cast(_predictor.get()); + + // Without analysis, the scope_ and sub_scope_ are created by predictor + // itself. + ASSERT_TRUE(predictor->scope_); + ASSERT_TRUE(predictor->sub_scope_); + ASSERT_EQ(predictor->scope_->parent(), nullptr); + ASSERT_EQ(predictor->sub_scope_->parent(), predictor->scope_.get()); + // ir is turned off, so program shouldn't be optimized. + ASSERT_FALSE(predictor->status_program_optimized_); + LOG(INFO) << "scope parameters " << predictor->scope_->LocalVarNames().size(); + + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + ASSERT_TRUE(predictor->Run(inputs, &outputs)); +} + +TEST(AnalysisPredictor, analysis_on) { + AnalysisConfig config(false); + config.model_dir = FLAGS_dirname; + config.enable_ir_optim = true; + + auto _predictor = CreatePaddlePredictor(config); + auto* predictor = static_cast(_predictor.get()); + + ASSERT_TRUE(predictor->scope_); + ASSERT_TRUE(predictor->sub_scope_); + ASSERT_EQ(predictor->scope_->parent(), nullptr); + ASSERT_EQ(predictor->sub_scope_->parent(), predictor->scope_.get()); + // ir is turned on, so program should be optimized. + ASSERT_TRUE(predictor->status_program_optimized_); + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + ASSERT_TRUE(predictor->Run(inputs, &outputs)); + + for (auto& output : outputs) { + LOG(INFO) << inference::DescribeTensor(output); + } + + // compare with NativePredictor + auto naive_predictor = CreatePaddlePredictor(config); + std::vector naive_outputs; + ASSERT_TRUE(naive_predictor->Run(inputs, &naive_outputs)); + ASSERT_EQ(naive_outputs.size(), 1UL); + inference::CompareTensor(outputs.front(), naive_outputs.front()); +} + TEST(AnalysisPredictor, ZeroCopy) { AnalysisConfig config; - config.model_dir = FLAGS_dirname + "/word2vec.inference.model"; + config.model_dir = FLAGS_dirname; config.use_feed_fetch_ops = false; - auto predictor = - CreatePaddlePredictor( - config); + auto predictor = CreatePaddlePredictor(config); auto w0 = predictor->GetInputTensor("firstw"); auto w1 = predictor->GetInputTensor("secondw"); @@ -63,5 +130,59 @@ TEST(AnalysisPredictor, ZeroCopy) { LOG(INFO) << "output_data: " << out_data; } -} // namespace inference +TEST(AnalysisPredictor, Clone) { + AnalysisConfig config; + config.model_dir = FLAGS_dirname; + config.use_feed_fetch_ops = true; + config.enable_ir_optim = true; + + std::vector> predictors; + predictors.emplace_back(CreatePaddlePredictor(config)); + + LOG(INFO) << "************** to clone ************************"; + const int num_threads = 3; + for (int i = 1; i < num_threads; i++) { + predictors.emplace_back(predictors.front()->Clone()); + } + + auto* root_scope = + static_cast(predictors[0].get())->scope(); + ASSERT_FALSE(root_scope->kids().empty()); + LOG(INFO) << "***** scope ******\n" + << framework::GenScopeTreeDebugInfo(root_scope); + + // 2. Dummy Input Data + int64_t data[4] = {1, 2, 3, 4}; + PaddleTensor tensor; + tensor.shape = std::vector({4, 1}); + tensor.data.Reset(data, sizeof(data)); + tensor.dtype = PaddleDType::INT64; + + std::vector inputs(4, tensor); + std::vector outputs; + predictors[0]->Run(inputs, &outputs); + + LOG(INFO) << "Run with single thread"; + for (int i = 0; i < num_threads; i++) { + LOG(INFO) << "run predictor " << i; + ASSERT_TRUE(predictors[i]->Run(inputs, &outputs)); + } + + LOG(INFO) << "Run with multiple threads"; + std::vector threads; + for (int i = 0; i < num_threads; i++) { + threads.emplace_back([&predictors, &inputs, i] { + LOG(INFO) << "thread #" << i << " running"; + std::vector outputs; + for (int j = 0; j < 10; j++) { + ASSERT_TRUE(predictors[i]->Run(inputs, &outputs)); + } + }); + } + + for (auto& t : threads) { + t.join(); + } +} + } // namespace paddle diff --git a/paddle/fluid/inference/api/api.cc b/paddle/fluid/inference/api/api.cc index 01ea942d3c8d20180cfc9664b8601ba87a898e86..9be059c73e20ebeeff2c4b6e8e5502e4a56fd0d6 100644 --- a/paddle/fluid/inference/api/api.cc +++ b/paddle/fluid/inference/api/api.cc @@ -15,8 +15,8 @@ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/scope.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/api/paddle_pass_builder.h" #include "paddle/fluid/platform/enforce.h" -#include "paddle_inference_api.h" namespace paddle { diff --git a/paddle/fluid/inference/api/api_anakin_engine.h b/paddle/fluid/inference/api/api_anakin_engine.h index 04536ea3a53bbbc9293d92e69a23567e4bfd84c0..6a8b81cc57281b12cd3a4c89c863b20a824ce34a 100644 --- a/paddle/fluid/inference/api/api_anakin_engine.h +++ b/paddle/fluid/inference/api/api_anakin_engine.h @@ -19,11 +19,13 @@ limitations under the License. */ #pragma once +#define WITH_ANAKIN + #include #include "framework/core/net/net.h" #include "framework/graph/graph.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/inference/api/paddle_anakin_config.h" #include "saber/core/shape.h" #include "saber/saber_types.h" diff --git a/paddle/fluid/inference/api/api_impl.cc b/paddle/fluid/inference/api/api_impl.cc index 7cda9c5d8a8366bd097491f37f5352a10e4fb16c..a7cef426d17908bfae59a4a5a204237ab580a5a1 100644 --- a/paddle/fluid/inference/api/api_impl.cc +++ b/paddle/fluid/inference/api/api_impl.cc @@ -22,12 +22,13 @@ limitations under the License. */ #include "paddle/fluid/framework/feed_fetch_method.h" #include "paddle/fluid/inference/api/api_impl.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" #include "paddle/fluid/inference/api/helper.h" +#include "paddle/fluid/memory/memcpy.h" #include "paddle/fluid/platform/cpu_helper.h" #include "paddle/fluid/platform/profiler.h" DEFINE_bool(profile, false, "Turn on profiler for fluid"); -DECLARE_int32(paddle_num_threads); namespace paddle { namespace { @@ -63,7 +64,6 @@ void NativePaddlePredictor::PrepareFeedFetch() { bool NativePaddlePredictor::Init( std::shared_ptr parent_scope) { VLOG(3) << "Predictor::init()"; -#if !defined(_WIN32) if (FLAGS_profile) { LOG(WARNING) << "Profiler is actived, might affect the performance"; LOG(INFO) << "You can turn off by set gflags '-profile false'"; @@ -72,10 +72,9 @@ bool NativePaddlePredictor::Init( : platform::ProfilerState::kCPU; platform::EnableProfiler(tracking_device); } -#endif // no matter with or without MKLDNN - paddle::platform::SetNumThreads(FLAGS_paddle_num_threads); + paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads()); if (config_.use_gpu) { place_ = paddle::platform::CUDAPlace(config_.device); @@ -120,12 +119,10 @@ bool NativePaddlePredictor::Init( } NativePaddlePredictor::~NativePaddlePredictor() { -#if !defined(_WIN32) if (FLAGS_profile) { platform::DisableProfiler(platform::EventSortingKey::kTotal, "./profile.log"); } -#endif if (sub_scope_) { scope_->DeleteScope(sub_scope_); } @@ -138,7 +135,6 @@ bool NativePaddlePredictor::Run(const std::vector &inputs, Timer timer; timer.tic(); // set feed variable - std::vector feeds; framework::Scope *scope = sub_scope_ != nullptr ? sub_scope_ : scope_.get(); if (!SetFeed(inputs, scope)) { LOG(ERROR) << "fail to set feed"; @@ -156,7 +152,11 @@ bool NativePaddlePredictor::Run(const std::vector &inputs, LOG(ERROR) << "fail to get fetches"; return false; } - VLOG(3) << "predict cost: " << timer.toc() << "ms"; + VLOG(30) << "predict cost: " << timer.toc() << "ms"; + + // Fix TensorArray reuse not cleaned bug. + tensor_array_batch_cleaner_.CollectTensorArrays(scope_.get()); + tensor_array_batch_cleaner_.ResetTensorArray(); return true; } @@ -190,17 +190,30 @@ bool NativePaddlePredictor::SetFeed(const std::vector &inputs, framework::DDim ddim = framework::make_ddim(inputs[i].shape); void *input_ptr; if (inputs[i].dtype == PaddleDType::INT64) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else if (inputs[i].dtype == PaddleDType::FLOAT32) { - input_ptr = input.mutable_data(ddim, platform::CPUPlace()); + input_ptr = input.mutable_data(ddim, place_); } else { LOG(ERROR) << "unsupported feed type " << inputs[i].dtype; return false; } - // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. - std::memcpy(static_cast(input_ptr), inputs[i].data.data(), - inputs[i].data.length()); + if (platform::is_cpu_place(place_)) { + // TODO(panyx0718): Init LoDTensor from existing memcpy to save a copy. + std::memcpy(static_cast(input_ptr), inputs[i].data.data(), + inputs[i].data.length()); + } else { +#ifdef PADDLE_WITH_CUDA + auto dst_gpu_place = boost::get(place_); + memory::Copy(dst_gpu_place, static_cast(input_ptr), + platform::CPUPlace(), inputs[i].data.data(), + inputs[i].data.length(), + 0); // stream 0 for sync copy +#else + PADDLE_THROW("Not compile with CUDA, should not reach here."); +#endif + } + // TODO(Superjomn) Low performance, need optimization for heavy LoD copy. framework::LoD lod; for (auto &level : inputs[i].lod) { diff --git a/paddle/fluid/inference/api/api_impl.h b/paddle/fluid/inference/api/api_impl.h index 7882f6a53c7ce9a2486158ea9b50c018d1814091..9dfa48d501f17fa654ec50049608b1a87c586cb6 100644 --- a/paddle/fluid/inference/api/api_impl.h +++ b/paddle/fluid/inference/api/api_impl.h @@ -14,23 +14,17 @@ limitations under the License. */ #pragma once -// logging.h and windows.h conflict -#define GLOG_NO_ABBREVIATED_SEVERITIES -// solve static linking error in windows -// https://github.com/google/glog/issues/301 -#define GOOGLE_GLOG_DLL_DECL - #include #include #include #include #include -#include "paddle/fluid/inference/api/paddle_inference_api.h" - #include "paddle/fluid/framework/ddim.h" #include "paddle/fluid/framework/lod_tensor.h" +#include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/naive_executor.h" +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" #include "paddle/fluid/inference/api/paddle_inference_api.h" #include "paddle/fluid/inference/io.h" #include "paddle/fluid/platform/init.h" @@ -77,6 +71,7 @@ class NativePaddlePredictor : public PaddlePredictor { std::vector fetchs_; // Do not use unique_ptr, use parent scope to delete framework::Scope *sub_scope_{nullptr}; + details::TensorArrayBatchCleaner tensor_array_batch_cleaner_; }; } // namespace paddle diff --git a/paddle/fluid/inference/api/api_impl_tester.cc b/paddle/fluid/inference/api/api_impl_tester.cc index bed7c871311e476b4ed113d982c690383ad732de..014bdc6a379744463e535df97af4c9c2e1651656 100644 --- a/paddle/fluid/inference/api/api_impl_tester.cc +++ b/paddle/fluid/inference/api/api_impl_tester.cc @@ -27,7 +27,9 @@ limitations under the License. */ #define ACC_DIFF 1e-3 #endif -DEFINE_string(dirname, "", "Directory of the inference model."); +DEFINE_string(word2vec_dirname, "", + "Directory of the word2vec inference model."); +DEFINE_string(book_dirname, "", "Directory of the book inference model."); namespace paddle { @@ -49,7 +51,7 @@ PaddleTensor LodTensorToPaddleTensor(framework::LoDTensor* t) { NativeConfig GetConfig() { NativeConfig config; - config.model_dir = FLAGS_dirname + "/word2vec.inference.model"; + config.model_dir = FLAGS_word2vec_dirname; LOG(INFO) << "dirname " << config.model_dir; config.fraction_of_gpu_memory = 0.15; #ifdef PADDLE_WITH_CUDA @@ -116,7 +118,7 @@ void MainImageClassification(bool use_gpu) { NativeConfig config = GetConfig(); config.use_gpu = use_gpu; config.model_dir = - FLAGS_dirname + "/image_classification_resnet.inference.model"; + FLAGS_book_dirname + "/image_classification_resnet.inference.model"; const bool is_combined = false; std::vector> feed_target_shapes = @@ -187,7 +189,7 @@ void MainThreadsWord2Vec(bool use_gpu) { std::vector threads; for (int tid = 0; tid < num_jobs; ++tid) { threads.emplace_back([&, tid]() { - auto predictor = main_predictor->Clone(); + auto predictor = CreatePaddlePredictor(config); auto& local_inputs = paddle_tensor_feeds[tid]; std::vector local_outputs; ASSERT_TRUE(predictor->Run(local_inputs, &local_outputs)); @@ -205,7 +207,7 @@ void MainThreadsWord2Vec(bool use_gpu) { float* ref_data = refs[tid].data(); EXPECT_EQ(refs[tid].numel(), static_cast(len / sizeof(float))); for (int i = 0; i < refs[tid].numel(); ++i) { - EXPECT_NEAR(ref_data[i], data[i], ACC_DIFF); + EXPECT_NEAR(ref_data[i], data[i], 2e-3); } }); } @@ -220,7 +222,7 @@ void MainThreadsImageClassification(bool use_gpu) { NativeConfig config = GetConfig(); config.use_gpu = use_gpu; config.model_dir = - FLAGS_dirname + "/image_classification_resnet.inference.model"; + FLAGS_book_dirname + "/image_classification_resnet.inference.model"; auto main_predictor = CreatePaddlePredictor(config); std::vector jobs(num_jobs); @@ -245,7 +247,7 @@ void MainThreadsImageClassification(bool use_gpu) { std::vector threads; for (int tid = 0; tid < num_jobs; ++tid) { threads.emplace_back([&, tid]() { - auto predictor = main_predictor->Clone(); + auto predictor = CreatePaddlePredictor(config); auto& local_inputs = paddle_tensor_feeds[tid]; std::vector local_outputs; ASSERT_TRUE(predictor->Run(local_inputs, &local_outputs)); @@ -271,7 +273,7 @@ TEST(inference_api_native, word2vec_cpu_threads) { MainThreadsWord2Vec(false /*use_gpu*/); } TEST(inference_api_native, image_classification_cpu) { - MainThreadsImageClassification(false /*use_gpu*/); + MainImageClassification(false /*use_gpu*/); } TEST(inference_api_native, image_classification_cpu_threads) { MainThreadsImageClassification(false /*use_gpu*/); @@ -279,16 +281,25 @@ TEST(inference_api_native, image_classification_cpu_threads) { #ifdef PADDLE_WITH_CUDA TEST(inference_api_native, word2vec_gpu) { MainWord2Vec(true /*use_gpu*/); } -TEST(inference_api_native, word2vec_gpu_threads) { - MainThreadsWord2Vec(true /*use_gpu*/); -} +// Turn off temporarily for the unstable result. +// TEST(inference_api_native, word2vec_gpu_threads) { +// MainThreadsWord2Vec(true /*use_gpu*/); +// } TEST(inference_api_native, image_classification_gpu) { - MainThreadsImageClassification(true /*use_gpu*/); -} -TEST(inference_api_native, image_classification_gpu_threads) { - MainThreadsImageClassification(true /*use_gpu*/); + MainImageClassification(true /*use_gpu*/); } - +// Turn off temporarily for the unstable result. +// TEST(inference_api_native, image_classification_gpu_threads) { +// MainThreadsImageClassification(true /*use_gpu*/); +// } #endif +TEST(PassBuilder, Delete) { + contrib::AnalysisConfig config(false); + config.pass_builder()->DeletePass("attention_lstm_fuse_pass"); + const auto& passes = config.pass_builder()->AllPasses(); + auto it = std::find(passes.begin(), passes.end(), "attention_lstm_fuse_pass"); + ASSERT_EQ(it, passes.end()); +} + } // namespace paddle diff --git a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc b/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc deleted file mode 100644 index 7ac468ee4d33f49bba20a07c976055a083743cbc..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine.cc +++ /dev/null @@ -1,188 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/api_impl.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" -#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" -#include "paddle/fluid/inference/utils/singleton.h" -#include "paddle/fluid/operators/tensorrt_engine_op.h" - -namespace paddle { - -using inference::analysis::Argument; -using inference::Singleton; -using inference::analysis::Analyzer; -using framework::proto::ProgramDesc; -using paddle::contrib::MixedRTConfig; - -class TensorRTSubgraphPredictor : public NativePaddlePredictor { - public: - explicit TensorRTSubgraphPredictor(const MixedRTConfig& config) - : NativePaddlePredictor(config), config_(config) {} - - bool Init(const std::shared_ptr& parent_scope) { - FLAGS_IA_enable_tensorrt_subgraph_engine = true; - VLOG(3) << "Predictor::init()"; - if (config_.use_gpu) { - place_ = paddle::platform::CUDAPlace(config_.device); - } else { - place_ = paddle::platform::CPUPlace(); - } - if (parent_scope) { - scope_ = parent_scope; - sub_scope_ = &(parent_scope->NewScope()); - } else { - paddle::framework::InitDevices(false); - scope_.reset(new paddle::framework::Scope()); - } - - executor_.reset(new paddle::framework::Executor(place_)); - - // Initialize the inference program - if (!config_.model_dir.empty()) { - // Parameters are saved in separate files sited in - // the specified `dirname`. - inference_program_ = paddle::inference::Load( - executor_.get(), scope_.get(), config_.model_dir); - } else if (!config_.prog_file.empty() && !config_.param_file.empty()) { - // All parameters are saved in a single file. - // The file names should be consistent with that used - // in Python API `fluid.io.save_inference_model`. - inference_program_ = paddle::inference::Load( - executor_.get(), scope_.get(), config_.prog_file, config_.param_file); - } else { - LOG(ERROR) << "fail to load inference model."; - return false; - } - - OptimizeInferenceProgram(); - ctx_ = executor_->Prepare(*inference_program_, 0); - - VLOG(5) << "to create variables"; - executor_->CreateVariables(*inference_program_, - sub_scope_ ? sub_scope_ : scope_.get(), 0); - // Get the feed_target_names and fetch_target_names - PrepareFeedFetch(); - return true; - } - - bool Run(const std::vector& inputs, - std::vector* output_data, - int batch_size = -1) override { - PADDLE_ENFORCE_GT(batch_size, 0, - "TensorRT engine needs the argument batch_size set"); - FLAGS_tensorrt_engine_batch_size = batch_size; - return NativePaddlePredictor::Run(inputs, output_data, batch_size); - } - - void OptimizeInferenceProgram() { - // Analyze inference_program - Argument argument; - - argument.Set("minimum_subgraph_size", - new int(config_.minimum_subgraph_size)); - argument.Set("max_batch_size", new int(config_.max_batch_size)); - argument.Set("workspace_size", new int(config_.workspace_size)); - argument.Set("precision_mode", - new std::string(config_.precision_mode)); - - if (!config_.model_dir.empty()) { - argument.fluid_model_dir.reset(new std::string(config_.model_dir)); - } else { - PADDLE_ENFORCE( - !config_.param_file.empty(), - "Either model_dir or (param_file, prog_file) should be set."); - PADDLE_ENFORCE(!config_.prog_file.empty()); - argument.fluid_model_program_path.reset( - new std::string(config_.prog_file)); - argument.fluid_model_param_path.reset( - new std::string(config_.param_file)); - } - argument.origin_program_desc.reset( - new ProgramDesc(*inference_program_->Proto())); - Singleton::Global().Run(&argument); - CHECK(argument.transformed_program_desc); - VLOG(5) << "transformed program:\n" - << argument.transformed_program_desc->SerializeAsString(); - VLOG(5) << "to prepare executor"; - inference_program_.reset( - new framework::ProgramDesc(*argument.transformed_program_desc)); - } - - private: - MixedRTConfig config_; -}; - -template <> -std::unique_ptr -CreatePaddlePredictor( - const MixedRTConfig& config) { - VLOG(3) << "create TensorRTSubgraphPredictor"; - if (config.use_gpu) { - // 1. GPU memeroy - PADDLE_ENFORCE_GT( - config.fraction_of_gpu_memory, 0.f, - "fraction_of_gpu_memory in the config should be set to range (0., 1.]"); - PADDLE_ENFORCE_GE(config.device, 0, "Invalid device id %d", config.device); - std::vector flags; - if (config.fraction_of_gpu_memory >= 0.0f || - config.fraction_of_gpu_memory <= 0.95f) { - flags.push_back("dummpy"); - std::string flag = "--fraction_of_gpu_memory_to_use=" + - std::to_string(config.fraction_of_gpu_memory); - flags.push_back(flag); - VLOG(3) << "set flag: " << flag; - framework::InitGflags(flags); - } - } - - std::unique_ptr predictor( - new TensorRTSubgraphPredictor(config)); - if (!dynamic_cast(predictor.get()) - ->Init(nullptr)) { - return nullptr; - } - return std::move(predictor); -} - -template <> -std::unique_ptr CreatePaddlePredictor( - const MixedRTConfig& config) { - return CreatePaddlePredictor(config); -} - -} // namespace paddle - -USE_TRT_CONVERTER(elementwise_add_weight); -USE_TRT_CONVERTER(elementwise_add_tensor); -USE_TRT_CONVERTER(elementwise_sub_tensor); -USE_TRT_CONVERTER(elementwise_div_tensor); -USE_TRT_CONVERTER(elementwise_mul_tensor); -USE_TRT_CONVERTER(elementwise_max_tensor); -USE_TRT_CONVERTER(elementwise_min_tensor); -USE_TRT_CONVERTER(elementwise_pow_tensor); -USE_TRT_CONVERTER(mul); -USE_TRT_CONVERTER(conv2d); -USE_TRT_CONVERTER(relu); -USE_TRT_CONVERTER(sigmoid); -USE_TRT_CONVERTER(tanh); -USE_TRT_CONVERTER(fc); -USE_TRT_CONVERTER(pool2d); -USE_TRT_CONVERTER(softmax); -USE_TRT_CONVERTER(batch_norm); -USE_TRT_CONVERTER(concat); -USE_TRT_CONVERTER(dropout); -USE_TRT_CONVERTER(pad); diff --git a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc b/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc deleted file mode 100644 index fc6310e90b0257bc84742fb617a00f5778bb1866..0000000000000000000000000000000000000000 --- a/paddle/fluid/inference/api/api_tensorrt_subgraph_engine_tester.cc +++ /dev/null @@ -1,95 +0,0 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -#include -#include -#include -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" - -namespace paddle { - -using contrib::MixedRTConfig; - -DEFINE_string(dirname, "", "Directory of the inference model."); - -void CompareTensorRTWithFluid(bool enable_tensorrt) { - FLAGS_IA_enable_tensorrt_subgraph_engine = enable_tensorrt; - - //# 1. Create PaddlePredictor with a config. - NativeConfig config0; - config0.model_dir = FLAGS_dirname + "word2vec.inference.model"; - config0.use_gpu = true; - config0.fraction_of_gpu_memory = 0.3; - config0.device = 0; - - MixedRTConfig config1; - config1.model_dir = FLAGS_dirname + "word2vec.inference.model"; - config1.use_gpu = true; - config1.fraction_of_gpu_memory = 0.3; - config1.device = 0; - config1.max_batch_size = 10; - - auto predictor0 = - CreatePaddlePredictor(config0); - auto predictor1 = - CreatePaddlePredictor(config1); - - for (int batch_id = 0; batch_id < 1; batch_id++) { - //# 2. Prepare input. - std::vector data(20); - for (int i = 0; i < 20; i++) data[i] = i; - - PaddleTensor tensor; - tensor.shape = std::vector({10, 1}); - tensor.data = PaddleBuf(data.data(), data.size() * sizeof(int64_t)); - tensor.dtype = PaddleDType::INT64; - - // For simplicity, we set all the slots with the same data. - std::vector slots(4, tensor); - - //# 3. Run - std::vector outputs0; - std::vector outputs1; - CHECK(predictor0->Run(slots, &outputs0)); - CHECK(predictor1->Run(slots, &outputs1, 10)); - - //# 4. Get output. - ASSERT_EQ(outputs0.size(), 1UL); - ASSERT_EQ(outputs1.size(), 1UL); - - const size_t num_elements = outputs0.front().data.length() / sizeof(float); - const size_t num_elements1 = outputs1.front().data.length() / sizeof(float); - EXPECT_EQ(num_elements, num_elements1); - - auto *data0 = static_cast(outputs0.front().data.data()); - auto *data1 = static_cast(outputs1.front().data.data()); - - ASSERT_GT(num_elements, 0UL); - for (size_t i = 0; i < std::min(num_elements, num_elements1); i++) { - EXPECT_NEAR(data0[i], data1[i], 1e-3); - } - } -} - -TEST(paddle_inference_api_tensorrt_subgraph_engine, without_tensorrt) { - CompareTensorRTWithFluid(false); -} - -TEST(paddle_inference_api_tensorrt_subgraph_engine, with_tensorrt) { - CompareTensorRTWithFluid(true); -} - -} // namespace paddle diff --git a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt index ec8471ef960a2fc44af23c52be09cd678fab3f70..8fb464c0f5443f116815b14324f6cbc966dc6482 100644 --- a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt +++ b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt @@ -46,12 +46,11 @@ if(WITH_GPU) endif() endif(NOT WIN32) endif() - -include_directories("D:/Paddle/") include_directories("${PADDLE_LIB}") include_directories("${PADDLE_LIB}/third_party/install/protobuf/include") include_directories("${PADDLE_LIB}/third_party/install/glog/include") include_directories("${PADDLE_LIB}/third_party/install/gflags/include") +include_directories("${PADDLE_LIB}/third_party/install/xxhash/include") if (NOT WIN32) include_directories("${PADDLE_LIB}/third_party/install/snappy/include") include_directories("${PADDLE_LIB}/third_party/install/snappystream/include") @@ -61,8 +60,8 @@ endif(NOT WIN32) include_directories("${PADDLE_LIB}/third_party/boost") include_directories("${PADDLE_LIB}/third_party/eigen3") -if (NOT WIN32) - if (USE_TENSORRT AND WITH_GPU) +if (NOT WIN32) + if (USE_TENSORRT AND WITH_GPU) include_directories("${TENSORRT_INCLUDE_DIR}") link_directories("${TENSORRT_LIB_DIR}") endif() @@ -77,13 +76,14 @@ endif(NOT WIN32) link_directories("${PADDLE_LIB}/third_party/install/protobuf/lib") link_directories("${PADDLE_LIB}/third_party/install/glog/lib") link_directories("${PADDLE_LIB}/third_party/install/gflags/lib") -link_directories("${PADDLE_LIB}/paddle/fluid/inference") +link_directories("${PADDLE_LIB}/third_party/install/xxhash/lib") +link_directories("${PADDLE_LIB}/paddle/lib") add_executable(${DEMO_NAME} ${DEMO_NAME}.cc) if(WITH_MKL) include_directories("${PADDLE_LIB}/third_party/install/mklml/include") - set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX} + set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX} ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX}) set(MKLDNN_PATH "${PADDLE_LIB}/third_party/install/mkldnn") if(EXISTS ${MKLDNN_PATH}) @@ -97,17 +97,17 @@ endif() # Note: libpaddle_inference_api.so/a must put before libpaddle_fluid.so/a if(WITH_STATIC_LIB) set(DEPS - ${PADDLE_LIB}/paddle/fluid/inference/libpaddle_fluid${CMAKE_STATIC_LIBRARY_SUFFIX}) + ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_STATIC_LIBRARY_SUFFIX}) else() set(DEPS - ${PADDLE_LIB}/paddle/fluid/inference/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX}) + ${PADDLE_LIB}/paddle/lib/libpaddle_fluid${CMAKE_SHARED_LIBRARY_SUFFIX}) endif() if (NOT WIN32) set(EXTERNAL_LIB "-lrt -ldl -lpthread") set(DEPS ${DEPS} ${MATH_LIB} ${MKLDNN_LIB} - glog gflags protobuf snappystream snappy z + glog gflags protobuf snappystream snappy z xxhash ${EXTERNAL_LIB}) else() set(DEPS ${DEPS} @@ -120,7 +120,7 @@ endif(NOT WIN32) if(WITH_GPU) if(NOT WIN32) - if (USE_TENSORRT) + if (USE_TENSORRT) set(DEPS ${DEPS} ${TENSORRT_LIB_DIR}/libnvinfer${CMAKE_STATIC_LIBRARY_SUFFIX}) set(DEPS ${DEPS} ${TENSORRT_LIB_DIR}/libnvinfer_plugin${CMAKE_STATIC_LIBRARY_SUFFIX}) endif() diff --git a/paddle/fluid/inference/api/demo_ci/run.sh b/paddle/fluid/inference/api/demo_ci/run.sh index 65c95f0834a9356fc14faed8342f5d1e474edf8f..ff718077c1ba6b10fe87aac10d84f96a23ad6bba 100755 --- a/paddle/fluid/inference/api/demo_ci/run.sh +++ b/paddle/fluid/inference/api/demo_ci/run.sh @@ -5,22 +5,23 @@ TEST_GPU_CPU=$3 # test both GPU/CPU mode or only CPU mode DATA_DIR=$4 # dataset TENSORRT_INCLUDE_DIR=$5 # TensorRT header file dir, defalut to /usr/local/TensorRT/include TENSORRT_LIB_DIR=$6 # TensorRT lib file dir, default to /usr/local/TensorRT/lib +inference_install_dir=${PADDLE_ROOT}/build/fluid_inference_install_dir cd `dirname $0` current_dir=`pwd` if [ $2 == ON ]; then # You can export yourself if move the install path - MKL_LIB=${PADDLE_ROOT}/build/fluid_install_dir/third_party/install/mklml/lib + MKL_LIB=${inference_install_dir}/third_party/install/mklml/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MKL_LIB} fi if [ $3 == ON ]; then use_gpu_list='true false' -else +else use_gpu_list='false' fi USE_TENSORRT=OFF -if [ [-d"$TENSORRT_INCLUDE_DIR"] -a [-d"$TENSORRT_LIB_DIR"] ]; then +if [ -d "$TENSORRT_INCLUDE_DIR" -a -d "$TENSORRT_LIB_DIR" ]; then USE_TENSORRT=ON fi @@ -55,13 +56,13 @@ cd build for WITH_STATIC_LIB in ON OFF; do # -----simple_on_word2vec----- rm -rf * - cmake .. -DPADDLE_LIB=${PADDLE_ROOT}/build/fluid_install_dir/ \ + cmake .. -DPADDLE_LIB=${inference_install_dir} \ -DWITH_MKL=$TURN_ON_MKL \ -DDEMO_NAME=simple_on_word2vec \ -DWITH_GPU=$TEST_GPU_CPU \ -DWITH_STATIC_LIB=$WITH_STATIC_LIB make -j - word2vec_model=${PADDLE_ROOT}'/build/python/paddle/fluid/tests/book/word2vec.inference.model' + word2vec_model=$DATA_DIR'/word2vec/word2vec.inference.model' if [ -d $word2vec_model ]; then for use_gpu in $use_gpu_list; do ./simple_on_word2vec \ @@ -75,14 +76,14 @@ for WITH_STATIC_LIB in ON OFF; do fi # ---------vis_demo--------- rm -rf * - cmake .. -DPADDLE_LIB=${PADDLE_ROOT}/build/fluid_install_dir/ \ + cmake .. -DPADDLE_LIB=${inference_install_dir} \ -DWITH_MKL=$TURN_ON_MKL \ -DDEMO_NAME=vis_demo \ -DWITH_GPU=$TEST_GPU_CPU \ -DWITH_STATIC_LIB=$WITH_STATIC_LIB make -j for use_gpu in $use_gpu_list; do - for vis_demo_name in $vis_demo_list; do + for vis_demo_name in $vis_demo_list; do ./vis_demo \ --modeldir=$DATA_DIR/$vis_demo_name/model \ --data=$DATA_DIR/$vis_demo_name/data.txt \ @@ -94,11 +95,11 @@ for WITH_STATIC_LIB in ON OFF; do fi done done - + # --------tensorrt mobilenet------ if [ $USE_TENSORRT == ON -a $TEST_GPU_CPU == ON ]; then rm -rf * - cmake .. -DPADDLE_LIB=${PADDLE_ROOT}/build/fluid_install_dir/ \ + cmake .. -DPADDLE_LIB=${inference_install_dir} \ -DWITH_MKL=$TURN_ON_MKL \ -DDEMO_NAME=trt_mobilenet_demo \ -DWITH_GPU=$TEST_GPU_CPU \ @@ -106,7 +107,7 @@ for WITH_STATIC_LIB in ON OFF; do -DUSE_TENSORRT=$USE_TENSORRT \ -DTENSORRT_INCLUDE_DIR=$TENSORRT_INCLUDE_DIR \ -DTENSORRT_LIB_DIR=$TENSORRT_LIB_DIR - make -j + make -j ./trt_mobilenet_demo \ --modeldir=$DATA_DIR/mobilenet/model \ --data=$DATA_DIR/mobilenet/data.txt \ diff --git a/paddle/fluid/inference/api/demo_ci/simple_on_word2vec.cc b/paddle/fluid/inference/api/demo_ci/simple_on_word2vec.cc index 8058d7e881025b1d806efe187d4523adadff367d..3dd1d3c838c4b1bcdefdadff16b02dbfb4a02ee9 100644 --- a/paddle/fluid/inference/api/demo_ci/simple_on_word2vec.cc +++ b/paddle/fluid/inference/api/demo_ci/simple_on_word2vec.cc @@ -23,7 +23,7 @@ limitations under the License. */ #include #include //NOLINT -#include "paddle/fluid/inference/paddle_inference_api.h" +#include "utils.h" // NOLINT DEFINE_string(dirname, "", "Directory of the inference model."); DEFINE_bool(use_gpu, false, "Whether use gpu."); @@ -42,8 +42,7 @@ void Main(bool use_gpu) { config.use_gpu = use_gpu; config.fraction_of_gpu_memory = 0.15; config.device = 0; - auto predictor = - CreatePaddlePredictor(config); + auto predictor = CreatePaddlePredictor(config); for (int batch_id = 0; batch_id < 3; batch_id++) { //# 2. Prepare input. @@ -85,8 +84,7 @@ void MainThreads(int num_threads, bool use_gpu) { config.use_gpu = use_gpu; config.fraction_of_gpu_memory = 0.15; config.device = 0; - auto main_predictor = - CreatePaddlePredictor(config); + auto main_predictor = CreatePaddlePredictor(config); std::vector threads; for (int tid = 0; tid < num_threads; ++tid) { diff --git a/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc b/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc index ffb12b5871f088f15e43a1b0ff7e2a8b2f5fd079..0eb620ea516d28fb9598af8dbd297e84580a99f9 100644 --- a/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc +++ b/paddle/fluid/inference/api/demo_ci/trt_mobilenet_demo.cc @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -18,7 +18,7 @@ limitations under the License. */ #include #include // use glog instead of CHECK to avoid importing other paddle header files. -#include "paddle/fluid/inference/demo_ci/utils.h" +#include "utils.h" // NOLINT DECLARE_double(fraction_of_gpu_memory_to_use); DEFINE_string(modeldir, "", "Directory of the inference model."); @@ -36,16 +36,15 @@ namespace demo { */ void Main() { std::unique_ptr predictor; - paddle::contrib::MixedRTConfig config; + paddle::contrib::AnalysisConfig config(true); config.param_file = FLAGS_modeldir + "/__params__"; config.prog_file = FLAGS_modeldir + "/__model__"; - config.use_gpu = true; config.device = 0; - config.max_batch_size = 1; + config.EnableTensorRtEngine(); config.fraction_of_gpu_memory = 0.1; // set by yourself - predictor = CreatePaddlePredictor(config); + predictor = CreatePaddlePredictor(config); - VLOG(3) << "begin to process data"; + VLOG(30) << "begin to process data"; // Just a single batch of data. std::string line; std::ifstream file(FLAGS_data); @@ -60,13 +59,13 @@ void Main() { PaddleBuf(record.data.data(), record.data.size() * sizeof(float)); input.dtype = PaddleDType::FLOAT32; - VLOG(3) << "run executor"; + VLOG(30) << "run executor"; std::vector output; predictor->Run({input}, &output, 1); - VLOG(3) << "output.size " << output.size(); + VLOG(30) << "output.size " << output.size(); auto& tensor = output.front(); - VLOG(3) << "output: " << SummaryTensor(tensor); + VLOG(30) << "output: " << SummaryTensor(tensor); // compare with reference result CheckOutput(FLAGS_refer, tensor); diff --git a/paddle/fluid/inference/api/demo_ci/utils.h b/paddle/fluid/inference/api/demo_ci/utils.h index 4792c97fe7d0a3f9c904774ad4a8e580cefcf237..664b9d01c7810aa4f053cd6ebbff5f3f7619fd05 100644 --- a/paddle/fluid/inference/api/demo_ci/utils.h +++ b/paddle/fluid/inference/api/demo_ci/utils.h @@ -18,7 +18,7 @@ #include #include #include -#include "paddle/fluid/inference/paddle_inference_api.h" +#include "paddle/include/paddle_inference_api.h" namespace paddle { namespace demo { @@ -47,7 +47,7 @@ static void split(const std::string& str, char sep, } Record ProcessALine(const std::string& line) { - VLOG(3) << "process a line"; + VLOG(30) << "process a line"; std::vector columns; split(line, '\t', &columns); CHECK_EQ(columns.size(), 2UL) @@ -65,8 +65,8 @@ Record ProcessALine(const std::string& line) { for (auto& s : shape_strs) { record.shape.push_back(std::stoi(s)); } - VLOG(3) << "data size " << record.data.size(); - VLOG(3) << "data shape size " << record.shape.size(); + VLOG(30) << "data size " << record.data.size(); + VLOG(30) << "data shape size " << record.shape.size(); return record; } @@ -78,8 +78,8 @@ void CheckOutput(const std::string& referfile, const PaddleTensor& output) { file.close(); size_t numel = output.data.length() / PaddleDtypeSize(output.dtype); - VLOG(3) << "predictor output numel " << numel; - VLOG(3) << "reference output numel " << refer.data.size(); + VLOG(30) << "predictor output numel " << numel; + VLOG(30) << "reference output numel " << refer.data.size(); CHECK_EQ(numel, refer.data.size()); switch (output.dtype) { case PaddleDType::INT64: { diff --git a/paddle/fluid/inference/api/demo_ci/vis_demo.cc b/paddle/fluid/inference/api/demo_ci/vis_demo.cc index db61786e2fefda29256d84b5357028ec9c39b014..bc8891455dc8e4a30ddfcc5f89792296e59c2548 100644 --- a/paddle/fluid/inference/api/demo_ci/vis_demo.cc +++ b/paddle/fluid/inference/api/demo_ci/vis_demo.cc @@ -17,8 +17,8 @@ limitations under the License. */ */ #include -#include // use glog instead of CHECK to avoid importing other paddle header files. -#include "paddle/fluid/inference/demo_ci/utils.h" +#include +#include "utils.h" // NOLINT #ifdef PADDLE_WITH_CUDA DECLARE_double(fraction_of_gpu_memory_to_use); @@ -34,25 +34,23 @@ DEFINE_bool(use_gpu, false, "Whether use gpu."); namespace paddle { namespace demo { +using contrib::AnalysisConfig; /* - * Use the native fluid engine to inference the demo. + * Use the native and analysis fluid engine to inference the demo. */ void Main(bool use_gpu) { - std::unique_ptr predictor; - NativeConfig config; + std::unique_ptr predictor, analysis_predictor; + AnalysisConfig config(use_gpu); config.param_file = FLAGS_modeldir + "/__params__"; config.prog_file = FLAGS_modeldir + "/__model__"; - config.use_gpu = use_gpu; config.device = 0; if (FLAGS_use_gpu) { config.fraction_of_gpu_memory = 0.1; // set by yourself } - VLOG(3) << "init predictor"; - predictor = - CreatePaddlePredictor(config); + predictor = CreatePaddlePredictor(config); + analysis_predictor = CreatePaddlePredictor(config); - VLOG(3) << "begin to process data"; // Just a single batch of data. std::string line; std::ifstream file(FLAGS_data); @@ -67,16 +65,17 @@ void Main(bool use_gpu) { PaddleBuf(record.data.data(), record.data.size() * sizeof(float)); input.dtype = PaddleDType::FLOAT32; - VLOG(3) << "run executor"; - std::vector output; + std::vector output, analysis_output; predictor->Run({input}, &output, 1); - VLOG(3) << "output.size " << output.size(); auto& tensor = output.front(); - VLOG(3) << "output: " << SummaryTensor(tensor); // compare with reference result CheckOutput(FLAGS_refer, tensor); + + // the analysis_output has some diff with native_output, + // TODO(luotao): add CheckOutput for analysis_output later. + analysis_predictor->Run({input}, &analysis_output, 1); } } // namespace demo diff --git a/paddle/fluid/inference/api/details/reset_tensor_array.cc b/paddle/fluid/inference/api/details/reset_tensor_array.cc new file mode 100644 index 0000000000000000000000000000000000000000..244b0b567b5df6735acd7f1bf3c2056f449be872 --- /dev/null +++ b/paddle/fluid/inference/api/details/reset_tensor_array.cc @@ -0,0 +1,50 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/details/reset_tensor_array.h" + +namespace paddle { +namespace details { + +// Should be called after the parameters are loaded. +void TensorArrayBatchCleaner::CollectTensorArrays(framework::Scope *scope) { + if (flag_) { + for (auto &var_name : scope->LocalVarNames()) { + auto *var = scope->FindVar(var_name); + // TODO(Superjomn) should avoid the case when a TensorArray is a + // parameter. + if (var_name == "feed" || var_name == "fetch") continue; + if (var->Type() == typeid(framework::LoDTensorArray)) { + VLOG(40) << "collect " << var_name; + arrays_.push_back(var->GetMutable()); + } + } + for (auto *kid : scope->kids()) { + CollectTensorArrays(kid); + } + + VLOG(30) << "Collect " << arrays_.size() << " arrays"; + flag_ = false; + } +} + +// Should be called when `Run` finished. +void TensorArrayBatchCleaner::ResetTensorArray() { + for (auto *arr : arrays_) { + arr->clear(); + } +} + +} // namespace details +} // namespace paddle diff --git a/paddle/fluid/inference/api/details/reset_tensor_array.h b/paddle/fluid/inference/api/details/reset_tensor_array.h new file mode 100644 index 0000000000000000000000000000000000000000..a39449ff0e67786815dfb8d2d30d79dcdba757d7 --- /dev/null +++ b/paddle/fluid/inference/api/details/reset_tensor_array.h @@ -0,0 +1,37 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/scope.h" + +namespace paddle { +namespace details { + +// Clean the TensorArray each batch to make the behavior the same with the +// training phase. +struct TensorArrayBatchCleaner { + // Fix the tensor array not clear in the inference scenarios. + void CollectTensorArrays(framework::Scope *scope); + void ResetTensorArray(); + + private: + bool flag_{true}; + std::vector arrays_; +}; + +} // namespace details +} // namespace paddle diff --git a/paddle/fluid/inference/api/details/zero_copy_tensor.cc b/paddle/fluid/inference/api/details/zero_copy_tensor.cc index 14698f6dfc8885ec1d35f1912bad10a9caa13db4..0f540699b8ffea94c3f3aaf3354a0462e0e674a9 100644 --- a/paddle/fluid/inference/api/details/zero_copy_tensor.cc +++ b/paddle/fluid/inference/api/details/zero_copy_tensor.cc @@ -51,7 +51,7 @@ T *ZeroCopyTensor::mutable_data(PaddlePlace place) { } template -T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { +T *ZeroCopyTensor::data(PaddlePlace *place, int *size) const { auto *tensor = static_cast(FindTensor()); auto *res = tensor->data(); @@ -67,8 +67,10 @@ T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { return res; } -template float *ZeroCopyTensor::data(PaddlePlace *place, int *size); -template int64_t *ZeroCopyTensor::data(PaddlePlace *place, int *size); +template float *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; +template int64_t *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; template float *ZeroCopyTensor::mutable_data(PaddlePlace place); template int64_t *ZeroCopyTensor::mutable_data(PaddlePlace place); @@ -84,7 +86,7 @@ void *ZeroCopyTensor::FindTensor() const { return tensor; } -std::vector ZeroCopyTensor::shape() { +std::vector ZeroCopyTensor::shape() const { auto *tensor = static_cast(FindTensor()); PADDLE_ENFORCE(tensor, "not found tensor called %s in the scope", name_); return framework::vectorize(tensor->dims()); diff --git a/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc b/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc index 2d5b561d801cd9e734cab13b28e7285493e30f94..12071e09f8442f2c52a06b7c3fe4bed2c28b524a 100644 --- a/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc +++ b/paddle/fluid/inference/api/details/zero_copy_tensor_dummy.cc @@ -24,18 +24,20 @@ T *ZeroCopyTensor::mutable_data(PaddlePlace place) { } template -T *ZeroCopyTensor::data(PaddlePlace *place, int *size) { +T *ZeroCopyTensor::data(PaddlePlace *place, int *size) const { return nullptr; } -template float *ZeroCopyTensor::data(PaddlePlace *place, int *size); -template int64_t *ZeroCopyTensor::data(PaddlePlace *place, int *size); +template float *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; +template int64_t *ZeroCopyTensor::data(PaddlePlace *place, + int *size) const; template float *ZeroCopyTensor::mutable_data(PaddlePlace place); template int64_t *ZeroCopyTensor::mutable_data(PaddlePlace place); void *ZeroCopyTensor::FindTensor() const { return nullptr; } -std::vector ZeroCopyTensor::shape() { return {}; } +std::vector ZeroCopyTensor::shape() const { return {}; } void ZeroCopyTensor::SetLoD(const std::vector> &x) {} diff --git a/paddle/fluid/inference/api/helper.h b/paddle/fluid/inference/api/helper.h index 24f59cf43a9700ff1732e1ef6ad82e1a6294eede..9a393a61c4b4586009a022884e88f3f5c6392ed3 100644 --- a/paddle/fluid/inference/api/helper.h +++ b/paddle/fluid/inference/api/helper.h @@ -15,14 +15,17 @@ #pragma once #include -#include + +#include #include // NOLINT +#include #include #include #include #include +#include "paddle/fluid/inference/api/paddle_inference_api.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/string/printf.h" -#include "paddle_inference_api.h" namespace paddle { namespace inference { @@ -124,6 +127,51 @@ static int ZeroCopyTensorAssignData(ZeroCopyTensor *tensor, return size; } +static bool CompareTensor(const PaddleTensor &a, const PaddleTensor &b) { + if (a.dtype != b.dtype) { + LOG(ERROR) << "dtype not match"; + return false; + } + + if (a.lod.size() != b.lod.size()) { + LOG(ERROR) << "lod not match"; + return false; + } + for (size_t i = 0; i < a.lod.size(); i++) { + if (a.lod[i].size() != b.lod[i].size()) { + LOG(ERROR) << "lod not match"; + return false; + } + for (size_t j = 0; j < a.lod[i].size(); j++) { + if (a.lod[i][j] != b.lod[i][j]) { + LOG(ERROR) << "lod not match"; + return false; + } + } + } + + if (a.shape.size() != b.shape.size()) { + LOG(INFO) << "shape not match"; + return false; + } + for (size_t i = 0; i < a.shape.size(); i++) { + if (a.shape[i] != b.shape[i]) { + LOG(ERROR) << "shape not match"; + return false; + } + } + + auto *adata = static_cast(a.data.data()); + auto *bdata = static_cast(b.data.data()); + for (int i = 0; i < VecReduceToInt(a.shape); i++) { + if (adata[i] != bdata[i]) { + LOG(ERROR) << "data not match"; + return false; + } + } + return true; +} + static std::string DescribeTensor(const PaddleTensor &tensor) { std::stringstream os; os << "Tensor [" << tensor.name << "]\n"; @@ -156,11 +204,32 @@ static std::string DescribeTensor(const PaddleTensor &tensor) { return os.str(); } +static std::string DescribeZeroCopyTensor(const ZeroCopyTensor &tensor) { + std::stringstream os; + os << "Tensor [" << tensor.name() << "]\n"; + + os << " - shape: " << to_string(tensor.shape()) << '\n'; + os << " - lod: "; + for (auto &l : tensor.lod()) { + os << to_string(l) << "; "; + } + os << "\n"; + os << " - data: "; + PaddlePlace place; + int size; + const auto *data = tensor.data(&place, &size); + for (int i = 0; i < size; i++) { + os << data[i] << " "; + } + return os.str(); +} + static void PrintTime(int batch_size, int repeat, int num_threads, int tid, double latency, int epoch = 1) { LOG(INFO) << "====== batch_size: " << batch_size << ", repeat: " << repeat << ", threads: " << num_threads << ", thread id: " << tid - << ", latency: " << latency << "ms ======"; + << ", latency: " << latency << "ms, fps: " << 1 / (latency / 1000.f) + << " ======"; if (epoch > 1) { int samples = batch_size * epoch; LOG(INFO) << "====== sample number: " << samples diff --git a/paddle/fluid/inference/api/paddle_anakin_config.h b/paddle/fluid/inference/api/paddle_anakin_config.h new file mode 100644 index 0000000000000000000000000000000000000000..0e91c2624bed4459b936ac4477d73ae954e55bcc --- /dev/null +++ b/paddle/fluid/inference/api/paddle_anakin_config.h @@ -0,0 +1,35 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include + +#include "paddle_api.h" // NOLINT + +namespace paddle { +namespace contrib { +// Configurations for Anakin engine. +struct AnakinConfig : public PaddlePredictor::Config { + enum TargetType { NVGPU = 0, X86 }; + int device; + std::string model_file; + int max_batch_size{-1}; + TargetType target_type; +}; + +} // namespace contrib +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_analysis_config.h b/paddle/fluid/inference/api/paddle_analysis_config.h new file mode 100644 index 0000000000000000000000000000000000000000..a09bd1cac2aa31b2ecee9b0f77d2b777104f1161 --- /dev/null +++ b/paddle/fluid/inference/api/paddle_analysis_config.h @@ -0,0 +1,79 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include + +// Here we include some header files with relative paths, for that in deploy, +// the abstract path of this header file will be changed. +#include "paddle_api.h" // NOLINT +#include "paddle_pass_builder.h" // NOLINT + +namespace paddle { + +class AnalysisPredictor; +// == +// +// ----------------------------------------------------------------------------------- +// NOTE: The following APIs are not mature yet, we are still working on them. +namespace contrib { + +// NOTE WIP, not stable yet. +struct AnalysisConfig : public NativeConfig { + explicit AnalysisConfig(bool use_gpu = false); + explicit AnalysisConfig(const AnalysisConfig& other); + explicit AnalysisConfig(AnalysisConfig&& other); + + // Determine whether to perform graph optimization. + bool enable_ir_optim = true; + + // Get a pass builder for customize the passes in IR analysis phase. + PassStrategy* pass_builder() const; + + // NOT stable yet. + bool use_feed_fetch_ops{true}; + + void EnableTensorRtEngine(int workspace_size = 1 << 20, + int max_batch_size = 1); + bool use_tensorrt() const { return use_tensorrt_; } + + void EnableMKLDNN(); + // NOTE this is just for internal development, please not use it. + // NOT stable yet. + bool use_mkldnn() const { return use_mkldnn_; } + + friend class ::paddle::AnalysisPredictor; + + protected: + bool use_tensorrt_{false}; + bool use_mkldnn_{false}; + int tensorrt_workspace_size_; + int tensorrt_max_batchsize_; + std::unique_ptr pass_builder_; +}; + +// Configurations for Anakin engine. +struct AnakinConfig : public PaddlePredictor::Config { + enum TargetType { NVGPU = 0, X86 }; + int device; + std::string model_file; + int max_batch_size{-1}; + TargetType target_type; +}; + +} // namespace contrib +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_api.h b/paddle/fluid/inference/api/paddle_api.h new file mode 100644 index 0000000000000000000000000000000000000000..1513a4b3b4f6620aa8bf53e082afa203af15ca13 --- /dev/null +++ b/paddle/fluid/inference/api/paddle_api.h @@ -0,0 +1,233 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once + +#include +#include +#include +#include + +namespace paddle { + +// Data type. +enum PaddleDType { + FLOAT32, + INT64, + // TODO(Superjomn) support more data types if needed. +}; + +/* + * Memory menage for PaddleTensor. + * The PaddleBuf holds a buffer for data input or output. The memory can be + * allocated by user or by PaddleBuf itself, but in any case, the PaddleBuf + * should be reused for better performance. + * + * For user allocated memory, the following API can be used: + * - PaddleBuf(void* data, size_t length) to set an external memory by + * specifying + * the memory address and length. + * - Reset(void* data, size_t length) to reset the PaddleBuf with an external + * memory. + * ATTENTION, for user allocated memory, deallocation should be done by users + * externally after the program finished. The PaddleBuf won't do any allocation + * or deallocation. + * + * To have the PaddleBuf allocate and manage the memory: + * - PaddleBuf(size_t length) will allocate a memory of size `length`. + * - Resize(size_t length) resize the memory to no less than `length`, ATTENTION + * if the allocated memory is larger than `length`, nothing will done. + */ +class PaddleBuf { + public: + // PaddleBuf allocate memory internally, and manage it. + explicit PaddleBuf(size_t length) + : data_(new char[length]), length_(length), memory_owned_(true) {} + // Set external memory, the PaddleBuf won't manage it. + PaddleBuf(void* data, size_t length) + : data_(data), length_(length), memory_owned_{false} {} + // Copy only available when memory is managed externally. + explicit PaddleBuf(const PaddleBuf&); + + // Resize the memory. + void Resize(size_t length); + // Reset to external memory, with address and length set. + void Reset(void* data, size_t length); + // Tell whether the buffer is empty. + bool empty() const { return length_ == 0; } + // Get the memory address. + void* data() const { return data_; } + // Get the memory length. + size_t length() const { return length_; } + + ~PaddleBuf() { Free(); } + PaddleBuf& operator=(const PaddleBuf&); + PaddleBuf& operator=(PaddleBuf&&); + PaddleBuf() = default; + PaddleBuf(PaddleBuf&& other); + + private: + void Free(); + void* data_{nullptr}; // pointer to the data memory. + size_t length_{0}; // number of memory bytes. + bool memory_owned_{true}; +}; + +// Basic input and output data structure for PaddlePredictor. +struct PaddleTensor { + PaddleTensor() = default; + std::string name; // variable name. + std::vector shape; + PaddleBuf data; // blob of data. + PaddleDType dtype; + std::vector> lod; // Tensor+LoD equals LoDTensor +}; + +enum class PaddlePlace { kUNK = -1, kCPU, kGPU }; +// Tensor without copy, currently only supports AnalysisPredictor. +class ZeroCopyTensor { + public: + void Reshape(const std::vector& shape); + + // Get the memory in CPU or GPU with specific data type, should Reshape first + // to tell the data size. + // Once can directly call this data to feed the data. + // This is for write the input tensor. + template + T* mutable_data(PaddlePlace place); + // Get the memory directly, will return the place and memory size by pointer. + // This is for reading the output tensor. + template + T* data(PaddlePlace* place, int* size) const; + + std::vector shape() const; + + void SetLoD(const std::vector>& x); + std::vector> lod() const; + const std::string& name() const { return name_; } + + protected: + explicit ZeroCopyTensor(void* scope) : scope_{scope} {} + void SetName(const std::string& name) { name_ = name; } + void* FindTensor() const; + + private: + std::string name_; + bool input_or_output_; + friend class AnalysisPredictor; + void* scope_{nullptr}; +}; + +/* + * A simple Inference API for Paddle. + */ +class PaddlePredictor { + public: + struct Config; + PaddlePredictor() = default; + PaddlePredictor(const PaddlePredictor&) = delete; + PaddlePredictor& operator=(const PaddlePredictor&) = delete; + + // Predict an record. + // The caller should be responsible for allocating and releasing the memory of + // `inputs`. `inputs` should be available until Run returns. Caller should be + // responsible for the output tensor's buffer, either allocated or passed from + // outside. + virtual bool Run(const std::vector& inputs, + std::vector* output_data, + int batch_size = -1) = 0; + + // Zero copy input and output optimization. + // Get the input or output tensors, and operate on their memory directly, + // without copy. + virtual std::unique_ptr GetInputTensor( + const std::string& name) { + return nullptr; + } + virtual std::unique_ptr GetOutputTensor( + const std::string& name) { + return nullptr; + } + virtual bool ZeroCopyRun() { return false; } + + // Clone a predictor that share the model weights, the Cloned predictor should + // be thread-safe. + virtual std::unique_ptr Clone() = 0; + + // Destroy the Predictor. + virtual ~PaddlePredictor() = default; + + // The common configs for all the predictors. + struct Config { + std::string model_dir; // path to the model directory. + }; +}; + +struct NativeConfig : public PaddlePredictor::Config { + // GPU related fields. + bool use_gpu{false}; + int device{0}; + float fraction_of_gpu_memory{-1.f}; // Change to a float in (0,1] if needed. + + // Specify the exact path of program and parameter files. + std::string prog_file; + std::string param_file; + + // Specify the variable's name of each input if input tensors don't follow the + // `feeds` and `fetches` of the phase `save_inference_model`. + bool specify_input_name{false}; + + // Set and get the number of cpu math library threads. + void SetCpuMathLibraryNumThreads(int cpu_math_library_num_threads) { + cpu_math_library_num_threads_ = cpu_math_library_num_threads; + } + int cpu_math_library_num_threads() const { + return cpu_math_library_num_threads_; + } + + protected: + // number of cpu math library (such as MKL, OpenBlas) threads for each + // instance. + int cpu_math_library_num_threads_{1}; +}; + +// A factory to help create different predictors. +// +// Usage: +// +// NativeConfig config; +// ... // change the configs. +// auto native_predictor = CreatePaddlePredictor(config); +// +// FOR EXTENSION DEVELOPER: +// Different predictors are designated by config type. Similar configs can be +// merged, but there shouldn't be a huge config containing different fields for +// more than one kind of predictors. +template +std::unique_ptr CreatePaddlePredictor(const ConfigT& config); + +// NOTE The following APIs are too trivial, we will discard it in the following +// versions. +enum class PaddleEngineKind { + kNative = 0, // Use the native Fluid facility. + kAutoMixedTensorRT, // Automatically mix Fluid with TensorRT. + kAnalysis, // More optimization. + kAnakin // Use Anakin for inference, not mature yet. +}; + +template +std::unique_ptr CreatePaddlePredictor(const ConfigT& config); + +int PaddleDtypeSize(PaddleDType dtype); + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_inference_api.h b/paddle/fluid/inference/api/paddle_inference_api.h index d2876dc27c8826c2f27be21fa0b9fef92d03067a..92fb51d647cf4e2c8a4914d8df2e8b7b6318d1d1 100644 --- a/paddle/fluid/inference/api/paddle_inference_api.h +++ b/paddle/fluid/inference/api/paddle_inference_api.h @@ -26,264 +26,9 @@ limitations under the License. */ #include #include -namespace paddle { - -// Data type. -enum PaddleDType { - FLOAT32, - INT64, - // TODO(Superjomn) support more data types if needed. -}; - -/* - * Memory menage for PaddleTensor. - * The PaddleBuf holds a buffer for data input or output. The memory can be - * allocated by user or by PaddleBuf itself, but in any case, the PaddleBuf - * should be reused for better performance. - * - * For user allocated memory, the following API can be used: - * - PaddleBuf(void* data, size_t length) to set an external memory by - * specifying - * the memory address and length. - * - Reset(void* data, size_t length) to reset the PaddleBuf with an external - * memory. - * ATTENTION, for user allocated memory, deallocation should be done by users - * externally after the program finished. The PaddleBuf won't do any allocation - * or deallocation. - * - * To have the PaddleBuf allocate and manage the memory: - * - PaddleBuf(size_t length) will allocate a memory of size `length`. - * - Resize(size_t length) resize the memory to no less than `length`, ATTENTION - * if the allocated memory is larger than `length`, nothing will done. - */ -class PaddleBuf { - public: - // PaddleBuf allocate memory internally, and manage it. - explicit PaddleBuf(size_t length) - : data_(new char[length]), length_(length), memory_owned_(true) {} - // Set external memory, the PaddleBuf won't manage it. - PaddleBuf(void* data, size_t length) - : data_(data), length_(length), memory_owned_{false} {} - // Copy only available when memory is managed externally. - explicit PaddleBuf(const PaddleBuf&); - - // Resize the memory. - void Resize(size_t length); - // Reset to external memory, with address and length set. - void Reset(void* data, size_t length); - // Tell whether the buffer is empty. - bool empty() const { return length_ == 0; } - // Get the memory address. - void* data() const { return data_; } - // Get the memory length. - size_t length() const { return length_; } - - ~PaddleBuf() { Free(); } - PaddleBuf& operator=(const PaddleBuf&); - PaddleBuf& operator=(PaddleBuf&&); - PaddleBuf() = default; - PaddleBuf(PaddleBuf&& other); - - private: - void Free(); - void* data_{nullptr}; // pointer to the data memory. - size_t length_{0}; // number of memory bytes. - bool memory_owned_{true}; -}; - -// Basic input and output data structure for PaddlePredictor. -struct PaddleTensor { - PaddleTensor() = default; - std::string name; // variable name. - std::vector shape; - PaddleBuf data; // blob of data. - PaddleDType dtype; - std::vector> lod; // Tensor+LoD equals LoDTensor -}; - -enum class PaddlePlace { kUNK = -1, kCPU, kGPU }; -// Tensor without copy, currently only supports AnalysisPredictor. -class ZeroCopyTensor { - public: - void Reshape(const std::vector& shape); - - // Get the memory in CPU or GPU with specific data type, should Reshape first - // to tell the data size. - // Once can directly call this data to feed the data. - // This is for write the input tensor. - template - T* mutable_data(PaddlePlace place); - // Get the memory directly, will return the place and memory size by pointer. - // This is for reading the output tensor. - template - T* data(PaddlePlace* place, int* size); - - std::vector shape(); - - void SetLoD(const std::vector>& x); - std::vector> lod() const; - - protected: - ZeroCopyTensor(void* scope) : scope_{scope} {} - void SetName(const std::string& name) { name_ = name; } - void* FindTensor() const; - - private: - std::string name_; - bool input_or_output_; - friend class AnalysisPredictor; - void* scope_{nullptr}; -}; - -/* - * A simple Inference API for Paddle. - */ -class PaddlePredictor { - public: - struct Config; - PaddlePredictor() = default; - PaddlePredictor(const PaddlePredictor&) = delete; - PaddlePredictor& operator=(const PaddlePredictor&) = delete; - - // Predict an record. - // The caller should be responsible for allocating and releasing the memory of - // `inputs`. `inputs` should be available until Run returns. Caller should be - // responsible for the output tensor's buffer, either allocated or passed from - // outside. - virtual bool Run(const std::vector& inputs, - std::vector* output_data, - int batch_size = -1) = 0; - - // Zero copy input and output optimization. - // Get the input or output tensors, and operate on their memory directly, - // without copy. - virtual std::unique_ptr GetInputTensor( - const std::string& name) { - return nullptr; - } - virtual std::unique_ptr GetOutputTensor( - const std::string& name) { - return nullptr; - } - virtual bool ZeroCopyRun() { return false; } - - // Clone a predictor that share the model weights, the Cloned predictor should - // be thread-safe. - virtual std::unique_ptr Clone() = 0; - - // Destroy the Predictor. - virtual ~PaddlePredictor() = default; - - // The common configs for all the predictors. - struct Config { - std::string model_dir; // path to the model directory. - }; -}; - -struct NativeConfig : public PaddlePredictor::Config { - // GPU related fields. - bool use_gpu{false}; - int device{0}; - float fraction_of_gpu_memory{-1.f}; // Change to a float in (0,1] if needed. - - // Specify the exact path of program and parameter files. - std::string prog_file; - std::string param_file; - - // Specify the variable's name of each input if input tensors don't follow the - // `feeds` and `fetches` of the phase `save_inference_model`. - bool specify_input_name{false}; -}; - -// A factory to help create different predictors. -// -// Usage: -// -// NativeConfig config; -// ... // change the configs. -// auto native_predictor = CreatePaddlePredictor(config); -// -// FOR EXTENSION DEVELOPER: -// Different predictors are designated by config type. Similar configs can be -// merged, but there shouldn't be a huge config containing different fields for -// more than one kind of predictors. -template -std::unique_ptr CreatePaddlePredictor(const ConfigT& config); - -// NOTE The following APIs are too trivial, we will discard it in the following -// versions. -enum class PaddleEngineKind { - kNative = 0, // Use the native Fluid facility. - kAutoMixedTensorRT, // Automatically mix Fluid with TensorRT. - kAnalysis, // More optimization. - kAnakin // Use Anakin for inference, not mature yet. -}; - -template -std::unique_ptr CreatePaddlePredictor(const ConfigT& config); - -// == -// -// ----------------------------------------------------------------------------------- -// NOTE: The following APIs are not mature yet, we are still working on them. - -namespace contrib { - -// Accelerate GPU computation with TensorRT engine. -struct MixedRTConfig : public NativeConfig { - // Determine whether a subgraph will be executed by TRT. - int min_subgraph_size{1}; - // While TensorRT allows an engine optimized for a given max batch size - // to run at any smaller size, the performance for those smaller - // sizes may not be as well-optimized. Therefore, Max batch is best - // equivalent to the runtime batch size. - int max_batch_size{1}; - // For workspace_size, refer it from here: - // https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#troubleshooting - int workspace_size{1 << 30}; - // We transform the Ops that can be converted into TRT layer in the model, - // and aggregate these Ops into subgraphs for TRT execution. - // We set this variable to control the minimum number of nodes in the - // subgraph, 3 as default value. - int minimum_subgraph_size = 3; - // Reserved configuration - // We just support "FP32" now, "FP16" and "INT8" will be supported. - std::string precision_mode = "FP32"; -}; - -// NOTE WIP, not stable yet. -struct AnalysisConfig : public NativeConfig { - enum class IrPassMode { - kSystem, // Use system default passes, not customize. - kInclude, // Specify the passes in `ir_passes`. - kExclude // Specify the disabled passes in `ir_passes`. - }; - - // Determine whether to perform graph optimization. - bool enable_ir_optim = true; - // Manually determine the IR passes to run. - IrPassMode ir_mode{IrPassMode::kExclude}; - std::vector ir_passes{"embedding_fc_lstm_fuse_pass"}; - - // NOT stable yet. - bool use_feed_fetch_ops{true}; - - // NOTE this is just for internal development, please not use it. - // NOT stable yet. - bool _use_mkldnn{false}; -}; - -// Configurations for Anakin engine. -struct AnakinConfig : public PaddlePredictor::Config { - enum TargetType { NVGPU = 0, X86 }; - int device; - std::string model_file; - int max_batch_size{-1}; - TargetType target_type; -}; - -} // namespace contrib - -int PaddleDtypeSize(PaddleDType dtype); - -} // namespace paddle +#include "paddle_api.h" // NOLINT +#ifndef WITH_ANAKIN +#include "paddle_analysis_config.h" // NOLINT +#else +#include "paddle_anakin_config.h" // NOLINT +#endif diff --git a/paddle/fluid/inference/api/paddle_pass_builder.cc b/paddle/fluid/inference/api/paddle_pass_builder.cc new file mode 100644 index 0000000000000000000000000000000000000000..bc3ce72f0832c4bf029f86e023bd9ff11f6578bd --- /dev/null +++ b/paddle/fluid/inference/api/paddle_pass_builder.cc @@ -0,0 +1,68 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/api/paddle_pass_builder.h" +#include + +namespace paddle { + +void PaddlePassBuilder::AppendPass(const std::string &pass_type) { + passes_.push_back(pass_type); +} + +void PaddlePassBuilder::TurnOnDebug() { + std::vector passes; + auto it = std::begin(passes_); + while (it != std::end(passes_)) { + if (*it != "graph_viz_pass") { + it = passes_.insert(it + 1, "graph_viz_pass"); + } else { + ++it; + } + } +} + +std::string PaddlePassBuilder::DebugString() { + std::stringstream ss; + ss << "Passes to apply:\n"; + for (auto &pass : passes_) { + ss << " - " << pass << '\n'; + } + return ss.str(); +} + +void PaddlePassBuilder::DeletePass(const std::string &pass_type) { + auto it = std::begin(passes_); + while (it != std::end(passes_)) { + if (*it == pass_type) { + it = passes_.erase(it); + } else { + ++it; + } + } +} + +void PaddlePassBuilder::InsertPass(size_t idx, const std::string &pass_type) { + passes_.insert(std::begin(passes_) + idx, pass_type); +} + +void PaddlePassBuilder::DeletePass(size_t idx) { + passes_.erase(std::begin(passes_) + idx); +} + +void GpuPassStrategy::EnableMKLDNN() { + LOG(ERROR) << "GPU not support MKLDNN yet"; +} + +} // namespace paddle diff --git a/paddle/fluid/inference/api/paddle_pass_builder.h b/paddle/fluid/inference/api/paddle_pass_builder.h new file mode 100644 index 0000000000000000000000000000000000000000..12e3a6f42e14010feedbbb5d8f8a98f60cea4556 --- /dev/null +++ b/paddle/fluid/inference/api/paddle_pass_builder.h @@ -0,0 +1,136 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include + +namespace paddle { +/* + * This is a pass builder based on string. It is part of inference API. + */ +class PaddlePassBuilder { + public: + explicit PaddlePassBuilder(const std::vector &passes) + : passes_(passes) {} + + void AppendPass(const std::string &pass_type); + + void InsertPass(size_t idx, const std::string &pass_type); + + // Delete the `idx`-th pass. + void DeletePass(size_t idx); + + // Delete all the passes that has type `pass_type`. + void DeletePass(const std::string &pass_type); + + // Visualize the computation graph after each pass by generating a DOT + // language file, one can draw them with the Graphviz toolkit. + void TurnOnDebug(); + + // Human-readible information. + std::string DebugString(); + + const std::vector &AllPasses() const { return passes_; } + + protected: + std::vector passes_; +}; + +/* + * Pass strategy to help control the IR passes. + */ +class PassStrategy : public PaddlePassBuilder { + public: + explicit PassStrategy(const std::vector &passes) + : PaddlePassBuilder(passes) {} + + // The MKLDNN control exists in both CPU and GPU mode, because there can be + // still some CPU kernels running in CPU mode. + virtual void EnableMKLDNN() = 0; + + virtual ~PassStrategy() = default; +}; + +/* + * The CPU passes controller, it is used in AnalysisPredictor with CPU mode. + */ +class CpuPassStrategy : public PassStrategy { + public: + CpuPassStrategy() : PassStrategy({}) { + // NOTE the large fusions should be located in the front, so that they will + // not be damaged by smaller ones. + passes_.assign({ + "infer_clean_graph_pass", // + "attention_lstm_fuse_pass", // + "seqconv_eltadd_relu_fuse_pass", // + // "embedding_fc_lstm_fuse_pass", // + "fc_lstm_fuse_pass", // + "mul_lstm_fuse_pass", // + "fc_gru_fuse_pass", // + "mul_gru_fuse_pass", // + "seq_concat_fc_fuse_pass", // + "fc_fuse_pass", // + "conv_bn_fuse_pass", // + "conv_eltwiseadd_bn_fuse_pass", // + "is_test_pass", // + }); + } + + virtual ~CpuPassStrategy() = default; + + void EnableMKLDNN() override { +// TODO(Superjomn) Consider the way to mix CPU with GPU. +#ifdef PADDLE_WITH_MKLDNN + passes_.insert(passes_.begin(), "mkldnn_placement_pass"); + + for (auto &pass : + std::vector({"depthwise_conv_mkldnn_pass", // + "conv_bias_mkldnn_fuse_pass", // + "conv_relu_mkldnn_fuse_pass", // + "conv_elementwise_add_mkldnn_fuse_pass"})) { + passes_.push_back(pass); + } +#endif + } + + CpuPassStrategy(const CpuPassStrategy &other) : PassStrategy(other.passes_) {} +}; + +/* + * The GPU passes strategy, it is used in + */ +class GpuPassStrategy : public PassStrategy { + public: + GpuPassStrategy() : PassStrategy({}) { + // TODO(NHZlX) Problem with Data synchronization between GPU and CPU + // When running in GPU mode, the parameters are all on GPU. But the + // opearations of "conv_bn_fuse_pass" are on CPU. + passes_.assign({ + "infer_clean_graph_pass", + // "infer_clean_graph_pass", "conv_bn_fuse_pass", + }); + } + + GpuPassStrategy(const GpuPassStrategy &other) + : PassStrategy(other.AllPasses()) {} + + void EnableMKLDNN() override; + + virtual ~GpuPassStrategy() = default; +}; + +} // namespace paddle diff --git a/paddle/fluid/inference/io.cc b/paddle/fluid/inference/io.cc index e246a06fd079d837ac321197914c9f70b528f2c8..bb749e8f8b0ba9d5cd82d91ce86c619f52f34c30 100644 --- a/paddle/fluid/inference/io.cc +++ b/paddle/fluid/inference/io.cc @@ -59,7 +59,8 @@ void ReadBinaryFile(const std::string& filename, std::string* contents) { bool IsPersistable(const framework::VarDesc* var) { if (var->Persistable() && var->GetType() != framework::proto::VarType::FEED_MINIBATCH && - var->GetType() != framework::proto::VarType::FETCH_LIST) { + var->GetType() != framework::proto::VarType::FETCH_LIST && + var->GetType() != framework::proto::VarType::RAW) { return true; } return false; @@ -77,7 +78,7 @@ void LoadPersistables(framework::Executor* executor, framework::Scope* scope, for (auto* var : global_block.AllVars()) { if (IsPersistable(var)) { - VLOG(3) << "persistable variable's name: " << var->Name(); + VLOG(30) << "persistable variable's name: " << var->Name(); framework::VarDesc* new_var = load_block->Var(var->Name()); new_var->SetShape(var->GetShape()); @@ -120,7 +121,7 @@ std::unique_ptr Load(framework::Executor* executor, const std::string& dirname) { std::string model_filename = dirname + "/__model__"; std::string program_desc_str; - VLOG(3) << "loading model from " << model_filename; + VLOG(30) << "loading model from " << model_filename; ReadBinaryFile(model_filename, &program_desc_str); std::unique_ptr main_program( diff --git a/paddle/fluid/inference/tensorrt/CMakeLists.txt b/paddle/fluid/inference/tensorrt/CMakeLists.txt index a610687a5b11999a7cb7426dbe961e5972ee1746..17f6c6d9f10abf99fd93364d1356e2b3ef1b3934 100644 --- a/paddle/fluid/inference/tensorrt/CMakeLists.txt +++ b/paddle/fluid/inference/tensorrt/CMakeLists.txt @@ -1,4 +1,5 @@ -nv_library(tensorrt_engine SRCS engine.cc DEPS framework_proto device_context) +nv_library(tensorrt_engine SRCS engine.cc DEPS ${GLOB_OPERATOR_DEPS} framework_proto device_context) nv_test(test_tensorrt SRCS test_tensorrt.cc DEPS dynload_cuda device_context dynamic_loader) nv_test(test_tensorrt_engine SRCS test_engine.cc DEPS dynload_cuda tensorrt_engine) +add_subdirectory(plugin) add_subdirectory(convert) diff --git a/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt b/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt index 0a35e10f6936313928ab21a6f17c40335e8fc882..840abd26a755c39bc9c17315aefdd0dec862e77c 100644 --- a/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt +++ b/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt @@ -1,33 +1,42 @@ # Add TRT tests nv_library(tensorrt_converter - SRCS mul_op.cc conv2d_op.cc fc_op.cc pool2d_op.cc elementwise_op.cc -batch_norm_op.cc activation_op.cc softmax_op.cc concat_op.cc dropout_op.cc pad_op.cc - DEPS tensorrt_engine operator scope framework_proto op_registry) + SRCS mul_op.cc conv2d_op.cc fc_op.cc pool2d_op.cc elementwise_op.cc + batch_norm_op.cc activation_op.cc softmax_op.cc concat_op.cc dropout_op.cc + pad_op.cc split_op.cc prelu_op.cc leaky_relu_op.cc + DEPS tensorrt_engine tensorrt_plugin operator scope framework_proto op_registry) nv_test(test_op_converter SRCS test_op_converter.cc DEPS - ${FLUID_CORE_MODULES} tensorrt_engine tensorrt_converter) + ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_converter) nv_test(test_io_converter SRCS test_io_converter.cc io_converter.cc DEPS dynload_cuda dynamic_loader lod_tensor) nv_test(test_trt_mul_op SRCS test_mul_op.cc mul_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine mul_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine mul_op SERIAL) nv_test(test_trt_fc_op SRCS test_fc_op.cc fc_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine mul_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine mul_op SERIAL) nv_test(test_trt_activation_op SRCS test_activation_op.cc activation_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine activation_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine activation_op SERIAL) nv_test(test_trt_conv_op SRCS test_conv2d_op.cc conv2d_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine conv_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine conv_op conv_transpose_op SERIAL) nv_test(test_trt_pool2d_op SRCS test_pool2d_op.cc pool2d_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine pool_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine pool_op tensorrt_plugin SERIAL) nv_test(test_trt_elementwise_op SRCS test_elementwise_op.cc elementwise_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine elementwise_add_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + elementwise_add_op elementwise_mul_op SERIAL) nv_test(test_trt_softmax_op SRCS test_softmax_op.cc softmax_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine softmax_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine softmax_op SERIAL) nv_test(test_trt_batch_norm_op SRCS test_batch_norm_op.cc batch_norm_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine batch_norm_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine batch_norm_op SERIAL) nv_test(test_trt_concat_op SRCS test_concat_op.cc concat_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine concat_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine concat_op SERIAL) nv_test(test_trt_dropout_op SRCS test_dropout_op.cc dropout_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine dropout_op SERIAL) - + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine dropout_op SERIAL) nv_test(test_trt_pad_op SRCS test_pad_op.cc pad_op.cc - DEPS ${FLUID_CORE_MODULES} tensorrt_engine pad_op SERIAL) + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine pad_op SERIAL) +nv_test(test_trt_split_op SRCS test_split_op.cc split_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + split_op concat_op SERIAL) +nv_test(test_trt_prelu_op SRCS test_prelu_op.cc prelu_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine tensorrt_plugin + prelu_op SERIAL) +nv_test(test_trt_leaky_relu_op SRCS test_leaky_relu_op.cc leaky_relu_op.cc + DEPS ${FLUID_CORE_MODULES} ${GLOB_OPERATOR_DEPS} tensorrt_engine activation_op SERIAL) diff --git a/paddle/fluid/inference/tensorrt/convert/activation_op.cc b/paddle/fluid/inference/tensorrt/convert/activation_op.cc index e73c5bbf57501e4ff3c080a46d91685035652bfa..0b756534ec6fbf27a3e92bf39fb7544d9785ca48 100644 --- a/paddle/fluid/inference/tensorrt/convert/activation_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/activation_op.cc @@ -27,7 +27,7 @@ class ActivationOpConverter : public OpConverter { // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. framework::OpDesc op_desc(op, nullptr); - LOG(INFO) + VLOG(3) << "convert a fluid Activation op to tensorrt activation layer whose " "type is " << op_type_; diff --git a/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc b/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc index 3330af2da6c97ad153dcecd86be4b441eac62b5e..d017bac66dd99a4b54c44ec786de61d1e66b8981 100644 --- a/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/batch_norm_op.cc @@ -23,7 +23,7 @@ class BatchNormOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - LOG(INFO) << "convert a fluid batch norm op to tensorrt batch_norm"; + VLOG(3) << "convert a fluid batch norm op to tensorrt batch_norm"; framework::OpDesc op_desc(op, nullptr); PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); diff --git a/paddle/fluid/inference/tensorrt/convert/concat_op.cc b/paddle/fluid/inference/tensorrt/convert/concat_op.cc index a11dfa1e8f2dacfad067d025678911200db500fb..525ba9dc341c8c1343553ac9523611f79ac3aa2d 100644 --- a/paddle/fluid/inference/tensorrt/convert/concat_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/concat_op.cc @@ -19,13 +19,13 @@ namespace inference { namespace tensorrt { /* - * MulOp, IMatrixMultiplyLayer in TRT. This Layer doesn't has weights. + * ConcatOp */ class ConcatOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid mul op to tensorrt mul layer without bias"; + VLOG(3) << "convert a fluid mul op to tensorrt mul layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc index 0a37d3968c39d2c244bbd82161afddf6330e421d..7900f56c9ce17ffc7c62c85a42c62ba326dea16e 100644 --- a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc @@ -18,74 +18,139 @@ namespace paddle { namespace inference { namespace tensorrt { +bool to_skip_merging_optimize(TensorRTEngine* engine, + const std::vector& filters, + const std::vector& strides, + const std::vector& paddings, + std::string input_name) { + if (engine->itensor_quote_num[input_name] > 0) { + return true; + } + if (filters[0] == 1 && filters[1] == 1 && strides[0] == 1 && + strides[1] == 1 && paddings[0] == 0 && paddings[1] == 0) + engine->itensor_quote_num[input_name] += 1; + + return false; +} + +template +void ConvertConv2d(TensorRTEngine* engine, const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode, + RegistFunc fadd_layer, SetDilationFunc fset_dilation, + const std::string& name) { + VLOG(3) << "convert a fluid " << name << " op to tensorrt layer without bias"; + + framework::OpDesc op_desc(op, nullptr); + PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1); + PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1); // Y is a weight + PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1); + + PADDLE_ENFORCE(engine != nullptr); + auto* X = engine->GetITensor(op_desc.Input("Input").front()); + + // Declare weights + auto* Y_v = scope.FindVar(op_desc.Input("Filter").front()); + PADDLE_ENFORCE_NOT_NULL(Y_v); + auto* Y_t = Y_v->GetMutable(); + + platform::CPUPlace cpu_place; + std::unique_ptr weight_tensor( + new framework::LoDTensor()); + weight_tensor->Resize(Y_t->dims()); + TensorCopySync((*Y_t), cpu_place, weight_tensor.get()); + + auto* weight_data = weight_tensor->mutable_data(platform::CPUPlace()); + + PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); + const int n_output = weight_tensor->dims()[0]; + const int n_input = weight_tensor->dims()[1]; + const int filter_h = weight_tensor->dims()[2]; + const int filter_w = weight_tensor->dims()[3]; + const int groups = boost::get(op_desc.GetAttr("groups")); + const std::vector dilations = + boost::get>(op_desc.GetAttr("dilations")); + const std::vector strides = + boost::get>(op_desc.GetAttr("strides")); + const std::vector paddings = + boost::get>(op_desc.GetAttr("paddings")); + + nvinfer1::DimsHW nv_ksize(filter_h, filter_w); + nvinfer1::DimsHW nv_dilations(dilations[0], dilations[1]); + nvinfer1::DimsHW nv_strides(strides[0], strides[1]); + nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); + + TensorRTEngine::Weight weight{nvinfer1::DataType::kFLOAT, + static_cast(weight_data), + static_cast(weight_tensor->numel())}; + + TensorRTEngine::Weight bias{nvinfer1::DataType::kFLOAT, nullptr, 0}; + auto* layer = fadd_layer(const_cast(X), n_output, n_input, + nv_ksize, weight, bias); + PADDLE_ENFORCE(layer != nullptr); + layer->setStride(nv_strides); + layer->setPadding(nv_paddings); + layer->setNbGroups(groups); + // set dilations + fset_dilation(layer, nv_dilations); + + auto output_name = op_desc.Output("Output").front(); + layer->setName((name + " (Output: " + output_name + ")").c_str()); + engine->weight_map[op_desc.Input("Filter").front()] = + std::move(weight_tensor); + layer->getOutput(0)->setName(output_name.c_str()); + engine->SetITensor(output_name, layer->getOutput(0)); + + if (test_mode || + to_skip_merging_optimize(engine, {filter_h, filter_w}, strides, paddings, + op_desc.Input("Input").front())) { + engine->DeclareOutput(output_name); + } +} + class Conv2dOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - LOG(INFO) - << "convert a fluid conv2d op to tensorrt conv layer without bias"; - - framework::OpDesc op_desc(op, nullptr); - PADDLE_ENFORCE_EQ(op_desc.Input("Input").size(), 1); - PADDLE_ENFORCE_EQ(op_desc.Input("Filter").size(), 1); // Y is a weight - PADDLE_ENFORCE_EQ(op_desc.Output("Output").size(), 1); - - auto* X = engine_->GetITensor(op_desc.Input("Input").front()); - // Declare weights - auto* Y_v = scope.FindVar(op_desc.Input("Filter").front()); - PADDLE_ENFORCE_NOT_NULL(Y_v); - auto* Y_t = Y_v->GetMutable(); - - platform::CPUPlace cpu_place; - std::unique_ptr weight_tensor( - new framework::LoDTensor()); - weight_tensor->Resize(Y_t->dims()); - TensorCopySync((*Y_t), cpu_place, weight_tensor.get()); - - auto* weight_data = - weight_tensor->mutable_data(platform::CPUPlace()); - - PADDLE_ENFORCE_EQ(weight_tensor->dims().size(), 4UL); - const int n_output = weight_tensor->dims()[0]; - const int filter_h = weight_tensor->dims()[2]; - const int filter_w = weight_tensor->dims()[3]; - - const int groups = boost::get(op_desc.GetAttr("groups")); - const std::vector dilations = - boost::get>(op_desc.GetAttr("dilations")); - const std::vector strides = - boost::get>(op_desc.GetAttr("strides")); - const std::vector paddings = - boost::get>(op_desc.GetAttr("paddings")); - - nvinfer1::DimsHW nv_ksize(filter_h, filter_w); - nvinfer1::DimsHW nv_dilations(dilations[0], dilations[1]); - nvinfer1::DimsHW nv_strides(strides[0], strides[1]); - nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); - - TensorRTEngine::Weight weight{nvinfer1::DataType::kFLOAT, - static_cast(weight_data), - weight_tensor->memory_size() / sizeof(float)}; - - TensorRTEngine::Weight bias{nvinfer1::DataType::kFLOAT, nullptr, 0}; - auto* layer = TRT_ENGINE_ADD_LAYER( - engine_, Convolution, *const_cast(X), n_output, - nv_ksize, weight.get(), bias.get()); - PADDLE_ENFORCE(layer != nullptr); - layer->setStride(nv_strides); - layer->setPadding(nv_paddings); - layer->setDilation(nv_dilations); - layer->setNbGroups(groups); - - auto output_name = op_desc.Output("Output").front(); - layer->setName(("conv2d (Output: " + output_name + ")").c_str()); - engine_->weight_map[op_desc.Input("Filter").front()] = - std::move(weight_tensor); - layer->getOutput(0)->setName(output_name.c_str()); - engine_->SetITensor(output_name, layer->getOutput(0)); - if (test_mode) { - engine_->DeclareOutput(output_name); - } + ConvertConv2d( + engine_, op, scope, test_mode, + [&](nvinfer1::ITensor* inputs, int n_output, /* Conv output maps */ + int n_input, /* Conv input maps */ + nvinfer1::DimsHW& ksize, TensorRTEngine::Weight& weight, + TensorRTEngine::Weight& bias) -> nvinfer1::IConvolutionLayer* { + auto* layer = + TRT_ENGINE_ADD_LAYER(engine_, Convolution, *inputs, n_output, + ksize, weight.get(), bias.get()); + return layer; + }, + [](nvinfer1::IConvolutionLayer* layer, nvinfer1::DimsHW& dilations) { + layer->setDilation(dilations); + }, + "conv2d"); + } +}; + +class Deconv2dOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + ConvertConv2d( + engine_, op, scope, test_mode, + [&](nvinfer1::ITensor* inputs, int n_output, /* Deconv input maps */ + int n_input, /* Deconv output maps */ + nvinfer1::DimsHW& ksize, TensorRTEngine::Weight& weight, + TensorRTEngine::Weight& bias) -> nvinfer1::IDeconvolutionLayer* { + auto* layer = + TRT_ENGINE_ADD_LAYER(engine_, Deconvolution, *inputs, n_input, + ksize, weight.get(), bias.get()); + return layer; + }, + [](nvinfer1::IDeconvolutionLayer* layer, nvinfer1::DimsHW& dilations) { + PADDLE_ENFORCE( + dilations.d[0] == 1 && dilations.d[1] == 1, + "Dilations must be (1, 1) for tensorRT, but given (%d, %d)", + dilations.d[0], dilations.d[1]); + }, + "conv2d_transpose"); } }; @@ -94,3 +159,4 @@ class Conv2dOpConverter : public OpConverter { } // namespace paddle REGISTER_TRT_OP_CONVERTER(conv2d, Conv2dOpConverter); +REGISTER_TRT_OP_CONVERTER(conv2d_transpose, Deconv2dOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/dropout_op.cc b/paddle/fluid/inference/tensorrt/convert/dropout_op.cc index 9533ecbcfda4e2500fd201d8efc64fc5bd97169a..ddbc724e3b2a48b75df17f9bda691a1fd3883c32 100644 --- a/paddle/fluid/inference/tensorrt/convert/dropout_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/dropout_op.cc @@ -25,7 +25,7 @@ class DropoutOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid dropout op to tensorrt dropout layer"; + VLOG(3) << "convert a fluid dropout op to tensorrt dropout layer"; framework::OpDesc op_desc(op, nullptr); // Declare inputs auto* input1 = engine_->GetITensor(op_desc.Input("X")[0]); diff --git a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc index 0a6ce568f194f03c7259e1ebf28dd6ce4df2d594..6975086193d991dc9f53b2d9d988f960c8ad118d 100644 --- a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, @@ -13,11 +13,25 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h" namespace paddle { namespace inference { namespace tensorrt { +static bool CheckDims(const nvinfer1::Dims& dims_x, + const nvinfer1::Dims& dims_y) { + if (dims_x.nbDims != dims_y.nbDims) { + return false; + } + for (int i = 0; i < dims_x.nbDims; i++) { + if (dims_x.d[i] != dims_y.d[i]) { + return false; + } + } + return true; +} + class ElementwiseWeightOpConverter : public OpConverter { public: ElementwiseWeightOpConverter() {} @@ -26,7 +40,7 @@ class ElementwiseWeightOpConverter : public OpConverter { // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. framework::OpDesc op_desc(op, nullptr); - LOG(INFO) << "convert a fluid elementwise op to tensorrt IScaleLayer"; + VLOG(3) << "Convert a fluid elementwise op to TensorRT IScaleLayer"; PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); // Y is a weight @@ -34,7 +48,8 @@ class ElementwiseWeightOpConverter : public OpConverter { auto* X = engine_->GetITensor(op_desc.Input("X").front()); nvinfer1::Dims dims_x = X->getDimensions(); - PADDLE_ENFORCE(dims_x.nbDims >= 3); + PADDLE_ENFORCE(dims_x.nbDims >= 3, "x dims experts 3, but %d is given.", + dims_x.nbDims); auto* Y_v = scope.FindVar(op_desc.Input("Y").front()); PADDLE_ENFORCE_NOT_NULL(Y_v); @@ -105,10 +120,12 @@ class ElementwiseTensorOpConverter : public OpConverter { ElementwiseTensorOpConverter() {} void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { + auto op_pair = ops.find(op_type_); + PADDLE_ENFORCE(op_pair != ops.end(), "Wrong elementwise op type!"); + // Here the two nullptr looks strange, that's because the // framework::OpDesc's constructor is strange. framework::OpDesc op_desc(op, nullptr); - LOG(INFO) << "convert a fluid elementwise op to tensorrt IScaleLayer"; PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1); // Y is a weight @@ -119,29 +136,35 @@ class ElementwiseTensorOpConverter : public OpConverter { nvinfer1::Dims dims_x = X->getDimensions(); nvinfer1::Dims dims_y = Y->getDimensions(); - // The two input tensor should have the same dims - PADDLE_ENFORCE(dims_x.nbDims >= 3); - if (dims_x.nbDims == dims_y.nbDims) { - for (int i = 0; i < dims_x.nbDims; i++) { - if (dims_x.d[i] != dims_y.d[i]) - PADDLE_THROW("TensorRT unsupported tensor shape for Elementwise op!"); - } - } else { - PADDLE_THROW("TensorRT unsupported tensor shape for Elementwise op!"); - } + int axis = boost::get(op_desc.GetAttr("axis")); + auto output_name = op_desc.Output("Out")[0]; + if (CheckDims(dims_x, dims_y)) { + // The two input tensor should have the same dims + VLOG(3) << "Convert a fluid elementwise op to TensorRT IElementWiseLayer"; - auto op_pair = ops.find(op_type_); - if (op_pair == ops.end()) { - PADDLE_THROW("Wrong elementwise op type!"); - } - nvinfer1::IElementWiseLayer* layer = TRT_ENGINE_ADD_LAYER( - engine_, ElementWise, *const_cast(X), - *const_cast(Y), op_pair->second); + nvinfer1::IElementWiseLayer* layer = TRT_ENGINE_ADD_LAYER( + engine_, ElementWise, *const_cast(X), + *const_cast(Y), op_pair->second); - auto output_name = op_desc.Output("Out")[0]; - layer->setName(("elementwise (Output: " + output_name + ")").c_str()); - layer->getOutput(0)->setName(output_name.c_str()); - engine_->SetITensor(output_name, layer->getOutput(0)); + layer->setName(("elementwise (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + } else { + VLOG(3) << "Convert a fluid elementwise op to TensorRT " + "ElementWisePluginLayer"; + + plugin::ElementWisePlugin* plugin = + new plugin::ElementWisePlugin(op_pair->second, dims_x, dims_y, axis); + plugin->AddInput(X); + plugin->AddInput(Y); + nvinfer1::IPluginLayer* layer = engine_->AddPlugin( + const_cast(plugin->GetInputs().data()), 2, + reinterpret_cast(plugin)); + + layer->setName(("elementwise (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + } if (test_mode) { // the test framework can not determine which is the // output, so place the declaration inside. engine_->DeclareOutput(output_name); diff --git a/paddle/fluid/inference/tensorrt/convert/fc_op.cc b/paddle/fluid/inference/tensorrt/convert/fc_op.cc index 7c21ecd95da07b498eed2ab1bbdcc0e8cd184787..eef4fab4e86f05fa80bc614371f1aa43e433407e 100644 --- a/paddle/fluid/inference/tensorrt/convert/fc_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/fc_op.cc @@ -52,7 +52,7 @@ class FcOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid fc op to tensorrt fc layer without bias"; + VLOG(3) << "convert a fluid fc op to tensorrt fc layer without bias"; framework::OpDesc op_desc(op, nullptr); PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); diff --git a/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..3f6ed04c46d70b1ab68b4c01ef0c908a1a8d1a19 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc @@ -0,0 +1,95 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +// LeakyRelu converter from fluid to tensorRT +class LeakyReluOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert fluid leaky_relu op to tensorrt layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + int input_num = op_desc.Input("X").size(); + PADDLE_ENFORCE(input_num == 1); + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + // Get output + size_t output_num = op_desc.Output("Out").size(); + PADDLE_ENFORCE(output_num == 1); + // Get attrs + float alpha = boost::get(op_desc.GetAttr("alpha")); + + platform::CPUPlace place; + std::unique_ptr alpha_tensor( + new framework::LoDTensor()); + alpha_tensor->Resize(framework::make_ddim({2})); + float* alpha_data = alpha_tensor->mutable_data(place); + alpha_data[0] = alpha; + alpha_data[1] = 1.f - alpha; + // the leaky relu formula y = (x > 0) ? x : alpha * x is equal to + // y = alpha * x + (x > 0) ? (1 - alpha) * x : 0 + TensorRTEngine::Weight scale{nvinfer1::DataType::kFLOAT, &alpha_data[0], 1}; + TensorRTEngine::Weight shift{nvinfer1::DataType::kFLOAT, nullptr, 0}; + TensorRTEngine::Weight power{nvinfer1::DataType::kFLOAT, nullptr, 0}; + // y_scale = alpha * x + auto* scale_layer = TRT_ENGINE_ADD_LAYER( + engine_, Scale, *input, nvinfer1::ScaleMode::kUNIFORM, shift.get(), + scale.get(), power.get()); + PADDLE_ENFORCE(nullptr != scale_layer); + // y_relu = (x > 0) : x : 0 + auto* relu_layer = TRT_ENGINE_ADD_LAYER(engine_, Activation, *input, + nvinfer1::ActivationType::kRELU); + PADDLE_ENFORCE(nullptr != relu_layer); + // + TensorRTEngine::Weight sub_scale{nvinfer1::DataType::kFLOAT, &alpha_data[1], + 1}; + auto* scale_relu_layer = + TRT_ENGINE_ADD_LAYER(engine_, Scale, *(relu_layer->getOutput(0)), + nvinfer1::ScaleMode::kUNIFORM, shift.get(), + sub_scale.get(), power.get()); + PADDLE_ENFORCE(nullptr != scale_relu_layer); + auto* output_layer = + TRT_ENGINE_ADD_LAYER(engine_, ElementWise, *(scale_layer->getOutput(0)), + *(scale_relu_layer->getOutput(0)), + nvinfer1::ElementWiseOperation::kSUM); + PADDLE_ENFORCE(nullptr != output_layer); + // keep alpha tensor to avoid release it's memory + std::string alpha_name = op_desc.Output("Out")[0] + "_alpha"; + PADDLE_ENFORCE(engine_->weight_map.find(alpha_name) == + engine_->weight_map.end()); + engine_->weight_map[alpha_name] = std::move(alpha_tensor); + + std::string layer_name = "leaky_relu (Output: "; + auto output_name = op_desc.Output("Out")[0]; + output_layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, output_layer->getOutput(0)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + output_layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(leaky_relu, LeakyReluOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/mul_op.cc b/paddle/fluid/inference/tensorrt/convert/mul_op.cc index 514eb659a8da73b6e56b5d17148ec0cb2aeaa135..5b6aaad49833cedbd8d1ee0ec5d24c7f983190e6 100644 --- a/paddle/fluid/inference/tensorrt/convert/mul_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/mul_op.cc @@ -25,7 +25,7 @@ class MulOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid mul op to tensorrt mul layer without bias"; + VLOG(3) << "convert a fluid mul op to tensorrt mul layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/op_converter.h b/paddle/fluid/inference/tensorrt/convert/op_converter.h index d309d94c560f2b484fac6b6cd40cc2704d641069..d61d635ed707bc455d495f2420925a3585234b5c 100644 --- a/paddle/fluid/inference/tensorrt/convert/op_converter.h +++ b/paddle/fluid/inference/tensorrt/convert/op_converter.h @@ -61,7 +61,7 @@ class OpConverter { // TODO(xingzhaolong): all mul, sub, div // static std::unordered_set add_weight_op_set {"add", "mul", // "sub", "div"}; - static std::unordered_set add_weight_op_set{"add"}; + static std::unordered_set add_weight_op_set{"add", "mul"}; PADDLE_ENFORCE_EQ(op_desc.Input("Y").size(), 1UL); int op_type_len = op_desc.Type().size(); std::string op_type = op_desc.Type().substr(op_type_len - 3, op_type_len); diff --git a/paddle/fluid/inference/tensorrt/convert/pad_op.cc b/paddle/fluid/inference/tensorrt/convert/pad_op.cc index 218030a591fcc7e533ef37062265449d4b6044bc..4afcb0aecec9d07b52d2fd701fae8750067a6041 100644 --- a/paddle/fluid/inference/tensorrt/convert/pad_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/pad_op.cc @@ -25,7 +25,7 @@ class PadOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) << "convert a fluid transpose op to tensorrt tranpose layer"; + VLOG(3) << "convert a fluid transpose op to tensorrt tranpose layer"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc index f9bb66a6e9f81a10368db7710108c319860e940a..d700e08590ec5f9a397c3a6de80e0394c0dd4dc5 100644 --- a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc @@ -13,25 +13,57 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h" namespace paddle { namespace inference { namespace tensorrt { +void DealCeilMode(const nvinfer1::Dims &input_shape, std::vector ksize, + std::vector strides, std::vector paddings, + nvinfer1::DimsHW *pre_pad, nvinfer1::DimsHW *post_pad, + int input_dims) { + int input_height = input_shape.d[input_dims - 2]; + int input_width = input_shape.d[input_dims - 1]; + int floor_h_output_size = + (input_height - ksize[0] + 2 * paddings[0]) / strides[0] + 1; + int ceil_h_output_size = + (input_height - ksize[0] + 2 * paddings[0] + strides[0] - 1) / + strides[0] + + 1; + + int floor_w_output_size = + (input_width - ksize[1] + 2 * paddings[1]) / strides[1] + 1; + int ceil_w_output_size = + (input_width - ksize[1] + 2 * paddings[1] + strides[1] - 1) / strides[1] + + 1; + if (floor_h_output_size != ceil_h_output_size) { + post_pad->h() = strides[0] - 1; + } + + if (floor_w_output_size != ceil_w_output_size) { + post_pad->w() = strides[1] - 1; + } +} + /* * Pool2dOp, IPoolingLayer in TRT. This Layer doesn't has weights. */ class Pool2dOpConverter : public OpConverter { public: - void operator()(const framework::proto::OpDesc& op, - const framework::Scope& scope, bool test_mode) override { - VLOG(4) + void operator()(const framework::proto::OpDesc &op, + const framework::Scope &scope, bool test_mode) override { + VLOG(40) << "convert a fluid pool2d op to tensorrt pool2d layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs PADDLE_ENFORCE_EQ(op_desc.Input("X").size(), 1); PADDLE_ENFORCE_EQ(op_desc.Output("Out").size(), 1); - auto* input1 = engine_->GetITensor(op_desc.Input("X")[0]); + auto *input1 = engine_->GetITensor(op_desc.Input("X")[0]); + nvinfer1::Dims input_shape = input1->getDimensions(); + int input_dims = input_shape.nbDims; + + PADDLE_ENFORCE_EQ(input_dims, 3UL); bool global_pooling = boost::get(op_desc.GetAttr("global_pooling")); std::string pool_type = @@ -42,18 +74,7 @@ class Pool2dOpConverter : public OpConverter { boost::get>(op_desc.GetAttr("strides")); std::vector paddings = boost::get>(op_desc.GetAttr("paddings")); - - nvinfer1::DimsHW nv_ksize(ksize[0], ksize[1]); - if (global_pooling == true) { - nvinfer1::Dims input_shape = input1->getDimensions(); - int nbDims = input_shape.nbDims; - nv_ksize.d[0] = input_shape.d[nbDims - 2]; - nv_ksize.d[1] = input_shape.d[nbDims - 1]; - } - const nvinfer1::DimsHW nv_strides(strides[0], strides[1]); - const nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); - - PADDLE_ENFORCE_EQ(input1->getDimensions().nbDims, 3UL); + bool ceil_mode = boost::get(op_desc.GetAttr("ceil_mode")); nvinfer1::PoolingType nv_pool_type = nvinfer1::PoolingType::kMAX; if (pool_type == "max") { @@ -64,12 +85,63 @@ class Pool2dOpConverter : public OpConverter { PADDLE_THROW("TensorRT unsupported pooling type!"); } - auto* layer = TRT_ENGINE_ADD_LAYER(engine_, Pooling, - *const_cast(input1), - nv_pool_type, nv_ksize); - PADDLE_ENFORCE_NOT_NULL(layer, "pool layer could not be created."); - layer->setStride(nv_strides); - layer->setPadding(nv_paddings); + nvinfer1::DimsHW nv_ksize(ksize[0], ksize[1]); + nvinfer1::DimsHW nv_strides(strides[0], strides[1]); + nvinfer1::DimsHW nv_paddings(paddings[0], paddings[1]); + + nvinfer1::ILayer *layer = nullptr; + + if (global_pooling == true) { + nv_ksize.d[0] = input_shape.d[input_dims - 2]; + nv_ksize.d[1] = input_shape.d[input_dims - 1]; + auto *layer = TRT_ENGINE_ADD_LAYER( + engine_, Pooling, *const_cast(input1), + nv_pool_type, nv_ksize); + PADDLE_ENFORCE_NOT_NULL(layer, "pool layer could not be created."); + auto output_name = op_desc.Output("Out")[0]; + layer->setName(("pool2d (Output: " + output_name + ")").c_str()); + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + if (test_mode) { + engine_->DeclareOutput(output_name); + } + return; + } + + if (pool_type == "max") { + nvinfer1::DimsHW pre_pad(paddings[0], paddings[1]); + nvinfer1::DimsHW post_pad(paddings[0], paddings[1]); + if (ceil_mode) { + // If ceil mode is true, we will pad the appropriate size to the input. + DealCeilMode(input_shape, ksize, strides, paddings, &pre_pad, &post_pad, + input_dims); + auto *pad_layer = TRT_ENGINE_ADD_LAYER( + engine_, Padding, *const_cast(input1), pre_pad, + post_pad); + PADDLE_ENFORCE_NOT_NULL( + pad_layer, "pad layer in poolOp converter could not be created."); + input1 = pad_layer->getOutput(0); + } + auto *pool_layer = TRT_ENGINE_ADD_LAYER( + engine_, Pooling, *const_cast(input1), + nv_pool_type, nv_ksize); + PADDLE_ENFORCE_NOT_NULL(pool_layer, "pool layer could not be created."); + pool_layer->setStride(nv_strides); + pool_layer->setPadding(nv_paddings); + layer = pool_layer; + } else { + // Average pooling needs to exclude the padding pixels from the average + // mean. + // It is not supported well by TRT, we use a plugin here. + std::vector input_shape_v; + for (int i = 0; i < input_dims; i++) { + input_shape_v.push_back(input_shape.d[i]); + } + plugin::AvgPoolPlugin *plugin = new plugin::AvgPoolPlugin( + ceil_mode, ksize, strides, paddings, input_shape_v); + auto *avg_pool_layer = engine_->AddPlugin(&input1, 1, plugin); + layer = avg_pool_layer; + } auto output_name = op_desc.Output("Out")[0]; layer->setName(("pool2d (Output: " + output_name + ")").c_str()); diff --git a/paddle/fluid/inference/tensorrt/convert/prelu_op.cc b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..dbdff85ddebc85bc51938a204a48affe485b8240 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc @@ -0,0 +1,80 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +/* + * PRelu converter from fluid to tensorRT. + */ +class PReluOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert fluid prelu op to tensorrt prelu layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + int input_num = op_desc.Input("X").size(); + PADDLE_ENFORCE(input_num == 1); + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + // Get output + size_t output_num = op_desc.Output("Out").size(); + PADDLE_ENFORCE(output_num == 1); + // Get attrs + std::string mode = boost::get(op_desc.GetAttr("mode")); + // + auto* alpha_var = scope.FindVar(op_desc.Input("Alpha")[0]); + PADDLE_ENFORCE_NOT_NULL(alpha_var); + auto* alpha_tensor = alpha_var->GetMutable(); + + platform::CUDAPlace place; + std::unique_ptr alpha_tensor_device( + new framework::LoDTensor()); + alpha_tensor_device->Resize(alpha_tensor->dims()); + TensorCopySync(*alpha_tensor, place, alpha_tensor_device.get()); + float* alpha_data = alpha_tensor_device->mutable_data(place); + + // Transform alpha to TensorRTEngine::Weight + TensorRTEngine::Weight alpha_rt(nvinfer1::DataType::kFLOAT, + static_cast(alpha_data), + alpha_tensor_device->numel()); + plugin::PReluPlugin* plugin = new plugin::PReluPlugin(alpha_rt, mode); + nvinfer1::IPluginLayer* layer = + engine_->AddPlugin(&input, input_num, plugin); + // keep alpha tensor to avoid release it's memory + engine_->weight_map[op_desc.Input("Alpha")[0]] = + std::move(alpha_tensor_device); + + std::string layer_name = "prelu (Output: "; + auto output_name = op_desc.Output("Out")[0]; + layer->getOutput(0)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(0)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(prelu, PReluOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/softmax_op.cc b/paddle/fluid/inference/tensorrt/convert/softmax_op.cc index 0064f90fd7944403c14d4d47616ea82f681ceb74..80bfb2d190a5637032e7c18fbac7f22b3a9e81e1 100644 --- a/paddle/fluid/inference/tensorrt/convert/softmax_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/softmax_op.cc @@ -25,7 +25,7 @@ class SoftMaxOpConverter : public OpConverter { public: void operator()(const framework::proto::OpDesc& op, const framework::Scope& scope, bool test_mode) override { - VLOG(4) + VLOG(3) << "convert a fluid softmax op to tensorrt softmax layer without bias"; framework::OpDesc op_desc(op, nullptr); // Declare inputs diff --git a/paddle/fluid/inference/tensorrt/convert/split_op.cc b/paddle/fluid/inference/tensorrt/convert/split_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..ae5b1b98060a4e73b2d1761d4edafb152f364070 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/split_op.cc @@ -0,0 +1,67 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +class SplitOpConverter : public OpConverter { + public: + void operator()(const framework::proto::OpDesc& op, + const framework::Scope& scope, bool test_mode) override { + VLOG(4) << "convert a fluid split op to tensorrt split layer"; + + framework::OpDesc op_desc(op, nullptr); + // Declare inputs + auto* input = engine_->GetITensor(op_desc.Input("X")[0]); + auto input_dims = input->getDimensions(); + int input_num = op_desc.Input("X").size(); + size_t output_num = op_desc.Output("Out").size(); + + // Get Attrs + PADDLE_ENFORCE(input_num == 1); + int axis = boost::get(op_desc.GetAttr("axis")); + std::vector output_lengths = + boost::get>(op_desc.GetAttr("sections")); + // split on batch is not supported in TensorRT + PADDLE_ENFORCE(axis != 0); + axis += (axis < 0) ? input_dims.nbDims : -1; + + PADDLE_ENFORCE(output_lengths.size() == output_num); + plugin::SplitPlugin* plugin = new plugin::SplitPlugin(axis, output_lengths); + nvinfer1::IPluginLayer* layer = + engine_->AddPlugin(&input, input_num, plugin); + + std::string layer_name = "split (Output: "; + for (size_t i = 0; i < output_num; i++) { + auto output_name = op_desc.Output("Out")[i]; + layer->getOutput(i)->setName(output_name.c_str()); + engine_->SetITensor(output_name, layer->getOutput(i)); + layer_name += output_name; + if (test_mode) { + engine_->DeclareOutput(output_name); + } + } + layer->setName((layer_name + ")").c_str()); + } +}; + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +REGISTER_TRT_OP_CONVERTER(split, SplitOpConverter); diff --git a/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc b/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc index f8711c6b60d74639529624c25429bc245de46479..95916746d6fcb528d26a8f8bb39980b55c4f3704 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_conv2d_op.cc @@ -16,6 +16,9 @@ limitations under the License. */ #include "paddle/fluid/inference/tensorrt/convert/op_converter.h" #include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" +USE_OP(conv2d); +USE_OP(conv2d_transpose); + namespace paddle { namespace inference { namespace tensorrt { @@ -51,7 +54,37 @@ TEST(conv2d_op, test) { validator.Execute(3); } +TEST(conv2d_transpose_op, test) { + std::unordered_set parameters({"deconv2d-Y"}); + framework::Scope scope; + TRTConvertValidation validator(5, parameters, scope, 1 << 15); + + validator.DeclInputVar("deconv2d-X", nvinfer1::Dims3(3, 5, 5)); + validator.DeclParamVar("deconv2d-Y", nvinfer1::Dims4(3, 2, 3, 3)); + validator.DeclOutputVar("deconv2d-Out", nvinfer1::Dims3(2, 5, 5)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("conv2d_transpose"); + desc.SetInput("Input", {"deconv2d-X"}); + desc.SetInput("Filter", {"deconv2d-Y"}); + desc.SetOutput("Output", {"deconv2d-Out"}); + + const std::vector strides({1, 1}); + const std::vector paddings({1, 1}); + const std::vector dilations({1, 1}); + const int groups = 1; + + desc.SetAttr("strides", strides); + desc.SetAttr("paddings", paddings); + desc.SetAttr("dilations", dilations); + desc.SetAttr("groups", groups); + + validator.SetOp(*desc.Proto()); + + validator.Execute(3); +} + } // namespace tensorrt } // namespace inference } // namespace paddle -USE_OP(conv2d); diff --git a/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc b/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc index 7537d02a35b66a41c158cd8eb1b1e5d4107e7d84..cc967464a5f29151a061e99cda6870f9f370ec1b 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_elementwise_op.cc @@ -20,13 +20,12 @@ namespace paddle { namespace inference { namespace tensorrt { -TEST(elementwise_op, add_weight_test) { +TEST(elementwise_op, add_weight) { std::unordered_set parameters({"elementwise_add-Y"}); framework::Scope scope; TRTConvertValidation validator(10, parameters, scope, 1 << 15); validator.DeclInputVar("elementwise_add-X", nvinfer1::DimsCHW(10, 3, 3)); validator.DeclParamVar("elementwise_add-Y", nvinfer1::Dims3(10, 1, 1)); - // validator.DeclParamVar("mul-Y", nvinfer1::Dims2(8, 2)); validator.DeclOutputVar("elementwise_add-Out", nvinfer1::DimsCHW(10, 3, 3)); // Prepare Op description @@ -44,30 +43,65 @@ TEST(elementwise_op, add_weight_test) { validator.Execute(8); } -TEST(elementwise_op, add_tensor_test) { - std::unordered_set parameters; - framework::Scope scope; - TRTConvertValidation validator(8, parameters, scope, 1 << 15); - validator.DeclInputVar("elementwise_add-X", nvinfer1::DimsCHW(10, 3, 3)); - validator.DeclInputVar("elementwise_add-Y", nvinfer1::Dims3(10, 3, 3)); - // validator.DeclParamVar("mul-Y", nvinfer1::Dims2(8, 2)); - validator.DeclOutputVar("elementwise_add-Out", nvinfer1::DimsCHW(10, 3, 3)); - - // Prepare Op description - framework::OpDesc desc; - desc.SetType("elementwise_add"); - desc.SetInput("X", {"elementwise_add-X"}); - desc.SetInput("Y", {"elementwise_add-Y"}); - desc.SetOutput("Out", {"elementwise_add-Out"}); - - // the defalut axis of elementwise op is -1 - - validator.SetOp(*desc.Proto()); +TEST(elementwise_op, native) { + for (std::string type : {"add", "mul"}) { + int batch_size = 8; + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(batch_size, parameters, scope, 1 << 15); + validator.DeclInputVar("elementwise_" + type + "-X", + nvinfer1::DimsCHW(10, 3, 3)); + validator.DeclInputVar("elementwise_" + type + "-Y", + nvinfer1::Dims3(10, 3, 3)); + validator.DeclOutputVar("elementwise_" + type + "-Out", + nvinfer1::DimsCHW(10, 3, 3)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("elementwise_" + type); + desc.SetInput("X", {"elementwise_" + type + "-X"}); + desc.SetInput("Y", {"elementwise_" + type + "-Y"}); + desc.SetOutput("Out", {"elementwise_" + type + "-Out"}); + + int axis = -1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + validator.Execute(batch_size); + } +} - validator.Execute(8); +TEST(elementwise_op, plugin) { + for (std::string type : {"add", "mul"}) { + int batch_size = 8; + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(batch_size, parameters, scope, 1 << 15); + validator.DeclInputVar("elementwise_" + type + "-X", + nvinfer1::DimsCHW(10, 3, 3)); + validator.DeclInputVar("elementwise_" + type + "-Y", + nvinfer1::Dims3(10, 1, 1)); + validator.DeclOutputVar("elementwise_" + type + "-Out", + nvinfer1::DimsCHW(10, 3, 3)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("elementwise_" + type); + desc.SetInput("X", {"elementwise_" + type + "-X"}); + desc.SetInput("Y", {"elementwise_" + type + "-Y"}); + desc.SetOutput("Out", {"elementwise_" + type + "-Out"}); + + int axis = -1; + desc.SetAttr("axis", axis); + + validator.SetOp(*desc.Proto()); + validator.Execute(batch_size); + } } } // namespace tensorrt } // namespace inference } // namespace paddle + USE_OP(elementwise_add); +USE_OP(elementwise_mul); diff --git a/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc b/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..d00826af075159004d3727a7519e7c319dbddb02 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_leaky_relu_op.cc @@ -0,0 +1,48 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +TEST(leaky_relu_op, test_leaky_relu) { + std::unordered_set parameters; + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("leaky_relu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclOutputVar("leaky_relu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("leaky_relu"); + desc.SetInput("X", {"leaky_relu_input"}); + desc.SetOutput("Out", {"leaky_relu_out"}); + + desc.SetAttr("alpha", 0.1f); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +// USE_OP(leaky_relu); +USE_OP(leaky_relu); diff --git a/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc b/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc index 3d34cd7d5d0deca4d83a3f5b5ed0fb396c6acd56..282f53559aa75b2c7c252450e392e1996f9b1d81 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_mul_op.cc @@ -1,16 +1,16 @@ /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ #include #include "paddle/fluid/framework/op_registry.h" diff --git a/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc b/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc index aedd6b62df040eeee4e48f628128511cd8bf4439..bded833505cd25352adc4123de415613d1fc926d 100644 --- a/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc +++ b/paddle/fluid/inference/tensorrt/convert/test_pool2d_op.cc @@ -20,18 +20,21 @@ namespace paddle { namespace inference { namespace tensorrt { -void test_pool2d(bool global_pooling) { +void test_pool2d(bool global_pooling, bool ceil_mode, + std::string pool_type = "max") { framework::Scope scope; std::unordered_set parameters; TRTConvertValidation validator(5, parameters, scope, 1 << 15); // The ITensor's Dims should not contain the batch size. // So, the ITensor's Dims of input and output should be C * H * W. - validator.DeclInputVar("pool2d-X", nvinfer1::Dims3(3, 4, 4)); + validator.DeclInputVar("pool2d-X", nvinfer1::Dims3(3, 6, 7)); if (global_pooling) validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 1, 1)); + else if (ceil_mode) + validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 3, 4)); else - validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 2, 2)); + validator.DeclOutputVar("pool2d-Out", nvinfer1::Dims3(3, 3, 3)); // Prepare Op description framework::OpDesc desc; @@ -42,13 +45,14 @@ void test_pool2d(bool global_pooling) { std::vector ksize({2, 2}); std::vector strides({2, 2}); std::vector paddings({0, 0}); - std::string pooling_t = "max"; + std::string pooling_t = pool_type; desc.SetAttr("pooling_type", pooling_t); desc.SetAttr("ksize", ksize); desc.SetAttr("strides", strides); desc.SetAttr("paddings", paddings); desc.SetAttr("global_pooling", global_pooling); + desc.SetAttr("ceil_mode", ceil_mode); LOG(INFO) << "set OP"; validator.SetOp(*desc.Proto()); @@ -57,9 +61,11 @@ void test_pool2d(bool global_pooling) { validator.Execute(3); } -TEST(Pool2dOpConverter, normal) { test_pool2d(false); } +TEST(Pool2dOpConverter, normal) { test_pool2d(false, false); } +TEST(Pool2dOpConverter, test_global_pooling) { test_pool2d(true, false); } -TEST(Pool2dOpConverter, test_global_pooling) { test_pool2d(true); } +TEST(Pool2dOpConverter, max_ceil_test) { test_pool2d(false, true); } +TEST(Pool2dOpConverter, avg_ceil_test) { test_pool2d(false, true, "avg"); } } // namespace tensorrt } // namespace inference diff --git a/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc b/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..453f222f1f1e3f3b9ee8fa7bd49f4cab2286e7ea --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_prelu_op.cc @@ -0,0 +1,94 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +TEST(prelu_op, test_channel_wise) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims3(3, 1, 1)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("channel")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +TEST(prelu_op, test_element_wise) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims4(10, 3, 2, 2)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("element")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +TEST(prelu_op, test_scalar) { + std::unordered_set parameters({"prelu_alpha"}); + framework::Scope scope; + TRTConvertValidation validator(10, parameters, scope, 1000); + validator.DeclInputVar("prelu_input", nvinfer1::DimsCHW(3, 2, 2)); + validator.DeclParamVar("prelu_alpha", nvinfer1::Dims3(1, 1, 1)); + validator.DeclOutputVar("prelu_out", nvinfer1::DimsCHW(3, 2, 2)); + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("prelu"); + desc.SetInput("X", {"prelu_input"}); + desc.SetInput("Alpha", {"prelu_alpha"}); + desc.SetOutput("Out", {"prelu_out"}); + + desc.SetAttr("mode", std::string("all")); + + validator.SetOp(*desc.Proto()); + + validator.Execute(1); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +// USE_OP(prelu); +USE_CPU_ONLY_OP(prelu); diff --git a/paddle/fluid/inference/tensorrt/convert/test_split_op.cc b/paddle/fluid/inference/tensorrt/convert/test_split_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..5aacc5c600dd1371e3865adc888bb8e24640e7d9 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/convert/test_split_op.cc @@ -0,0 +1,115 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/convert/op_converter.h" +#include "paddle/fluid/inference/tensorrt/convert/ut_helper.h" + +namespace paddle { +namespace inference { +namespace tensorrt { + +template +void TensorRTSplitTest(const std::vector &in_shape, + const std::vector §ions) { + std::unordered_set parameters({""}); + framework::Scope scope; + TRTConvertValidation validator(BatchSize + 1, parameters, scope, 10000); + + auto make_dim = [](const std::vector &shape) { + nvinfer1::DimsCHW dim; + dim.c() = shape[0]; + dim.h() = shape[1]; + dim.w() = shape[2]; + return dim; + }; + validator.DeclInputVar("split_input", make_dim(in_shape)); + std::vector output_vars; + for (size_t i = 0; i < sections.size(); ++i) { + auto out_shape = in_shape; + out_shape[Axis - 1] = sections[i]; + std::string output_name = "split_out" + std::to_string(i); + validator.DeclOutputVar(output_name, make_dim(out_shape)); + output_vars.push_back(output_name); + } + + // Prepare Op description + framework::OpDesc desc; + desc.SetType("split"); + desc.SetInput("X", {"split_input"}); + desc.SetOutput("Out", output_vars); + + desc.SetAttr("axis", Axis); + desc.SetAttr("num", 0); + desc.SetAttr("sections", sections); + + validator.SetOp(*desc.Proto()); + + validator.Execute(BatchSize); +} + +// batch = 0, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch1) { + TensorRTSplitTest<1, 1>({4, 2, 2}, {2, 2}); +} +// batch = 0, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch1) { + TensorRTSplitTest<1, 1>({3, 2, 2}, {2, 1}); +} +// batch = 10, axis = 1, same shape +TEST(split_op, test_same_shape_axis1_batch10) { + TensorRTSplitTest<10, 1>({4, 2, 2}, {2, 2}); +} +// batch = 10, axis = 1, different shape +TEST(split_op, test_different_shape_axis1_batch10) { + TensorRTSplitTest<10, 1>({3, 2, 2}, {2, 1}); +} +// batch = 0, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch1) { + TensorRTSplitTest<1, 2>({3, 4, 2}, {2, 2}); +} +// batch = 0, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch1) { + TensorRTSplitTest<1, 2>({3, 3, 2}, {2, 1}); +} +// batch = 10, axis = 2, same shape +TEST(split_op, test_same_shape_axis2_batch10) { + TensorRTSplitTest<10, 2>({3, 4, 2}, {2, 2}); +} +// batch = 10, axis = 2, different shape +TEST(split_op, test_different_shape_axis2_batch10) { + TensorRTSplitTest<10, 2>({3, 3, 2}, {2, 1}); +} +// batch = 0, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch1) { + TensorRTSplitTest<1, 3>({3, 2, 4}, {2, 2}); +} +// batch = 0, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch1) { + TensorRTSplitTest<1, 3>({3, 2, 3}, {2, 1}); +} +// batch = 10, axis = 3, same shape +TEST(split_op, test_same_shape_axis3_batch10) { + TensorRTSplitTest<10, 3>({3, 2, 4}, {2, 2}); +} +// batch = 10, axis = 3, different shape +TEST(split_op, test_different_shape_axis3_batch10) { + TensorRTSplitTest<10, 3>({3, 2, 3}, {2, 1}); +} + +} // namespace tensorrt +} // namespace inference +} // namespace paddle + +USE_OP(split); diff --git a/paddle/fluid/inference/tensorrt/convert/ut_helper.h b/paddle/fluid/inference/tensorrt/convert/ut_helper.h index 0a6f171fc40a838fd81d6a51aca0430d5526f188..f313beb73bb0d21cab1d62859a46fcc76a373548 100644 --- a/paddle/fluid/inference/tensorrt/convert/ut_helper.h +++ b/paddle/fluid/inference/tensorrt/convert/ut_helper.h @@ -4,7 +4,7 @@ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, diff --git a/paddle/fluid/inference/tensorrt/engine.cc b/paddle/fluid/inference/tensorrt/engine.cc index 14e9e14d33d637ee68e37593cc48721e5169499f..f739752cbc44805cb0fb3246385609cf16ba744a 100644 --- a/paddle/fluid/inference/tensorrt/engine.cc +++ b/paddle/fluid/inference/tensorrt/engine.cc @@ -61,6 +61,7 @@ TensorRTEngine::~TensorRTEngine() { } void TensorRTEngine::FreezeNetwork() { + VLOG(3) << "TRT to freeze network"; freshDeviceId(); PADDLE_ENFORCE(infer_builder_ != nullptr, "Call InitNetwork first to initialize network."); @@ -133,6 +134,10 @@ void TensorRTEngine::DeclareOutput(const nvinfer1::ILayer *layer, int offset, buffer_sizes_[name] = 0; } +bool TensorRTEngine::HasDeclared(const std::string &name) { + return buffer_sizes_.count(name) > 0; +} + void TensorRTEngine::DeclareOutput(const std::string &name) { PADDLE_ENFORCE_EQ(0, buffer_sizes_.count(name), "duplicate output name %s", name); @@ -195,7 +200,8 @@ void TensorRTEngine::GetOutputInCPU(const std::string &name, void *dst, Buffer &TensorRTEngine::buffer(const std::string &name) { PADDLE_ENFORCE(infer_engine_ != nullptr, "call FreezeNetwork first."); auto it = buffer_sizes_.find(name); - PADDLE_ENFORCE(it != buffer_sizes_.end()); + PADDLE_ENFORCE(it != buffer_sizes_.end(), "tried to access buffer named %s", + name); auto slot_offset = infer_engine_->getBindingIndex(name.c_str()); return buffers_[slot_offset]; } @@ -250,6 +256,13 @@ void TensorRTEngine::freshDeviceId() { cudaSetDevice(device_); } +nvinfer1::IPluginLayer *TensorRTEngine::AddPlugin( + nvinfer1::ITensor *const *inputs, int num_inputs, + plugin::PluginTensorRT *plugin) { + owned_plugin_.emplace_back(plugin); + return infer_network_.get()->addPluginExt(inputs, num_inputs, *plugin); +} + } // namespace tensorrt } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/engine.h b/paddle/fluid/inference/tensorrt/engine.h index bd3ba4cea6551a7f6651e311e2649de191a6faa1..f5b2c28ba9e6fefc1d6c14640d696c3bf3ac8249 100644 --- a/paddle/fluid/inference/tensorrt/engine.h +++ b/paddle/fluid/inference/tensorrt/engine.h @@ -22,6 +22,7 @@ limitations under the License. */ #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/inference/engine.h" #include "paddle/fluid/inference/tensorrt/helper.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" #include "paddle/fluid/inference/utils/singleton.h" namespace paddle { @@ -39,6 +40,7 @@ class TensorRTEngine : public EngineBase { // Weight is model parameter. class Weight { public: + Weight() = default; Weight(nvinfer1::DataType dtype, void* value, size_t num_elem) { w_.type = dtype; w_.values = value; @@ -91,6 +93,8 @@ class TensorRTEngine : public EngineBase { const std::string& name); // Set the itensor_map_[name] as the network's output, and set its name. void DeclareOutput(const std::string& name); + // Check if the ITensor has been declared + bool HasDeclared(const std::string& name); // GPU memory address for an ITensor with specific name. One can operate on // these memory directly for acceleration, for example, output the converted @@ -123,6 +127,8 @@ class TensorRTEngine : public EngineBase { void SetRuntimeBatch(size_t batch_size); int GetRuntimeBatch(); int GetDevice() { return device_; } + nvinfer1::IPluginLayer* AddPlugin(nvinfer1::ITensor* const* inputs, + int num_inputs, plugin::PluginTensorRT*); // A pointer to CPU memory is needed of the TRT weight. // Before TRT runs, fluid loads weight into GPU storage. @@ -132,6 +138,16 @@ class TensorRTEngine : public EngineBase { std::unordered_map> weight_map; + // TODO(NHZLX) + // In the normal case, the paddle-trt exists bug when runing the googlenet. + // When there are more than two convolutions of 1 * 1 with the same input, the + // paddle-tensorrt will do the merging optimization, which fuse those conv + // into + // one conv, and then trigger bug. So, We should use strategy to avoid this + // optimization for the time being. This bug will be fixed in the future. + std::unordered_map + itensor_quote_num; + private: // the max batch size int max_batch_; @@ -152,8 +168,10 @@ class TensorRTEngine : public EngineBase { std::unordered_map buffer_sizes_; std::unordered_map itensor_map_; + // The specific GPU id that the TensorRTEngine bounded to. int device_; + std::vector> owned_plugin_; // TensorRT related internal members template diff --git a/paddle/fluid/inference/tensorrt/helper.h b/paddle/fluid/inference/tensorrt/helper.h index b6e7968108403c9c9c192759c44eac040d1c5073..fc7ca7714e9325d2b6bce6189300aa339c81c2ba 100644 --- a/paddle/fluid/inference/tensorrt/helper.h +++ b/paddle/fluid/inference/tensorrt/helper.h @@ -52,7 +52,7 @@ class NaiveLogger : public nvinfer1::ILogger { void log(nvinfer1::ILogger::Severity severity, const char* msg) override { switch (severity) { case Severity::kINFO: - LOG(INFO) << msg; + VLOG(3) << msg; break; case Severity::kWARNING: LOG(WARNING) << msg; diff --git a/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt b/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..e822785ad6f4f6f67b72141f3e7b04aefa72e58b --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/CMakeLists.txt @@ -0,0 +1,4 @@ +nv_library(tensorrt_plugin + SRCS trt_plugin.cc split_op_plugin.cu elementwise_op_plugin.cu prelu_op_plugin.cu + avg_pool_op_plugin.cu + DEPS enforce tensorrt_engine) diff --git a/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..5d747af8c55d71fee90ee0cc06fd328e583f3700 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.cu @@ -0,0 +1,64 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h" +#include "paddle/fluid/operators/math/pooling.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +nvinfer1::Dims AvgPoolPlugin::getOutputDimensions( + int index, const nvinfer1::Dims* inputDims, int nbInputs) { + assert(nbInputs == 1); + assert(index == 0); + assert(inputDims[0].nbDims == 3); + nvinfer1::Dims const& input_dims = inputDims[0]; + + nvinfer1::Dims output_dims = input_dims; + + output_dims.d[1] = output_shape_[1]; + output_dims.d[2] = output_shape_[2]; + return output_dims; +} + +int AvgPoolPlugin::enqueue(int batchSize, const void* const* inputs, + void** outputs, void* workspace, + cudaStream_t stream) { + auto const& input_dims = this->getInputDims(0); + int input_size = 0; + float const* idata = reinterpret_cast(inputs[0]); + float** odatas = reinterpret_cast(outputs); + + paddle::operators::math::AvgPool pool_process; + paddle::operators::math::Pool2dDirectCUDAFunctor< + paddle::operators::math::AvgPool, float> + pool2d_forward; + + std::vector input_shape = input_shape_; + std::vector output_shape = output_shape_; + input_shape.insert(input_shape.begin(), batchSize); + output_shape.insert(output_shape.begin(), batchSize); + + pool2d_forward(idata, input_shape, output_shape, ksize_, strides_, paddings_, + pool_process, true, odatas[0], stream); + + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..b5e4ece0fba446627d619df6fe225e8c07231487 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/avg_pool_op_plugin.h @@ -0,0 +1,111 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class AvgPoolPlugin : public PluginTensorRT { + private: + bool ceil_mode_; + std::vector ksize_; + std::vector strides_; + std::vector paddings_; + std::vector input_shape_; + std::vector output_shape_; + + protected: + size_t getSerializationSize() override { + return SerializedSize(ceil_mode_) + SerializedSize(ksize_) + + SerializedSize(strides_) + SerializedSize(paddings_) + + SerializedSize(input_shape_) + getBaseSerializationSize(); + } + + // TRT will call this func when we need to serialize the configuration of + // tensorrt. + // It should not be called by users. + void serialize(void *buffer) override { + serializeBase(buffer); + SerializeValue(&buffer, ceil_mode_); + SerializeValue(&buffer, ksize_); + SerializeValue(&buffer, strides_); + SerializeValue(&buffer, paddings_); + SerializeValue(&buffer, input_shape_); + } + + public: + AvgPoolPlugin(bool ceil_mode, std::vector ksize, + std::vector strides, std::vector paddings, + std::vector input_shape) + : ceil_mode_(ceil_mode), + ksize_(ksize), + strides_(strides), + paddings_(paddings), + input_shape_(input_shape) { + int output_h, output_w; + output_shape_ = input_shape_; + if (!ceil_mode_) { + output_h = + (input_shape[1] - ksize_[0] + 2 * paddings_[0]) / strides_[0] + 1; + output_w = + (input_shape[2] - ksize_[1] + 2 * paddings_[1]) / strides_[1] + 1; + } else { + output_h = + (input_shape[1] - ksize_[0] + 2 * paddings_[0] + strides_[0] - 1) / + strides_[0] + + 1; + output_w = + (input_shape[2] - ksize_[1] + 2 * paddings_[1] + strides_[1] - 1) / + strides_[1] + + 1; + } + output_shape_[1] = output_h; + output_shape_[2] = output_w; + } + + // It was used for tensorrt deserialization. + // It should not be called by users. + AvgPoolPlugin(void const *serialData, size_t serialLength) { + deserializeBase(serialData, serialLength); + DeserializeValue(&serialData, &serialLength, &ceil_mode_); + DeserializeValue(&serialData, &serialLength, &ksize_); + DeserializeValue(&serialData, &serialLength, &strides_); + DeserializeValue(&serialData, &serialLength, &paddings_); + DeserializeValue(&serialData, &serialLength, &input_shape_); + } + + AvgPoolPlugin *clone() const override { + return new AvgPoolPlugin(ceil_mode_, ksize_, strides_, paddings_, + input_shape_); + } + + const char *getPluginType() const override { return "avg_pool"; } + int getNbOutputs() const override { return 1; } + nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims *inputs, + int nbInputDims) override; + int initialize() override { return 0; } + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..9cd9026b7328083389b5af484bbb15c07b4908b0 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu @@ -0,0 +1,138 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +namespace details { + +template +struct Add { + __device__ T operator()(const T& a, const T& b) const { return a + b; } +}; + +template +struct Mul { + __device__ T operator()(const T& a, const T& b) const { return a * b; } +}; + +template +__global__ void ColumnWiseKernel(Operator op, const T* x, const T* y, T* out, + int batch_size, int num_rows, int num_cols) { + for (int batch_id = 0; batch_id < batch_size; ++batch_id) { + int row = blockIdx.x; + for (; row < num_rows; row += gridDim.x) { + T value_y = y[batch_id * num_rows + row]; + int col = threadIdx.x; + int offset = (batch_id * num_rows + row) * num_cols; + for (; col < num_cols; col += blockDim.x) { + T value_x = x[offset + col]; + out[offset + col] = op(value_x, value_y); + } + } + } +} + +template +static void ElementWise(Operator op, const T* x, const T* y, T* out, + int batch_size, int prev, int midd, int post, + cudaStream_t stream) { + const int kThreadsPerBlock = 1024; + const int kMaximumBlocks = 65535; + if (prev == 1) { + int num_threads = (post > kThreadsPerBlock) ? kThreadsPerBlock + : (((post + 31) >> 5) << 5); + int num_blocks = (midd < kMaximumBlocks) ? midd : kMaximumBlocks; + ColumnWiseKernel<<>>( + op, x, y, out, batch_size, midd, post); + } else if (post == 1) { + PADDLE_THROW("Not implemented."); + } else { + PADDLE_THROW("Not implemented."); + } +} + +} // namespace details + +nvinfer1::Dims ElementWisePlugin::getOutputDimensions( + int index, const nvinfer1::Dims* input_dims, int num_inputs) { + PADDLE_ENFORCE_EQ(index, 0); + PADDLE_ENFORCE_EQ(num_inputs, 2); + PADDLE_ENFORCE_NOT_NULL(input_dims); + return input_dims[0]; +} + +int ElementWisePlugin::initialize() { + PADDLE_ENFORCE_GT(dims_y_.nbDims, 0); + + axis_ = (axis_ == -1) ? dims_x_.nbDims - dims_y_.nbDims : axis_; + int trimed_nb_dims = dims_y_.nbDims; + for (; trimed_nb_dims > 0; --trimed_nb_dims) { + if (dims_y_.d[trimed_nb_dims - 1] != 1) { + break; + } + } + dims_y_.nbDims = trimed_nb_dims; + + PADDLE_ENFORCE_GE(dims_x_.nbDims, dims_y_.nbDims + axis_); + PADDLE_ENFORCE_LT(axis_, dims_x_.nbDims); + + prev_size_ = 1; + midd_size_ = 1; + post_size_ = 1; + for (int i = 0; i < axis_; ++i) { + prev_size_ *= dims_x_.d[i]; + } + + for (int i = 0; i < dims_y_.nbDims; ++i) { + PADDLE_ENFORCE_EQ(dims_x_.d[i + axis_], dims_y_.d[i], + "Broadcast dimension mismatch."); + midd_size_ *= dims_y_.d[i]; + } + + for (int i = axis_ + dims_y_.nbDims; i < dims_x_.nbDims; ++i) { + post_size_ *= dims_x_.d[i]; + } + return 0; +} + +int ElementWisePlugin::enqueue(int batch_size, const void* const* inputs, + void** outputs, void* workspace, + cudaStream_t stream) { + const float* x = reinterpret_cast(inputs[0]); + const float* y = reinterpret_cast(inputs[1]); + float* out = reinterpret_cast(outputs[0]); + + if (type_ == nvinfer1::ElementWiseOperation::kSUM) { + details::ElementWise(details::Add(), x, y, out, batch_size, + prev_size_, midd_size_, post_size_, stream); + } else if (type_ == nvinfer1::ElementWiseOperation::kPROD) { + details::ElementWise(details::Mul(), x, y, out, batch_size, + prev_size_, midd_size_, post_size_, stream); + } else { + PADDLE_THROW("Not implemented."); + } + + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..9c461f7a5c44ebb9d4a755288c69abff55e2dea8 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.h @@ -0,0 +1,87 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class ElementWisePlugin : public PluginTensorRT { + public: + ElementWisePlugin(nvinfer1::ElementWiseOperation type, + nvinfer1::Dims const &dims_x, nvinfer1::Dims const &dims_y, + int axis) + : type_(type), + dims_x_(dims_x), + dims_y_(dims_y), + axis_(axis), + prev_size_(1), + midd_size_(1), + post_size_(1) {} + + ElementWisePlugin(void const *serial_data, size_t serial_length) { + deserializeBase(serial_data, serial_length); + DeserializeValue(&serial_data, &serial_length, &axis_); + DeserializeValue(&serial_data, &serial_length, &dims_x_); + DeserializeValue(&serial_data, &serial_length, &dims_y_); + } + + ElementWisePlugin *clone() const override { + // return new ElementWisePlugin(dims_x_, dims_y_, axis_); + return nullptr; + } + + const char *getPluginType() const override { return "elementwise"; } + + nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims *input_dims, + int num_inputs) override; + + int initialize() override; + + // execute the layer + int enqueue(int batch_size, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream); + + protected: + size_t getSerializationSize() override { + return SerializedSize(axis_) + SerializedSize(dims_x_) + + SerializedSize(dims_y_) + getBaseSerializationSize(); + } + + void serialize(void *buffer) override { + serializeBase(buffer); + SerializeValue(&buffer, axis_); + SerializeValue(&buffer, dims_x_); + SerializeValue(&buffer, dims_y_); + } + + nvinfer1::ElementWiseOperation type_; + nvinfer1::Dims dims_x_; + nvinfer1::Dims dims_y_; + int axis_; + int prev_size_; + int midd_size_; + int post_size_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..e8f4254402a5d8a5e6c5a2384bf9fbe48341956e --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.cu @@ -0,0 +1,133 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include "glog/logging.h" +#include "paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +static const int CUDA_NUM_THREADS = 1024; +static const int CUDA_MAX_NUM_BLOCKS = 65535; +inline static int GET_NUM_BLOCKS(const int N) { + return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS; +} + +__global__ void PReluChannelWiseKernel(const float *input, const float *alpha, + float *output, int channel, + size_t spatial_size) { + size_t offset = blockIdx.x * spatial_size; + const float *in = input + offset; + float *out = output + offset; + float scale = alpha[blockIdx.x % channel]; + + for (size_t i = threadIdx.x; i < spatial_size; i += blockDim.x) { + float x = in[i]; + out[i] = (x > 0) ? x : scale * x; + } +} + +__global__ void PReluElementWiseKernel(const float *input, const float *alpha, + float *output, size_t spatial_size) { + size_t offset = blockIdx.x * spatial_size; + const float *in = input + offset; + const float *scale = alpha + offset; + float *out = output + offset; + + for (size_t i = threadIdx.x; i < spatial_size; i += blockDim.x) { + float x = in[i]; + out[i] = (x > 0) ? x : scale[i] * x; + } +} + +__global__ void PReluScalarKernel(const float *input, const float *alpha, + float *output, size_t spatial_size) { + size_t offset = blockIdx.x * spatial_size; + const float *in = input + offset; + float scale = *alpha; + float *out = output + offset; + + for (size_t i = threadIdx.x; i < spatial_size; i += blockDim.x) { + float x = in[i]; + out[i] = (x > 0) ? x : scale * x; + } +} + +static inline void PReluChannelWise(cudaStream_t stream, const float *input, + const float *alpha, float *output, + int batch_size, + const nvinfer1::Dims &dims) { + size_t unroll = batch_size * dims.d[0]; + size_t spatial_size = dims.d[1] * dims.d[2]; + CHECK_LT(unroll, CUDA_MAX_NUM_BLOCKS); + PReluChannelWiseKernel<<>>( + input, alpha, output, dims.d[0], spatial_size); +} + +static inline void PReluElementWise(cudaStream_t stream, const float *input, + const float *alpha, float *output, + int batch_size, + const nvinfer1::Dims &dims) { + size_t unroll = batch_size * dims.d[0]; + size_t spatial_size = dims.d[1] * dims.d[2]; + CHECK_LT(unroll, CUDA_MAX_NUM_BLOCKS); + PReluElementWiseKernel<<>>( + input, alpha, output, spatial_size); +} + +static inline void PReluScalar(cudaStream_t stream, const float *input, + const float *alpha, float *output, + int batch_size, const nvinfer1::Dims &dims) { + size_t unroll = batch_size * dims.d[0]; + size_t spatial_size = dims.d[1] * dims.d[2]; + CHECK_LT(unroll, CUDA_MAX_NUM_BLOCKS); + PReluScalarKernel<<>>( + input, alpha, output, spatial_size); +} + +nvinfer1::Dims PReluPlugin::getOutputDimensions(int index, + const nvinfer1::Dims *inputDims, + int nbInputs) { + assert(nbInputs == 1); + assert(index < this->getNbOutputs()); + nvinfer1::Dims const &input_dims = inputDims[0]; + nvinfer1::Dims output_dims = input_dims; + return output_dims; +} + +int PReluPlugin::enqueue(int batchSize, const void *const *inputs, + void **outputs, void *workspace, cudaStream_t stream) { + // input dims is CHW. + const auto &input_dims = this->getInputDims(0); + const float *input = reinterpret_cast(inputs[0]); + const float *alpha = reinterpret_cast(alpha_.get().values); + float *output = reinterpret_cast(outputs)[0]; + if (mode_ == "channel") { + PReluChannelWise(stream, input, alpha, output, batchSize, input_dims); + } else if (mode_ == "element") { + PReluElementWise(stream, input, alpha, output, batchSize, input_dims); + } else { + PReluScalar(stream, input, alpha, output, batchSize, input_dims); + } + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..0db56a310b072e64425f70ac23267ec72353e54b --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/prelu_op_plugin.h @@ -0,0 +1,70 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include "paddle/fluid/inference/tensorrt/engine.h" +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class PReluPlugin : public PluginTensorRT { + TensorRTEngine::Weight alpha_; + std::string mode_; + + protected: + size_t getSerializationSize() override { + // return getBaseSerializationSize(alpha_) + SerializedSize(mode_); + return 0; + } + + // TRT will call this func when we need to serialize the configuration of + // tensorrt. + // It should not be called by users. + void serialize(void *buffer) override { + // serializeBase(buffer); + // SerializeValue(&buffer, alpha_); + // SerializeValue(&buffer, mode_); + } + + public: + PReluPlugin(TensorRTEngine::Weight const &alpha, std::string const &mode) + : alpha_(alpha), mode_(mode) {} + + // It was used for tensorrt deserialization. + // It should not be called by users. + PReluPlugin(void const *serialData, size_t serialLength) { + // deserializeBase(serialData, serialLength); + // DeserializeValue(&serialData, &serialLength, &alpha_); + // DeserializeValue(&serialData, &serialLength, &mode_); + } + + PReluPlugin *clone() const override { return new PReluPlugin(alpha_, mode_); } + + const char *getPluginType() const override { return "prelu"; } + int getNbOutputs() const override { return 1; } + nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims *inputs, + int nbInputDims) override; + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/serialize.h b/paddle/fluid/inference/tensorrt/plugin/serialize.h new file mode 100644 index 0000000000000000000000000000000000000000..ce859f16fc87479adf090687121ff06951b5684c --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/serialize.h @@ -0,0 +1,127 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +template +inline void SerializeValue(void** buffer, T const& value); + +template +inline void DeserializeValue(void const** buffer, size_t* buffer_size, + T* value); + +namespace details { + +template +struct Serializer {}; + +template +struct Serializer::value || + std::is_enum::value || + std::is_pod::value>::type> { + static size_t SerializedSize(T const& value) { return sizeof(T); } + + static void Serialize(void** buffer, T const& value) { + std::memcpy(*buffer, &value, sizeof(T)); + reinterpret_cast(*buffer) += sizeof(T); + } + + static void Deserialize(void const** buffer, size_t* buffer_size, T* value) { + assert(*buffer_size >= sizeof(T)); + std::memcpy(value, *buffer, sizeof(T)); + reinterpret_cast(*buffer) += sizeof(T); + *buffer_size -= sizeof(T); + } +}; + +template <> +struct Serializer { + static size_t SerializedSize(const char* value) { return strlen(value) + 1; } + + static void Serialize(void** buffer, const char* value) { + std::strcpy(static_cast(*buffer), value); // NOLINT + reinterpret_cast(*buffer) += strlen(value) + 1; + } + + static void Deserialize(void const** buffer, size_t* buffer_size, + const char** value) { + *value = static_cast(*buffer); + size_t data_size = strnlen(*value, *buffer_size) + 1; + assert(*buffer_size >= data_size); + reinterpret_cast(*buffer) += data_size; + *buffer_size -= data_size; + } +}; + +template +struct Serializer, + typename std::enable_if::value || + std::is_enum::value || + std::is_pod::value>::type> { + static size_t SerializedSize(std::vector const& value) { + return sizeof(value.size()) + value.size() * sizeof(T); + } + + static void Serialize(void** buffer, std::vector const& value) { + SerializeValue(buffer, value.size()); + size_t nbyte = value.size() * sizeof(T); + std::memcpy(*buffer, value.data(), nbyte); + reinterpret_cast(*buffer) += nbyte; + } + + static void Deserialize(void const** buffer, size_t* buffer_size, + std::vector* value) { + size_t size; + DeserializeValue(buffer, buffer_size, &size); + value->resize(size); + size_t nbyte = value->size() * sizeof(T); + PADDLE_ENFORCE_GE(*buffer_size, nbyte); + std::memcpy(value->data(), *buffer, nbyte); + reinterpret_cast(*buffer) += nbyte; + *buffer_size -= nbyte; + } +}; + +} // namespace details + +template +inline size_t SerializedSize(T const& value) { + return details::Serializer::SerializedSize(value); +} + +template +inline void SerializeValue(void** buffer, T const& value) { + return details::Serializer::Serialize(buffer, value); +} + +template +inline void DeserializeValue(void const** buffer, size_t* buffer_size, + T* value) { + return details::Serializer::Deserialize(buffer, buffer_size, value); +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu new file mode 100644 index 0000000000000000000000000000000000000000..de61ace59e299a1f51940e4b433a0133d4fbe7ff --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.cu @@ -0,0 +1,178 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +// copied from operators::math::SplitFunctor +template +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int* out_cols, + int out_cols_size, T** outputs_data) { + int tid_x = blockIdx.x * blockDim.x + threadIdx.x; + int curr_segment = 0; + int curr_offset = out_cols[0]; + for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) { + int curr_col_offset = out_cols[curr_segment + 1]; + while (curr_col_offset <= tid_x) { + curr_offset = curr_col_offset; + ++curr_segment; + curr_col_offset = out_cols[curr_segment + 1]; + } + + int local_col = tid_x - curr_offset; + int segment_width = curr_col_offset - curr_offset; + T* output_ptr = outputs_data[curr_segment]; + if (output_ptr != nullptr) { + int tid_y = blockIdx.y * blockDim.y + threadIdx.y; + for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y) + output_ptr[tid_y * segment_width + local_col] = + input_data[tid_y * in_col + tid_x]; + } + } +} + +template +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int fixed_out_col, + T** outputs_data) { + int tid_x = blockIdx.x * blockDim.x + threadIdx.x; + for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) { + int split = tid_x / fixed_out_col; + int in_offset = tid_x - split * fixed_out_col; + T* output_ptr = outputs_data[split]; + if (output_ptr != nullptr) { + int tid_y = blockIdx.y * blockDim.y + threadIdx.y; + for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y) + output_ptr[tid_y * fixed_out_col + in_offset] = + input_data[tid_y * in_col + tid_x]; + } + } +} + +nvinfer1::Dims SplitPlugin::getOutputDimensions( + int index, const nvinfer1::Dims* input_dims, int num_inputs) { + PADDLE_ENFORCE_EQ(num_inputs, 1); + PADDLE_ENFORCE_LT(index, this->getNbOutputs()); + + nvinfer1::Dims output_dims = input_dims[0]; + output_dims.d[axis_] = output_length_.at(index); + return output_dims; +} + +int SplitPlugin::initialize() { + PADDLE_ENFORCE_LE(axis_, nvinfer1::Dims::MAX_DIMS); + // notice input dims is [C, H, W] + nvinfer1::Dims dims = this->getInputDims(0); + outer_rows_ = 1; + inner_cols_ = 1; + for (int i = 0; i < axis_; ++i) { + outer_rows_ *= dims.d[i]; + } + for (int i = axis_ + 1; i < dims.nbDims; ++i) { + inner_cols_ *= dims.d[i]; + } + same_shape_ = true; + std::vector segment_offsets(1, 0); + for (int i = 0; i < this->getNbOutputs(); ++i) { + if (output_length_[i] != output_length_[0]) { + same_shape_ = false; + } + segment_offsets.push_back(segment_offsets.back() + + output_length_[i] * inner_cols_); + } + inner_cols_ *= dims.d[axis_]; + d_segment_offsets_ = segment_offsets; + segment_offsets_ = std::move(segment_offsets); + d_output_ptrs_.resize(this->getNbOutputs(), nullptr); + return 0; +} + +template +inline void Split(cudaStream_t stream, const bool same_shape, + const int outer_rows, const int inner_cols, + const std::vector& segment_offsets, + const int* d_segment_offsets, const T* input, T** outputs) { + const int kThreadsPerBlock = 1024; + const int kMaxBlocks = 65535; + int block_cols = kThreadsPerBlock; + if (inner_cols < kThreadsPerBlock) { // block_cols is aligned by 32. + block_cols = ((inner_cols + 31) >> 5) << 5; + } + int block_rows = kThreadsPerBlock / block_cols; + dim3 block_size = dim3(block_cols, block_rows, 1); + + int grid_cols = + std::min((inner_cols + block_cols - 1) / block_cols, kMaxBlocks); + int grid_rows = + std::min(kMaxBlocks / grid_cols, std::max(outer_rows / block_rows, 1)); + dim3 grid_size = dim3(grid_cols, grid_rows, 1); + + if (same_shape) { + SplitKernel<<>>( + input, outer_rows, inner_cols, segment_offsets[1], outputs); + } else { + SplitKernel<<>>( + input, outer_rows, inner_cols, d_segment_offsets, + static_cast(segment_offsets.size()), outputs); + } +} + +int SplitPlugin::enqueue(int batchSize, const void* const* inputs, + void** outputs, void* workspace, cudaStream_t stream) { + float const* input_ptr = reinterpret_cast(inputs[0]); + if (((batchSize == 1 && axis_ == 0) || axis_ == -1) && + this->getNbOutputs() < 10) { + float** output_ptrs = reinterpret_cast(outputs); + int data_type_size = (this->getDataType() == nvinfer1::DataType::kFLOAT) + ? sizeof(float) + : sizeof(__half); + for (int i = 0; i < this->getNbOutputs(); ++i) { + PADDLE_ENFORCE( + cudaMemcpyAsync( + output_ptrs[i], input_ptr + segment_offsets_[i], + (segment_offsets_[i + 1] - segment_offsets_[i]) * data_type_size, + cudaMemcpyDeviceToDevice, stream) == cudaSuccess); + } + } else { + outer_rows_ *= batchSize; + const int* d_segment_offsets_ptr = + thrust::raw_pointer_cast(&d_segment_offsets_[0]); + float** output_ptrs = thrust::raw_pointer_cast(&d_output_ptrs_[0]); + PADDLE_ENFORCE(cudaMemcpyAsync(output_ptrs, outputs, + this->getNbOutputs() * sizeof(float*), + cudaMemcpyHostToDevice, + stream) == cudaSuccess); + if (this->getDataType() == nvinfer1::DataType::kFLOAT) { + Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_, + d_segment_offsets_ptr, input_ptr, output_ptrs); + } else { + Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_, + d_segment_offsets_ptr, (__half*)input_ptr, // NOLINT + (__half**)output_ptrs); // NOLINT + } + } + return cudaGetLastError() != cudaSuccess; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..6f028d3d72ae3cc7d96c6782b734cdbf1243c06c --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h @@ -0,0 +1,76 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class SplitPlugin : public PluginTensorRT { + public: + SplitPlugin(int axis, std::vector const &output_lengths) + : axis_(axis), same_shape_(true), output_length_(output_lengths) {} + + SplitPlugin(void const *serial_data, size_t serial_length) { + deserializeBase(serial_data, serial_length); + DeserializeValue(&serial_data, &serial_length, &axis_); + DeserializeValue(&serial_data, &serial_length, &output_length_); + } + + SplitPlugin *clone() const override { + return new SplitPlugin(axis_, output_length_); + } + + const char *getPluginType() const override { return "split"; } + int getNbOutputs() const override { return output_length_.size(); } + nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims *input_dims, + int num_inputs) override; + + int initialize() override; + int enqueue(int batchSize, const void *const *inputs, void **outputs, + void *workspace, cudaStream_t stream) override; + + protected: + size_t getSerializationSize() override { + return SerializedSize(axis_) + SerializedSize(output_length_) + + getBaseSerializationSize(); + } + + void serialize(void *buffer) override { + serializeBase(buffer); + SerializeValue(&buffer, axis_); + SerializeValue(&buffer, output_length_); + } + + int axis_; + int outer_rows_; + int inner_cols_; + bool same_shape_; + std::vector output_length_; + std::vector segment_offsets_; + thrust::device_vector d_segment_offsets_; + thrust::device_vector d_output_ptrs_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc new file mode 100644 index 0000000000000000000000000000000000000000..b0f4cff3ac184beeed2ebd3a4b7531d570c87075 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.cc @@ -0,0 +1,61 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +void PluginTensorRT::serializeBase(void*& buffer) { + SerializeValue(&buffer, input_dims_); + SerializeValue(&buffer, max_batch_size_); + SerializeValue(&buffer, data_type_); + SerializeValue(&buffer, data_format_); +} + +void PluginTensorRT::deserializeBase(void const*& serial_data, + size_t& serial_length) { + DeserializeValue(&serial_data, &serial_length, &input_dims_); + DeserializeValue(&serial_data, &serial_length, &max_batch_size_); + DeserializeValue(&serial_data, &serial_length, &data_type_); + DeserializeValue(&serial_data, &serial_length, &data_format_); +} + +size_t PluginTensorRT::getBaseSerializationSize() { + return (SerializedSize(input_dims_) + SerializedSize(max_batch_size_) + + SerializedSize(data_type_) + SerializedSize(data_format_)); +} + +bool PluginTensorRT::supportsFormat(nvinfer1::DataType type, + nvinfer1::PluginFormat format) const { + return ((type == nvinfer1::DataType::kFLOAT) && + (format == nvinfer1::PluginFormat::kNCHW)); +} + +void PluginTensorRT::configureWithFormat( + const nvinfer1::Dims* input_dims, int num_inputs, + const nvinfer1::Dims* output_dims, int num_outputs, nvinfer1::DataType type, + nvinfer1::PluginFormat format, int max_batch_size) { + data_type_ = type; + data_format_ = format; + input_dims_.assign(input_dims, input_dims + num_inputs); + max_batch_size_ = max_batch_size; +} + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h new file mode 100644 index 0000000000000000000000000000000000000000..86084829e150f8a39610319a8f2138f2b2fdec68 --- /dev/null +++ b/paddle/fluid/inference/tensorrt/plugin/trt_plugin.h @@ -0,0 +1,110 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include + +#include "paddle/fluid/inference/tensorrt/plugin/serialize.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/profiler.h" + +DECLARE_bool(profile); + +namespace paddle { +namespace inference { +namespace tensorrt { +namespace plugin { + +class PluginTensorRT : public nvinfer1::IPluginExt { + public: + PluginTensorRT() {} + // It was used for TensorRT deserialization. + // It should not be called by users. + PluginTensorRT(const void* serialized_data, size_t length) {} + virtual ~PluginTensorRT() {} + + nvinfer1::Dims const& getInputDims(int index) const { + return input_dims_.at(index); + } + size_t getMaxBatchSize() const { return max_batch_size_; } + nvinfer1::DataType getDataType() const { return data_type_; } + nvinfer1::PluginFormat getDataFormat() const { return data_format_; } + virtual const char* getPluginVersion() const { return "1"; } + + void AddInput(nvinfer1::ITensor* input) { inputs_.push_back(input); } + std::vector& GetInputs() { return inputs_; } + + virtual nvinfer1::IPluginExt* clone() const = 0; + virtual const char* getPluginType() const = 0; + + // Following functions are inherit from nvinfer1::IPluginExt + // Get the number of outputs from the layer + int getNbOutputs() const { return 1; } + // Get the dimension of an output tensor + virtual nvinfer1::Dims getOutputDimensions(int index, + const nvinfer1::Dims* input_dims, + int num_inputs) = 0; + // Find the workspace size required by the layer + size_t getWorkspaceSize(int) const override { return 0; } + + // Initialize the layer for execution. + // This is called when the engine is created. + int initialize() override { return 0; } + // Shutdown the layer. This is called when the engine is destroyed + void terminate() override {} + // Execute the layer + virtual int enqueue(int batch_size, const void* const* inputs, void** outputs, + void* workspace, cudaStream_t stream) = 0; + + // Find the size of the serialization buffer required + virtual size_t getSerializationSize() = 0; + // Serialize the layer config to buffer. + // TensorRT will call this func to serialize the configuration of TensorRT + // engine. It should not be called by users. + virtual void serialize(void* buffer) = 0; + + // Check format support. The default is FLOAT32 and NCHW. + bool supportsFormat(nvinfer1::DataType type, + nvinfer1::PluginFormat format) const override; + // Configure the layer + void configureWithFormat(const nvinfer1::Dims* input_dims, int num_inputs, + const nvinfer1::Dims* output_dims, int num_outputs, + nvinfer1::DataType type, + nvinfer1::PluginFormat format, + int max_batch_size) override; + + protected: + // Deserialize input_dims, max_batch_size, data_type, data_format + void deserializeBase(void const*& serial_data, // NOLINT + size_t& serial_length); // NOLINT + size_t getBaseSerializationSize(); + // Serialize input_dims, max_batch_size, data_type, data_format + void serializeBase(void*& buffer); // NOLINT + + std::vector input_dims_; + size_t max_batch_size_; + nvinfer1::DataType data_type_; + nvinfer1::PluginFormat data_format_; + + std::vector inputs_; +}; + +} // namespace plugin +} // namespace tensorrt +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/CMakeLists.txt b/paddle/fluid/inference/tests/api/CMakeLists.txt index c3dd1f433691e1c96e9f38ef7b595befad26408f..7dc88d9dd052c59aaa59b7802ee5a38ea9d89bc6 100644 --- a/paddle/fluid/inference/tests/api/CMakeLists.txt +++ b/paddle/fluid/inference/tests/api/CMakeLists.txt @@ -1,17 +1,13 @@ -set(INFERENCE_URL "http://paddle-inference-dist.cdn.bcebos.com") -set(INFERENCE_DEMO_INSTALL_DIR "${THIRD_PARTY_PATH}/inference_demo" CACHE STRING - "A path setting inference demo download directories.") set(INFERENCE_EXTRA_DEPS paddle_inference_api paddle_fluid_api ir_pass_manager analysis_predictor) -function (inference_download install_dir url filename) - message(STATUS "Download inference test stuff from ${url}/${filename}") - execute_process(COMMAND bash -c "mkdir -p ${install_dir}") - execute_process(COMMAND bash -c "cd ${install_dir} && wget -q ${url}/${filename}") - message(STATUS "finish downloading ${filename}") -endfunction() -function (inference_download_and_uncompress install_dir url filename) - inference_download(${install_dir} ${url} ${filename}) - execute_process(COMMAND bash -c "cd ${install_dir} && tar xzf ${filename}") +if(WITH_GPU AND TENSORRT_FOUND) + set(INFERENCE_EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} analysis ${analysis_deps} ir_pass_manager analysis_predictor) +endif() + +function(download_model install_dir model_name) + if (NOT EXISTS ${install_dir}) + inference_download_and_uncompress(${install_dir} ${INFERENCE_URL} ${model_name}) + endif() endfunction() function(download_model_and_data install_dir model_name data_name) @@ -27,15 +23,22 @@ function(inference_analysis_api_test target install_dir filename) ARGS --infer_model=${install_dir}/model --infer_data=${install_dir}/data.txt) endfunction() +function(inference_analysis_api_test_with_fake_data target install_dir filename model_name) + download_model(${install_dir} ${model_name}) + inference_analysis_test(${target} SRCS ${filename} + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${install_dir}/model) +endfunction() + # RNN1 -if(NOT APPLE) +if(NOT APPLE AND WITH_MKLML) set(RNN1_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/rnn1") download_model_and_data(${RNN1_INSTALL_DIR} "rnn1%2Fmodel.tar.gz" "rnn1%2Fdata.txt.tar.gz") inference_analysis_api_test(test_analyzer_rnn1 ${RNN1_INSTALL_DIR} analyzer_rnn1_tester.cc) else() - # TODO: fix this test on MACOS, the reason is that - # fusion_seqexpand_concat_fc_op is not supported on MACOS - message(WARNING "These tests has been disabled in OSX before being fixed: \n test_analyzer_rnn1") + # TODO: fix this test on MACOS and OPENBLAS, the reason is that + # fusion_seqexpand_concat_fc_op is not supported on MACOS and OPENBLAS + message(WARNING "These tests has been disabled in OSX or WITH_MKL=OFF before being fixed: \n test_analyzer_rnn1") endif() # RNN2 @@ -43,6 +46,11 @@ set(RNN2_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/rnn2") download_model_and_data(${RNN2_INSTALL_DIR} "rnn2_model.tar.gz" "rnn2_data.txt.tar.gz") inference_analysis_api_test(test_analyzer_rnn2 ${RNN2_INSTALL_DIR} analyzer_rnn2_tester.cc) +# DAM +set(DAM_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/dam") +download_model_and_data(${DAM_INSTALL_DIR} "DAM_model.tar.gz" "DAM_data.txt.tar.gz") +inference_analysis_api_test(test_analyzer_dam ${DAM_INSTALL_DIR} analyzer_dam_tester.cc) + # chinese_ner set(CHINESE_NER_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/chinese_ner") download_model_and_data(${CHINESE_NER_INSTALL_DIR} "chinese_ner_model.tar.gz" "chinese_ner-data.txt.tar.gz") @@ -71,40 +79,40 @@ endif() inference_analysis_api_test(test_analyzer_ocr ${OCR_INSTALL_DIR} analyzer_vis_tester.cc) # resnet50 -set(RESNET50_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/resnet50") -if (NOT EXISTS ${RESNET50_INSTALL_DIR}) - inference_download_and_uncompress(${RESNET50_INSTALL_DIR} ${INFERENCE_URL} "resnet50_model.tar.gz") -endif() -inference_analysis_test(test_analyzer_resnet50 SRCS analyzer_resnet50_tester.cc - EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} ARGS --infer_model=${RESNET50_INSTALL_DIR}/model) +inference_analysis_api_test_with_fake_data(test_analyzer_resnet50 + "${INFERENCE_DEMO_INSTALL_DIR}/resnet50" analyzer_resnet50_tester.cc "resnet50_model.tar.gz") + +# mobilenet with depthwise_conv op +inference_analysis_api_test_with_fake_data(test_analyzer_mobilenet + "${INFERENCE_DEMO_INSTALL_DIR}/mobilenet_depthwise_conv" analyzer_resnet50_tester.cc "mobilenet_model.tar.gz") # anakin if (WITH_ANAKIN AND WITH_MKL) # only needed in CI - # anakin rnn1 - set(ANAKIN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/anakin") - set(ANAKIN_RNN1_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/rnn1") - inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn.anakin2.model.bin") - inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn_data.txt") - cc_test(test_anakin_rnn1 SRCS anakin_rnn1_tester.cc - ARGS --model=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn.anakin2.model.bin - --datapath=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn_data.txt - DEPS inference_anakin_api_shared SERIAL) - # anakin mobilenet - if(WITH_GPU) - set(ANAKIN_MOBILENET_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/mobilenet") - inference_download(${ANAKIN_MOBILENET_INSTALL_DIR} ${INFERENCE_URL} "mobilenet_v2.anakin.bin") - cc_test(test_anakin_mobilenet SRCS anakin_mobilenet_tester.cc - ARGS --model=${ANAKIN_MOBILENET_INSTALL_DIR}/mobilenet_v2.anakin.bin - DEPS inference_anakin_api_shared dynload_cuda SERIAL) - endif() + # anakin rnn1 + set(ANAKIN_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/anakin") + set(ANAKIN_RNN1_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/rnn1") + inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn.anakin2.model.bin") + inference_download(${ANAKIN_RNN1_INSTALL_DIR} ${INFERENCE_URL} "anakin_test%2Fditu_rnn_data.txt") + cc_test(test_anakin_rnn1 SRCS anakin_rnn1_tester.cc + ARGS --model=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn.anakin2.model.bin + --datapath=${ANAKIN_RNN1_INSTALL_DIR}/anakin_test%2Fditu_rnn_data.txt + DEPS inference_anakin_api_shared SERIAL) + # anakin mobilenet + if(WITH_GPU) + set(ANAKIN_MOBILENET_INSTALL_DIR "${ANAKIN_INSTALL_DIR}/mobilenet") + inference_download(${ANAKIN_MOBILENET_INSTALL_DIR} ${INFERENCE_URL} "mobilenet_v2.anakin.bin") + cc_test(test_anakin_mobilenet SRCS anakin_mobilenet_tester.cc + ARGS --model=${ANAKIN_MOBILENET_INSTALL_DIR}/mobilenet_v2.anakin.bin + DEPS inference_anakin_api_shared dynload_cuda SERIAL) + endif() endif() if(WITH_GPU AND TENSORRT_FOUND) - set(TRT_MODEL_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/trt") - if (NOT EXISTS ${TRT_MODEL_INSTALL_DIR}) - inference_download_and_uncompress(${TRT_MODEL_INSTALL_DIR} ${INFERENCE_URL}/tensorrt_test "trt_test_models.tar.gz") - endif() - cc_test(test_trt_models SRCS trt_models_tester.cc - ARGS --dirname=${TRT_MODEL_INSTALL_DIR}/trt_test_models - DEPS paddle_inference_tensorrt_subgraph_engine) + set(TRT_MODEL_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/trt") + if (NOT EXISTS ${TRT_MODEL_INSTALL_DIR}) + inference_download_and_uncompress(${TRT_MODEL_INSTALL_DIR} ${INFERENCE_URL}/tensorrt_test "trt_test_models.tar.gz") + endif() + inference_analysis_test(test_trt_models SRCS trt_models_tester.cc + EXTRA_DEPS ${INFERENCE_EXTRA_DEPS} + ARGS --infer_model=${TRT_MODEL_INSTALL_DIR}/trt_test_models SERIAL) endif() diff --git a/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc b/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc index c4022225fd4526998af8526d0afb87e7a5be6336..da42688f29f044639d6d97ff09f932490d4c3d54 100644 --- a/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc +++ b/paddle/fluid/inference/tests/api/anakin_rnn1_tester.cc @@ -13,7 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include #include #include #include diff --git a/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc b/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..b369cba5c8b3f8aadd1123d6b7345fad6e47bd0f --- /dev/null +++ b/paddle/fluid/inference/tests/api/analyzer_dam_tester.cc @@ -0,0 +1,222 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/tests/api/tester_helper.h" + +namespace paddle { +namespace inference { +using contrib::AnalysisConfig; +#define MAX_TURN_NUM 9 +#define MAX_TURN_LEN 50 +static std::vector result_data; + +struct DataRecord { + std::vector> + turns[MAX_TURN_NUM]; // turns data : MAX_TURN_NUM + std::vector> + turns_mask[MAX_TURN_NUM]; // turns mask data : MAX_TURN_NUM + std::vector> response; // response data : 1 + std::vector> response_mask; // response mask data : 1 + size_t batch_iter{0}; + size_t batch_size{1}; + size_t num_samples; // total number of samples + DataRecord() = default; + explicit DataRecord(const std::string &path, int batch_size = 1) + : batch_size(batch_size) { + Load(path); + } + DataRecord NextBatch() { + DataRecord data; + size_t batch_end = batch_iter + batch_size; + // NOTE skip the final batch, if no enough data is provided. + if (batch_end <= response.size()) { + for (int i = 0; i < MAX_TURN_NUM; ++i) { + data.turns[i].assign(turns[i].begin() + batch_iter, + turns[i].begin() + batch_end); + } + for (int i = 0; i < MAX_TURN_NUM; ++i) { + data.turns_mask[i].assign(turns_mask[i].begin() + batch_iter, + turns_mask[i].begin() + batch_end); + } + data.response.assign(response.begin() + batch_iter, + response.begin() + batch_end); + data.response_mask.assign(response_mask.begin() + batch_iter, + response_mask.begin() + batch_end); + CHECK(!data.response.empty()); + CHECK(!data.response_mask.empty()); + CHECK_EQ(data.response.size(), data.response_mask.size()); + } + batch_iter += batch_size; + return data; + } + void Load(const std::string &path) { + std::ifstream file(path); + std::string line; + size_t num_lines = 0; + result_data.clear(); + while (std::getline(file, line)) { + num_lines++; + std::vector data; + split(line, ',', &data); + CHECK_EQ(data.size(), (size_t)(2 * MAX_TURN_NUM + 3)); + // load turn data + std::vector turns_tmp[MAX_TURN_NUM]; + for (int i = 0; i < MAX_TURN_NUM; ++i) { + split_to_int64(data[i], ' ', &turns_tmp[i]); + turns[i].push_back(std::move(turns_tmp[i])); + } + // load turn_mask data + std::vector turns_mask_tmp[MAX_TURN_NUM]; + for (int i = 0; i < MAX_TURN_NUM; ++i) { + split_to_float(data[MAX_TURN_NUM + i], ' ', &turns_mask_tmp[i]); + turns_mask[i].push_back(std::move(turns_mask_tmp[i])); + } + // load response data + std::vector response_tmp; + split_to_int64(data[2 * MAX_TURN_NUM], ' ', &response_tmp); + response.push_back(std::move(response_tmp)); + // load response_mask data + std::vector response_mask_tmp; + split_to_float(data[2 * MAX_TURN_NUM + 1], ' ', &response_mask_tmp); + response_mask.push_back(std::move(response_mask_tmp)); + // load result data + float result_tmp; + result_tmp = std::stof(data[2 * MAX_TURN_NUM + 2]); + result_data.push_back(result_tmp); + } + num_samples = num_lines; + } +}; + +void PrepareInputs(std::vector *input_slots, DataRecord *data, + int batch_size) { + PaddleTensor turns_tensor[MAX_TURN_NUM]; + PaddleTensor turns_mask_tensor[MAX_TURN_NUM]; + PaddleTensor response_tensor; + PaddleTensor response_mask_tensor; + std::string turn_pre = "turn_"; + std::string turn_mask_pre = "turn_mask_"; + + auto one_batch = data->NextBatch(); + int size = one_batch.response[0].size(); + CHECK_EQ(size, MAX_TURN_LEN); + // turn tensor assignment + for (int i = 0; i < MAX_TURN_NUM; ++i) { + turns_tensor[i].name = turn_pre + std::to_string(i); + turns_tensor[i].shape.assign({batch_size, size, 1}); + turns_tensor[i].dtype = PaddleDType::INT64; + TensorAssignData(&turns_tensor[i], one_batch.turns[i]); + } + // turn mask tensor assignment + for (int i = 0; i < MAX_TURN_NUM; ++i) { + turns_mask_tensor[i].name = turn_mask_pre + std::to_string(i); + turns_mask_tensor[i].shape.assign({batch_size, size, 1}); + turns_mask_tensor[i].dtype = PaddleDType::FLOAT32; + TensorAssignData(&turns_mask_tensor[i], one_batch.turns_mask[i]); + } + // response tensor assignment + response_tensor.name = "response"; + response_tensor.shape.assign({batch_size, size, 1}); + response_tensor.dtype = PaddleDType::INT64; + TensorAssignData(&response_tensor, one_batch.response); + // response mask tensor assignment + response_mask_tensor.name = "response_mask"; + response_mask_tensor.shape.assign({batch_size, size, 1}); + response_mask_tensor.dtype = PaddleDType::FLOAT32; + TensorAssignData(&response_mask_tensor, one_batch.response_mask); + + // Set inputs. + for (int i = 0; i < MAX_TURN_NUM; ++i) { + input_slots->push_back(std::move(turns_tensor[i])); + } + for (int i = 0; i < MAX_TURN_NUM; ++i) { + input_slots->push_back(std::move(turns_mask_tensor[i])); + } + input_slots->push_back(std::move(response_tensor)); + input_slots->push_back(std::move(response_mask_tensor)); +} + +void SetConfig(contrib::AnalysisConfig *cfg) { + cfg->prog_file = FLAGS_infer_model + "/__model__"; + cfg->param_file = FLAGS_infer_model + "/param"; + cfg->use_gpu = false; + cfg->device = 0; + cfg->specify_input_name = true; + cfg->enable_ir_optim = true; +} + +void SetInput(std::vector> *inputs) { + DataRecord data(FLAGS_infer_data, FLAGS_batch_size); + std::vector input_slots; + int test_batch_num = + FLAGS_test_all_data ? data.num_samples / FLAGS_batch_size : 1; + LOG(INFO) << "The number of samples to be test: " + << test_batch_num * FLAGS_batch_size; + for (int bid = 0; bid < test_batch_num; ++bid) { + input_slots.clear(); + PrepareInputs(&input_slots, &data, FLAGS_batch_size); + (*inputs).emplace_back(input_slots); + } +} + +// Easy for profiling independently. +TEST(Analyzer_dam, profile) { + contrib::AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector outputs; + std::vector> input_slots_all; + SetInput(&input_slots_all); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); + + if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { + PADDLE_ENFORCE_GT(outputs.size(), 0); + size_t size = GetSize(outputs[0]); + PADDLE_ENFORCE_GT(size, 0); + float *result = static_cast(outputs[0].data.data()); + for (size_t i = 0; i < size; i++) { + EXPECT_NEAR(result[i], result_data[i], 1e-3); + } + } +} + +// Check the fuse status +TEST(Analyzer_dam, fuse_statis) { + contrib::AnalysisConfig cfg; + SetConfig(&cfg); + + int num_ops; + auto predictor = CreatePaddlePredictor(cfg); + auto fuse_statis = GetFuseStatis( + static_cast(predictor.get()), &num_ops); + ASSERT_TRUE(fuse_statis.count("fc_fuse")); + EXPECT_EQ(fuse_statis.at("fc_fuse"), 317); + EXPECT_EQ(num_ops, 2020); +} + +// Compare result of NativeConfig and AnalysisConfig +TEST(Analyzer_dam, compare) { + contrib::AnalysisConfig cfg; + SetConfig(&cfg); + + std::vector> input_slots_all; + SetInput(&input_slots_all); + + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc b/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc index 5fb551810fd4d1c56547a8aa581cb6c4587df031..310852e2f7cb284bda3041911d0059b55ee3b477 100644 --- a/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_lac_tester.cc @@ -133,7 +133,8 @@ TEST(Analyzer_LAC, profile) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result @@ -175,7 +176,8 @@ TEST(Analyzer_LAC, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace analysis diff --git a/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc b/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc index 577b97e271aacab5d6740de7c8bc00bc87ae54dd..3a5f844de3cae7eb9b6e3555c5219c6cf8ee1919 100644 --- a/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_ner_tester.cc @@ -20,7 +20,6 @@ using contrib::AnalysisConfig; struct DataRecord { std::vector> word_data_all, mention_data_all; - std::vector> rnn_word_datas, rnn_mention_datas; std::vector lod; // two inputs have the same lod info. size_t batch_iter{0}; size_t batch_size{1}; @@ -45,8 +44,6 @@ struct DataRecord { CHECK(!data.mention_data_all.empty()); CHECK_EQ(data.word_data_all.size(), data.mention_data_all.size()); for (size_t j = 0; j < data.word_data_all.size(); j++) { - data.rnn_word_datas.push_back(data.word_data_all[j]); - data.rnn_mention_datas.push_back(data.mention_data_all[j]); // calculate lod data.lod.push_back(data.lod.back() + data.word_data_all[j].size()); } @@ -87,8 +84,8 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data, lod_mention_tensor.shape.assign({size, 1}); lod_mention_tensor.lod.assign({one_batch.lod}); // assign data - TensorAssignData(&lod_word_tensor, one_batch.rnn_word_datas); - TensorAssignData(&lod_mention_tensor, one_batch.rnn_mention_datas); + TensorAssignData(&lod_word_tensor, one_batch.word_data_all); + TensorAssignData(&lod_mention_tensor, one_batch.mention_data_all); // Set inputs. input_slots->assign({lod_word_tensor, lod_mention_tensor}); for (auto &tensor : *input_slots) { @@ -124,7 +121,8 @@ TEST(Analyzer_Chinese_ner, profile) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result @@ -163,7 +161,8 @@ TEST(Analyzer_Chinese_ner, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc b/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc index 290fb007d8ba94a2d121947fe67c6474586ac0e0..abc63577b7913a3c9de7d6c16d8ac3e85ffd7c3c 100644 --- a/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_resnet50_tester.cc @@ -27,48 +27,34 @@ void SetConfig(AnalysisConfig *cfg) { cfg->device = 0; cfg->enable_ir_optim = true; cfg->specify_input_name = true; + cfg->SetCpuMathLibraryNumThreads(FLAGS_paddle_num_threads); } void SetInput(std::vector> *inputs) { - PADDLE_ENFORCE_EQ(FLAGS_test_all_data, 0, "Only have single batch of data."); - - PaddleTensor input; - // channel=3, height/width=318 - std::vector shape({FLAGS_batch_size, 3, 318, 318}); - input.shape = shape; - input.dtype = PaddleDType::FLOAT32; - - // fill input data, for profile easily, do not use random data here. - size_t size = FLAGS_batch_size * 3 * 318 * 318; - input.data.Resize(size * sizeof(float)); - float *input_data = static_cast(input.data.data()); - for (size_t i = 0; i < size; i++) { - *(input_data + i) = static_cast(i) / size; - } - - std::vector input_slots; - input_slots.assign({input}); - (*inputs).emplace_back(input_slots); + SetFakeImageInput(inputs, FLAGS_infer_model); } // Easy for profiling independently. -TEST(Analyzer_resnet50, profile) { +void profile(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); + + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); - - if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { - PADDLE_ENFORCE_EQ(outputs.size(), 1UL); - size_t size = GetSize(outputs[0]); - // output is a 512-dimension feature - EXPECT_EQ(size, 512 * FLAGS_batch_size); - } + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); } +TEST(Analyzer_resnet50, profile) { profile(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_resnet50, profile_mkldnn) { profile(true /* use_mkldnn */); } +#endif + // Check the fuse status TEST(Analyzer_resnet50, fuse_statis) { AnalysisConfig cfg; @@ -77,20 +63,28 @@ TEST(Analyzer_resnet50, fuse_statis) { auto predictor = CreatePaddlePredictor(cfg); auto fuse_statis = GetFuseStatis( static_cast(predictor.get()), &num_ops); - ASSERT_TRUE(fuse_statis.count("fc_fuse")); - EXPECT_EQ(fuse_statis.at("fc_fuse"), 1); + LOG(INFO) << "num_ops: " << num_ops; } // Compare result of NativeConfig and AnalysisConfig -TEST(Analyzer_resnet50, compare) { +void compare(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } +TEST(Analyzer_resnet50, compare) { compare(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_resnet50, compare_mkldnn) { compare(true /* use_mkldnn */); } +#endif + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc b/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc index c76d72ccd99649913aefcb2aa57fe6061db8ca6d..1ae2b4b03a1b2a66b3ddc8cb66d9575751a52297 100644 --- a/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_rnn1_tester.cc @@ -210,7 +210,6 @@ void SetConfig(AnalysisConfig *cfg) { cfg->device = 0; cfg->specify_input_name = true; cfg->enable_ir_optim = true; - cfg->ir_passes.clear(); // Do not exclude any pass. } void SetInput(std::vector> *inputs) { @@ -226,13 +225,16 @@ void SetInput(std::vector> *inputs) { // Easy for profiling independently. TEST(Analyzer_rnn1, profile) { - contrib::AnalysisConfig cfg; + contrib::AnalysisConfig cfg(false); SetConfig(&cfg); + cfg.fraction_of_gpu_memory = 0.1; + cfg.pass_builder()->TurnOnDebug(); std::vector outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); } // Check the fuse status @@ -259,7 +261,8 @@ TEST(Analyzer_rnn1, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } // Test Multi-Thread. @@ -270,32 +273,8 @@ TEST(Analyzer_rnn1, multi_thread) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, 4 /* multi_thread */); -} - -bool CompareTensors(const framework::Scope &a_scope, - const framework::Scope &b_scope, - const std::vector &tensors) { - for (auto &x : tensors) { - auto *a_var = a_scope.FindVar(x); - auto *b_var = b_scope.FindVar(x); - if (a_var && b_var) { - if (a_var->Type() == typeid(framework::LoDTensor) || - a_var->Type() == typeid(framework::Tensor)) { - LOG(INFO) << "comparing tensor " << x; - auto &a_t = a_var->Get(); - auto &b_t = b_var->Get(); - if (!inference::CompareTensor(a_t, b_t)) { - LOG(ERROR) << string::Sprintf("tensor %s not match in two scopes", x); - } - } else { - LOG(INFO) << "skip no tensor " << x; - } - } else { - LOG(INFO) << "skip tensor " << x; - } - } - return true; + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, 4 /* multi_thread */); } // Validate that the AnalysisPredictor + ZeroCopyTensor really works by testing @@ -306,20 +285,14 @@ TEST(Analyzer_rnn1, ZeroCopy) { config.use_feed_fetch_ops = false; PaddlePlace place; - int output_size{0}; - auto predictor = - CreatePaddlePredictor( - config); + auto predictor = CreatePaddlePredictor(config); config.use_feed_fetch_ops = true; - auto native_predictor = - CreatePaddlePredictor(config); + auto native_predictor = CreatePaddlePredictor(config); config.use_feed_fetch_ops = true; // the analysis predictor needs feed/fetch. - auto analysis_predictor = - CreatePaddlePredictor( - config); + auto analysis_predictor = CreatePaddlePredictor(config); #define NEW_TENSOR(name__) \ auto name__##_tensor = predictor->GetInputTensor(#name__); @@ -357,86 +330,22 @@ TEST(Analyzer_rnn1, ZeroCopy) { Timer timer; double total_time{0}; - double native_total_time{0}; - double analysis_total_time{0.}; - for (int i = 0; i < FLAGS_repeat; i++) { timer.tic(); predictor->ZeroCopyRun(); total_time += timer.toc(); } + LOG(INFO) << "ZeroCopy output: " << DescribeZeroCopyTensor(*output_tensor); - auto *output_data = output_tensor->data(&place, &output_size); - ASSERT_GT(output_size, 0); // more than one output! - - for (int i = 0; i < FLAGS_repeat; i++) { - // Run native predictor. - timer.tic(); - ASSERT_TRUE(native_predictor->Run(native_inputs.front(), &native_outputs)); - native_total_time += timer.toc(); - } - - for (int i = 0; i < FLAGS_repeat; i++) { - timer.tic(); - ASSERT_TRUE( - analysis_predictor->Run(native_inputs.front(), &analysis_outputs)); - analysis_total_time += timer.toc(); - } - - if (!FLAGS_with_precision_check) { - return; - } - int native_output_size = VecReduceToInt(native_outputs.front().shape); - - EXPECT_EQ(native_output_size, output_size); + ASSERT_TRUE(native_predictor->Run(native_inputs.front(), &native_outputs)); + LOG(INFO) << "native output " << DescribeTensor(native_outputs.front()); - // Compare tensors between analysis and zerocopy - auto *p0 = static_cast(predictor.get()); - auto *p1 = static_cast(analysis_predictor.get()); - auto *p2 = static_cast(native_predictor.get()); - - std::vector tensor_names; - for (auto &var_desc : p0->program().Block(0).AllVars()) { - tensor_names.push_back(var_desc->Name()); - } - - LOG(INFO) << "Comparing tensors"; - ASSERT_TRUE( - CompareTensors(*p0->scope(), *p1->scope(), {"final_output.tmp_1"})); - ASSERT_TRUE( - CompareTensors(*p0->scope(), *p2->scope(), {"final_output.tmp_1"})); - - LOG(INFO) << "output1 " << inference::LoDTensorSummary( - p0->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - LOG(INFO) << "output2 " << inference::LoDTensorSummary( - p1->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - LOG(INFO) << "output3 " << inference::LoDTensorSummary( - p2->scope() - ->FindVar("final_output.tmp_1") - ->Get()); - - for (int i = 0; i < output_size; i++) { - LOG(INFO) << output_data[i] << " " - << static_cast(native_outputs.front().data.data())[i] - << " " - << static_cast(analysis_outputs.front().data.data())[i]; - EXPECT_NEAR(output_data[i], - static_cast(native_outputs.front().data.data())[i], - 1e-3); + int output_size{0}; + auto *zero_copy_data = output_tensor->data(&place, &output_size); + auto *native_data = static_cast(native_outputs.front().data.data()); + for (size_t i = 0; i < output_size / sizeof(float); i++) { + EXPECT_NEAR(zero_copy_data[i], native_data[i], 1e-3); } - - LOG(INFO) << "batch_size: " << FLAGS_batch_size; - - LOG(INFO) << "zero average time: " - << total_time / (FLAGS_repeat * FLAGS_batch_size); - LOG(INFO) << "analysis average time: " - << analysis_total_time / (FLAGS_repeat * FLAGS_batch_size); - LOG(INFO) << "native average time: " - << native_total_time / (FLAGS_repeat * FLAGS_batch_size); } TEST(Analyzer_rnn1, ZeroCopyMultiThread) { diff --git a/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc b/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc index ba04d030b94c0924311dcff5c6a34270a764f877..e2985006f0ed858e778bf4737be3aaee0e056021 100644 --- a/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_rnn2_tester.cc @@ -18,12 +18,12 @@ namespace paddle { namespace inference { using namespace framework; // NOLINT +static std::vector result_data; struct DataRecord { std::vector>> link_step_data_all; std::vector lod; std::vector> rnn_link_data; - std::vector result_data; size_t num_samples; // total number of samples size_t batch_iter{0}; size_t batch_size{1}; @@ -57,6 +57,7 @@ struct DataRecord { std::ifstream file(path); std::string line; int num_lines = 0; + result_data.clear(); while (std::getline(file, line)) { num_lines++; std::vector data; @@ -131,17 +132,17 @@ TEST(Analyzer_rnn2, profile) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result - DataRecord data(FLAGS_infer_data, FLAGS_batch_size); PADDLE_ENFORCE_GT(outputs.size(), 0); size_t size = GetSize(outputs[0]); PADDLE_ENFORCE_GT(size, 0); float *result = static_cast(outputs[0].data.data()); for (size_t i = 0; i < size; i++) { - EXPECT_NEAR(result[i], data.result_data[i], 1e-3); + EXPECT_NEAR(result[i], result_data[i], 1e-3); } } } @@ -153,7 +154,8 @@ TEST(Analyzer_rnn2, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc b/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc index cb4671c4379b5f6f144bfd5330866aa38163f4d4..858191184a377a26042c98e17d5b8df782575efc 100644 --- a/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_seq_conv1_tester.cc @@ -161,7 +161,8 @@ TEST(Analyzer_seq_conv1, profile) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { // the first inference result @@ -183,7 +184,13 @@ TEST(Analyzer_seq_conv1, fuse_statis) { SetConfig(&cfg); int num_ops; auto predictor = CreatePaddlePredictor(cfg); - GetFuseStatis(predictor.get(), &num_ops); + + auto fuse_statis = GetFuseStatis(predictor.get(), &num_ops); + ASSERT_TRUE(fuse_statis.count("fc_fuse")); + ASSERT_TRUE(fuse_statis.count("seqconv_eltadd_relu_fuse")); + EXPECT_EQ(fuse_statis.at("fc_fuse"), 2); + EXPECT_EQ(fuse_statis.at("seqconv_eltadd_relu_fuse"), 6); + EXPECT_EQ(num_ops, 32); } // Compare result of NativeConfig and AnalysisConfig @@ -193,7 +200,8 @@ TEST(Analyzer_seq_conv1, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc b/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc index ca19475bda372398d425b0fa6f9a732cd79a8166..34a241f070fdc62d1b1e94835fb1dad405baafa9 100644 --- a/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_text_classification_tester.cc @@ -74,7 +74,8 @@ TEST(Analyzer_Text_Classification, profile) { std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1) { // Get output @@ -101,20 +102,20 @@ TEST(Analyzer_Text_Classification, compare) { std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } TEST(Analyzer_Text_Classification, compare_against_embedding_fc_lstm_fused) { AnalysisConfig cfg; SetConfig(&cfg); // Enable embedding_fc_lstm_fuse_pass (disabled by default) - auto it = std::find(cfg.ir_passes.begin(), cfg.ir_passes.end(), - "embedding_fc_lstm_fuse_pass"); - if (it != cfg.ir_passes.end()) cfg.ir_passes.erase(it); + cfg.pass_builder()->InsertPass(2, "embedding_fc_lstm_fuse_pass"); std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } } // namespace inference diff --git a/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc b/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc index 305b8bfe158150d5dfd8bdaee2c0a89afe264de4..956a235edcefb7d688983c3b63b187e284efb02a 100644 --- a/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc +++ b/paddle/fluid/inference/tests/api/analyzer_vis_tester.cc @@ -27,7 +27,7 @@ struct Record { }; Record ProcessALine(const std::string &line) { - VLOG(3) << "process a line"; + VLOG(30) << "process a line"; std::vector columns; split(line, '\t', &columns); CHECK_EQ(columns.size(), 2UL) @@ -45,8 +45,8 @@ Record ProcessALine(const std::string &line) { for (auto &s : shape_strs) { record.shape.push_back(std::stoi(s)); } - VLOG(3) << "data size " << record.data.size(); - VLOG(3) << "data shape size " << record.shape.size(); + VLOG(30) << "data size " << record.data.size(); + VLOG(30) << "data shape size " << record.shape.size(); return record; } @@ -58,10 +58,7 @@ void SetConfig(AnalysisConfig *cfg) { cfg->enable_ir_optim = true; cfg->specify_input_name = true; // TODO(TJ): fix fusion gru - cfg->ir_passes.push_back("fc_gru_fuse_pass"); -#ifdef PADDLE_WITH_MKLDNN - cfg->_use_mkldnn = true; -#endif + cfg->pass_builder()->DeletePass("fc_gru_fuse_pass"); } void SetInput(std::vector> *inputs) { @@ -84,14 +81,18 @@ void SetInput(std::vector> *inputs) { // Easy for profiling independently. // ocr, mobilenet and se_resnext50 -TEST(Analyzer_vis, profile) { +void profile(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector outputs; std::vector> input_slots_all; SetInput(&input_slots_all); - TestPrediction(cfg, input_slots_all, &outputs, FLAGS_num_threads); + TestPrediction(reinterpret_cast(&cfg), + input_slots_all, &outputs, FLAGS_num_threads); if (FLAGS_num_threads == 1 && !FLAGS_test_all_data) { const float ocr_result_data[] = { @@ -108,6 +109,12 @@ TEST(Analyzer_vis, profile) { } } +TEST(Analyzer_vis, profile) { profile(); } + +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_vis, profile_mkldnn) { profile(true /* use_mkldnn */); } +#endif + // Check the fuse status TEST(Analyzer_vis, fuse_statis) { AnalysisConfig cfg; @@ -118,15 +125,24 @@ TEST(Analyzer_vis, fuse_statis) { } // Compare result of NativeConfig and AnalysisConfig -TEST(Analyzer_vis, compare) { +void compare(bool use_mkldnn = false) { AnalysisConfig cfg; SetConfig(&cfg); + if (use_mkldnn) { + cfg.EnableMKLDNN(); + } std::vector> input_slots_all; SetInput(&input_slots_all); - CompareNativeAndAnalysis(cfg, input_slots_all); + CompareNativeAndAnalysis( + reinterpret_cast(&cfg), input_slots_all); } +TEST(Analyzer_vis, compare) { compare(); } +#ifdef PADDLE_WITH_MKLDNN +TEST(Analyzer_vis, compare_mkldnn) { compare(true /* use_mkldnn */); } +#endif + } // namespace analysis } // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/api/config_printer.h b/paddle/fluid/inference/tests/api/config_printer.h new file mode 100644 index 0000000000000000000000000000000000000000..4231eef7220735d0b80eb1adc951c55ff7378f1b --- /dev/null +++ b/paddle/fluid/inference/tests/api/config_printer.h @@ -0,0 +1,81 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/inference/api/paddle_inference_api.h" + +namespace paddle { +namespace inference { + +thread_local int num_spaces = 0; + +static std::string GenSpaces(int num_spaces) { + std::ostringstream os; + for (int i = 0; i < num_spaces; ++i) { + os << " "; + } + return os.str(); +} + +std::ostream &operator<<(std::ostream &os, + const PaddlePredictor::Config &config) { + os << GenSpaces(num_spaces) << "PaddlePredictor::Config {\n"; + num_spaces++; + os << GenSpaces(num_spaces) << "model_dir: " << config.model_dir << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +std::ostream &operator<<(std::ostream &os, const NativeConfig &config) { + os << GenSpaces(num_spaces) << "NativeConfig {\n"; + num_spaces++; + os << *reinterpret_cast(&config); + os << GenSpaces(num_spaces) << "use_gpu: " << config.use_gpu << "\n"; + os << GenSpaces(num_spaces) << "device: " << config.device << "\n"; + os << GenSpaces(num_spaces) + << "fraction_of_gpu_memory: " << config.fraction_of_gpu_memory << "\n"; + os << GenSpaces(num_spaces) << "prog_file: " << config.prog_file << "\n"; + os << GenSpaces(num_spaces) << "param_file: " << config.param_file << "\n"; + os << GenSpaces(num_spaces) + << "specify_input_name: " << config.specify_input_name << "\n"; + os << GenSpaces(num_spaces) + << "cpu_num_threads: " << config.cpu_math_library_num_threads() << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +std::ostream &operator<<(std::ostream &os, + const contrib::AnalysisConfig &config) { + os << GenSpaces(num_spaces) << "contrib::AnalysisConfig {\n"; + num_spaces++; + os << *reinterpret_cast(&config); + os << GenSpaces(num_spaces) << "enable_ir_optim: " << config.enable_ir_optim + << "\n"; + os << GenSpaces(num_spaces) + << "use_feed_fetch_ops: " << config.use_feed_fetch_ops << "\n"; + os << GenSpaces(num_spaces) << "use_tensorrt: " << config.use_tensorrt() + << "\n"; + os << GenSpaces(num_spaces) << "use_mkldnn: " << config.use_mkldnn() << "\n"; + num_spaces--; + os << GenSpaces(num_spaces) << "}\n"; + return os; +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/tests/api/tester_helper.h b/paddle/fluid/inference/tests/api/tester_helper.h index 8603d09cbd09e4cee254faf52c2ccbe8d661994d..d572ea0177c1e398229a02718ca31cc78a7059ef 100644 --- a/paddle/fluid/inference/tests/api/tester_helper.h +++ b/paddle/fluid/inference/tests/api/tester_helper.h @@ -19,12 +19,17 @@ #include #include // NOLINT #include + #include "paddle/fluid/framework/ir/fuse_pass_base.h" +#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/inference/analysis/analyzer.h" #include "paddle/fluid/inference/analysis/ut_helper.h" #include "paddle/fluid/inference/api/analysis_predictor.h" -#include "paddle/fluid/inference/api/helper.h" #include "paddle/fluid/inference/api/paddle_inference_pass.h" + +#include "paddle/fluid/inference/api/helper.h" +#include "paddle/fluid/inference/tests/api/config_printer.h" +#include "paddle/fluid/inference/tests/test_helper.h" #include "paddle/fluid/platform/profiler.h" DEFINE_string(infer_model, "", "model path"); @@ -36,10 +41,19 @@ DEFINE_int32(num_threads, 1, "Running the inference program in multi-threads."); DEFINE_bool(use_analysis, true, "Running the inference program in analysis mode."); +DECLARE_bool(profile); +DECLARE_int32(paddle_num_threads); + namespace paddle { namespace inference { -using contrib::AnalysisConfig; +void PrintConfig(const PaddlePredictor::Config *config, bool use_analysis) { + if (use_analysis) { + LOG(INFO) << *reinterpret_cast(config); + return; + } + LOG(INFO) << *reinterpret_cast(config); +} void CompareResult(const std::vector &outputs, const std::vector &ref_outputs) { @@ -50,7 +64,7 @@ void CompareResult(const std::vector &outputs, auto &ref_out = ref_outputs[i]; size_t size = VecReduceToInt(out.shape); size_t ref_size = VecReduceToInt(ref_out.shape); - EXPECT_GT(size, 0); + EXPECT_GT(size, 0UL); EXPECT_EQ(size, ref_size); EXPECT_EQ(out.dtype, ref_out.dtype); switch (out.dtype) { @@ -75,85 +89,167 @@ void CompareResult(const std::vector &outputs, } std::unique_ptr CreateTestPredictor( - const AnalysisConfig &config, bool use_analysis = true) { + const PaddlePredictor::Config *config, bool use_analysis = true) { if (use_analysis) { - return CreatePaddlePredictor(config); - } else { - return CreatePaddlePredictor( - config); + return CreatePaddlePredictor( + *(reinterpret_cast(config))); } + return CreatePaddlePredictor( + *(reinterpret_cast(config))); } size_t GetSize(const PaddleTensor &out) { return VecReduceToInt(out.shape); } std::unordered_map GetFuseStatis(PaddlePredictor *predictor, int *num_ops) { + std::unordered_map res; auto *analysis_predictor = static_cast(predictor); - auto &fuse_statis = analysis_predictor->analysis_argument() - .Get>( - framework::ir::kFuseStatisAttr); - for (auto &item : fuse_statis) { + auto *fusion_status = + analysis_predictor->analysis_argument().fusion_statis_ptr(); + if (!fusion_status) { + return res; + } + for (auto &item : *fusion_status) { LOG(INFO) << "fused " << item.first << " " << item.second; } int num = 0; for (auto &node : - analysis_predictor->analysis_argument().main_dfg->nodes.nodes()) { - if (node->IsFunction()) { + analysis_predictor->analysis_argument().main_graph().Nodes()) { + if (node->IsOp()) { ++num; } } *num_ops = num; - return fuse_statis; + return *fusion_status; +} + +void SetFakeImageInput(std::vector> *inputs, + const std::string &dirname, bool is_combined = true, + std::string model_filename = "model", + std::string params_filename = "params") { + // Set fake_image_data + PADDLE_ENFORCE_EQ(FLAGS_test_all_data, 0, "Only have single batch of data."); + std::vector> feed_target_shapes = GetFeedTargetShapes( + dirname, is_combined, model_filename, params_filename); + std::ostringstream os; + for (size_t i = 0; i < feed_target_shapes.size(); ++i) { + os << "feed target " << i << ": {" << feed_target_shapes[i][0]; + for (size_t j = 1; j < feed_target_shapes[i].size(); ++j) { + os << ", " << feed_target_shapes[i][j]; + } + os << "}\n"; + } + LOG(INFO) << os.str(); + + int dim1 = feed_target_shapes[0][1]; + int dim2 = feed_target_shapes[0][2]; + int dim3 = feed_target_shapes[0][3]; + + PaddleTensor input; + std::vector shape({FLAGS_batch_size, dim1, dim2, dim3}); + input.shape = shape; + input.dtype = PaddleDType::FLOAT32; + + // fill input data, for profile easily, do not use random data here. + size_t size = FLAGS_batch_size * dim1 * dim2 * dim3; + input.data.Resize(size * sizeof(float)); + float *input_data = static_cast(input.data.data()); + for (size_t i = 0; i < size; i++) { + *(input_data + i) = static_cast(i) / size; + } + + std::vector input_slots; + input_slots.assign({input}); + (*inputs).emplace_back(input_slots); } void TestOneThreadPrediction( - const AnalysisConfig &config, + const PaddlePredictor::Config *config, const std::vector> &inputs, std::vector *outputs, bool use_analysis = true) { int batch_size = FLAGS_batch_size; int num_times = FLAGS_repeat; auto predictor = CreateTestPredictor(config, use_analysis); - Timer timer; - timer.tic(); - for (int i = 0; i < num_times; i++) { - for (size_t j = 0; j < inputs.size(); j++) { - predictor->Run(inputs[j], outputs); + + // warmup run + LOG(INFO) << "Warm up run..."; + { + Timer warmup_timer; + warmup_timer.tic(); + predictor->Run(inputs[0], outputs, batch_size); + PrintTime(batch_size, 1, 1, 0, warmup_timer.toc(), 1); + if (FLAGS_profile) { + paddle::platform::ResetProfiler(); + } + } + + LOG(INFO) << "Run " << num_times << " times..."; + { + Timer run_timer; + run_timer.tic(); + for (int i = 0; i < num_times; i++) { + for (size_t j = 0; j < inputs.size(); j++) { + predictor->Run(inputs[j], outputs, batch_size); + } } + PrintTime(batch_size, num_times, 1, 0, run_timer.toc() / num_times, + inputs.size()); } - PrintTime(batch_size, num_times, 1, 0, timer.toc() / num_times, - inputs.size()); } void TestMultiThreadPrediction( - const AnalysisConfig &config, + const PaddlePredictor::Config *config, const std::vector> &inputs, std::vector *outputs, int num_threads, bool use_analysis = true) { int batch_size = FLAGS_batch_size; int num_times = FLAGS_repeat; std::vector threads; - std::vector> predictors; - // TODO(yanchunwei): Bug here, the analyzer phase can't be parallelled - // because AttentionLSTM's hard code nodeid will be damanged. - for (int tid = 0; tid < num_threads; ++tid) { - predictors.emplace_back(CreateTestPredictor(config, use_analysis)); - } + auto main_predictor = CreateTestPredictor(config, use_analysis); + + size_t total_time{0}; for (int tid = 0; tid < num_threads; ++tid) { threads.emplace_back([&, tid]() { // Each thread should have local inputs and outputs. // The inputs of each thread are all the same. - std::vector> inputs_tid = inputs; std::vector outputs_tid; - Timer timer; - timer.tic(); - for (int i = 0; i < num_times; i++) { - for (size_t j = 0; j < inputs_tid.size(); j++) { - predictors[tid]->Run(inputs_tid[j], &outputs_tid); + // To ensure the thread binding correctly, + // please clone inside the threadpool. + auto predictor = main_predictor->Clone(); +#ifdef PADDLE_WITH_MKLDNN + if (use_analysis) { + static_cast(predictor.get()) + ->SetMkldnnThreadID(static_cast(tid) + 1); + } +#endif + + // warmup run + LOG(INFO) << "Running thread " << tid << ", warm up run..."; + { + Timer warmup_timer; + warmup_timer.tic(); + predictor->Run(inputs[0], outputs, batch_size); + PrintTime(batch_size, 1, num_threads, tid, warmup_timer.toc(), 1); + if (FLAGS_profile) { + paddle::platform::ResetProfiler(); } } - PrintTime(batch_size, num_times, num_threads, tid, - timer.toc() / num_times, inputs_tid.size()); + + LOG(INFO) << "Thread " << tid << " run " << num_times << " times..."; + { + Timer timer; + timer.tic(); + for (int i = 0; i < num_times; i++) { + for (const auto &input : inputs) { + ASSERT_TRUE(predictor->Run(input, &outputs_tid)); + } + } + + auto time = timer.toc(); + total_time += time; + PrintTime(batch_size, num_times, num_threads, tid, time / num_times, + inputs.size()); + } }); } for (int i = 0; i < num_threads; ++i) { @@ -161,11 +257,11 @@ void TestMultiThreadPrediction( } } -void TestPrediction(const AnalysisConfig &config, +void TestPrediction(const PaddlePredictor::Config *config, const std::vector> &inputs, std::vector *outputs, int num_threads, bool use_analysis = FLAGS_use_analysis) { - LOG(INFO) << "use_analysis: " << use_analysis; + PrintConfig(config, use_analysis); if (num_threads == 1) { TestOneThreadPrediction(config, inputs, outputs, use_analysis); } else { @@ -175,8 +271,9 @@ void TestPrediction(const AnalysisConfig &config, } void CompareNativeAndAnalysis( - const AnalysisConfig &config, + const PaddlePredictor::Config *config, const std::vector> &inputs) { + PrintConfig(config, true); std::vector native_outputs, analysis_outputs; TestOneThreadPrediction(config, inputs, &native_outputs, false); TestOneThreadPrediction(config, inputs, &analysis_outputs, true); diff --git a/paddle/fluid/inference/tests/api/trt_models_tester.cc b/paddle/fluid/inference/tests/api/trt_models_tester.cc index bf320a0cbc2fff5f973c48768281e26d0fde232b..ef612ce6148329c33f194842945bb5438afcf645 100644 --- a/paddle/fluid/inference/tests/api/trt_models_tester.cc +++ b/paddle/fluid/inference/tests/api/trt_models_tester.cc @@ -1,106 +1,147 @@ -// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. -// -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ #include #include #include -#include "paddle/fluid/inference/analysis/analyzer.h" -#include "paddle/fluid/inference/api/paddle_inference_api.h" + +#include "paddle/fluid/inference/tests/api/tester_helper.h" namespace paddle { -using paddle::contrib::MixedRTConfig; - -DEFINE_string(dirname, "", "Directory of the inference model."); - -NativeConfig GetConfigNative() { - NativeConfig config; - config.model_dir = FLAGS_dirname; - // LOG(INFO) << "dirname " << config.model_dir; - config.fraction_of_gpu_memory = 0.45; - config.use_gpu = true; - config.device = 0; - return config; +namespace inference { + +DEFINE_bool(use_tensorrt, true, "Test the performance of TensorRT engine."); +DEFINE_string(prog_filename, "", "Name of model file."); +DEFINE_string(param_filename, "", "Name of parameters file."); + +template +void SetConfig(ConfigType* config, std::string model_dir, bool use_gpu, + bool use_tensorrt = false, int batch_size = -1) { + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + config->prog_file = model_dir + "/" + FLAGS_prog_filename; + config->param_file = model_dir + "/" + FLAGS_param_filename; + } else { + config->model_dir = model_dir; + } + if (use_gpu) { + config->use_gpu = true; + config->device = 0; + config->fraction_of_gpu_memory = 0.15; + } } -MixedRTConfig GetConfigTRT() { - MixedRTConfig config; - config.model_dir = FLAGS_dirname; - config.use_gpu = true; - config.fraction_of_gpu_memory = 0.2; - config.device = 0; - config.max_batch_size = 3; - return config; +template <> +void SetConfig(contrib::AnalysisConfig* config, + std::string model_dir, bool use_gpu, + bool use_tensorrt, int batch_size) { + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + config->prog_file = model_dir + "/" + FLAGS_prog_filename; + config->param_file = model_dir + "/" + FLAGS_param_filename; + } else { + config->model_dir = model_dir; + } + if (use_gpu) { + config->use_gpu = true; + config->device = 0; + config->fraction_of_gpu_memory = 0.15; + if (use_tensorrt) { + config->EnableTensorRtEngine(1 << 10, batch_size); + config->pass_builder()->DeletePass("conv_bn_fuse_pass"); + config->pass_builder()->DeletePass("fc_fuse_pass"); + config->pass_builder()->TurnOnDebug(); + } else { + config->enable_ir_optim = true; + } + } } -void CompareTensorRTWithFluid(int batch_size, std::string model_dirname) { - NativeConfig config0 = GetConfigNative(); - config0.model_dir = model_dirname; - - MixedRTConfig config1 = GetConfigTRT(); - config1.model_dir = model_dirname; - config1.max_batch_size = batch_size; - - auto predictor0 = - CreatePaddlePredictor(config0); - auto predictor1 = - CreatePaddlePredictor(config1); - // Prepare inputs - int height = 224; - int width = 224; - float *data = new float[batch_size * 3 * height * width]; - memset(data, 0, sizeof(float) * (batch_size * 3 * height * width)); - data[0] = 1.0f; - - // Prepare inputs - PaddleTensor tensor; - tensor.name = "input_0"; - tensor.shape = std::vector({batch_size, 3, height, width}); - tensor.data = PaddleBuf(static_cast(data), - sizeof(float) * (batch_size * 3 * height * width)); - tensor.dtype = PaddleDType::FLOAT32; - std::vector paddle_tensor_feeds(1, tensor); - - // Prepare outputs - std::vector outputs0; - std::vector outputs1; - CHECK(predictor0->Run(paddle_tensor_feeds, &outputs0)); - - CHECK(predictor1->Run(paddle_tensor_feeds, &outputs1, batch_size)); - - // Get output. - ASSERT_EQ(outputs0.size(), 1UL); - ASSERT_EQ(outputs1.size(), 1UL); - - const size_t num_elements = outputs0.front().data.length() / sizeof(float); - const size_t num_elements1 = outputs1.front().data.length() / sizeof(float); - EXPECT_EQ(num_elements, num_elements1); - - auto *data0 = static_cast(outputs0.front().data.data()); - auto *data1 = static_cast(outputs1.front().data.data()); - - ASSERT_GT(num_elements, 0UL); - for (size_t i = 0; i < std::min(num_elements, num_elements1); i++) { - EXPECT_NEAR(data0[i], data1[i], 1e-3); +void profile(std::string model_dir, bool use_analysis, bool use_tensorrt) { + std::vector> inputs_all; + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename, + FLAGS_param_filename); + } else { + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", ""); + } + + std::vector outputs; + if (use_analysis || use_tensorrt) { + contrib::AnalysisConfig config(true); + SetConfig(&config, model_dir, true, use_tensorrt, + FLAGS_batch_size); + TestPrediction(reinterpret_cast(&config), + inputs_all, &outputs, FLAGS_num_threads, true); + } else { + NativeConfig config; + SetConfig(&config, model_dir, true, false); + TestPrediction(reinterpret_cast(&config), + inputs_all, &outputs, FLAGS_num_threads, false); } } -TEST(trt_models_test, main) { - std::vector infer_models = {"mobilenet", "resnet50", - "resnext50"}; - for (auto &model_dir : infer_models) { - CompareTensorRTWithFluid(1, FLAGS_dirname + "/" + model_dir); +void compare(std::string model_dir, bool use_tensorrt) { + std::vector> inputs_all; + if (!FLAGS_prog_filename.empty() && !FLAGS_param_filename.empty()) { + SetFakeImageInput(&inputs_all, model_dir, true, FLAGS_prog_filename, + FLAGS_param_filename); + } else { + SetFakeImageInput(&inputs_all, model_dir, false, "__model__", ""); } + + std::vector native_outputs; + NativeConfig native_config; + SetConfig(&native_config, model_dir, true, false, + FLAGS_batch_size); + TestOneThreadPrediction( + reinterpret_cast(&native_config), inputs_all, + &native_outputs, false); + + std::vector analysis_outputs; + contrib::AnalysisConfig analysis_config(true); + SetConfig(&analysis_config, model_dir, true, + use_tensorrt, FLAGS_batch_size); + TestOneThreadPrediction( + reinterpret_cast(&analysis_config), inputs_all, + &analysis_outputs, true); + + CompareResult(native_outputs, analysis_outputs); } + +TEST(TensorRT_mobilenet, compare) { + std::string model_dir = FLAGS_infer_model + "/mobilenet"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnet50, compare) { + std::string model_dir = FLAGS_infer_model + "/resnet50"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnext50, compare) { + std::string model_dir = FLAGS_infer_model + "/resnext50"; + compare(model_dir, /* use_tensorrt */ true); +} + +TEST(TensorRT_resnext50, profile) { + std::string model_dir = FLAGS_infer_model + "/resnext50"; + profile(model_dir, /* use_analysis */ true, FLAGS_use_tensorrt); +} + +TEST(TensorRT_mobilenet, analysis) { + std::string model_dir = FLAGS_infer_model + "/" + "mobilenet"; + compare(model_dir, /* use_tensorrt */ false); +} + +} // namespace inference } // namespace paddle diff --git a/paddle/fluid/inference/tests/book/test_inference_nlp.cc b/paddle/fluid/inference/tests/book/test_inference_nlp.cc index cbcfc964c91c33ab41a72ad7fec759086ad887cc..5c1204b9e6b78e42d999b12a2b7be6f822ecf818 100644 --- a/paddle/fluid/inference/tests/book/test_inference_nlp.cc +++ b/paddle/fluid/inference/tests/book/test_inference_nlp.cc @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include #include #include #include // NOLINT diff --git a/paddle/fluid/inference/tests/test.cmake b/paddle/fluid/inference/tests/test.cmake new file mode 100644 index 0000000000000000000000000000000000000000..ab3a30ce6bba14a7d5ec700a159d90031e6b5dc7 --- /dev/null +++ b/paddle/fluid/inference/tests/test.cmake @@ -0,0 +1,31 @@ +set(INFERENCE_URL "http://paddle-inference-dist.cdn.bcebos.com" CACHE STRING "inference download url") +set(INFERENCE_DEMO_INSTALL_DIR "${THIRD_PARTY_PATH}/inference_demo" CACHE STRING + "A path setting inference demo download directories.") +function (inference_download install_dir url filename) + message(STATUS "Download inference test stuff from ${url}/${filename}") + execute_process(COMMAND bash -c "mkdir -p ${install_dir}") + execute_process(COMMAND bash -c "cd ${install_dir} && wget -q ${url}/${filename}") + message(STATUS "finish downloading ${filename}") +endfunction() + +function (inference_download_and_uncompress install_dir url filename) + inference_download(${install_dir} ${url} ${filename}) + execute_process(COMMAND bash -c "cd ${install_dir} && tar xzf ${filename}") +endfunction() + +set(WORD2VEC_INSTALL_DIR "${INFERENCE_DEMO_INSTALL_DIR}/word2vec") +if (NOT EXISTS ${WORD2VEC_INSTALL_DIR}) + inference_download_and_uncompress(${WORD2VEC_INSTALL_DIR} ${INFERENCE_URL} "word2vec.inference.model.tar.gz") +endif() +set(WORD2VEC_MODEL_DIR "${WORD2VEC_INSTALL_DIR}/word2vec.inference.model") + +function (inference_base_test TARGET) + set(options "") + set(oneValueArgs "") + set(multiValueArgs SRCS ARGS DEPS) + cmake_parse_arguments(base_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + if(WITH_GPU) + set(mem_opt "--fraction_of_gpu_memory_to_use=0.5") + endif() + cc_test(${TARGET} SRCS ${base_test_SRCS} DEPS ${base_test_DEPS} ARGS ${mem_opt} ${base_test_ARGS}) +endfunction() diff --git a/paddle/fluid/inference/tests/test_helper.h b/paddle/fluid/inference/tests/test_helper.h index 94f0550df57e79fa68c135f5c9c4b7effe6ac156..75fa611c0d701dd81dfe5b33231655e0959c7dbf 100644 --- a/paddle/fluid/inference/tests/test_helper.h +++ b/paddle/fluid/inference/tests/test_helper.h @@ -18,9 +18,9 @@ limitations under the License. */ #include #include -#include "paddle/fluid/framework/ir/graph_to_program_pass.h" #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/inference/io.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/platform/profiler.h" DECLARE_bool(use_mkldnn); @@ -94,15 +94,15 @@ void CheckError(const paddle::framework::LoDTensor& output1, std::unique_ptr InitProgram( paddle::framework::Executor* executor, paddle::framework::Scope* scope, - const std::string& dirname, const bool is_combined = false) { + const std::string& dirname, const bool is_combined = false, + const std::string& prog_filename = "__model_combined__", + const std::string& param_filename = "__params_combined__") { std::unique_ptr inference_program; if (is_combined) { // All parameters are saved in a single file. // Hard-coding the file names of program and parameters in unittest. // The file names should be consistent with that used in Python API // `fluid.io.save_inference_model`. - std::string prog_filename = "__model_combined__"; - std::string param_filename = "__params_combined__"; inference_program = paddle::inference::Load(executor, scope, dirname + "/" + prog_filename, dirname + "/" + param_filename); @@ -115,12 +115,15 @@ std::unique_ptr InitProgram( } std::vector> GetFeedTargetShapes( - const std::string& dirname, const bool is_combined = false) { + const std::string& dirname, const bool is_combined = false, + const std::string& prog_filename = "__model_combined__", + const std::string& param_filename = "__params_combined__") { auto place = paddle::platform::CPUPlace(); auto executor = paddle::framework::Executor(place); auto* scope = new paddle::framework::Scope(); - auto inference_program = InitProgram(&executor, scope, dirname, is_combined); + auto inference_program = InitProgram(&executor, scope, dirname, is_combined, + prog_filename, param_filename); auto& global_block = inference_program->Block(0); const std::vector& feed_target_names = @@ -136,15 +139,6 @@ std::vector> GetFeedTargetShapes( return feed_target_shapes; } -void Compile(paddle::framework::ProgramDesc* program) { - std::unique_ptr g( - new paddle::framework::ir::Graph(*program)); - auto pass = paddle::framework::ir::PassRegistry::Instance().Get( - "graph_to_program_pass"); - pass->SetNotOwned("program", program); - pass->Apply(std::move(g)); -} - template void TestInference(const std::string& dirname, const std::vector& cpu_feeds, @@ -182,7 +176,6 @@ void TestInference(const std::string& dirname, paddle::platform::DeviceContextPool::Instance().Get(place)); inference_program = InitProgram(&executor, scope, dirname, is_combined); } - Compile(inference_program.get()); // Disable the profiler and print the timing information paddle::platform::DisableProfiler(paddle::platform::EventSortingKey::kDefault, @@ -261,5 +254,3 @@ void TestInference(const std::string& dirname, delete scope; } - -USE_PASS(graph_to_program_pass); diff --git a/paddle/fluid/inference/utils/CMakeLists.txt b/paddle/fluid/inference/utils/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2104e4ac7222258ee025bd5acd60b1db251df654 --- /dev/null +++ b/paddle/fluid/inference/utils/CMakeLists.txt @@ -0,0 +1,2 @@ +cc_library(benchmark SRCS benchmark.cc DEPS enforce) +cc_test(test_benchmark SRCS benchmark_tester.cc DEPS benchmark) diff --git a/paddle/fluid/inference/utils/benchmark.cc b/paddle/fluid/inference/utils/benchmark.cc new file mode 100644 index 0000000000000000000000000000000000000000..021edc2de5e90023fcd1431dd2025450e7462bd9 --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark.cc @@ -0,0 +1,49 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/utils/benchmark.h" +#include +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace inference { + +std::string Benchmark::SerializeToString() const { + std::stringstream ss; + ss << "-----------------------------------------------------\n"; + ss << "name\t"; + ss << "batch_size\t"; + ss << "num_threads\t"; + ss << "latency\t"; + ss << "qps"; + ss << '\n'; + + ss << name_ << "\t"; + ss << batch_size_ << "\t"; + ss << num_threads_ << "\t"; + ss << latency_ << "\t"; + ss << 1000 / latency_; + ss << '\n'; + return ss.str(); +} +void Benchmark::PersistToFile(const std::string &path) const { + std::ofstream file(path, std::ios::app); + PADDLE_ENFORCE(file.is_open(), "Can not open %s to add benchmark", path); + file << SerializeToString(); + file.flush(); + file.close(); +} + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/utils/benchmark.h b/paddle/fluid/inference/utils/benchmark.h new file mode 100644 index 0000000000000000000000000000000000000000..80e8f77adb4ff2cc81a2a3dd0c44e4e304800122 --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark.h @@ -0,0 +1,52 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include + +namespace paddle { +namespace inference { + +/* + * Helper class to calculate the performance. + */ +struct Benchmark { + int batch_size() const { return batch_size_; } + void SetBatchSize(int x) { batch_size_ = x; } + + int num_threads() const { return num_threads_; } + void SetNumThreads(int x) { num_threads_ = x; } + + bool use_gpu() const { return use_gpu_; } + void SetUseGpu() { use_gpu_ = true; } + + int latency() const { return latency_; } + void SetLatency(int x) { latency_ = x; } + + const std::string& name() const { return name_; } + void SetName(const std::string& name) { name_ = name; } + + std::string SerializeToString() const; + void PersistToFile(const std::string& path) const; + + private: + bool use_gpu_{false}; + int batch_size_{0}; + int latency_; + int num_threads_{1}; + std::string name_; +}; + +} // namespace inference +} // namespace paddle diff --git a/paddle/fluid/inference/utils/benchmark_tester.cc b/paddle/fluid/inference/utils/benchmark_tester.cc new file mode 100644 index 0000000000000000000000000000000000000000..eb255474082b27180a8b3176b5f880c0d38f6c3b --- /dev/null +++ b/paddle/fluid/inference/utils/benchmark_tester.cc @@ -0,0 +1,39 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/inference/utils/benchmark.h" +#include +#include + +using namespace paddle::inference; +TEST(Benchmark, basic) { + Benchmark benchmark; + benchmark.SetName("key0"); + benchmark.SetBatchSize(10); + benchmark.SetUseGpu(); + benchmark.SetLatency(220); + LOG(INFO) << "benchmark:\n" << benchmark.SerializeToString(); +} + +TEST(Benchmark, PersistToFile) { + Benchmark benchmark; + benchmark.SetName("key0"); + benchmark.SetBatchSize(10); + benchmark.SetUseGpu(); + benchmark.SetLatency(220); + + benchmark.PersistToFile("1.log"); + benchmark.PersistToFile("1.log"); + benchmark.PersistToFile("1.log"); +} \ No newline at end of file diff --git a/paddle/fluid/memory/CMakeLists.txt b/paddle/fluid/memory/CMakeLists.txt index 709fc7e12e1db537ceece30c405c0e8a2582e8ca..e7268077643c3988c59a52bf54873f1e8db4619b 100644 --- a/paddle/fluid/memory/CMakeLists.txt +++ b/paddle/fluid/memory/CMakeLists.txt @@ -1,15 +1,12 @@ add_subdirectory(detail) - -cc_library(malloc SRCS malloc.cc DEPS buddy_allocator place enforce) +add_subdirectory(allocation) +cc_library(malloc SRCS malloc.cc DEPS place enforce allocator_facade) cc_library(memcpy SRCS memcpy.cc DEPS place) cc_library(memory DEPS malloc memcpy) - -cc_test(malloc_test SRCS malloc_test.cc DEPS malloc) - #if (WITH_GPU) # nv_test(pinned_memory_test SRCS pinned_memory_test.cu DEPS place memory) #endif() diff --git a/paddle/fluid/memory/allocation/CMakeLists.txt b/paddle/fluid/memory/allocation/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..4b7b9064dcde9b5209264257d51bbd976ba8eb85 --- /dev/null +++ b/paddle/fluid/memory/allocation/CMakeLists.txt @@ -0,0 +1,64 @@ +cc_library(allocator SRCS allocator.cc DEPS place) +cc_library(cpu_allocator SRCS cpu_allocator.cc DEPS allocator) +cc_library(best_fit_allocator SRCS best_fit_allocator.cc DEPS allocator) +cc_library(locked_allocator SRCS locked_allocator.cc DEPS allocator) +cc_library(buffered_allocator SRCS buffered_allocator.cc DEPS allocator) +cc_library(legacy_allocator SRCS legacy_allocator.cc DEPS allocator buddy_allocator) +cc_test(buffered_allocator_test SRCS buffered_allocator_test.cc DEPS best_fit_allocator locked_allocator buffered_allocator cpu_allocator) + +if (WITH_GPU) + nv_library(cuda_allocator SRCS cuda_allocator.cc DEPS allocator cuda_device_guard) +endif() + +cc_library(retry_allocator SRCS retry_allocator.cc DEPS allocator) + +if (WITH_GPU) + nv_test(best_fit_allocator_test + SRCS best_fit_allocator_test.cc + best_fit_allocator_test.cu + DEPS best_fit_allocator + locked_allocator + cpu_allocator + cuda_allocator + device_context + memcpy) +else() + cc_test(best_fit_allocator_test + SRCS best_fit_allocator_test.cc + DEPS best_fit_allocator + locked_allocator + cpu_allocator) +endif() + +nv_library(pinned_allocator SRCS pinned_allocator.cc DEPS allocator) +if (WITH_GPU) + set(AllocatorFacadeDeps gpu_info cuda_allocator pinned_allocator cuda_device_guard) +else () + set(AllocatorFacadeDeps) +endif() + +cc_library(aligned_allocator SRCS aligned_allocator.cc DEPS allocator) +cc_library(auto_increment_allocator SRCS auto_increment_allocator.cc DEPS allocator) +cc_library(zero_size_allocator SRCS zero_size_allocator.cc DEPS allocator) +cc_library(conditional_allocator SRCS conditional_allocator.cc DEPS allocator) +cc_library(allocator_strategy SRCS allocator_strategy.cc DEPS gflags) +cc_library(allocator_facade SRCS allocator_facade.cc DEPS + ${AllocatorFacadeDeps} + cpu_allocator + locked_allocator + best_fit_allocator + aligned_allocator + auto_increment_allocator + zero_size_allocator + conditional_allocator + retry_allocator + buffered_allocator + allocator_strategy + legacy_allocator + ) + +nv_test(allocation_and_eigen_test SRCS allocation_and_eigen_test.cu DEPS allocator_facade) + +cc_test(retry_allocator_test SRCS retry_allocator_test.cc DEPS retry_allocator best_fit_allocator locked_allocator cpu_allocator) + +cc_test(allocator_facade_test SRCS allocator_facade_test.cc DEPS allocator_facade) diff --git a/paddle/fluid/inference/analysis/analyzer_main.cc b/paddle/fluid/memory/allocation/aligned_allocator.cc similarity index 56% rename from paddle/fluid/inference/analysis/analyzer_main.cc rename to paddle/fluid/memory/allocation/aligned_allocator.cc index 5e1fe3eb797cdced56a61aa2db0c3d18601824f8..efae280dbd47a1db476f9c371ba73eac96c30df6 100644 --- a/paddle/fluid/inference/analysis/analyzer_main.cc +++ b/paddle/fluid/memory/allocation/aligned_allocator.cc @@ -12,22 +12,20 @@ // See the License for the specific language governing permissions and // limitations under the License. -/* - * This file implements analysizer -- an executation help to analyze and - * optimize trained model. - */ -#include "paddle/fluid/inference/analysis/analyzer.h" -#include -#include +#include "paddle/fluid/memory/allocation/aligned_allocator.h" -int main(int argc, char** argv) { - google::ParseCommandLineFlags(&argc, &argv, true); - using paddle::inference::analysis::Analyzer; - using paddle::inference::analysis::Argument; +namespace paddle { +namespace memory { +namespace allocation { - Argument argument; - Analyzer analyzer; - analyzer.Run(&argument); +ThinAlignedAllocator::ThinAlignedAllocator( + std::shared_ptr underlyning_allocator) + : underlying_allocator_(std::move(underlyning_allocator)) {} - return 0; +bool ThinAlignedAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); } + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/aligned_allocator.h b/paddle/fluid/memory/allocation/aligned_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..fc1a8e9247b16374037bfde44449fd552b44c6b4 --- /dev/null +++ b/paddle/fluid/memory/allocation/aligned_allocator.h @@ -0,0 +1,100 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The aligned allocation and allocator will wrap a managed allocator, +// and returns the aligned pointer. +// +// NOTE(yy): For speed reason, I just use a template parameter to get +// alignment, however, it can be an private member if necessary. +// +// NOTE(yy): kAlignment must be 2^N. a `static_assert` should be added. +template +class AlignedAllocation : public Allocation { + static_assert(kAlignment > 0 && (kAlignment & (kAlignment - 1)) == 0, + "kAlignment must be 2^N"); + + public: + AlignedAllocation(AllocationPtr&& underlying_allocation, size_t size) + : Allocation(AlignedPtr(underlying_allocation->ptr()), + size + kAlignment - Offset(underlying_allocation->ptr()), + underlying_allocation->place()), + underlying_allocation_(std::move(underlying_allocation)) {} + + private: + static void* AlignedPtr(void* ptr) { + return reinterpret_cast(reinterpret_cast(ptr) + + Offset(ptr)); + } + + // Offset to aligned pointer. + // if ptr is already aligned, returns 0. + static size_t Offset(void* ptr) { + auto ptr_addr = reinterpret_cast(ptr); + intptr_t aligned_addr = (ptr_addr & ~(kAlignment - 1)); + intptr_t diff = aligned_addr - ptr_addr; + if (diff == 0) { + return 0; + } else { + return kAlignment + diff; + } + } + + AllocationPtr underlying_allocation_; +}; + +// Thin aligned allocator is trivial and used to generate a small size binary. +// +// NOTE(yy): This is a trick to make a template class. This class extract the +// common code into a `thin` class. So if there are multiple specification of +// the template class, the binary size will not extended too much. +// +// NOTE(yy): This could be an over design. If it harms readability of code, it +// could be removed later. +class ThinAlignedAllocator : public Allocator { + public: + explicit ThinAlignedAllocator( + std::shared_ptr underlyning_allocator); + + bool IsAllocThreadSafe() const; + + protected: + std::shared_ptr underlying_allocator_; +}; + +// An aligned allocator will allocate `size+kAlignment` allocation and adjust +// the pointer offset. +template +class AlignedAllocator : public ThinAlignedAllocator { + public: + using ThinAlignedAllocator::ThinAlignedAllocator; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + auto raw_allocation = + underlying_allocator_->Allocate(size + kAlignment, attr); + return new AlignedAllocation(std::move(raw_allocation), size); + } +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu b/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu new file mode 100644 index 0000000000000000000000000000000000000000..b61649e59d326a64aa806460feffc3a910b1cab8 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocation_and_eigen_test.cu @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "gtest/gtest.h" +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/platform/device_context.h" +#include "paddle/fluid/platform/for_range.h" +#include "unsupported/Eigen/CXX11/Tensor" + +// NOTE(yy): this unittest is not important. It just used for debugging. +// It can be removed later. +struct FillZero { + public: + float* ptr_; + + __device__ void operator()(size_t i) { ptr_[i] = 0.0f; } +}; + +namespace paddle { +TEST(Eigen, main) { + framework::Tensor tensor; + platform::CUDAPlace gpu(0); + float* ptr = tensor.mutable_data({10, 10}, gpu); + auto& dev_ctx = *reinterpret_cast( + platform::DeviceContextPool::Instance().Get(gpu)); + PADDLE_ENFORCE(cudaMemset(ptr, 0, sizeof(float) * 100)); + + platform::ForRange for_range(dev_ctx, 100); + for_range(FillZero{ptr}); + dev_ctx.Wait(); + + auto eigen_vec = framework::EigenVector::Flatten(tensor); + auto& eigen_dev = *dev_ctx.eigen_device(); + eigen_vec.device(eigen_dev) = eigen_vec.constant(0.0f); +} +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocation_with_underlying.h b/paddle/fluid/memory/allocation/allocation_with_underlying.h new file mode 100644 index 0000000000000000000000000000000000000000..69f78667d7d33c59245a9890b9a2ce469f629450 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocation_with_underlying.h @@ -0,0 +1,33 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class AllocationWithUnderlying : public Allocation { + public: + explicit AllocationWithUnderlying(AllocationPtr allocation) + : Allocation(allocation->ptr(), allocation->size(), allocation->place()), + allocation_(std::move(allocation)) {} + AllocationPtr allocation_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator.cc b/paddle/fluid/memory/allocation/allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..8fb8a5fb897a736d7515951ba08c633da9a7706c --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator.cc @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator.h" + +#include + +namespace paddle { +namespace memory { +namespace allocation { +Allocation::~Allocation() {} + +Allocator::~Allocator() {} + +bool Allocator::IsAllocThreadSafe() const { return false; } + +AllocationPtr Allocator::Allocate(size_t size, Allocator::Attr attr) { + auto ptr = AllocateImpl(size, attr); + ptr->set_allocator(this); + return AllocationPtr(ptr); +} + +void Allocator::Free(Allocation* allocation) { delete allocation; } + +const char* BadAlloc::what() const noexcept { return msg_.c_str(); } + +void AllocationDeleter::operator()(Allocation* allocation) const { + auto* allocator = allocation->allocator(); + allocator->Free(allocation); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator.h b/paddle/fluid/memory/allocation/allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..f2b6f438c382275cab4ecf9aceea1c55e5885dee --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator.h @@ -0,0 +1,145 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Exception when `Alloc`/`AllocShared` failed +class BadAlloc : public std::exception { + public: + explicit BadAlloc(std::string msg) : msg_(std::move(msg)) {} + const char* what() const noexcept override; + + private: + std::string msg_; +}; + +class Allocation; +class AllocationDeleter { + public: + void operator()(Allocation* allocation) const; +}; + +class Allocator; +// Allocation is the object holding the actually pointer. Use +// `Allocation::ptr()` will returns the pointer that allocated. +// +// NOTE: this is the base class of Allocation. Each allocator can use its own +// allocation object. +// NOTE: the `Allocation::ptr()` could be nullptr, if the allocation size is 0 +class Allocation { + public: + Allocation(void* ptr, size_t size, platform::Place place) + : allocator_(nullptr), ptr_(ptr), size_(size), place_(place) {} + + Allocation(const Allocation& o) = delete; + Allocation& operator=(const Allocation& o) = delete; + + // Returns the holding pointer. + // NOTE: For performance consideration, it is better not to make this method + // as a virtual method. If we want to implement a `defragmentation` later, + // we might need to make `ptr_` field as a protected field, and add a virtual + // method like `defragmentation` to change `ptr_`. + void* ptr() const { return ptr_; } + + // Returns the size of this memory buffer, i.e., ptr() + size() - 1 is the + // last valid element. + // + // NOTE: Some allocator might alloc more memory than request. The size + // could larger than its request. For example, + // the AlignedAllocator will always allocate memory as size + kAlignment. + // The raw pointer might not aligned, so an offset might be added to raw + // the pointer. The size of this allocation will be + // `size + kAlignemnt - offset`. + size_t size() const { return size_; } + + const platform::Place& place() const { return place_; } + + Allocator* allocator() { return allocator_; } + + void set_allocator(Allocator* allocator) { allocator_ = allocator; } + + virtual ~Allocation(); + + private: + Allocator* allocator_; + void* ptr_; + size_t size_; + platform::Place place_; +}; + +using AllocationPtr = std::unique_ptr; + +// Base interface class of memory Allocator. +// To allocate a memory, allocator needs two parameters: +// 1. size of bytes. +// 2. Attribute of memory. +// NOTE: the attribute of memory might be ignored if the allocator does not +// care it. +class Allocator { + public: + enum Attr { + kDefault = 0, // Default attribute. Uses the fast or stablest allocation + // algorithm. + + kFixedHuge = 1, // The allocation may not be freed until the program + // ends. e.g., `Parameters` and `Momentum`. + + kFluxHuge = 2, // The allocation may create and freed frequently and the + // allocation is considerable huge. Like `activations` + // and gradients. + + kScratchpad = + 3, // The `Scratchpad` memory is allocated and freed very soon, + // usually within an operator or aux memory. + // Like CUDNN workspace, AUX memory in batch norm, etc. + // + // https://en.wikipedia.org/wiki/Scratchpad_memory + + kCrossDevice = + 4, // The memory used cross-device memory copy/communication. + // For example: + // 1. it can use an `pinned` memory for CPU-GPU + // communication. + // 2. it can use an `registered` memory for RDMA + // communication. + + NumOfAttrs = 5 // The number of all attributes. It is used internally. + }; + + virtual ~Allocator(); + + // Allocate an allocation. + AllocationPtr Allocate(size_t size, Allocator::Attr attr = kDefault); + + // True if the `Allocate` is thread safe. + virtual bool IsAllocThreadSafe() const; + + protected: + virtual void Free(Allocation* allocation); + virtual Allocation* AllocateImpl(size_t size, Allocator::Attr attr) = 0; + + private: + friend class AllocationDeleter; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade.cc b/paddle/fluid/memory/allocation/allocator_facade.cc new file mode 100644 index 0000000000000000000000000000000000000000..e207a853c8f782698b19d7f71caacf92f8df8e41 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade.cc @@ -0,0 +1,271 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator.h" +#include +#include +#include +#include +#include +#include "paddle/fluid/memory/allocation/aligned_allocator.h" +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "paddle/fluid/memory/allocation/auto_increment_allocator.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/conditional_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/legacy_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include "paddle/fluid/memory/allocation/zero_size_allocator.h" +#include "paddle/fluid/platform/cpu_info.h" +#include "paddle/fluid/platform/place.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include "paddle/fluid/memory/allocation/pinned_allocator.h" +#include "paddle/fluid/platform/cuda_device_guard.h" +#include "paddle/fluid/platform/gpu_info.h" +#endif + +DEFINE_int64( + gpu_allocator_retry_time, 0, + "The retry time (milliseconds) when allocator fails " + "to allocate memory. No retry if this value is not greater than 0"); + +namespace paddle { +namespace memory { +namespace allocation { + +// TODO(yy): Dirty code here. This class should be configurable in runtime. +class CPUManagedAllocator : public Allocator { + public: + CPUManagedAllocator() : normal_allocator_(new CPUAllocator()) {} + + bool IsAllocThreadSafe() const override { return true; } + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + return normal_allocator_->Allocate(size, attr).release(); + } + + private: + std::shared_ptr normal_allocator_; +}; + +// TODO(yy): Dirty code here. This class should be configurable in runtime. +class ChunkedAllocator : public Allocator { + public: + explicit ChunkedAllocator(std::unique_ptr system_allocator, + size_t max_chunk_size, size_t capacity = 1, + int64_t retry_time = -1) + : max_chunk_size_(max_chunk_size), retry_time_(retry_time) { + raw_allocator_ = std::move(system_allocator); + + if (max_chunk_size_ == 0) { + default_allocator_ = raw_allocator_; + } else { + if (capacity == 1) { + VLOG(10) << "Create BestFitAllocator with chunk_size " + << max_chunk_size_; + default_allocator_ = CreateAllocatorWithChunk(); + } else { + VLOG(10) << "Create AutoIncrementAllocator with chunk_size " + << max_chunk_size_ << " and capacity " << capacity; + default_allocator_ = std::make_shared( + [this] { return std::move(CreateAllocatorWithChunk()); }, capacity); + } + } + + auto* cond_allocator = new ConditionalAllocator(); + cond_allocator + ->AddAllocator( + [this](size_t size, Attr attr) { return size < max_chunk_size_; }, + default_allocator_) + .AddAllocator( + [](size_t size, Attr attr) { + return true; // default case + }, + raw_allocator_); + default_allocator_.reset(cond_allocator); + } + + ~ChunkedAllocator() override { + // Specify destruct order. + default_allocator_.reset(); + chunks_.clear(); + raw_allocator_.reset(); + } + + std::shared_ptr CreateAllocatorWithChunk() { + chunks_.emplace_back(raw_allocator_->Allocate(max_chunk_size_)); + auto* allocation = chunks_.back().get(); + std::unique_ptr allocator(new LockedAllocator( + std::unique_ptr(new BestFitAllocator(allocation)))); + + if (retry_time_ > 0) { + auto* retry_allocator = + new RetryAllocator(std::move(allocator), retry_time_); + allocator.reset(retry_allocator); + } + + return std::make_shared>(std::move(allocator)); + } + + bool IsAllocThreadSafe() const override { return true; } + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override { + return default_allocator_->Allocate(size, attr).release(); + } + + protected: + size_t max_chunk_size_; + int64_t retry_time_; + std::vector chunks_; + std::shared_ptr raw_allocator_; + std::shared_ptr default_allocator_; +}; + +#ifdef PADDLE_WITH_CUDA + +class CUDAChunkedAllocator : public ChunkedAllocator { + public: + explicit CUDAChunkedAllocator(int dev_id) + : ChunkedAllocator(std::unique_ptr( + new CUDAAllocator(platform::CUDAPlace(dev_id))), + GetMaxChunkSize(dev_id), GetCapcity(dev_id), + GetRetryTime()) {} + + private: + static size_t GetMaxChunkSize(int dev_id) { + platform::CUDADeviceGuard guard(dev_id); + return platform::GpuMaxChunkSize(); + } + + static size_t GetCapcity(int dev_id) { + platform::CUDADeviceGuard guard(dev_id); + size_t available, total; + platform::GpuMemoryUsage(&available, &total); + size_t max_chunk_size = platform::GpuMaxChunkSize(); + return max_chunk_size == 0 ? 0 : available / max_chunk_size; + } + + static int64_t GetRetryTime() { return FLAGS_gpu_allocator_retry_time; } +}; + +class CUDAPinnedChunkedAllocator : public ChunkedAllocator { + public: + CUDAPinnedChunkedAllocator() + : ChunkedAllocator(std::unique_ptr(new CPUPinnedAllocator()), + platform::CUDAPinnedMaxChunkSize(), GetCapacity(), + -1) {} // never retry + + private: + static size_t GetCapacity() { + size_t total = platform::CpuTotalPhysicalMemory(); + size_t max_chunk_size = platform::CUDAPinnedMaxChunkSize(); + return max_chunk_size == 0 ? 0 : total / max_chunk_size; + } +}; + +#endif + +class AllocatorFacadePrivate { + public: + std::map> allocators_; + + ~AllocatorFacadePrivate() = default; + + AllocatorFacadePrivate() { + if (GetAllocatorStrategy() == AllocatorStrategy::kLegacy) { + InitLegacyAllocator(); + } else { + InitCPUAllocator(); + InitCUDAAllocator(); + InitCUDAPinnedAllocator(); + WrapZeroSizeAllocator(); + } + } + + private: + void InitLegacyAllocator() { + std::vector places{platform::CPUPlace()}; +#ifdef PADDLE_WITH_CUDA + for (int dev_id = 0; dev_id < platform::GetCUDADeviceCount(); ++dev_id) { + places.emplace_back(platform::CUDAPlace(dev_id)); + } + places.emplace_back(platform::CUDAPinnedPlace()); +#endif + for (auto& p : places) { + allocators_[p] = std::make_shared(p); + } + } + + void InitCPUAllocator() { + allocators_[platform::CPUPlace()] = std::make_shared(); + } + + void InitCUDAAllocator() { +#ifdef PADDLE_WITH_CUDA + int device_count = platform::GetCUDADeviceCount(); + for (int dev_id = 0; dev_id < device_count; ++dev_id) { + allocators_[platform::CUDAPlace(dev_id)] = + std::make_shared(dev_id); + } +#endif + } + + void InitCUDAPinnedAllocator() { +#ifdef PADDLE_WITH_CUDA + allocators_[platform::CUDAPinnedPlace()] = + std::make_shared(); +#endif + } + + void WrapZeroSizeAllocator() { + for (auto& pair : allocators_) { + pair.second = + std::make_shared(pair.second, pair.first); + } + } +}; + +// Pimpl. Make interface clean. +AllocatorFacade::AllocatorFacade() : m_(new AllocatorFacadePrivate()) {} +AllocatorFacade::~AllocatorFacade() { delete m_; } + +AllocatorFacade& AllocatorFacade::Instance() { + static AllocatorFacade instance; + return instance; +} + +std::shared_ptr AllocatorFacade::AllocShared( + const platform::Place& place, size_t size, Allocator::Attr attr) { + return std::shared_ptr(Alloc(place, size, attr).release(), + AllocationDeleter()); +} + +AllocationPtr AllocatorFacade::Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr) { + auto it = m_->allocators_.find(place); + if (it == m_->allocators_.end()) { + throw BadAlloc( + string::Sprintf("No such allocator for the place, %s", place)); + } + return m_->allocators_.at(place)->Allocate(size, attr); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade.h b/paddle/fluid/memory/allocation/allocator_facade.h new file mode 100644 index 0000000000000000000000000000000000000000..16da30bec0d9f524bd076fe76d15c2fcfa7edd3a --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade.h @@ -0,0 +1,57 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Allocator Facade is the interface exposed to other modules. +// All the configuration or dirty code under development should +// be hidden behind this facade. +// +// NOTE(yy): This class is a singleton class. +// NOTE(yy): To create a stable ABI and make compilation faster. Here we use +// a Pimpl trick; +class AllocatorFacadePrivate; +class AllocatorFacade { + public: + ~AllocatorFacade(); + AllocatorFacade(const AllocatorFacade& o) = delete; + const AllocatorFacade& operator=(const AllocatorFacade& o) = delete; + + static AllocatorFacade& Instance(); + + // Allocate a shared allocation. + std::shared_ptr AllocShared( + const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); + + // Allocate a unique allocation. + AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); + + // TODO(yy): Allocate a Copy-On-Write allocation? + private: + AllocatorFacade(); + AllocatorFacadePrivate* m_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_facade_test.cc b/paddle/fluid/memory/allocation/allocator_facade_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..802d79e15de253d4e67e35046bdf1d689258da6d --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_facade_test.cc @@ -0,0 +1,87 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include +#include + +#ifdef PADDLE_WITH_CUDA +DECLARE_double(fraction_of_gpu_memory_to_use); +DECLARE_double(fraction_of_cuda_pinned_memory_to_use); +DECLARE_int64(gpu_allocator_retry_time); +#endif + +namespace paddle { +namespace memory { +namespace allocation { + +TEST(allocator, allocator) { +#ifdef PADDLE_WITH_CUDA + FLAGS_fraction_of_gpu_memory_to_use = 0.01; + FLAGS_gpu_allocator_retry_time = 500; + FLAGS_fraction_of_cuda_pinned_memory_to_use = 0.5; +#endif + + auto &instance = AllocatorFacade::Instance(); + platform::Place place; + size_t size = 1024; + + { + place = platform::CPUPlace(); + size = 1024; + auto cpu_allocation = instance.Alloc(place, size); + ASSERT_NE(cpu_allocation, nullptr); + ASSERT_NE(cpu_allocation->ptr(), nullptr); + ASSERT_EQ(cpu_allocation->place(), place); + ASSERT_EQ(cpu_allocation->size(), size); + } + +#ifdef PADDLE_WITH_CUDA + { + place = platform::CUDAPlace(0); + size = 1024; + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + // Allocate 2GB gpu memory + place = platform::CUDAPlace(0); + size = 2 * static_cast(1 << 30); + auto gpu_allocation = instance.Alloc(place, size); + ASSERT_NE(gpu_allocation, nullptr); + ASSERT_NE(gpu_allocation->ptr(), nullptr); + ASSERT_EQ(gpu_allocation->place(), place); + ASSERT_GE(gpu_allocation->size(), size); + } + + { + place = platform::CUDAPinnedPlace(); + size = (1 << 20); + auto cuda_pinned_allocation = + instance.Alloc(platform::CUDAPinnedPlace(), 1 << 20); + ASSERT_NE(cuda_pinned_allocation, nullptr); + ASSERT_NE(cuda_pinned_allocation->ptr(), nullptr); + ASSERT_EQ(cuda_pinned_allocation->place(), place); + ASSERT_GE(cuda_pinned_allocation->size(), size); + } +#endif +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_strategy.cc b/paddle/fluid/memory/allocation/allocator_strategy.cc new file mode 100644 index 0000000000000000000000000000000000000000..b46b1e9ae206b82f5810b4ba7345ebc60fb84285 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_strategy.cc @@ -0,0 +1,41 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "gflags/gflags.h" + +DEFINE_string( + allocator_strategy, "legacy", + "The allocation strategy. Legacy means the original allocator of Fluid." + "New means the experimental allocators of Fluid. in [legacy, new]"); + +namespace paddle { +namespace memory { +namespace allocation { + +static AllocatorStrategy GetStrategyFromFlag() { + return FLAGS_allocator_strategy == "legacy" + ? AllocatorStrategy::kLegacy + : AllocatorStrategy::kNaiveBestFit; +} + +AllocatorStrategy GetAllocatorStrategy() { + static AllocatorStrategy strategy = GetStrategyFromFlag(); + return strategy; +} + +void UseAllocatorStrategyGFlag() {} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/allocator_strategy.h b/paddle/fluid/memory/allocation/allocator_strategy.h new file mode 100644 index 0000000000000000000000000000000000000000..9adbd879939c562cf84579a92f21d3b82e69a7e5 --- /dev/null +++ b/paddle/fluid/memory/allocation/allocator_strategy.h @@ -0,0 +1,30 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +namespace paddle { +namespace memory { +namespace allocation { + +enum class AllocatorStrategy { kLegacy, kNaiveBestFit }; + +extern AllocatorStrategy GetAllocatorStrategy(); + +// Do nothing, just make sure linker do not prune this file. +extern void UseAllocatorStrategyGFlag(); + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/auto_increment_allocator.cc b/paddle/fluid/memory/allocation/auto_increment_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..c4785d2078601d7f9c5eeb7b902c7d1020340214 --- /dev/null +++ b/paddle/fluid/memory/allocation/auto_increment_allocator.cc @@ -0,0 +1,78 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/auto_increment_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { +bool AutoIncrementAllocator::IsAllocThreadSafe() const { return true; } + +std::shared_ptr AutoIncrementAllocator::CreateNewAllocator() { + std::lock_guard guard(mtx_); + auto old_size = allocator_num_.load(); + PADDLE_ENFORCE_LT(old_size, underlying_allocators_.size(), + "Allocator number exceeds capacity %d", + underlying_allocators_.size()); + underlying_allocators_[old_size] = creator_(); + prev_success_allocator_ = old_size; + ++allocator_num_; + PADDLE_ENFORCE( + underlying_allocators_[old_size]->IsAllocThreadSafe(), + "the underlying allocator must be thread safe. This is a program " + "bug."); + return underlying_allocators_[old_size]; +} +Allocation *AutoIncrementAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + auto cur = prev_success_allocator_.load(); + size_t retry_count = allocator_num_.load(); + size_t allocator_num = retry_count; + while (retry_count-- > 0) { // until there retry count is zero + try { + auto res = underlying_allocators_[cur]->Allocate(size, attr); + prev_success_allocator_ = cur; + return res.release(); + } catch (BadAlloc &) { + if (++cur >= allocator_num) { + cur = 0; + } + } catch (...) { + // if there is another type of allocation, just rethrow it. + throw; + } + } + + // This happens when the first allocator is exhausted and + // there are more than 1 allocation requests + // In this situation, the first allocation request would success + // and the second allocation request would fail if we do not use + // the newly created allocator by the first allocation request. + for (cur = allocator_num; cur < allocator_num_; ++cur) { + try { + auto ret = underlying_allocators_[cur]->Allocate(size, attr); + prev_success_allocator_ = cur; + return ret.release(); + } catch (BadAlloc &) { + } catch (...) { + throw; + } + } + // No suitable allocator + return CreateNewAllocator()->Allocate(size, attr).release(); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/auto_increment_allocator.h b/paddle/fluid/memory/allocation/auto_increment_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..382588f17a9748b1b0a356c0469c683f6c904778 --- /dev/null +++ b/paddle/fluid/memory/allocation/auto_increment_allocator.h @@ -0,0 +1,79 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include +#include +#include // NOLINT +#include // NOLINT +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The AutoIncrementAllocator manages many underlying allocators. If none of +// them can allocate the request memory, a new allocator will be created and +// invoke its `allocate` method. +// +// NOTE(yy): The AutoIncrementAllocator will prefer to allocate memory from +// the latest successful allocator. +// +// NOTE(yy): We may need to release an underlying allocator if it allocate +// nothing. However, it is generally not useful, since it will make performance +// undetermined. +// +// NOTE(yy): This allocator is only locked when creating new underlying +// allocator. The allocation requests from many threads may be dispatched +// to the same underlying allocator. So the underlying allocator must be +// thread safe. +// +// NOTE(zjl): Add capacity parameters to constructor. A high-performance +// thread-safe std::vector with varying size is hard to implement. +// Fortunately, we can get the total GPU memory and each chunk size. +// Therefore, we can get the suitable capacity of AutoIncrementAllocator. +class AutoIncrementAllocator : public Allocator { + public: + // Creator is the method to create ManagedAllocator + using AllocatorCreator = std::function()>; + + explicit AutoIncrementAllocator(AllocatorCreator&& creator, size_t capacity) + : creator_(std::move(creator)), underlying_allocators_(capacity) {} + + bool IsAllocThreadSafe() const override; + + private: + std::shared_ptr CreateNewAllocator(); + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + AllocatorCreator creator_; + + std::vector underlying_allocators_; + std::atomic allocator_num_{0}; + + // Use std::atomic rather than std::mutex, since std::atomic is usually + // lock-free + std::atomic prev_success_allocator_{0}; + + std::mutex mtx_; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator.cc b/paddle/fluid/memory/allocation/best_fit_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..6f3e512fb0b68df5e86eba3e50a255c18f75214f --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator.cc @@ -0,0 +1,168 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include +#include +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { + +static int HighestBitPos(size_t N) { + if (UNLIKELY(N == 0)) { + return 0; + } else { +#ifdef __GNUCC__ + return sizeof(unsigned int) * 8 - __builtin_clz(N); +#else + return static_cast(std::log2(N) + 1); +#endif + } +} + +BestFitAllocator::BestFitAllocator(Allocation* allocation) + : allocation_(allocation) { + details::Chunk chunk; + chunk.size_ = allocation_->size(); + chunk.offset_ = 0; + chunk.is_free = true; + chunks_.emplace_back(chunk); + free_chunks_[HighestBitPos(chunk.size_)].insert( + {chunk.size_, chunks_.begin()}); +} + +size_t BestFitAllocator::FreeSize() const { + size_t acc = 0; + for (auto& array_item : free_chunks_) { + for (auto& pair : array_item) { + acc += pair.second->size_; + } + } + return acc; +} + +BestFitAllocator::ListIt BestFitAllocator::SplitChunk(size_t request_size, + size_t free_chunk_offset, + MapIt bin_iterator) { + auto to_split_it = bin_iterator->second; + free_chunks_[free_chunk_offset].erase(bin_iterator); + + PADDLE_ENFORCE(to_split_it->is_free); + PADDLE_ENFORCE_GE(to_split_it->size_, request_size); + + auto remaining_size = to_split_it->size_ - request_size; + details::Chunk to_use; + details::Chunk remaining; + to_use.size_ = request_size; + to_use.is_free = false; + remaining.size_ = remaining_size; + remaining.is_free = true; + + // calc offsets + to_use.offset_ = to_split_it->offset_; + remaining.offset_ = to_use.offset_ + to_use.size_; + + // insert to chunk list + auto to_use_it = chunks_.insert(to_split_it, to_use); + if (remaining.size_ != 0) { + auto bit_size = static_cast(HighestBitPos(remaining.size_)); + free_chunks_[bit_size].insert( + {remaining.size_, chunks_.insert(to_split_it, remaining)}); + } + chunks_.erase(to_split_it); + return to_use_it; +} + +void BestFitAllocator::InsertFreeNode(const ListIt& it) { + auto pos = static_cast(HighestBitPos(it->size_)); + auto& free_map = free_chunks_[pos]; + free_map.insert({it->size_, it}); +} +void BestFitAllocator::EraseFreeNode(const ListIt& it) { + size_t pos = static_cast(HighestBitPos(it->size_)); + auto& free_map = free_chunks_[pos]; + auto map_it = free_map.find(it->size_); + while (map_it->second != it && map_it != free_map.end()) { + ++map_it; + } + PADDLE_ENFORCE(map_it != free_map.end()); + free_map.erase(map_it); +} +size_t BestFitAllocator::NumFreeChunks() const { + size_t num = 0; + for (auto& array_item : free_chunks_) { + num += array_item.size(); + } + return num; +} +void BestFitAllocator::Free(Allocation* allocation) { + auto* bf_allocation = dynamic_cast(allocation); + auto chunk_it = bf_allocation->ChunkIterator(); + PADDLE_ENFORCE(!chunk_it->is_free); + chunk_it->is_free = true; + if (chunk_it != chunks_.begin()) { + auto prev_it = chunk_it; + --prev_it; + + if (prev_it->is_free) { + // Merge Left. + EraseFreeNode(prev_it); + prev_it->size_ += chunk_it->size_; + chunks_.erase(chunk_it); + chunk_it = prev_it; + } + } + + auto next_it = chunk_it; + ++next_it; + if (next_it != chunks_.end() && next_it->is_free) { + EraseFreeNode(next_it); + chunk_it->size_ += next_it->size_; + chunks_.erase(next_it); + } + + InsertFreeNode(chunk_it); + delete allocation; +} +Allocation* BestFitAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + auto highest_set_bit = static_cast(HighestBitPos(size)); + MapIt map_it; + for (; highest_set_bit < free_chunks_.size(); ++highest_set_bit) { + map_it = free_chunks_[highest_set_bit].lower_bound(size); + if (map_it != free_chunks_[highest_set_bit].end()) { + break; + } + } + if (UNLIKELY(highest_set_bit == free_chunks_.size())) { + throw BadAlloc(string::Sprintf( + "Cannot allocate %d, All fragments size is %d", size, FreeSize())); + } + auto chunk_it = SplitChunk(size, highest_set_bit, map_it); + return new BestFitAllocation(this, chunk_it); +} + +BestFitAllocation::BestFitAllocation( + paddle::memory::allocation::BestFitAllocator* allocator, + typename details::ChunkList::iterator chunk_it) + : Allocation(reinterpret_cast( + reinterpret_cast(allocator->BasePtr()) + + chunk_it->offset_), + chunk_it->size_, allocator->Place()), + chunk_it_(chunk_it) {} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator.h b/paddle/fluid/memory/allocation/best_fit_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..4f10f2b53e8543d4197097f1cae8de765bceeb0f --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator.h @@ -0,0 +1,132 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { +namespace details { +struct Chunk { + bool is_free{true}; + // Offset to the base allocation. + uintptr_t offset_; + size_t size_; +}; + +// Here we use std::list to maintain chunk list. +// NOTE(yy): The traditional implementation of ChunkList is add `prev`/`next` +// pointers in `Chunk`, and split the allocation as `ChunkHeader` and +// `Payload`. Such as +// *-------*---------------*---------------*--------------* +// | Chunk | prev_ pointer | next_ pointer | payload .... | +// *-------*---------------*---------------*--------------* +// This implementation can just return a raw pointer, and we can get the list +// structure by the raw pointer. However, we cannot use the same code on GPU +// since CPU cannot access GPU memory directly. +// +// So we choose to use `std::list` and return an allocation instance, which +// contains the list node iterator, then we can unify CPU/GPU code. +// +// To return an allocation is not a bad idea, since Tensor/Vector should holds +// an allocation instead of raw pointer directly. +using ChunkList = std::list; + +// Here we use a multi-level map of free chunks. +// the map is +// MSB offset --> size --> [ChunkList::iterator] +// +// The time complexities: +// find a free chunk: +// O(logN), +// where N is the number of free nodes with the same MSB offset. +// find the position of a chunk iterator: +// O(logN + K), +// where N is the number of free nodes with the same MSB offset. +// where K is the number of free nodes with the same size. +// insert a free chunk: +// O(logN), +// where N is the number of free nodes with the same MSB offset. +// erase a free chunk: +// O(1) +using FreeChunkBin = + std::array, sizeof(size_t) * 8>; +} // namespace details + +class BestFitAllocator; + +// The BestFitAllocation maintain the List Node iterator. +class BestFitAllocation : public Allocation { + private: + using ListIt = typename details::ChunkList::iterator; + + public: + BestFitAllocation(BestFitAllocator* allocator, ListIt chunk_it); + + const ListIt& ChunkIterator() const { return chunk_it_; } + + private: + typename details::ChunkList::iterator chunk_it_; +}; + +// TODO(yy): Current BestFitAllocator is not thread-safe. To make it thread +// safe, we must wrap a locked_allocator. However, we can implement a thread +// safe allocator by locking each bin and chunks list independently. It will +// make BestFitAllocator faster in multi-thread situation. +// +// This allocator implements a best-fit allocator with merging the free nodes. +// +// To allocate a buffer, it will find the best-fit chunk. If the best-fit chunk +// is larger than request size, the original block will be split into two +// chunks. The first block will be used and the second block will be put into +// free chunks. +// +// To free an allocation, it will set the chunk of allocation to free and merge +// the prev-chunk and the next-chunk when possible. +class BestFitAllocator : public Allocator { + public: + explicit BestFitAllocator(Allocation* allocation); + + void* BasePtr() const { return allocation_->ptr(); } + + const platform::Place& Place() const { return allocation_->place(); } + + size_t NumFreeChunks() const; + + private: + size_t FreeSize() const; + using MapIt = typename details::FreeChunkBin::value_type::iterator; + using ListIt = typename details::ChunkList::iterator; + + ListIt SplitChunk(size_t request_size, size_t free_chunk_offset, + MapIt bin_iterator); + void EraseFreeNode(const ListIt& it); + void InsertFreeNode(const ListIt& it); + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + Allocation* allocation_; // not owned + details::ChunkList chunks_; + details::FreeChunkBin free_chunks_; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator_test.cc b/paddle/fluid/memory/allocation/best_fit_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..20748a23a1951383c888d9b8d7a360ec941e50cb --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator_test.cc @@ -0,0 +1,138 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class StubAllocation : public Allocation { + public: + explicit StubAllocation(size_t size) + : Allocation(0, size, platform::CPUPlace()) {} +}; + +TEST(BestFitAllocator, test_allocation) { + StubAllocation stub(4UL * 1024 * 1024 * 1024); + BestFitAllocator allocator(&stub); + { auto allocation = allocator.Allocate(64, allocator.kDefault); } + + { + auto allocation = allocator.Allocate(80, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation.get()); + ASSERT_NE(best_fit_allocation, nullptr); + ASSERT_FALSE(best_fit_allocation->ChunkIterator()->is_free); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 0); + ASSERT_EQ(allocation->size(), 80); + ASSERT_EQ(allocation->ptr(), nullptr); + } + + auto allocation2 = allocator.Allocate(60, allocator.kDefault); + auto allocation3 = allocator.Allocate(90, allocator.kDefault); + allocation2.reset(); + allocation2 = allocator.Allocate(30, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation2.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 80); + } + allocation2.reset(); + allocation2 = allocator.Allocate(60, allocator.kDefault); + + { + auto best_fit_allocation = + dynamic_cast(allocation2.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 80); + } + + allocation.reset(); + allocation2.reset(); + + allocation = allocator.Allocate(80 + 60, allocator.kDefault); + { + auto best_fit_allocation = + dynamic_cast(allocation.get()); + ASSERT_EQ(best_fit_allocation->ChunkIterator()->offset_, 0); + } + + allocation.reset(); + + allocation = allocator.Allocate(80, allocator.kDefault); + allocation2 = allocator.Allocate(60, allocator.kDefault); + allocation = nullptr; + allocation2 = nullptr; + allocation3 = nullptr; + + ASSERT_EQ(allocator.NumFreeChunks(), 1U); + } +} + +TEST(BestFitAllocator, test_concurrent_cpu_allocation) { + CPUAllocator allocator; + auto global_allocation = + allocator.Allocate(256UL * 1024 * 1024, allocator.kDefault); + + std::unique_ptr best_fit_allocator( + new BestFitAllocator(global_allocation.get())); + + LockedAllocator locked_allocator(std::move(best_fit_allocator)); + + auto th_main = [&] { + std::random_device dev; + std::default_random_engine engine(dev()); + std::uniform_int_distribution dist(1U, 1024U); + + for (size_t i = 0; i < 128; ++i) { + size_t allocate_size = dist(engine); + + auto allocation = locked_allocator.Allocate( + sizeof(size_t) * allocate_size, locked_allocator.kDefault); + + size_t* data = reinterpret_cast(allocation->ptr()); + + for (size_t j = 0; j < allocate_size; ++j) { + data[j] = j; + } + std::this_thread::yield(); + + for (size_t j = 0; j < allocate_size; ++j) { + ASSERT_EQ(data[j], j); + } + } + }; + { + std::vector threads; + for (size_t i = 0; i < 1024; ++i) { + threads.emplace_back(th_main); + } + for (auto& th : threads) { + th.join(); + } + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/best_fit_allocator_test.cu b/paddle/fluid/memory/allocation/best_fit_allocator_test.cu new file mode 100644 index 0000000000000000000000000000000000000000..f7f17e1d36e0adef0b0eb7a43715836db4b7927d --- /dev/null +++ b/paddle/fluid/memory/allocation/best_fit_allocator_test.cu @@ -0,0 +1,88 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include "paddle/fluid/memory/memcpy.h" +#include "paddle/fluid/platform/for_range.h" +namespace paddle { +namespace memory { +namespace allocation { + +struct ForEachFill { + size_t* ptr_; + + explicit ForEachFill(size_t* ptr) : ptr_(ptr) {} + + __device__ void operator()(size_t i) { ptr_[i] = i; } +}; + +TEST(BestFitAllocator, concurrent_cuda) { + CUDAAllocator allocator(platform::CUDAPlace(0)); + // 256 MB + auto cuda_allocation = + allocator.Allocate(256U * 1024 * 1024, allocator.kDefault); + LockedAllocator concurrent_allocator( + std::unique_ptr(new BestFitAllocator(cuda_allocation.get()))); + + auto th_main = [&] { + std::random_device dev; + std::default_random_engine engine(dev()); + std::uniform_int_distribution dist(1U, 1024U); + platform::CUDAPlace gpu(0); + platform::CUDADeviceContext dev_ctx(gpu); + std::array buf; + for (size_t i = 0; i < 128; ++i) { + size_t allocate_size = dist(engine); + + auto allocation = concurrent_allocator.Allocate( + sizeof(size_t) * allocate_size, concurrent_allocator.kDefault); + + size_t* data = reinterpret_cast(allocation->ptr()); + + ForEachFill fill(data); + platform::ForRange for_range(dev_ctx, + allocate_size); + for_range(fill); + + memory::Copy(platform::CPUPlace(), buf.data(), gpu, data, + sizeof(size_t) * allocate_size, dev_ctx.stream()); + + dev_ctx.Wait(); + for (size_t j = 0; j < allocate_size; ++j) { + ASSERT_EQ(buf[j], j); + } + allocation = nullptr; + } + }; + + { + std::vector threads; + for (size_t i = 0; i < 1024; ++i) { + threads.emplace_back(th_main); + } + for (auto& th : threads) { + th.join(); + } + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator.cc b/paddle/fluid/memory/allocation/buffered_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..fc75abc9dfee6c9df5bc87faa493002cc1fe6298 --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator.cc @@ -0,0 +1,80 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/buffered_allocator.h" +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" + +namespace paddle { +namespace memory { +namespace allocation { + +BufferedAllocator::BufferedAllocator(std::unique_ptr &&allocator) + : underlying_allocator_(std::move(allocator)) { + PADDLE_ENFORCE_NOT_NULL( + underlying_allocator_, + "Underlying allocator of BufferedAllocator must be unmanaged"); + if (underlying_allocator_->IsAllocThreadSafe()) { + mtx_.reset(new std::mutex()); + } +} + +BufferedAllocator::~BufferedAllocator() { FreeCache(-1UL); } + +void BufferedAllocator::FreeCache(size_t size) { + platform::LockGuardPtr guard(mtx_); + if (UNLIKELY(size == 0)) return; + size_t cur = 0; + while (!allocations_.empty()) { // free the largest + auto it = --allocations_.end(); + cur += it->second->size(); + delete it->second.release(); + allocations_.erase(it); + if (cur >= size) return; + } +} + +bool BufferedAllocator::IsAllocThreadSafe() const { + return this->underlying_allocator_->IsAllocThreadSafe(); +} +void BufferedAllocator::Free(Allocation *allocation) { + platform::LockGuardPtr guard(mtx_); + allocations_.emplace(allocation->size(), AllocationPtr(allocation)); +} +Allocation *BufferedAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + { + platform::LockGuardPtr guard(mtx_); + auto it = allocations_.lower_bound(size); + if (it != allocations_.end() && it->first < size * 2) { + AllocationPtr result(std::move(it->second)); + allocations_.erase(it); + return new AllocationWithUnderlying(std::move(result)); + } + } + + try { + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + } catch (BadAlloc &) { + FreeCache(size); + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator.h b/paddle/fluid/memory/allocation/buffered_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..d44a3f85beba712b1e735ba14008689bce7d0d64 --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator.h @@ -0,0 +1,58 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/lock_guard_ptr.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// NOTE(zjl): BufferedAllocator maintains a memory pool to accelerate +// memory allocation and reuse memory. +// BufferedAllocator provides the same thread-safety level as +// underlying_allocator_ +class BufferedAllocator : public Allocator { + public: + explicit BufferedAllocator(std::unique_ptr &&allocator); + + ~BufferedAllocator(); + + bool IsAllocThreadSafe() const override; + + // only used in unittest + inline void ClearCache() { FreeCache(-1UL); } + + private: + void FreeCache(size_t size); + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::multimap allocations_; + std::unique_ptr mtx_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/buffered_allocator_test.cc b/paddle/fluid/memory/allocation/buffered_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..41ebb9dbeaf36eafe3dff4ae294b84427f660cbf --- /dev/null +++ b/paddle/fluid/memory/allocation/buffered_allocator_test.cc @@ -0,0 +1,144 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/buffered_allocator.h" +#include +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +inline std::unique_ptr GetBufferedAllocator( + Allocation *allocation, bool thread_safe) { + std::unique_ptr allocator(new BestFitAllocator(allocation)); + if (thread_safe) { + allocator.reset(new LockedAllocator(std::move(allocator))); + } + + return std::unique_ptr( + new BufferedAllocator(std::move(allocator))); +} + +TEST(buffered_allocator, thread_safety) { + std::unique_ptr allocator(new CPUAllocator()); + auto chunk = allocator->Allocate(1 << 20, allocator->kDefault); + { + auto buf_allocator = GetBufferedAllocator(chunk.get(), true); + ASSERT_EQ(buf_allocator->IsAllocThreadSafe(), true); + } + + { + auto buf_allocator = GetBufferedAllocator(chunk.get(), false); + ASSERT_EQ(buf_allocator->IsAllocThreadSafe(), false); + } +} + +class StubAllocation : public Allocation { + public: + using Allocation::Allocation; +}; + +class StubAllocator : public Allocator { + public: + void ResetCounter() { + construct_count_ = 0; + destruct_count_ = 0; + } + + size_t GetAllocCount() const { return construct_count_; } + + size_t GetFreeCount() const { return destruct_count_; } + + protected: + void Free(Allocation *allocation) override { + auto *alloc = dynamic_cast(allocation); + PADDLE_ENFORCE_NOT_NULL(alloc); + if (alloc->ptr()) delete[] static_cast(alloc->ptr()); + ++destruct_count_; + delete allocation; + } + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override { + ++construct_count_; + if (size == 0) { + return new StubAllocation(nullptr, 0, platform::CPUPlace()); + } else { + return new StubAllocation(new uint8_t[size], size, platform::CPUPlace()); + } + } + + private: + size_t construct_count_ = 0; + size_t destruct_count_ = 0; +}; + +constexpr size_t kZero = 0; +constexpr size_t kOne = 1; +constexpr size_t kTwo = 2; + +TEST(buffered_allocator, lazy_free) { + std::unique_ptr stub_allocator(new StubAllocator()); + auto *underlying_allocator = stub_allocator.get(); + std::unique_ptr allocator( + new BufferedAllocator(std::move(stub_allocator))); + + { + underlying_allocator->ResetCounter(); + auto x = allocator->Allocate(1025, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kOne); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + x = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + } + + { + underlying_allocator->ResetCounter(); + auto x = allocator->Allocate(900, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kZero); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + auto y = allocator->Allocate(2048, allocator->kDefault); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kOne); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + x = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + y = nullptr; + ASSERT_EQ(underlying_allocator->GetFreeCount(), kZero); + } + + { + underlying_allocator->ResetCounter(); + allocator->ClearCache(); + ASSERT_EQ(underlying_allocator->GetAllocCount(), kZero); + ASSERT_EQ(underlying_allocator->GetFreeCount(), kTwo); + } +} + +TEST(buffered_allocator, garbage_collection) { + std::unique_ptr cpu_allocator(new CPUAllocator()); + auto chunk = cpu_allocator->Allocate(2048, cpu_allocator->kDefault); + auto allocator = GetBufferedAllocator(chunk.get(), false); + auto x1 = allocator->Allocate(1600, allocator->kDefault); + auto x2 = allocator->Allocate(400, allocator->kDefault); + x1 = nullptr; + x2 = nullptr; + auto x3 = allocator->Allocate(1600, allocator->kDefault); + ASSERT_NE(x3, nullptr); + ASSERT_NE(x3->ptr(), nullptr); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/conditional_allocator.cc b/paddle/fluid/memory/allocation/conditional_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..96a818e03e507c6de720344288312dc2af2ae647 --- /dev/null +++ b/paddle/fluid/memory/allocation/conditional_allocator.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/conditional_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +ConditionalAllocator& ConditionalAllocator::AddAllocator( + std::function func, + std::shared_ptr allocator) { + underlying_allocators_.emplace_back(std::move(func), std::move(allocator)); + return *this; +} + +bool ConditionalAllocator::IsAllocThreadSafe() const { + return std::all_of(underlying_allocators_.begin(), + underlying_allocators_.end(), + [](const AllocatorWithCond& allocatorWithCond) { + return allocatorWithCond.second->IsAllocThreadSafe(); + }); +} + +Allocation* ConditionalAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + for (auto& pair : underlying_allocators_) { + if (pair.first(size, attr)) { + return pair.second->Allocate(size, attr).release(); + } + } + throw BadAlloc("No suitable allocator"); +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/conditional_allocator.h b/paddle/fluid/memory/allocation/conditional_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..94cba4432ed4f72c0a75da9b31d48611a8404ad3 --- /dev/null +++ b/paddle/fluid/memory/allocation/conditional_allocator.h @@ -0,0 +1,61 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// A composite allocator who will dispatch the allocation request by registered +// condition. +// +// For example: +// +// auto* cond_allocator = new ConditionalAllocator(); +// cond_allocator->AddAllocator([](size_t size, Attr attr){ +// // if size > 10 +// return size > 10; +// }, allocator_a).AddAllocator([](size_t size, Attr attr){ +// // elif attr is kDefault +// return attr == kDefault; +// }, allocator_b).AddAllocator([](size_t size, Attr attr){ +// // else +// return true; +// }, allocator_c); +class ConditionalAllocator : public Allocator { + public: + ConditionalAllocator() = default; + + ConditionalAllocator& AddAllocator(std::function func, + std::shared_ptr allocator); + + bool IsAllocThreadSafe() const override; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + using AllocatorWithCond = + std::pair, std::shared_ptr>; + std::vector underlying_allocators_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cpu_allocator.cc b/paddle/fluid/memory/allocation/cpu_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..cc81a6f7b8b1950b07b6fb1571b53d9b5ddb1b9f --- /dev/null +++ b/paddle/fluid/memory/allocation/cpu_allocator.cc @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { + +CPUAllocation::CPUAllocation(void *ptr, size_t size) + : Allocation(ptr, size, platform::CPUPlace()) {} + +bool CPUAllocator::IsAllocThreadSafe() const { return true; } + +void CPUAllocator::Free(Allocation *allocation) { + PADDLE_ENFORCE_NOT_NULL(dynamic_cast(allocation)); + free(allocation->ptr()); + delete allocation; +} + +Allocation *CPUAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + void *ptr; + auto status = posix_memalign(&ptr, kAlignment, size); + if (UNLIKELY(status) != 0) { + throw BadAlloc(string::Sprintf("Cannot allocate cpu memory %d. Errno is %d", + size, status)); + } + return new CPUAllocation(ptr, size); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cpu_allocator.h b/paddle/fluid/memory/allocation/cpu_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..26d3643f4edff1f2d71b1c761e915a6dacb485ad --- /dev/null +++ b/paddle/fluid/memory/allocation/cpu_allocator.h @@ -0,0 +1,51 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" + +#ifdef _WIN32 +#define posix_memalign_free _aligned_free +#define posix_memalign(p, a, s) \ + (((*(p)) = _aligned_malloc((s), (a))), *(p) ? 0 : errno) +#endif + +namespace paddle { +namespace memory { +namespace allocation { +// CPU system allocator and allocation. +// +// NOTE(yy): Should we just use `malloc` here since there is an +// aligned_allocator. +// +// NOTE(yy): It is no need to use `BestFitAllocator` in CPU. We can import +// an open-sourced allocator into Paddle. +class CPUAllocator; +class CPUAllocation : public Allocation { + public: + CPUAllocation(void* ptr, size_t size); +}; + +class CPUAllocator : public Allocator { + public: + constexpr static size_t kAlignment = 64u; + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; +}; +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cuda_allocator.cc b/paddle/fluid/memory/allocation/cuda_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..430bf0be98e08787ac4412a8b6e0fcc310ffe2b4 --- /dev/null +++ b/paddle/fluid/memory/allocation/cuda_allocator.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/cuda_allocator.h" +#include +#include +#include +#include "paddle/fluid/platform/cuda_device_guard.h" +#include "paddle/fluid/platform/gpu_info.h" + +namespace paddle { +namespace memory { +namespace allocation { +bool CUDAAllocator::IsAllocThreadSafe() const { return true; } +void CUDAAllocator::Free(Allocation* allocation) { + platform::CUDADeviceGuard guard(place_.device); + auto* cuda_allocation = dynamic_cast(allocation); + PADDLE_ENFORCE_NOT_NULL(cuda_allocation); + PADDLE_ENFORCE_EQ(boost::get(cuda_allocation->place()), + place_); + PADDLE_ENFORCE(cudaFree(allocation->ptr())); + delete allocation; +} +Allocation* CUDAAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + platform::CUDADeviceGuard guard(place_.device); + void* ptr; + auto status = cudaMalloc(&ptr, size); + if (UNLIKELY(status != cudaSuccess)) { + throw BadAlloc(string::Sprintf( + "Cannot allocate %d on GPU %d, cuda status %d, %s", size, place_.device, + status, cudaGetErrorString(status))); + } + return new CUDAAllocation(ptr, size, platform::Place(place_)); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/cuda_allocator.h b/paddle/fluid/memory/allocation/cuda_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..63726f5820b1c81565117c7a9bf798c17c9681f6 --- /dev/null +++ b/paddle/fluid/memory/allocation/cuda_allocator.h @@ -0,0 +1,47 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// CUDA System allocator and allocation. +// Just a flag type. +class CUDAAllocation : public Allocation { + public: + using Allocation::Allocation; +}; + +class CUDAAllocator : public Allocator { + public: + explicit CUDAAllocator(const platform::CUDAPlace& place) : place_(place) {} + explicit CUDAAllocator(const platform::Place& place) + : place_(boost::get(place)) {} + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + platform::CUDAPlace place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/legacy_allocator.cc b/paddle/fluid/memory/allocation/legacy_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..e66537272340e89fe1075325323909213bbe97b8 --- /dev/null +++ b/paddle/fluid/memory/allocation/legacy_allocator.cc @@ -0,0 +1,307 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/legacy_allocator.h" +#include +#include "glog/logging.h" +#include "paddle/fluid/memory/detail/buddy_allocator.h" +#include "paddle/fluid/memory/detail/system_allocator.h" +#include "paddle/fluid/platform/gpu_info.h" +#include "paddle/fluid/string/printf.h" + +DEFINE_bool(init_allocated_mem, false, + "It is a mistake that the values of the memory allocated by " + "BuddyAllocator are always zeroed in some op's implementation. " + "To find this error in time, we use init_allocated_mem to indicate " + "that initializing the allocated memory with a small value " + "during unit testing."); +DECLARE_double(fraction_of_gpu_memory_to_use); + +namespace paddle { +namespace memory { +namespace legacy { +template +void *Alloc(const Place &place, size_t size); + +template +void Free(const Place &place, void *p); + +template +size_t Used(const Place &place); + +struct Usage : public boost::static_visitor { + size_t operator()(const platform::CPUPlace &cpu) const; + size_t operator()(const platform::CUDAPlace &gpu) const; + size_t operator()(const platform::CUDAPinnedPlace &cuda_pinned) const; +}; + +size_t memory_usage(const platform::Place &p); + +using BuddyAllocator = detail::BuddyAllocator; + +BuddyAllocator *GetCPUBuddyAllocator() { + // We tried thread_local for inference::RNN1 model, but that not works much + // for multi-thread test. + static std::once_flag init_flag; + static detail::BuddyAllocator *a = nullptr; + + std::call_once(init_flag, []() { + a = new detail::BuddyAllocator( + std::unique_ptr(new detail::CPUAllocator), + platform::CpuMinChunkSize(), platform::CpuMaxChunkSize()); + }); + + return a; +} + +// We compared the NaiveAllocator with BuddyAllocator in CPU memory allocation, +// seems they are almost the same overhead. +struct NaiveAllocator { + void *Alloc(size_t size) { return malloc(size); } + + void Free(void *p) { + PADDLE_ENFORCE(p); + free(p); + } + + static NaiveAllocator *Instance() { + static NaiveAllocator x; + return &x; + } + + private: + std::mutex lock_; +}; + +template <> +void *Alloc(const platform::CPUPlace &place, size_t size) { + VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place); + void *p = GetCPUBuddyAllocator()->Alloc(size); + if (FLAGS_init_allocated_mem) { + memset(p, 0xEF, size); + } + VLOG(100) << " pointer=" << p; + return p; +} + +template <> +void Free(const platform::CPUPlace &place, void *p) { + VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place); + GetCPUBuddyAllocator()->Free(p); +} + +template <> +size_t Used(const platform::CPUPlace &place) { + return GetCPUBuddyAllocator()->Used(); +} + +#ifdef PADDLE_WITH_CUDA +BuddyAllocator *GetGPUBuddyAllocator(int gpu_id) { + static std::once_flag init_flag; + static detail::BuddyAllocator **a_arr = nullptr; + + std::call_once(init_flag, [gpu_id]() { + int gpu_num = platform::GetCUDADeviceCount(); + PADDLE_ENFORCE(gpu_id < gpu_num, "gpu_id:%d should < gpu_num:%d", gpu_id, + gpu_num); + + a_arr = new BuddyAllocator *[gpu_num]; + for (int i = 0; i < gpu_num; i++) { + a_arr[i] = nullptr; + platform::SetDeviceId(i); + a_arr[i] = new BuddyAllocator( + std::unique_ptr(new detail::GPUAllocator(i)), + platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); + + VLOG(100) << "\n\nNOTE: each GPU device use " + << FLAGS_fraction_of_gpu_memory_to_use * 100 + << "% of GPU memory.\n" + << "You can set GFlags environment variable '" + << "FLAGS_fraction_of_gpu_memory_to_use" + << "' to change the fraction of GPU usage.\n\n"; + } + }); + + platform::SetDeviceId(gpu_id); + return a_arr[gpu_id]; +} +#endif + +template <> +size_t Used(const platform::CUDAPlace &place) { +#ifdef PADDLE_WITH_CUDA + return GetGPUBuddyAllocator(place.device)->Used(); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +template <> +void *Alloc(const platform::CUDAPlace &place, + size_t size) { +#ifdef PADDLE_WITH_CUDA + auto *buddy_allocator = GetGPUBuddyAllocator(place.device); + auto *ptr = buddy_allocator->Alloc(size); + if (ptr == nullptr) { + int cur_dev = platform::GetCurrentDeviceId(); + platform::SetDeviceId(place.device); + size_t avail, total; + platform::GpuMemoryUsage(&avail, &total); + LOG(WARNING) << "Cannot allocate " << string::HumanReadableSize(size) + << " in GPU " << place.device << ", available " + << string::HumanReadableSize(avail); + LOG(WARNING) << "total " << total; + LOG(WARNING) << "GpuMinChunkSize " + << string::HumanReadableSize( + buddy_allocator->GetMinChunkSize()); + LOG(WARNING) << "GpuMaxChunkSize " + << string::HumanReadableSize( + buddy_allocator->GetMaxChunkSize()); + LOG(WARNING) << "GPU memory used: " + << string::HumanReadableSize(Used(place)); + platform::SetDeviceId(cur_dev); + } + if (FLAGS_init_allocated_mem) { + cudaMemset(ptr, 0xEF, size); + } + return ptr; +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +template <> +void Free(const platform::CUDAPlace &place, void *p) { +#ifdef PADDLE_WITH_CUDA + GetGPUBuddyAllocator(place.device)->Free(p); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +#ifdef PADDLE_WITH_CUDA +BuddyAllocator *GetCUDAPinnedBuddyAllocator() { + static std::once_flag init_flag; + static BuddyAllocator *ba = nullptr; + + std::call_once(init_flag, []() { + ba = new BuddyAllocator(std::unique_ptr( + new detail::CUDAPinnedAllocator), + platform::CUDAPinnedMinChunkSize(), + platform::CUDAPinnedMaxChunkSize()); + }); + + return ba; +} +#endif + +template <> +size_t Used(const platform::CUDAPinnedPlace &place) { +#ifdef PADDLE_WITH_CUDA + return GetCUDAPinnedBuddyAllocator()->Used(); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +template <> +void *Alloc(const platform::CUDAPinnedPlace &place, + size_t size) { +#ifdef PADDLE_WITH_CUDA + auto *buddy_allocator = GetCUDAPinnedBuddyAllocator(); + void *ptr = buddy_allocator->Alloc(size); + + if (ptr == nullptr) { + LOG(WARNING) << "cudaMallocHost Cannot allocate " << size + << " bytes in CUDAPinnedPlace"; + } + if (FLAGS_init_allocated_mem) { + memset(ptr, 0xEF, size); + } + return ptr; +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +template <> +void Free(const platform::CUDAPinnedPlace &place, + void *p) { +#ifdef PADDLE_WITH_CUDA + GetCUDAPinnedBuddyAllocator()->Free(p); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} + +struct AllocVisitor : public boost::static_visitor { + inline explicit AllocVisitor(size_t size) : size_(size) {} + + template + inline void *operator()(const Place &place) const { + return Alloc(place, size_); + } + + private: + size_t size_; +}; + +struct FreeVisitor : public boost::static_visitor { + inline explicit FreeVisitor(void *ptr) : ptr_(ptr) {} + + template + inline void operator()(const Place &place) const { + Free(place, ptr_); + } + + private: + void *ptr_; +}; + +size_t Usage::operator()(const platform::CPUPlace &cpu) const { + return Used(cpu); +} + +size_t Usage::operator()(const platform::CUDAPlace &gpu) const { +#ifdef PADDLE_WITH_CUDA + return Used(gpu); +#else + PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); +#endif +} + +size_t Usage::operator()(const platform::CUDAPinnedPlace &cuda_pinned) const { +#ifdef PADDLE_WITH_CUDA + return Used(cuda_pinned); +#else + PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); +#endif +} +} // namespace legacy + +namespace allocation { + +Allocation *LegacyAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + void *ptr = boost::apply_visitor(legacy::AllocVisitor(size), place_); + return new Allocation(ptr, size, place_); +} + +void LegacyAllocator::Free(Allocation *allocation) { + boost::apply_visitor(legacy::FreeVisitor(allocation->ptr()), + allocation->place()); + delete allocation; +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/legacy_allocator.h b/paddle/fluid/memory/allocation/legacy_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..503a7a685cb9d8dbbbbd6c23b5b82c383893e3d8 --- /dev/null +++ b/paddle/fluid/memory/allocation/legacy_allocator.h @@ -0,0 +1,37 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" +#include "paddle/fluid/platform/place.h" +namespace paddle { +namespace memory { +namespace allocation { + +class LegacyAllocatorPrivate; +class LegacyAllocator : public Allocator { + public: + explicit LegacyAllocator(const platform::Place &p) : place_(p) {} + + protected: + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + void Free(Allocation *allocation) override; + + private: + platform::Place place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/locked_allocator.cc b/paddle/fluid/memory/allocation/locked_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..835f6527c8a1d83340167bd9079f7cee25ad24cf --- /dev/null +++ b/paddle/fluid/memory/allocation/locked_allocator.cc @@ -0,0 +1,48 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/locked_allocator.h" +#include // NOLINT +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" +#include "paddle/fluid/platform/lock_guard_ptr.h" +namespace paddle { +namespace memory { +namespace allocation { + +bool LockedAllocator::IsAllocThreadSafe() const { return true; } + +LockedAllocator::LockedAllocator( + std::unique_ptr &&underlying_allocator) + : underlying_allocator_(std::move(underlying_allocator)) { + PADDLE_ENFORCE_NOT_NULL(underlying_allocator_); + if (!underlying_allocator_->IsAllocThreadSafe()) { + mtx_.reset(new std::mutex()); + } +} +void LockedAllocator::Free(Allocation *allocation) { + { + platform::LockGuardPtr guard(mtx_); + reinterpret_cast(allocation) + ->allocation_.reset(); // Destroy inner allocation + } + delete allocation; +} +Allocation *LockedAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + platform::LockGuardPtr guard(mtx_); + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/locked_allocator.h b/paddle/fluid/memory/allocation/locked_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..4967b9bb8d3ad101cff4657b0a45b49b76e2deb2 --- /dev/null +++ b/paddle/fluid/memory/allocation/locked_allocator.h @@ -0,0 +1,41 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +#pragma once +#include +#include // NOLINT +#include // NOLINT +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// A allocator to make underlying allocator thread safe. +class LockedAllocator : public Allocator { + public: + explicit LockedAllocator(std::unique_ptr &&underlying_allocator); + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::unique_ptr mtx_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/pinned_allocator.cc b/paddle/fluid/memory/allocation/pinned_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..6ac3aefdd18d6d9a21dc7ce66511013dfb78bc5b --- /dev/null +++ b/paddle/fluid/memory/allocation/pinned_allocator.cc @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/pinned_allocator.h" +#include +#include + +namespace paddle { +namespace memory { +namespace allocation { +bool CPUPinnedAllocator::IsAllocThreadSafe() const { return true; } +void CPUPinnedAllocator::Free(Allocation *allocation) { + PADDLE_ENFORCE_NOT_NULL(dynamic_cast(allocation)); + PADDLE_ENFORCE(cudaFreeHost(allocation->ptr())); + delete allocation; +} +Allocation *CPUPinnedAllocator::AllocateImpl(size_t size, + Allocator::Attr attr) { + // PADDLE_ENFORCE_EQ( + // attr, kCrossDevice, + // "CPUPinnedAllocator should be used for Cross-Device Communication"); + + void *ptr; + PADDLE_ENFORCE(cudaMallocHost(&ptr, size)); + return new CPUPinnedAllocation(ptr, size); +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/pinned_allocator.h b/paddle/fluid/memory/allocation/pinned_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..26d12dd91c7fda31802226a84d883b6a6e9abbe4 --- /dev/null +++ b/paddle/fluid/memory/allocation/pinned_allocator.h @@ -0,0 +1,40 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// Allocator uses `cudaMallocHost` +class CPUPinnedAllocation : public Allocation { + public: + CPUPinnedAllocation(void *ptr, size_t size) + : Allocation(ptr, size, platform::CUDAPinnedPlace()) {} +}; + +class CPUPinnedAllocator : public Allocator { + public: + bool IsAllocThreadSafe() const override; + + protected: + void Free(Allocation *allocation) override; + Allocation *AllocateImpl(size_t size, Allocator::Attr attr) override; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator.cc b/paddle/fluid/memory/allocation/retry_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..981705051b449e6a35c2dcce9138dc2efae52920 --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator.cc @@ -0,0 +1,75 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include "paddle/fluid/memory/allocation/allocation_with_underlying.h" +namespace paddle { +namespace memory { +namespace allocation { + +bool RetryAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); +} + +void RetryAllocator::Free(Allocation* allocation) { + // Delete underlying allocation first. + reinterpret_cast(allocation)->allocation_.reset(); + { + // notify all waited allocators, they can try to allocate memory after free. + std::lock_guard lock(mutex_); + cv_.notify_all(); + } + delete allocation; +} + +Allocation* RetryAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + auto alloc_func = [&, this]() { + return new AllocationWithUnderlying( + underlying_allocator_->Allocate(size, attr)); + }; + // In fact, we can unify the code of allocation success and failure + // But it would add lock even when allocation success at the first time + try { + return alloc_func(); + } catch (BadAlloc& bad_alloc) { + { + // We can just write allocation retry inside the predicate function of + // wait_until + // But it needs to acquire the lock when executing predicate function + // For better performance, we use loop here + auto end_time = std::chrono::high_resolution_clock::now() + retry_time_; + auto wait_until = [&, this] { + std::unique_lock lock(mutex_); + return cv_.wait_until(lock, end_time); + }; + while (wait_until() != std::cv_status::timeout) { + try { + return alloc_func(); + } catch (BadAlloc& ex) { + bad_alloc = ex; + } catch (...) { + throw; + } + } + + throw; // rethrow the original exception or throw the internal bad_alloc + } + } catch (...) { + throw; + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator.h b/paddle/fluid/memory/allocation/retry_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..5efcac8b108002a2a2da920173d237096de4fffa --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator.h @@ -0,0 +1,66 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include // NOLINT +#include // NOLINT +#include +#include // NOLINT +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +class RetryAllocator; + +class RetryAllocator : public Allocator { + public: + RetryAllocator(std::unique_ptr&& allocator, size_t retry_ms) + : underlying_allocator_(std::move(allocator)), retry_time_(retry_ms) { + EnforceCheck(); + } + + bool IsAllocThreadSafe() const override; + + private: + void EnforceCheck() { + PADDLE_ENFORCE_NOT_NULL( + underlying_allocator_.get(), + "UnderlyingAllocator of RetryAllocator must be UnmanagedAllocator"); + PADDLE_ENFORCE(underlying_allocator_->IsAllocThreadSafe(), + "UnderlyingAllocator of RetryAllocator must be thread-safe"); + } + + protected: + void Free(Allocation* allocation) override; + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::unique_ptr underlying_allocator_; + std::chrono::milliseconds retry_time_; + std::mutex mutex_; + std::condition_variable cv_; + + // For debug, We can add an atomic integer to record how many memory sizes are + // waited to allocate + // std::atomic waited_allocate_size_{0}; + + friend class RetryAllocation; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/retry_allocator_test.cc b/paddle/fluid/memory/allocation/retry_allocator_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..f0b215dac252475217a403e680a23559280b0e8d --- /dev/null +++ b/paddle/fluid/memory/allocation/retry_allocator_test.cc @@ -0,0 +1,98 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/retry_allocator.h" +#include +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include // NOLINT +#include +#include "gtest/gtest.h" +#include "paddle/fluid/memory/allocation/best_fit_allocator.h" +#include "paddle/fluid/memory/allocation/cpu_allocator.h" +#include "paddle/fluid/memory/allocation/locked_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +TEST(RetryAllocator, RetryAllocator) { + CPUAllocator cpu_allocator; + + size_t size = (1 << 20); + auto cpu_allocation = cpu_allocator.Allocate(size, cpu_allocator.kDefault); + + std::unique_ptr best_fit_allocator( + new BestFitAllocator(cpu_allocation.get())); + std::unique_ptr locked_allocator( + new LockedAllocator(std::move(best_fit_allocator))); + + size_t thread_num = 32; + size_t sleep_time = 40; + size_t extra_time = 10; + + // Reserve to perform more tests in the future + std::vector> allocators; + { + std::unique_ptr best_fit_allocator( + new BestFitAllocator(cpu_allocation.get())); + std::unique_ptr locked_allocator( + new LockedAllocator(std::move(best_fit_allocator))); + allocators.push_back(std::make_shared( + std::move(locked_allocator), + (thread_num - 1) * (sleep_time + extra_time))); + } + + for (auto &allocator : allocators) { + std::vector threads(thread_num); + std::vector addresses(threads.size(), nullptr); + + std::mutex mutex; + std::condition_variable cv; + bool flag = false; + + for (size_t i = 0; i < threads.size(); ++i) { + threads[i] = std::thread([&, i]() { + { + std::unique_lock lock(mutex); + cv.wait(lock, [&] { return flag; }); + } + + auto ret = allocator->Allocate(size - 1); + addresses[i] = ret->ptr(); + std::this_thread::sleep_for(std::chrono::milliseconds(sleep_time)); + }); + } + + { + std::lock_guard lock(mutex); + flag = true; + cv.notify_all(); + } + + for (auto &th : threads) { + th.join(); + } + + void *val = cpu_allocation->ptr(); + bool is_all_equal = std::all_of(addresses.begin(), addresses.end(), + [val](void *p) { return p == val; }); + ASSERT_TRUE(is_all_equal); + } +} + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/zero_size_allocator.cc b/paddle/fluid/memory/allocation/zero_size_allocator.cc new file mode 100644 index 0000000000000000000000000000000000000000..cb2df1a029815478bbc9d3b09425f3ef145c5fb3 --- /dev/null +++ b/paddle/fluid/memory/allocation/zero_size_allocator.cc @@ -0,0 +1,34 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/memory/allocation/zero_size_allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +bool ZeroSizeAllocator::IsAllocThreadSafe() const { + return underlying_allocator_->IsAllocThreadSafe(); +} + +Allocation *ZeroSizeAllocator::AllocateImpl(size_t size, Allocator::Attr attr) { + if (size == 0) { + return new ZeroSizeAllocation(place_); + } else { + return underlying_allocator_->Allocate(size, attr).release(); + } +} +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/allocation/zero_size_allocator.h b/paddle/fluid/memory/allocation/zero_size_allocator.h new file mode 100644 index 0000000000000000000000000000000000000000..6b80245a34e7a6834aa75a90218845cc92036881 --- /dev/null +++ b/paddle/fluid/memory/allocation/zero_size_allocator.h @@ -0,0 +1,50 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" + +namespace paddle { +namespace memory { +namespace allocation { + +// The allocator handles the request's size is zero. Allocator will always +// return an allocation even the request size is zero. However, the +// allocation.ptr() is nullptr +class ZeroSizeAllocation : public Allocation { + public: + explicit ZeroSizeAllocation(const platform::Place& p) + : Allocation(nullptr, 0, p) {} +}; + +class ZeroSizeAllocator : public Allocator { + public: + ZeroSizeAllocator(std::shared_ptr underlying_allocator, + const platform::Place& p) + : underlying_allocator_(std::move(underlying_allocator)), place_(p) {} + + bool IsAllocThreadSafe() const override; + + protected: + Allocation* AllocateImpl(size_t size, Allocator::Attr attr) override; + + private: + std::shared_ptr underlying_allocator_; + const platform::Place& place_; +}; + +} // namespace allocation +} // namespace memory +} // namespace paddle diff --git a/paddle/fluid/memory/detail/buddy_allocator.cc b/paddle/fluid/memory/detail/buddy_allocator.cc index 26ef27c3caafadb4801b0ae52133f6175655ce0a..dd7ffaa26426edebd47ec3f6fb275ad5a2d23322 100644 --- a/paddle/fluid/memory/detail/buddy_allocator.cc +++ b/paddle/fluid/memory/detail/buddy_allocator.cc @@ -32,11 +32,11 @@ BuddyAllocator::BuddyAllocator( system_allocator_(std::move(system_allocator)) {} BuddyAllocator::~BuddyAllocator() { - VLOG(10) << "BuddyAllocator Disconstructor makes sure that all of these " - "have actually been freed"; + VLOG(100) << "BuddyAllocator Disconstructor makes sure that all of these " + "have actually been freed"; while (!pool_.empty()) { auto block = static_cast(std::get<2>(*pool_.begin())); - VLOG(10) << "Free from block (" << block << ", " << max_chunk_size_ << ")"; + VLOG(100) << "Free from block (" << block << ", " << max_chunk_size_ << ")"; system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); cache_.invalidate(block); @@ -57,12 +57,12 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) { // acquire the allocator lock std::lock_guard lock(mutex_); - VLOG(10) << "Allocate " << unaligned_size << " bytes from chunk size " - << size; + VLOG(100) << "Allocate " << unaligned_size << " bytes from chunk size " + << size; // if the allocation is huge, send directly to the system allocator if (size > max_chunk_size_) { - VLOG(10) << "Allocate from system allocator."; + VLOG(100) << "Allocate from system allocator."; return SystemAlloc(size); } @@ -77,9 +77,9 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) { return nullptr; } } else { - VLOG(10) << "Allocation from existing memory block " << std::get<2>(*it) - << " at address " - << reinterpret_cast(std::get<2>(*it))->data(); + VLOG(100) << "Allocation from existing memory block " << std::get<2>(*it) + << " at address " + << reinterpret_cast(std::get<2>(*it))->data(); } total_used_ += size; @@ -96,10 +96,10 @@ void BuddyAllocator::Free(void* p) { // Acquire the allocator lock std::lock_guard lock(mutex_); - VLOG(10) << "Free from address " << block; + VLOG(100) << "Free from address " << block; if (block->type(cache_) == MemoryBlock::HUGE_CHUNK) { - VLOG(10) << "Free directly from system allocator"; + VLOG(100) << "Free directly from system allocator"; system_allocator_->Free(block, block->total_size(cache_), block->index(cache_)); @@ -116,8 +116,8 @@ void BuddyAllocator::Free(void* p) { // Trying to merge the right buddy if (block->has_right_buddy(cache_)) { - VLOG(10) << "Merging this block " << block << " with its right buddy " - << block->right_buddy(cache_); + VLOG(100) << "Merging this block " << block << " with its right buddy " + << block->right_buddy(cache_); auto right_buddy = block->right_buddy(cache_); @@ -134,8 +134,8 @@ void BuddyAllocator::Free(void* p) { // Trying to merge the left buddy if (block->has_left_buddy(cache_)) { - VLOG(10) << "Merging this block " << block << " with its left buddy " - << block->left_buddy(cache_); + VLOG(100) << "Merging this block " << block << " with its left buddy " + << block->left_buddy(cache_); auto left_buddy = block->left_buddy(cache_); @@ -151,8 +151,8 @@ void BuddyAllocator::Free(void* p) { } // Dumping this block into pool - VLOG(10) << "Inserting free block (" << block << ", " - << block->total_size(cache_) << ")"; + VLOG(100) << "Inserting free block (" << block << ", " + << block->total_size(cache_) << ")"; pool_.insert( IndexSizeAddress(block->index(cache_), block->total_size(cache_), block)); @@ -174,7 +174,7 @@ void* BuddyAllocator::SystemAlloc(size_t size) { size_t index = 0; void* p = system_allocator_->Alloc(&index, size); - VLOG(10) << "Allocated " << p << " from system allocator."; + VLOG(100) << "Allocated " << p << " from system allocator."; if (p == nullptr) return nullptr; @@ -200,8 +200,8 @@ BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() { if (p == nullptr) return pool_.end(); - VLOG(10) << "Creating and inserting new block " << p - << " from system allocator"; + VLOG(100) << "Creating and inserting new block " << p + << " from system allocator"; static_cast(p)->init(&cache_, MemoryBlock::FREE_CHUNK, index, max_chunk_size_, nullptr, nullptr); @@ -245,19 +245,19 @@ void* BuddyAllocator::SplitToAlloc(BuddyAllocator::PoolSet::iterator it, auto block = static_cast(std::get<2>(*it)); pool_.erase(it); - VLOG(10) << "Split block (" << block << ", " << block->total_size(cache_) - << ") into"; + VLOG(100) << "Split block (" << block << ", " << block->total_size(cache_) + << ") into"; block->split(&cache_, size); - VLOG(10) << "Left block (" << block << ", " << block->total_size(cache_) - << ")"; + VLOG(100) << "Left block (" << block << ", " << block->total_size(cache_) + << ")"; block->set_type(&cache_, MemoryBlock::ARENA_CHUNK); // the rest of memory if exist if (block->has_right_buddy(cache_)) { if (block->right_buddy(cache_)->type(cache_) == MemoryBlock::FREE_CHUNK) { - VLOG(10) << "Insert right block (" << block->right_buddy(cache_) << ", " - << block->right_buddy(cache_)->total_size(cache_) << ")"; + VLOG(100) << "Insert right block (" << block->right_buddy(cache_) << ", " + << block->right_buddy(cache_)->total_size(cache_) << ")"; pool_.insert( IndexSizeAddress(block->right_buddy(cache_)->index(cache_), @@ -284,7 +284,7 @@ void BuddyAllocator::CleanIdleFallBackAlloc() { return; } - VLOG(10) << "Return block " << block << " to fallback allocator."; + VLOG(100) << "Return block " << block << " to fallback allocator."; system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); cache_.invalidate(block); @@ -320,7 +320,7 @@ void BuddyAllocator::CleanIdleNormalAlloc() { MemoryBlock* block = static_cast(std::get<2>(*pool)); - VLOG(10) << "Return block " << block << " to base allocator."; + VLOG(100) << "Return block " << block << " to base allocator."; system_allocator_->Free(block, max_chunk_size_, block->index(cache_)); cache_.invalidate(block); diff --git a/paddle/fluid/memory/detail/meta_cache.cc b/paddle/fluid/memory/detail/meta_cache.cc index b86e4f38c42a26e155f276f9b73cbed1d0d83f7d..152e4e7f9fa2e18a2b3e5b4042089660d291badf 100644 --- a/paddle/fluid/memory/detail/meta_cache.cc +++ b/paddle/fluid/memory/detail/meta_cache.cc @@ -29,7 +29,7 @@ MemoryBlock::Desc MetadataCache::load(const MemoryBlock* block) const { return existing_desc->second; } else { auto* desc = reinterpret_cast(block); - VLOG(10) << "Load MemoryBlock::Desc type=" << desc->type; + VLOG(100) << "Load MemoryBlock::Desc type=" << desc->type; PADDLE_ASSERT(desc->check_guards()); return *reinterpret_cast(block); } diff --git a/paddle/fluid/memory/detail/system_allocator.cc b/paddle/fluid/memory/detail/system_allocator.cc index 1b96798d23cec34a1863f56c1e4027ce32b2eec5..2019d1a14f6dd5ed09c251f26c6ca352faa594ae 100644 --- a/paddle/fluid/memory/detail/system_allocator.cc +++ b/paddle/fluid/memory/detail/system_allocator.cc @@ -30,12 +30,7 @@ limitations under the License. */ #include "paddle/fluid/platform/enforce.h" #include "paddle/fluid/platform/gpu_info.h" -// If use_pinned_memory is true, CPUAllocator calls mlock, which -// returns pinned and locked memory as staging areas for data exchange -// between host and device. Allocates too much would reduce the amount -// of memory available to the system for paging. So, by default, we -// should set false to use_pinned_memory. -DEFINE_bool(use_pinned_memory, true, "If set, allocate cpu pinned memory."); +DECLARE_bool(use_pinned_memory); DECLARE_double(fraction_of_gpu_memory_to_use); namespace paddle { namespace memory { diff --git a/paddle/fluid/memory/malloc.cc b/paddle/fluid/memory/malloc.cc index 0f13a4ea9c1af175771f5cc201ea5c0a8a0f7555..e414ad657a9447142d6e3a42fc7efc86f01e9c9f 100644 --- a/paddle/fluid/memory/malloc.cc +++ b/paddle/fluid/memory/malloc.cc @@ -12,213 +12,22 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include - #include "paddle/fluid/memory/malloc.h" - -#include "glog/logging.h" - -#include "paddle/fluid/memory/detail/buddy_allocator.h" -#include "paddle/fluid/memory/detail/system_allocator.h" -#include "paddle/fluid/platform/gpu_info.h" - -DEFINE_bool(init_allocated_mem, false, - "It is a mistake that the values of the memory allocated by " - "BuddyAllocator are always zeroed in some op's implementation. " - "To find this error in time, we use init_allocated_mem to indicate " - "that initializing the allocated memory with a small value " - "during unit testing."); -DECLARE_double(fraction_of_gpu_memory_to_use); - +#include +#include +#include "paddle/fluid/memory/allocation/allocator_facade.h" +#include "paddle/fluid/memory/allocation/allocator_strategy.h" +#include "paddle/fluid/platform/place.h" namespace paddle { namespace memory { - -using BuddyAllocator = detail::BuddyAllocator; - -BuddyAllocator* GetCPUBuddyAllocator() { - // We tried thread_local for inference::RNN1 model, but that not works much - // for multi-thread test. - static std::once_flag init_flag; - static detail::BuddyAllocator* a = nullptr; - - std::call_once(init_flag, []() { - a = new detail::BuddyAllocator( - std::unique_ptr(new detail::CPUAllocator), - platform::CpuMinChunkSize(), platform::CpuMaxChunkSize()); - }); - - return a; -} - -// We compared the NaiveAllocator with BuddyAllocator in CPU memory allocation, -// seems they are almost the same overhead. -struct NaiveAllocator { - void* Alloc(size_t size) { return malloc(size); } - - void Free(void* p) { - PADDLE_ENFORCE(p); - free(p); - } - - static NaiveAllocator* Instance() { - static NaiveAllocator x; - return &x; - } - - private: - std::mutex lock_; -}; - -template <> -void* Alloc(platform::CPUPlace place, size_t size) { - VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place); - void* p = GetCPUBuddyAllocator()->Alloc(size); - if (FLAGS_init_allocated_mem) { - memset(p, 0xEF, size); - } - VLOG(10) << " pointer=" << p; - return p; -} - -template <> -void Free(platform::CPUPlace place, void* p) { - VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place); - GetCPUBuddyAllocator()->Free(p); -} - -template <> -size_t Used(platform::CPUPlace place) { - return GetCPUBuddyAllocator()->Used(); -} - -#ifdef PADDLE_WITH_CUDA - -BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) { - static std::once_flag init_flag; - static detail::BuddyAllocator** a_arr = nullptr; - - std::call_once(init_flag, [gpu_id]() { - int gpu_num = platform::GetCUDADeviceCount(); - PADDLE_ENFORCE(gpu_id < gpu_num, "gpu_id:%d should < gpu_num:%d", gpu_id, - gpu_num); - - a_arr = new BuddyAllocator*[gpu_num]; - for (int i = 0; i < gpu_num; i++) { - a_arr[i] = nullptr; - platform::SetDeviceId(i); - a_arr[i] = new BuddyAllocator( - std::unique_ptr(new detail::GPUAllocator(i)), - platform::GpuMinChunkSize(), platform::GpuMaxChunkSize()); - - VLOG(10) << "\n\nNOTE: each GPU device use " - << FLAGS_fraction_of_gpu_memory_to_use * 100 - << "% of GPU memory.\n" - << "You can set GFlags environment variable '" - << "FLAGS_fraction_of_gpu_memory_to_use" - << "' to change the fraction of GPU usage.\n\n"; - } - }); - - platform::SetDeviceId(gpu_id); - return a_arr[gpu_id]; -} - -template <> -size_t Used(platform::CUDAPlace place) { - return GetGPUBuddyAllocator(place.device)->Used(); -} - -template <> -void* Alloc(platform::CUDAPlace place, size_t size) { - auto* buddy_allocator = GetGPUBuddyAllocator(place.device); - auto* ptr = buddy_allocator->Alloc(size); - if (ptr == nullptr) { - int cur_dev = platform::GetCurrentDeviceId(); - platform::SetDeviceId(place.device); - size_t avail, total; - platform::GpuMemoryUsage(&avail, &total); - LOG(WARNING) << "Cannot allocate " << size << " bytes in GPU " - << place.device << ", available " << avail << " bytes"; - LOG(WARNING) << "total " << total; - LOG(WARNING) << "GpuMinChunkSize " << buddy_allocator->GetMinChunkSize(); - LOG(WARNING) << "GpuMaxChunkSize " << buddy_allocator->GetMaxChunkSize(); - LOG(WARNING) << "GPU memory used: " << Used(place); - platform::SetDeviceId(cur_dev); - } - if (FLAGS_init_allocated_mem) { - cudaMemset(ptr, 0xEF, size); - } - return ptr; -} - -template <> -void Free(platform::CUDAPlace place, void* p) { - GetGPUBuddyAllocator(place.device)->Free(p); -} - -BuddyAllocator* GetCUDAPinnedBuddyAllocator() { - static std::once_flag init_flag; - static BuddyAllocator* ba = nullptr; - - std::call_once(init_flag, []() { - ba = new BuddyAllocator(std::unique_ptr( - new detail::CUDAPinnedAllocator), - platform::CUDAPinnedMinChunkSize(), - platform::CUDAPinnedMaxChunkSize()); - }); - - return ba; -} - -template <> -size_t Used(platform::CUDAPinnedPlace place) { - return GetCUDAPinnedBuddyAllocator()->Used(); -} - -template <> -void* Alloc(platform::CUDAPinnedPlace place, - size_t size) { - auto* buddy_allocator = GetCUDAPinnedBuddyAllocator(); - void* ptr = buddy_allocator->Alloc(size); - - if (ptr == nullptr) { - LOG(WARNING) << "cudaMallocHost Cannot allocate " << size - << " bytes in CUDAPinnedPlace"; - } - if (FLAGS_init_allocated_mem) { - memset(ptr, 0xEF, size); - } - return ptr; -} - -template <> -void Free(platform::CUDAPinnedPlace place, void* p) { - GetCUDAPinnedBuddyAllocator()->Free(p); -} -#endif - -size_t Usage::operator()(const platform::CPUPlace& cpu) const { - return Used(cpu); -} - -size_t Usage::operator()(const platform::CUDAPlace& gpu) const { -#ifdef PADDLE_WITH_CUDA - return Used(gpu); -#else - PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); -#endif -} - -size_t Usage::operator()(const platform::CUDAPinnedPlace& cuda_pinned) const { -#ifdef PADDLE_WITH_CUDA - return Used(cuda_pinned); -#else - PADDLE_THROW("'CUDAPinnedPlace' is not supported in CPU only device."); -#endif +std::shared_ptr AllocShared(const platform::Place& place, + size_t size, Allocator::Attr attr) { + return allocation::AllocatorFacade::Instance().AllocShared(place, size, attr); } -size_t memory_usage(const platform::Place& p) { - return boost::apply_visitor(Usage(), p); +AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr) { + return allocation::AllocatorFacade::Instance().Alloc(place, size, attr); } } // namespace memory diff --git a/paddle/fluid/memory/malloc.h b/paddle/fluid/memory/malloc.h index 3e6bfddd69cb16edf323d040ea5369cd551f299e..916538b2a659d7d9503fdc337a4ba84fa21f77f9 100644 --- a/paddle/fluid/memory/malloc.h +++ b/paddle/fluid/memory/malloc.h @@ -14,91 +14,21 @@ limitations under the License. */ #pragma once +#include +#include "paddle/fluid/memory/allocation/allocator.h" #include "paddle/fluid/platform/place.h" - namespace paddle { namespace memory { +using allocation::Allocation; +using allocation::Allocator; +using allocation::AllocationPtr; -/** - * \brief Allocate memory block in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * \param[in] size Allocation size. - * - * \return Allocated memory block address. - * - * \note If return nullptr, it indicates memory allocation failed - * because insufficient memory in current system. When Alloc - * function is invoked, you must check the returned memory - * address is valid or not. - */ -template -void* Alloc(Place place, size_t size); - -/** - * \brief Free memory block in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * \param[in] ptr Memory block address to free. - * - */ -template -void Free(Place place, void* ptr); - -/** - * \brief Total size of used memory in one place. - * - * \param[in] place Allocation place (CPU or GPU). - * - */ -template -size_t Used(Place place); - -struct Usage : public boost::static_visitor { - size_t operator()(const platform::CPUPlace& cpu) const; - size_t operator()(const platform::CUDAPlace& gpu) const; - size_t operator()(const platform::CUDAPinnedPlace& cuda_pinned) const; -}; - -size_t memory_usage(const platform::Place& p); - -/** - * \brief Free memory block in one place. - * - * \note In some cases, custom deleter is used to - * deallocate the memory automatically for - * std::unique_ptr in tensor.h. - * - */ -template -class PODDeleter { - static_assert(std::is_pod::value, "T must be POD"); - - public: - explicit PODDeleter(Place place) : place_(place) {} - void operator()(T* ptr) { Free(place_, static_cast(ptr)); } - - private: - Place place_; -}; - -/** - * \brief Free memory block in one place does not meet POD - * - * \note In some cases, custom deleter is used to - * deallocate the memory automatically for - * std::unique_ptr in tensor.h. - * - */ -template -class PlainDeleter { - public: - explicit PlainDeleter(Place place) : place_(place) {} - void operator()(T* ptr) { Free(place_, reinterpret_cast(ptr)); } +extern std::shared_ptr AllocShared( + const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); - private: - Place place_; -}; +extern AllocationPtr Alloc(const platform::Place& place, size_t size, + Allocator::Attr attr = Allocator::kDefault); } // namespace memory } // namespace paddle diff --git a/paddle/fluid/memory/malloc_test.cc b/paddle/fluid/memory/malloc_test.cc deleted file mode 100644 index d39466ef60c3750600dea726a6570397423d42f6..0000000000000000000000000000000000000000 --- a/paddle/fluid/memory/malloc_test.cc +++ /dev/null @@ -1,198 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/memory/malloc.h" - -#include - -#include "gtest/gtest.h" -#include "paddle/fluid/memory/detail/memory_block.h" -#include "paddle/fluid/platform/cpu_info.h" -#include "paddle/fluid/platform/gpu_info.h" -#include "paddle/fluid/platform/place.h" - -inline bool is_aligned(void const *p) { - return 0 == (reinterpret_cast(p) & 0x3); -} - -size_t align(size_t size, paddle::platform::CPUPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::CpuMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, CPUAllocation) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CPUPlace cpu; - p = paddle::memory::Alloc(cpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = cpu; - EXPECT_EQ(paddle::memory::Used(cpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(cpu, p); -} - -TEST(BuddyAllocator, CPUMultAlloc) { - paddle::platform::CPUPlace cpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(cpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(cpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(size, cpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(cpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(p.second, cpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } -} - -#ifdef PADDLE_WITH_CUDA - -size_t align(size_t size, paddle::platform::CUDAPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::GpuMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, GPUAllocation) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CUDAPlace gpu(0); - p = paddle::memory::Alloc(gpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = gpu; - EXPECT_EQ(paddle::memory::Used(gpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(gpu, p); -} - -TEST(BuddyAllocator, GPUMultAlloc) { - paddle::platform::CUDAPlace gpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(gpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(gpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(gpu) == total_size) continue; - - size_t aligned_size = align(size, gpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(gpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(gpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(gpu) == total_size) continue; - - size_t aligned_size = align(p.second, gpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(gpu)); - } -} - -size_t align(size_t size, paddle::platform::CUDAPinnedPlace place) { - size += sizeof(paddle::memory::detail::MemoryBlock::Desc); - size_t alignment = paddle::platform::CUDAPinnedMinChunkSize(); - size_t remaining = size % alignment; - return remaining == 0 ? size : size + (alignment - remaining); -} - -TEST(BuddyAllocator, CUDAPinnedAllocator) { - void *p = nullptr; - - EXPECT_EQ(p, nullptr); - - paddle::platform::CUDAPinnedPlace cpu; - p = paddle::memory::Alloc(cpu, 4096); - - EXPECT_NE(p, nullptr); - - paddle::platform::Place place = cpu; - EXPECT_EQ(paddle::memory::Used(cpu), paddle::memory::memory_usage(place)); - - paddle::memory::Free(cpu, p); -} - -TEST(BuddyAllocator, CUDAPinnedMultAllocator) { - paddle::platform::CUDAPinnedPlace cpu; - - std::unordered_map ps; - - size_t total_size = paddle::memory::Used(cpu); - EXPECT_EQ(total_size, 0UL); - - for (auto size : - {0, 128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) { - ps[paddle::memory::Alloc(cpu, size)] = size; - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(size, cpu); - total_size += aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } - - for (auto p : ps) { - EXPECT_EQ(is_aligned(p.first), true); - paddle::memory::Free(cpu, p.first); - - // Buddy Allocator doesn't manage too large memory chunk - if (paddle::memory::Used(cpu) == total_size) continue; - - size_t aligned_size = align(p.second, cpu); - total_size -= aligned_size; - EXPECT_EQ(total_size, paddle::memory::Used(cpu)); - } -} -#endif diff --git a/paddle/fluid/memory/memcpy.cc b/paddle/fluid/memory/memcpy.cc index a177d4985fd0e2cca983b6873af89c60f526b811..2a6f70a01e303aa1b608248cbeb8dcfa24837a0c 100644 --- a/paddle/fluid/memory/memcpy.cc +++ b/paddle/fluid/memory/memcpy.cc @@ -27,6 +27,8 @@ void Copy(platform::CPUPlace, void* dst, } #ifdef PADDLE_WITH_CUDA +static constexpr size_t kMaxGpuAsyncCopyBytes = 64 * 1024; // 64K + template <> void Copy( platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place, @@ -36,6 +38,10 @@ void Copy( platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream); } else { platform::GpuMemcpySync(dst, src, num, cudaMemcpyDeviceToHost); + // FIXME(zjl): do we really need it? + if (num <= kMaxGpuAsyncCopyBytes) { + cudaStreamSynchronize(0); + } } } @@ -48,6 +54,10 @@ void Copy( platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyHostToDevice, stream); } else { platform::GpuMemcpySync(dst, src, num, cudaMemcpyHostToDevice); + // FIXME(zjl): do we really need it? + if (num <= kMaxGpuAsyncCopyBytes) { + cudaStreamSynchronize(0); + } } } diff --git a/paddle/fluid/operators/CMakeLists.txt b/paddle/fluid/operators/CMakeLists.txt index df3e3fcd9c75f03f4d9b0a7c12788f06bfdefd7f..de4f23515d8591f28b80ad00322365f8cdce768b 100644 --- a/paddle/fluid/operators/CMakeLists.txt +++ b/paddle/fluid/operators/CMakeLists.txt @@ -1,352 +1,91 @@ -file(GLOB GENERAL_OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*_op.cc") -string(REPLACE "_mkldnn" "" GENERAL_OPS "${GENERAL_OPS}") -string(REPLACE ".cc" "" GENERAL_OPS "${GENERAL_OPS}") -list(REMOVE_DUPLICATES GENERAL_OPS) -set(DEPS_OPS "") -set(pybind_file ${PADDLE_BINARY_DIR}/paddle/fluid/pybind/pybind.h) -file(WRITE ${pybind_file} "// Generated by the paddle/fluid/operator/CMakeLists.txt. DO NOT EDIT!\n\n") -function(op_library TARGET) - # op_library is a function to create op library. The interface is same as - # cc_library. But it handle split GPU/CPU code and link some common library - # for ops. - set(cc_srcs) - set(cu_srcs) - set(hip_cu_srcs) - set(miopen_hip_cc_srcs) - set(cu_cc_srcs) - set(cudnn_cu_cc_srcs) - set(CUDNN_FILE) - set(mkldnn_cc_srcs) - set(MKLDNN_FILE) - set(op_common_deps operator op_registry math_function) - set(options "") - set(oneValueArgs "") - set(multiValueArgs SRCS DEPS) - set(pybind_flag 0) - cmake_parse_arguments(op_library "${options}" "${oneValueArgs}" - "${multiValueArgs}" ${ARGN}) - - list(LENGTH op_library_SRCS op_library_SRCS_len) - if (${op_library_SRCS_len} EQUAL 0) - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cc) - list(APPEND cc_srcs ${TARGET}.cc) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu.cc) - list(APPEND cu_cc_srcs ${TARGET}.cu.cc) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu) - list(APPEND cu_srcs ${TARGET}.cu) - endif() - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.hip.cu) - list(APPEND hip_cu_srcs ${TARGET}.hip.cu) - endif() - string(REPLACE "_op" "_cudnn_op" CUDNN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${CUDNN_FILE}.cu.cc) - list(APPEND cudnn_cu_cc_srcs ${CUDNN_FILE}.cu.cc) - endif() - if(WITH_AMD_GPU) - string(REPLACE "_op" "_miopen_op" MIOPEN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MIOPEN_FILE}.hip.cc) - list(APPEND miopen_hip_cc_srcs ${MIOPEN_FILE}.hip.cc) - endif() - endif() - if(WITH_MKLDNN) - string(REPLACE "_op" "_mkldnn_op" MKLDNN_FILE "${TARGET}") - if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${MKLDNN_FILE}.cc) - list(APPEND mkldnn_cc_srcs ${MKLDNN_FILE}.cc) - endif() - endif() - else() - foreach(src ${op_library_SRCS}) - if (${src} MATCHES ".*\\.hip.cu$") - list(APPEND hip_cu_srcs ${src}) - elseif (${src} MATCHES ".*\\.cu$") - list(APPEND cu_srcs ${src}) - elseif(${src} MATCHES ".*_cudnn_op.cu.cc$") - list(APPEND cudnn_cu_cc_srcs ${src}) - elseif(WITH_AMD_GPU AND ${src} MATCHES ".*_miopen_op.hip.cc$") - list(APPEND miopen_hip_cc_srcs ${src}) - elseif(WITH_MKLDNN AND ${src} MATCHES ".*_mkldnn_op.cc$") - list(APPEND mkldnn_cc_srcs ${src}) - elseif(${src} MATCHES ".*\\.cu.cc$") - list(APPEND cu_cc_srcs ${src}) - elseif(${src} MATCHES ".*\\.cc$") - list(APPEND cc_srcs ${src}) - else() - message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu") - endif() - endforeach() - endif() - - list(LENGTH cc_srcs cc_srcs_len) - if (${cc_srcs_len} EQUAL 0) - message(FATAL_ERROR "The op library ${TARGET} should contains at least one .cc file") - endif() - if (WIN32) - # remove windows unsupported op, because windows has no nccl, no warpctc such ops. - foreach(windows_unsupport_op "nccl_op" "gen_nccl_id_op" "warpctc_op" "hierarchical_sigmoid_op" - "crf_decoding_op" "select_op" "lstmp_op" "gru_op" "fusion_gru_op" "lstm_op" "fusion_lstm_op" "cumsum_op" - "channel_send_op" "channel_create_op" "channel_close_op" "channel_recv_op") - if ("${TARGET}" STREQUAL "${windows_unsupport_op}") - return() - endif() - endforeach() - endif(WIN32) - set(OP_LIBRARY ${TARGET} ${OP_LIBRARY} PARENT_SCOPE) - - list(LENGTH op_library_DEPS op_library_DEPS_len) - if (${op_library_DEPS_len} GREATER 0) - set(DEPS_OPS ${TARGET} ${DEPS_OPS} PARENT_SCOPE) - endif() - if (WITH_GPU) - nv_library(${TARGET} SRCS ${cc_srcs} ${cu_cc_srcs} ${cudnn_cu_cc_srcs} ${mkldnn_cc_srcs} ${cu_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - elseif (WITH_AMD_GPU) - hip_library(${TARGET} SRCS ${cc_srcs} ${hip_cu_srcs} ${miopen_hip_cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - else() - cc_library(${TARGET} SRCS ${cc_srcs} ${mkldnn_cc_srcs} DEPS ${op_library_DEPS} - ${op_common_deps}) - endif() - - # Define operators that don't need pybind here. - foreach(manual_pybind_op "compare_op" "logical_op" "nccl_op" -"tensor_array_read_write_op" "tensorrt_engine_op") - if ("${TARGET}" STREQUAL "${manual_pybind_op}") - set(pybind_flag 1) - endif() - endforeach() - - # The registration of USE_OP, please refer to paddle/fluid/framework/op_registry.h. - # Note that it's enough to just adding one operator to pybind in a *_op.cc file. - # And for detail pybind information, please see generated paddle/pybind/pybind.h. - file(READ ${TARGET}.cc TARGET_CONTENT) - string(REGEX MATCH "REGISTER_OPERATOR\\(.*REGISTER_OPERATOR\\(" multi_register "${TARGET_CONTENT}") - string(REGEX MATCH "REGISTER_OPERATOR\\([a-z0-9_]*," one_register "${multi_register}") - if (one_register STREQUAL "") - string(REPLACE "_op" "" TARGET "${TARGET}") - else () - string(REPLACE "REGISTER_OPERATOR(" "" TARGET "${one_register}") - string(REPLACE "," "" TARGET "${TARGET}") - endif() - - # pybind USE_NO_KERNEL_OP - # HACK: if REGISTER_OP_CPU_KERNEL presents the operator must have kernel - string(REGEX MATCH "REGISTER_OP_CPU_KERNEL" regex_result "${TARGET_CONTENT}") - string(REPLACE "_op" "" TARGET "${TARGET}") - if (${pybind_flag} EQUAL 0 AND regex_result STREQUAL "") - file(APPEND ${pybind_file} "USE_NO_KERNEL_OP(${TARGET});\n") - set(pybind_flag 1) - endif() +include(operators) - # pybind USE_CPU_ONLY_OP - list(LENGTH cu_srcs cu_srcs_len) - list(LENGTH cu_cc_srcs cu_cc_srcs_len) - list(LENGTH mkldnn_cc_srcs mkldnn_cc_srcs_len) - list(LENGTH hip_cu_srcs hip_cu_srcs_len) - list(LENGTH miopen_hip_cc_srcs miopen_hip_cc_srcs_len) - if (${pybind_flag} EQUAL 0 AND ${mkldnn_cc_srcs_len} EQUAL 0 AND ${cu_srcs_len} EQUAL 0 AND ${cu_cc_srcs_len} EQUAL 0 AND - ${hip_cu_srcs_len} EQUAL 0 AND ${miopen_hip_cc_srcs_len} EQUAL 0) - file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") - set(pybind_flag 1) - endif() - - # pybind USE_OP_DEVICE_KERNEL for CUDNN - list(LENGTH cudnn_cu_cc_srcs cudnn_cu_cc_srcs_len) - if (WITH_GPU AND ${cudnn_cu_cc_srcs_len} GREATER 0) - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, CUDNN);\n") - endif() - - # pybind USE_OP_DEVICE_KERNEL for MIOPEN - if (WITH_AMD_GPU AND ${miopen_hip_cc_srcs_len} GREATER 0) - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MIOPEN);\n") - endif() - - # pybind USE_OP_DEVICE_KERNEL for MKLDNN - if (WITH_MKLDNN AND ${mkldnn_cc_srcs_len} GREATER 0) - # Append first implemented MKLDNN activation operator - if (${MKLDNN_FILE} STREQUAL "activation_mkldnn_op") - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(relu, MKLDNN);\n") - else() - file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${TARGET}, MKLDNN);\n") - endif() - endif() - - # pybind USE_OP - if (${pybind_flag} EQUAL 0) - # NOTE(*): activation use macro to regist the kernels, set use_op manually. - if(${TARGET} STREQUAL "activation") - file(APPEND ${pybind_file} "USE_OP(relu);\n") - elseif(${TARGET} STREQUAL "fake_dequantize") - file(APPEND ${pybind_file} "USE_OP(fake_dequantize_max_abs);\n") - elseif(${TARGET} STREQUAL "fake_quantize") - file(APPEND ${pybind_file} "USE_OP(fake_quantize_abs_max);\n") - elseif(${TARGET} STREQUAL "tensorrt_engine_op") - message(STATUS "Pybind skips [tensorrt_engine_op], for this OP is only used in inference") - elseif(${TARGET} STREQUAL "fc") - # HACK: fc only have mkldnn and cpu, which would mismatch the cpu only condition - file(APPEND ${pybind_file} "USE_CPU_ONLY_OP(${TARGET});\n") - else() - file(APPEND ${pybind_file} "USE_OP(${TARGET});\n") - endif() - endif() -endfunction() +# clean cache and pybind_file content first when rebuild +unset(GLOB_OP_LIB CACHE) +unset(OP_LIBRARY CACHE) +set(pybind_file ${PADDLE_BINARY_DIR}/paddle/fluid/pybind/pybind.h CACHE INTERNAL "pybind.h file") +file(WRITE ${pybind_file} "// Generated by the paddle/fluid/operator/CMakeLists.txt. DO NOT EDIT!\n\n") add_subdirectory(math) -if (NOT WIN32) -add_subdirectory(nccl) -if(WITH_GPU) - op_library(nccl_op DEPS nccl_common) - file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(ncclAllReduce);\n") -else() - set(DEPS_OPS ${DEPS_OPS} nccl_op) -endif() -endif() # NOT WIN32 +add_subdirectory(controlflow) +add_subdirectory(csp) +add_subdirectory(detection) +add_subdirectory(elementwise) +add_subdirectory(fused) +add_subdirectory(metrics) +add_subdirectory(optimizers) +add_subdirectory(reduce_ops) +add_subdirectory(sequence_ops) -set(DISTRIBUTE_DEPS "") if(WITH_DISTRIBUTE) add_subdirectory(distributed) - set(DISTRIBUTE_DEPS "") - if(WITH_GRPC) - set(DISTRIBUTE_DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf node) - else() - set(DISTRIBUTE_DEPS sendrecvop_brpc brpc leveldb snappystream snappy protobuf ssl crypto zlib node) - if(WITH_BRPC_RDMA) - find_library(IBVERBS_LIBRARY NAMES ibverbs) - ADD_LIBRARY(ibverbs SHARED IMPORTED GLOBAL) - SET_PROPERTY(TARGET ibverbs PROPERTY IMPORTED_LOCATION ${IBVERBS_LIBRARY}) - - - find_library(RDMACM_LIBRARY NAMES rdmacm) - ADD_LIBRARY(rdmacm SHARED IMPORTED GLOBAL) - SET_PROPERTY(TARGET rdmacm PROPERTY IMPORTED_LOCATION ${RDMACM_LIBRARY}) - - set(DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} ibverbs rdmacm) - endif() - endif() - - set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") - foreach(dist_op "prefetch_op" "checkpoint_notify_op" "listen_and_serv_op" "send_op" "recv_op" "send_barrier_op" "fetch_barrier_op") - op_library(${dist_op} DEPS ${DISTRIBUTE_DEPS}) - set_source_files_properties(${dist_op}.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - endforeach() - - #set_source_files_properties(send_recv_op_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - #cc_test(test_send_recv SRCS send_recv_op_test.cc DEPS prefetch_op send_op - # listen_and_serv_op sum_op executor SERIAL) - if(WITH_GPU AND NOT WIN32) - set_source_files_properties(test_send_nccl_id.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - cc_test(test_send_nccl_id SRCS test_send_nccl_id.cc DEPS listen_and_serv_op ${DISTRIBUTE_DEPS} executor SERIAL) - if(WITH_GRPC) - op_library(gen_nccl_id_op DEPS nccl_common sendrecvop_grpc) - else() - op_library(gen_nccl_id_op DEPS nccl_common sendrecvop_brpc) - endif() - set_source_files_properties(gen_nccl_id_op.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) - else() - set(DEPS_OPS ${DEPS_OPS} gen_nccl_id_op) - endif() # WITH_GPU AND NOT WIN32 -else() - set(DEPS_OPS ${DEPS_OPS} checkpoint_notify_op prefetch_op recv_op listen_and_serv_op send_op send_barrier_op fetch_barrier_op gen_nccl_id_op) + add_subdirectory(distributed_ops) endif() -op_library(cross_entropy_op DEPS cross_entropy) -if(WITH_GPU) - op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax cub) - op_library(sequence_softmax_op DEPS cub) -else() - op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax) +add_subdirectory(reader) + +if (NOT WIN32) + add_subdirectory(nccl) endif() -op_library(softmax_op DEPS softmax) if (WITH_GPU AND TENSORRT_FOUND) - op_library(tensorrt_engine_op DEPS tensorrt_engine tensorrt_converter) - file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(tensorrt_engine);\n") - nv_test(test_tensorrt_engine_op SRCS tensorrt_engine_op_test.cc - DEPS tensorrt_engine_op - analysis) -else() - set(DEPS_OPS ${DEPS_OPS} tensorrt_engine_op) + add_subdirectory(tensorrt) endif() -op_library(clip_by_norm_op DEPS selected_rows_functor selected_rows) -op_library(sum_op DEPS selected_rows_functor) -op_library(sgd_op DEPS selected_rows_functor) -op_library(print_op DEPS lod_tensor) -op_library(adagrad_op DEPS selected_rows_functor) -op_library(maxout_op DEPS maxouting) -op_library(unpool_op DEPS unpooling) -op_library(pool_op DEPS pooling) -op_library(pool_with_index_op DEPS pooling) -op_library(lod_rank_table_op DEPS lod_rank_table) -op_library(lod_tensor_to_array_op DEPS lod_rank_table_op) -op_library(array_to_lod_tensor_op DEPS lod_rank_table_op) -op_library(max_sequence_len_op DEPS lod_rank_table) -op_library(sequence_conv_op DEPS context_project) -op_library(sequence_pool_op DEPS sequence_pooling) -if (NOT WIN32) -op_library(lstm_op DEPS sequence2batch lstm_compute) -op_library(hierarchical_sigmoid_op DEPS matrix_bit_code) -op_library(lstmp_op DEPS sequence2batch lstm_compute) -op_library(gru_op DEPS sequence2batch gru_compute) -endif(NOT WIN32) -op_library(recurrent_op DEPS executor) -op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) -op_library(cos_sim_op DEPS cos_sim_functor) -op_library(parallel_do_op DEPS executor) -op_library(unsqueeze_op DEPS reshape_op) -op_library(squeeze_op DEPS reshape_op) -op_library(extract_rows_op DEPS memory) -op_library(flatten_op DEPS reshape_op) -op_library(sequence_pad_op DEPS sequence_padding) -op_library(unstack_op DEPS stack_op) -op_library(fake_quantize_op DEPS memory) -op_library(fusion_lstm_op DEPS jit_kernel) + +SET(OP_HEADER_DEPS xxhash) if (WITH_GPU) - op_library(conv_op DEPS vol2col depthwise_conv im2col) - op_library(layer_norm_op DEPS cub) - op_library(reduce_mean_op DEPS cub) -else() - op_library(conv_op DEPS vol2col im2col) + SET(OP_HEADER_DEPS ${OP_HEADER_DEPS} cub) endif() -op_library(conv_transpose_op DEPS vol2col im2col) -# FIXME(typhoonzero): save/load depends lodtensor serialization functions -op_library(save_op DEPS lod_tensor) -op_library(load_op DEPS lod_tensor) -op_library(save_combine_op DEPS lod_tensor) -op_library(load_combine_op DEPS lod_tensor) -op_library(concat_op DEPS concat) +register_operators(EXCLUDES warpctc_op conv_fusion_op DEPS ${OP_HEADER_DEPS}) -list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS}) +# warpctc_op needs cudnn 7 above +if (WITH_GPU AND NOT WIN32) + if (${CUDNN_MAJOR_VERSION} VERSION_LESS 7) + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale SRCS warpctc_op.cc warpctc_op.cu.cc) + else() + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) + endif() + # conv_fusion_op needs cudnn 7 above + if (NOT ${CUDNN_MAJOR_VERSION} VERSION_LESS 7) + op_library(conv_fusion_op) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(conv2d_fusion);\n") + endif() +else() + op_library(warpctc_op DEPS dynload_warpctc sequence_padding sequence_scale) +endif() -foreach(src ${GENERAL_OPS}) - op_library(${src}) -endforeach() +set(COMMON_OP_DEPS ${OP_HEADER_DEPS}) -file(APPEND ${pybind_file} "USE_OP(less_than);\nUSE_OP(logical_and);\nUSE_NO_KERNEL_OP(read_from_array);\n") +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} selected_rows_functor selected_rows lod_tensor maxouting unpooling pooling lod_rank_table context_project sequence_pooling executor) if (NOT WIN32) -add_subdirectory(reader) -endif(NOT WIN32) -foreach(src ${READER_LIBRARY}) - set(OP_LIBRARY ${src} ${OP_LIBRARY}) -endforeach() + set(COMMON_OP_DEPS ${COMMON_OP_DEPS} dynload_warpctc) +endif() +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} sequence_padding sequence_scale cos_sim_functor memory jit_kernel concat_and_split cross_entropy softmax vol2col im2col sampler) +set(COMMON_OP_DEPS ${COMMON_OP_DEPS} sequence2batch lstm_compute matrix_bit_code gru_compute activation_functions) +if (WITH_GPU) + set(COMMON_OP_DEPS ${COMMON_OP_DEPS} depthwise_conv) +endif() -add_subdirectory(detection) -foreach(src ${DETECTION_LIBRARY}) - set(OP_LIBRARY ${src} ${OP_LIBRARY}) -endforeach() +# FIXME(typhoonzero): operator deps may not needed. +# op_library(lod_tensor_to_array_op DEPS lod_rank_table_op) +# op_library(array_to_lod_tensor_op DEPS lod_rank_table_op) +# op_library(unsqueeze_op DEPS reshape_op) +# op_library(squeeze_op DEPS reshape_op) +# op_library(flatten_op DEPS reshape_op) +# op_library(unstack_op DEPS stack_op) +# op_library(tensor_array_to_tensor_op DEPS concat_op) -set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library") -set(GLOB_DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} CACHE INTERNAL "distributed dependency") +set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COMMON_OP_DEPS}) +set(GLOB_OPERATOR_DEPS ${OPERATOR_DEPS} CACHE INTERNAL "Global Op dependencies") cc_test(gather_test SRCS gather_test.cc DEPS tensor) -cc_test(scatter_test SRCS scatter_test.cc DEPS tensor) +cc_test(scatter_test SRCS scatter_test.cc DEPS tensor math_function) cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor) cc_test(beam_search_op_test SRCS beam_search_op_test.cc DEPS lod_tensor beam_search_op) cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor memory) cc_test(save_load_op_test SRCS save_load_op_test.cc DEPS save_op load_op) cc_test(save_load_combine_op_test SRCS save_load_combine_op_test.cc DEPS save_combine_op load_combine_op) -if(NOT WIN32) -nv_test(nccl_op_test SRCS nccl_op_test.cu.cc DEPS nccl_op gpu_info device_context) -endif() nv_test(dropout_op_test SRCS dropout_op_test.cc DEPS dropout_op tensor) + +set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library") diff --git a/paddle/fluid/operators/activation_mkldnn_op.cc b/paddle/fluid/operators/activation_mkldnn_op.cc index 137bca5e2b8e2754aed274970e08b03ee816a7f2..64649b1a5e471a30f435e2b1c1a9db03d35dbd8a 100644 --- a/paddle/fluid/operators/activation_mkldnn_op.cc +++ b/paddle/fluid/operators/activation_mkldnn_op.cc @@ -71,6 +71,10 @@ class MKLDNNActivationGradKernel diff_y->format() != memory::format::format_undef, "Wrong layout/format set for Input OutGrad tensor"); + PADDLE_ENFORCE( + !ctx.Attr("is_test"), + "is_test attribute should be set to False in training phase."); + Functor functor; auto attrs = functor.GetAttrs(); @@ -115,11 +119,15 @@ void eltwise_forward(const framework::ExecutionContext &ctx, const std::string key_fwd = key_with_layout + "@eltwise_fwd"; const std::string key_fwd_pd = key_with_layout + "@eltwise_fwd_pd"; + bool is_test = ctx.Attr("is_test"); + // save input data and layout to be referred in backward path auto p_src_data = std::make_shared(x_data); - dev_ctx.SetBlob(key_src_data, p_src_data); auto p_src_layout = std::make_shared(src_format); - dev_ctx.SetBlob(key_src_layout, p_src_layout); + if (!is_test) { + dev_ctx.SetBlob(key_src_data, p_src_data); + dev_ctx.SetBlob(key_src_layout, p_src_layout); + } auto p_fwd = std::static_pointer_cast( dev_ctx.GetBlob(key_fwd)); @@ -136,14 +144,17 @@ void eltwise_forward(const framework::ExecutionContext &ctx, dev_ctx.SetBlob(key_src_mem, src_memory); // create primitive descriptor for activation forward and save it + auto mkldnn_forward_prop_kind = is_test + ? mkldnn::prop_kind::forward_inference + : mkldnn::prop_kind::forward_training; auto forward_desc = mkldnn::eltwise_forward::desc( - mkldnn::prop_kind::forward_training, algorithm, + mkldnn_forward_prop_kind, algorithm, src_memory->get_primitive_desc().desc(), alpha, beta); auto forward_pd = std::make_shared( forward_desc, mkldnn_engine); // save prim desc into global device context to be referred in backward path - dev_ctx.SetBlob(key_fwd_pd, forward_pd); + if (!is_test) dev_ctx.SetBlob(key_fwd_pd, forward_pd); // create mkldnn memory for output y dst_memory = diff --git a/paddle/fluid/operators/activation_op.cc b/paddle/fluid/operators/activation_op.cc index bbf52bea1358c32596ab6f14eeaa419735d19fc6..bb9ea3f3ba08753dd23b2b2a776b7d2960e5e00e 100644 --- a/paddle/fluid/operators/activation_op.cc +++ b/paddle/fluid/operators/activation_op.cc @@ -22,18 +22,23 @@ namespace operators { using paddle::framework::Tensor; -#define REGISTER_ACTIVATION_OP_MAKER(OP_NAME, OP_COMMENT) \ - class OP_NAME##OpMaker \ - : public ::paddle::framework::OpProtoAndCheckerMaker { \ - public: \ - void Make() override { \ - AddInput("X", "Input of " #OP_NAME " operator"); \ - AddOutput("Out", "Output of " #OP_NAME " operator").Reuse("X"); \ - AddAttr("use_mkldnn", \ - "(bool, default false) Only used in mkldnn kernel") \ - .SetDefault(false); \ - AddComment(#OP_COMMENT); \ - } \ +#define REGISTER_ACTIVATION_OP_MAKER(OP_NAME, OP_COMMENT) \ + class OP_NAME##OpMaker \ + : public ::paddle::framework::OpProtoAndCheckerMaker { \ + public: \ + void Make() override { \ + AddInput("X", "Input of " #OP_NAME " operator"); \ + AddOutput("Out", "Output of " #OP_NAME " operator"); \ + AddAttr("use_mkldnn", \ + "(bool, default false) Only used in mkldnn kernel") \ + .SetDefault(false); \ + AddAttr( \ + "is_test", \ + "(bool, default false) Set to true for inference only, false " \ + "for training. Some layers may run faster when this is true.") \ + .SetDefault(false); \ + AddComment(#OP_COMMENT); \ + } \ } #define REGISTER_ACTIVATION_OP_GRAD_MAKER(OP_NAME, KERNEL_TYPE) \ @@ -91,16 +96,12 @@ class ActivationOp : public framework::OperatorWithKernel { } }; -class ActivationOpInferVarType : public framework::VarTypeInference { - public: - void operator()(const framework::OpDesc& op_desc, - framework::BlockDesc* block) const override { - auto x_name = op_desc.Input("X")[0]; - auto out_name = op_desc.Output("Out")[0]; - auto& x = block->FindRecursiveOrCreateVar(x_name); - auto& out = block->FindRecursiveOrCreateVar(out_name); - out.SetType(x.GetType()); - out.SetDataType(x.GetDataType()); +class ActivationOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; } }; @@ -273,7 +274,7 @@ class SoftShrinkOpMaker : public framework::OpProtoAndCheckerMaker { :strong:`Softshrink Activation Operator` .. math:: - out = \begin{cases} + out = \begin{cases} x - \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < -\lambda \\ 0, \text{otherwise} @@ -439,7 +440,7 @@ class HardSigmoidOpMaker : public framework::OpProtoAndCheckerMaker { AddComment(R"DOC( HardSigmoid Activation Operator. -Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), +Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid. $out = \max(0, \min(1, slope * x + shift))$ diff --git a/paddle/fluid/operators/activation_op.cu b/paddle/fluid/operators/activation_op.cu index 27487b396ccf63d962defa6b270063ccb409164e..d3a7ceed466a9b5e4d773f1531d198adff97eac2 100644 --- a/paddle/fluid/operators/activation_op.cu +++ b/paddle/fluid/operators/activation_op.cu @@ -26,6 +26,8 @@ namespace plat = paddle::platform; act_type##_grad, ops::ActivationGradKernel>, \ ops::ActivationGradKernel>); + ops::grad_functor>, \ + ops::ActivationGradKernel>); FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_CUDA_KERNEL); diff --git a/paddle/fluid/operators/activation_op.h b/paddle/fluid/operators/activation_op.h index 2e31d1c9c708225135e27c93ba94722794c4b282..4ffc7f364bcb9bda5f94be5fe071c73bd5c40ca7 100644 --- a/paddle/fluid/operators/activation_op.h +++ b/paddle/fluid/operators/activation_op.h @@ -95,7 +95,7 @@ class ActivationGradKernel auto x = framework::EigenVector::Flatten(*X); functor(*place, x, out, dout, dx); } else { - VLOG(10) << " Inplace activation "; + VLOG(100) << " Inplace activation "; auto x = framework::EigenVector::Flatten(*dX); functor(*place, x, out, dout, dx); } @@ -333,8 +333,7 @@ struct SqrtGradFunctor : public BaseActivationFunctor { template void operator()(Device d, X x, Out out, dOut dout, dX dx) const { - const Out out_conj = Eigen::numext::conj(out); - dx.device(d) = static_cast(0.5) * dout / out_conj; + dx.device(d) = static_cast(0.5) * dout / out; } }; @@ -740,7 +739,7 @@ struct PowGradFunctor : public BaseActivationFunctor { typename dX> void operator()(Device d, X x, Out out, dOut dout, dX dx) const { dx.device(d) = dout * static_cast(factor) * - x.pow(static_cast(factor - static_cast(1))); + x.pow(static_cast(factor) - static_cast(1)); } }; diff --git a/paddle/fluid/operators/add_position_encoding_op.cc b/paddle/fluid/operators/add_position_encoding_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..8127e554bed1aae7a5ce8837bcadf1b7f13f1ac2 --- /dev/null +++ b/paddle/fluid/operators/add_position_encoding_op.cc @@ -0,0 +1,97 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/add_position_encoding_op.h" + +namespace paddle { +namespace operators { + +class AddPositionEncodingOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "X(Input) of add_position_encoding_op should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("Out"), + "Out(Output) of add_position_encoding_op should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + ctx->SetOutputDim("Out", x_dims); + ctx->ShareLoD("X", /*->*/ "Out"); + } +}; + +class AddPositionEncodingOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "X(Input) must not be null."); + PADDLE_ENFORCE(ctx->HasInput("Out"), "Out must not be null."); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), + "Out@GRAD must not be null."); + + auto out_dims = ctx->GetInputDim("Out"); + if (ctx->HasOutput(framework::GradVarName("X"))) { + ctx->SetOutputDim(framework::GradVarName("X"), out_dims); + } + } +}; + +class AddPositionEncodingOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input of AddPositionEncoding operator"); + AddOutput("Out", "Output of AddPositionEncoding operator"); + AddAttr("alpha", "The scale of Original Embedding.") + .SetDefault(1.0f) + .AddCustomChecker([](const float& alpha) { + PADDLE_ENFORCE(alpha >= 0.0f, "'alpha' must be above 0.0."); + }); + AddAttr("beta", "The scale of Position Embedding.") + .SetDefault(1.0f) + .AddCustomChecker([](const float& beta) { + PADDLE_ENFORCE(beta >= 0.0f, "'beta' must be between 0.0."); + }); + AddComment(R"DOC( + Add Position Encoding Operator. + + The add position encoding calculates the output based on the input, alpha, beta. + The size of each dimension of the parameters checked in the infer-shape. + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +namespace plt = paddle::platform; + +REGISTER_OPERATOR(add_position_encoding, ops::AddPositionEncodingOp, + ops::AddPositionEncodingOpMaker, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(add_position_encoding_grad, ops::AddPositionEncodingOpGrad); + +REGISTER_OP_CPU_KERNEL( + add_position_encoding, + ops::AddPositionEncodingKernel, + ops::AddPositionEncodingKernel); + +REGISTER_OP_CPU_KERNEL( + add_position_encoding_grad, + ops::AddPositionEncodingGradKernel, + ops::AddPositionEncodingGradKernel); diff --git a/paddle/fluid/operators/add_position_encoding_op.h b/paddle/fluid/operators/add_position_encoding_op.h new file mode 100644 index 0000000000000000000000000000000000000000..0b40d3de890a02a9dbec2328f9f6388ffa35561b --- /dev/null +++ b/paddle/fluid/operators/add_position_encoding_op.h @@ -0,0 +1,106 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detail/safe_ref.h" + +namespace paddle { +namespace operators { + +template +class AddPositionEncodingKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* X = context.Input("X"); + auto& x_lod = X->lod(); + auto* src_ptr = X->data(); + + auto* Out = context.Output("Out"); + auto* dst_ptr = Out->mutable_data(context.GetPlace()); + + float alpha = context.Attr("alpha"); + float beta = context.Attr("beta"); + + auto x_dim = X->dims(); + int batch_size = 0; + int max_seq_len = 0; + int enc_size = 0; + + if (x_lod.empty()) { + PADDLE_ENFORCE( + x_dim.size() == 3UL, + "The input X of Add Position Encoding should be 3-D Tensor!"); + batch_size = x_dim[0]; + max_seq_len = x_dim[1]; + enc_size = x_dim[2]; + } else { + PADDLE_ENFORCE( + x_dim.size() == 2UL, + "The input X of Add Position Encoding should be 2-D LoDTensor!"); + PADDLE_ENFORCE( + x_lod.size() == 1UL, + "The Add Position Encoding Op only supports lod_level == 1!"); + batch_size = x_lod[0].size() - 1; + max_seq_len = -1; + enc_size = x_dim[1]; + } + + PADDLE_ENFORCE(enc_size % 2 == 0, "Only support even encode size!"); + + const int half_size = enc_size / 2; + for (int i = 0; i < batch_size; ++i) { + const int max_length = + x_lod.empty() ? max_seq_len : x_lod[0][i + 1] - x_lod[0][i]; + for (int j = 0; j < max_length; ++j) { + for (int k = 0; k < half_size; ++k) { + const double val = + (half_size > 1) + ? j / pow(10000.0, static_cast(k) / (half_size - 1)) + : j / 10000.0; + dst_ptr[k] = src_ptr[k] * alpha + sin(val) * beta; + dst_ptr[half_size + k] = + src_ptr[half_size + k] * alpha + cos(val) * beta; + } + src_ptr += enc_size; + dst_ptr += enc_size; + } + } + } +}; + +template +class AddPositionEncodingGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + auto* dOut = + context.Input(framework::GradVarName("Out")); + auto dout = framework::EigenVector::Flatten(*dOut); + + auto* dX = + context.Output(framework::GradVarName("X")); + dX->mutable_data(context.GetPlace()); + auto dx = framework::EigenVector::Flatten(*dX); + + float alpha = context.Attr("alpha"); + + auto* place = + context.template device_context().eigen_device(); + dx.device(*place) = dout * static_cast(alpha); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/affine_channel_op.cc b/paddle/fluid/operators/affine_channel_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..8944a749674c3ba6c83526e4d66f449075716f43 --- /dev/null +++ b/paddle/fluid/operators/affine_channel_op.cc @@ -0,0 +1,255 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +Indicesou may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/data_layout.h" +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +class AffineChannelOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor) Feature map input can be a 4D tensor with order NCHW " + "or NHWC. It also can be a 2D tensor and C is the second " + "dimension."); + AddInput("Scale", + "(Tensor) 1D input of shape (C), the c-th element " + "is the scale factor of the affine transformation " + "for the c-th channel of the input."); + AddInput("Bias", + "(Tensor) 1D input of shape (C), the c-th element " + "is the bias of the affine transformation for the " + "c-th channel of the input."); + AddAttr( + "data_layout", + "(string, default NCHW) Only used in " + "An optional string from: \"NHWC\", \"NCHW\". " + "Defaults to \"NHWC\". Specify the data format of the output data, " + "the input will be transformed automatically. ") + .SetDefault("AnyLayout"); + AddOutput("Out", "(Tensor) A tensor of the same shape and order with X."); + AddComment(R"DOC( + +Applies a separate affine transformation to each channel of the input. Useful +for replacing spatial batch norm with its equivalent fixed transformation. +The input also can be 2D tensor and applies a affine transformation in second +dimension. + +$$Out = Scale*X + Bias$$ + +)DOC"); + } +}; + +class AffineChannelOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of AffineChannelOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Scale"), + "Input(Scale) of AffineChannelOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Bias"), + "Input(Bias) of AffineChannelOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of AffineChannelOp should not be null."); + ctx->SetOutputDim("Out", ctx->GetInputDim("X")); + ctx->ShareLoD("X", "Out"); + } +}; + +class AffineChannelOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), + "Input(Out@GRAD) should not be null."); + if (ctx->HasOutput(framework::GradVarName("X"))) { + PADDLE_ENFORCE(ctx->HasInput("Scale"), + "Input(Scale) should not be null."); + ctx->SetOutputDim(framework::GradVarName("X"), + ctx->GetInputDim(framework::GradVarName("Out"))); + } + if (ctx->HasOutput(framework::GradVarName("Scale"))) { + // Scale@GRAD and Bias@GRAD must exist at the same time. + PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Bias")), + "Output(Scale@GRAD) should not be null."); + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null."); + ctx->SetOutputDim(framework::GradVarName("Scale"), + ctx->GetInputDim("Scale")); + ctx->SetOutputDim(framework::GradVarName("Bias"), + ctx->GetInputDim("Scale")); + } + } +}; + +template +using EigenArrayMap = + Eigen::Map>; +template +using ConstEigenArrayMap = + Eigen::Map>; +template +using EigenVectorArrayMap = Eigen::Map>; +template +using ConstEigenVectorArrayMap = + Eigen::Map>; + +template +class AffineChannelKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* scale = ctx.Input("Scale"); + auto* bias = ctx.Input("Bias"); + + auto* y = ctx.Output("Out"); + y->mutable_data(ctx.GetPlace()); + + const framework::DataLayout layout = + framework::StringToDataLayout(ctx.Attr("data_layout")); + + auto dims = x->dims(); + int N = dims[0]; + int C = layout == framework::DataLayout::kNCHW ? dims[1] + : dims[dims.size() - 1]; + int HxW = x->numel() / N / C; + + auto* scale_d = scale->data(); + auto* bias_d = bias->data(); + ConstEigenVectorArrayMap a_e(scale_d, C); + ConstEigenVectorArrayMap b_e(bias_d, C); + + auto* x_d = x->data(); + auto* y_d = y->data(); + if (layout == framework::DataLayout::kNCHW) { + int stride = C * HxW; + for (int i = 0; i < N; i++) { + ConstEigenArrayMap x_e(x_d, HxW, C); + EigenArrayMap y_e(y_d, HxW, C); + y_e = (x_e.rowwise() * a_e.transpose()).rowwise() + b_e.transpose(); + x_d += stride; + y_d += stride; + } + } else { + int num = N * HxW; + ConstEigenArrayMap x_e(x_d, C, num); + EigenArrayMap y_e(y_d, C, num); + y_e = (x_e.colwise() * a_e).colwise() + b_e; + } + } +}; + +template +class AffineChannelGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* scale = ctx.Input("Scale"); + auto* dy = ctx.Input(framework::GradVarName("Out")); + + auto* dx = ctx.Output(framework::GradVarName("X")); + auto* dscale = + ctx.Output(framework::GradVarName("Scale")); + auto* dbias = ctx.Output(framework::GradVarName("Bias")); + + const framework::DataLayout layout = + framework::StringToDataLayout(ctx.Attr("data_layout")); + + auto dims = x->dims(); + int N = dims[0]; + int C = layout == framework::DataLayout::kNCHW ? dims[1] + : dims[dims.size() - 1]; + int HxW = x->numel() / N / C; + + auto* x_d = x->data(); + auto* dy_d = dy->data(); + auto* scale_d = scale->data(); + ConstEigenVectorArrayMap scale_e(scale_d, C); + + T* dx_d = dx ? dx->mutable_data(ctx.GetPlace()) : nullptr; + T* dscale_d = dscale ? dscale->mutable_data(ctx.GetPlace()) : nullptr; + T* dbias_d = dbias ? dbias->mutable_data(ctx.GetPlace()) : nullptr; + EigenVectorArrayMap dscale_e(dscale_d, C); + EigenVectorArrayMap dbias_e(dbias_d, C); + + if (layout == framework::DataLayout::kNCHW) { + // compute dx + int stride = C * HxW; + if (dx) { + for (int i = 0; i < N; i++) { + ConstEigenArrayMap dy_e(dy_d, HxW, C); + EigenArrayMap dx_e(dx_d, HxW, C); + dx_e = dy_e.rowwise() * scale_e.transpose(); + dy_d += stride; + dx_d += stride; + } + } + // compute dscale and dbias + if (dscale && dbias) { + dy_d = dy->data(); + for (int i = 0; i < N; i++) { + ConstEigenArrayMap x_e(x_d, HxW, C); + ConstEigenArrayMap dy_e(dy_d, HxW, C); + if (i == 0) { + dscale_e = (x_e * dy_e).colwise().sum(); + } else { + dscale_e += (x_e * dy_e).colwise().sum(); + } + if (i == 0) { + dbias_e = dy_e.colwise().sum(); + } else { + dbias_e += dy_e.colwise().sum(); + } + x_d += stride; + dy_d += stride; + } + } + } else { + int num = N * HxW; + ConstEigenArrayMap dy_e(dy_d, C, num); + // compute dx + if (dx) { + EigenArrayMap dx_e(dx_d, C, num); + dx_e = dy_e.colwise() * scale_e; + } + // compute dscale and dbias + if (dscale && dbias) { + ConstEigenArrayMap x_e(x_d, C, num); + dscale_e = (x_e * dy_e).rowwise().sum(); + dbias_e = dy_e.rowwise().sum(); + } + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +using CPU = paddle::platform::CPUDeviceContext; + +REGISTER_OPERATOR(affine_channel, ops::AffineChannelOp, + ops::AffineChannelOpMaker, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(affine_channel_grad, ops::AffineChannelOpGrad); + +REGISTER_OP_CPU_KERNEL(affine_channel, ops::AffineChannelKernel, + ops::AffineChannelKernel); +REGISTER_OP_CPU_KERNEL(affine_channel_grad, + ops::AffineChannelGradKernel, + ops::AffineChannelGradKernel); diff --git a/paddle/fluid/operators/affine_channel_op.cu b/paddle/fluid/operators/affine_channel_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..2bebdb345ab324eb0a2dafd54c74833dd21bdb6d --- /dev/null +++ b/paddle/fluid/operators/affine_channel_op.cu @@ -0,0 +1,187 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +Indicesou may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "cub/cub.cuh" +#include "paddle/fluid/framework/data_layout.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/cuda_primitives.h" + +namespace paddle { +namespace operators { + +template +__global__ void KeAffineChannelCUDA(const T* x, const T* scale, const T* bias, + const int C, const int HxW, const int num, + T* y) { + int gid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (int i = gid; i < num; i += stride) { + const int c = layout == framework::DataLayout::kNCHW ? i / HxW % C : i % C; + if (HasBias) { + y[i] = scale[c] * x[i] + bias[c]; + } else { + y[i] = scale[c] * x[i]; + } + } +} + +template +class AffineChannelCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* scale = ctx.Input("Scale"); + auto* bias = ctx.Input("Bias"); + + auto* y = ctx.Output("Out"); + y->mutable_data(ctx.GetPlace()); + + const framework::DataLayout layout = + framework::StringToDataLayout(ctx.Attr("data_layout")); + auto& dev_ctx = ctx.template device_context(); + + auto dims = x->dims(); + const int num = x->numel(); + int N = dims[0]; + int C = layout == framework::DataLayout::kNCHW ? dims[1] + : dims[dims.size() - 1]; + int HxW = num / N / C; + + const T* x_d = x->data(); + const T* scale_d = scale->data(); + const T* bias_d = bias->data(); + T* y_d = y->data(); + + int block = 1024; + int grid = (num + block - 1) / block; + if (layout == framework::DataLayout::kNCHW) { + KeAffineChannelCUDA<<>>( + x_d, scale_d, bias_d, C, HxW, num, y_d); + } else { + KeAffineChannelCUDA<<>>( + x_d, scale_d, bias_d, C, HxW, num, y_d); + } + } +}; + +template +__global__ void AffineChannelScaleBiasGradientCUDAKernel( + const T* dy, const T* x, const int N, const int C, const int HxW, T* dscale, + T* dbias) { + const int outer_size = C; + const int inner_size = N * HxW; + typedef cub::BlockReduce BlockReduce; + __shared__ typename BlockReduce::TempStorage ds_storage; + __shared__ typename BlockReduce::TempStorage db_storage; + + for (int i = blockIdx.x; i < outer_size; i += gridDim.x) { + T ds_sum = 0; + T db_sum = 0; + for (int j = threadIdx.x; j < inner_size; j += blockDim.x) { + const int index = layout == framework::DataLayout::kNCHW + ? (j / HxW * C + i) * HxW + j % HxW + : j * outer_size + i; + ds_sum += dy[index] * x[index]; + db_sum += dy[index]; + } + ds_sum = BlockReduce(ds_storage).Reduce(ds_sum, cub::Sum()); + db_sum = BlockReduce(db_storage).Reduce(db_sum, cub::Sum()); + if (threadIdx.x == 0) { + dscale[i] = ds_sum; + dbias[i] = db_sum; + } + __syncthreads(); + } +} + +template +class AffineChannelGradCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* x = ctx.Input("X"); + auto* scale = ctx.Input("Scale"); + auto* bias = ctx.Input("Bias"); + auto* dy = ctx.Input(framework::GradVarName("Out")); + + auto* dx = ctx.Output(framework::GradVarName("X")); + auto* dscale = + ctx.Output(framework::GradVarName("Scale")); + auto* dbias = ctx.Output(framework::GradVarName("Bias")); + + const framework::DataLayout layout = + framework::StringToDataLayout(ctx.Attr("data_layout")); + auto& dev_ctx = ctx.template device_context(); + + auto dims = x->dims(); + const int num = x->numel(); + int N = dims[0]; + int C = layout == framework::DataLayout::kNCHW ? dims[1] + : dims[dims.size() - 1]; + int HxW = num / N / C; + + const T* x_d = x->data(); + const T* dy_d = dy->data(); + const T* s_d = scale->data(); + + T* dx_d = dx ? dx->mutable_data(ctx.GetPlace()) : nullptr; + T* ds_d = dscale ? dscale->mutable_data(ctx.GetPlace()) : nullptr; + T* db_d = dbias ? dbias->mutable_data(ctx.GetPlace()) : nullptr; + + const int block = 1024; + int max_threads = dev_ctx.GetMaxPhysicalThreadCount(); + const int max_blocks = std::max(max_threads / block, 1); + int grid1 = (num + block - 1) / block; + int grid2 = std::min(C, max_blocks); + if (layout == framework::DataLayout::kNCHW) { + if (dx) { + KeAffineChannelCUDA<<>>( + dy_d, s_d, nullptr, C, HxW, num, dx_d); + } + if (dscale && dbias) { + AffineChannelScaleBiasGradientCUDAKernel< + T, block, framework::DataLayout::kNCHW><<>>( + dy_d, x_d, N, C, HxW, ds_d, db_d); + } + } else { + if (dx) { + KeAffineChannelCUDA<<>>( + dy_d, s_d, nullptr, C, HxW, num, dx_d); + } + if (dscale && dbias) { + AffineChannelScaleBiasGradientCUDAKernel< + T, block, framework::DataLayout::kNHWC><<>>( + dy_d, x_d, N, C, HxW, ds_d, db_d); + } + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +using CUDA = paddle::platform::CUDADeviceContext; + +REGISTER_OP_CUDA_KERNEL(affine_channel, + ops::AffineChannelCUDAKernel, + ops::AffineChannelCUDAKernel); +REGISTER_OP_CUDA_KERNEL(affine_channel_grad, + ops::AffineChannelGradCUDAKernel, + ops::AffineChannelGradCUDAKernel); diff --git a/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc b/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..ed71594ba5781590f3291d56c4ba1a4443003bd5 --- /dev/null +++ b/paddle/fluid/operators/affine_grid_cudnn_op.cu.cc @@ -0,0 +1,112 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using ScopedSpatialTransformerDescriptor = + platform::ScopedSpatialTransformerDescriptor; + +template +class CUDNNAffineGridOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto* theta = ctx.Input("Theta"); + auto* output = ctx.Output("Output"); + const T* theta_data = theta->data(); + + int n = theta->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + Tensor h_sizes; + int* h_size_data; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + h_size_data = h_sizes.data(); + } else { + h_size_data = h_sizes.mutable_data({4}, platform::CPUPlace()); + h_size_data[0] = n; + h_size_data[1] = size_attr[1]; + h_size_data[2] = size_attr[2]; + h_size_data[3] = size_attr[3]; + } + + T* output_data = output->mutable_data( + {n, h_size_data[2], h_size_data[3], 2}, ctx.GetPlace()); + ScopedSpatialTransformerDescriptor st_desc; + cudnnSpatialTransformerDescriptor_t cudnn_st_desc = + st_desc.descriptor(4, h_size_data); + + PADDLE_ENFORCE(platform::dynload::cudnnSpatialTfGridGeneratorForward( + handle, cudnn_st_desc, theta_data, output_data)); + } +}; + +template +class CUDNNAffineGridGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace."); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto output_grad = ctx.Input(framework::GradVarName("Output")); + auto theta_grad = ctx.Output(framework::GradVarName("Theta")); + + int n = output_grad->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + Tensor h_sizes; + int* h_size_data; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + h_size_data = h_sizes.data(); + } else { + h_size_data = h_sizes.mutable_data({4}, platform::CPUPlace()); + h_size_data[0] = n; + h_size_data[1] = size_attr[1]; + h_size_data[2] = size_attr[2]; + h_size_data[3] = size_attr[3]; + } + + ScopedSpatialTransformerDescriptor st_desc; + cudnnSpatialTransformerDescriptor_t cudnn_st_desc = + st_desc.descriptor(4, h_size_data); + + const T* output_grad_data = output_grad->data(); + T* theta_grad_data = theta_grad->mutable_data(ctx.GetPlace()); + + PADDLE_ENFORCE(platform::dynload::cudnnSpatialTfGridGeneratorBackward( + handle, cudnn_st_desc, output_grad_data, theta_grad_data)); + } +}; + +} // namespace operators +} // namespace paddle + +namespace plat = paddle::platform; +REGISTER_OP_KERNEL(affine_grid, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNAffineGridOpKernel, + paddle::operators::CUDNNAffineGridOpKernel); +REGISTER_OP_KERNEL(affine_grid_grad, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNAffineGridGradOpKernel, + paddle::operators::CUDNNAffineGridGradOpKernel); diff --git a/paddle/fluid/operators/affine_grid_op.cc b/paddle/fluid/operators/affine_grid_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6f7da445fc84fc1f14b01a633af0e886aec6f8ed --- /dev/null +++ b/paddle/fluid/operators/affine_grid_op.cc @@ -0,0 +1,231 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/affine_grid_op.h" +#include +#include "paddle/fluid/framework/op_registry.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +template +struct Linspace { + void operator()(T start, T end, int count, framework::Tensor* numbers, + const framework::ExecutionContext& ctx) { + T* number_data = numbers->mutable_data({count}, platform::CPUPlace()); + T slice = (end - start) / (T)(count - 1); + for (int i = 0; i < count; ++i) { + number_data[i] = start + (T)i * slice; + } + } +}; + +class AffineGridOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Theta"), + "Input(Theta) of AffineGridOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Output"), + "Output(Output) of AffineGridOp should not be null."); + auto theta_dims = ctx->GetInputDim("Theta"); + PADDLE_ENFORCE(theta_dims.size() == 3, + "AffineGrid's Input(Theta) should be 3-D tensor."); + + auto output_shape = ctx->Attrs().Get>("output_shape"); + if (output_shape.size() == 0) { + PADDLE_ENFORCE(ctx->HasInput("OutputShape"), + "Input(OutputShape) of AffineGridOp should not be null if " + "attr(output_shape) is not configured."); + auto output_shape_dims = ctx->GetInputDim("OutputShape"); + PADDLE_ENFORCE(output_shape_dims.size() == 1, + "AffineGrid's Input(OutputShape) should be 1-D tensor."); + } else { + PADDLE_ENFORCE(output_shape.size() == 4, + "The size of attr(output_shape) should be 4."); + } + + PADDLE_ENFORCE(theta_dims[1] == 2, "Input(theta) dims[1] should be 2."); + PADDLE_ENFORCE(theta_dims[2] == 3, "Input(theta) dims[2] should be 3."); + // N * H * W * 2 + ctx->SetOutputDim("Output", + framework::make_ddim({theta_dims[0], -1, -1, 2})); + ctx->ShareLoD("Theta", "Output"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library = framework::LibraryType::kCUDNN; + } +#endif + auto data_type = framework::ToDataType(ctx.Input("Theta")->type()); + return framework::OpKernelType(data_type, ctx.GetPlace(), + framework::DataLayout::kAnyLayout, library); + } +}; + +class AffineGridOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "Theta", + "(Tensor) A batch of affine transform parameters with shape [N, 2, 3]. " + "It is used to transform coordinate (x_0, y_0) to coordinate (x_1, " + "y_1)."); + AddInput("OutputShape", + "(Tensor) The shape of target image with format [N, C, H, W].") + .AsDispensable(); + AddOutput("Output", "(Tensor) Output Tensor with shape [N, H, W, 2]."); + AddAttr( + "use_cudnn", + "(bool, default false) Only used in cudnn kernel, need install cudnn") + .SetDefault(true); + AddAttr>( + "output_shape", + "The target output image shape with format [N, C, H, W].") + .SetDefault(std::vector()); + + AddComment(R"DOC( + It generates a grid of (x,y) coordinates using the parameters of the + affine transformation that correspond to a set of points where the input + feature map should be sampled to produce the transformed output feature map. + + Given: + Theta = [[[x_11, x_12, x_13] + [x_14, x_15, x_16]] + [[x_21, x_22, x_23] + [x_24, x_25, x_26]]] + + OutputShape = [2, 3, 5, 5] + + Step 1: + + Generate relative coordinates according to OutputShape. + The values of relative coordinates are in the interval between -1 and 1. + The shape of the relative coordinates is [2, H, W] as below: + + C = [[[-1. -1. -1. -1. -1. ] + [-0.5 -0.5 -0.5 -0.5 -0.5] + [ 0. 0. 0. 0. 0. ] + [ 0.5 0.5 0.5 0.5 0.5] + [ 1. 1. 1. 1. 1. ]] + [[-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ] + [-1. -0.5 0. 0.5 1. ]]] + C[0] is the coordinates in height axis and C[1] is the coordinates in width axis. + + Step2: + Tanspose and reshape C to shape [H * W, 2] and append ones to last dimension. The we get: + C_ = [[-1. -1. 1. ] + [-0.5 -1. 1. ] + [ 0. -1. 1. ] + [ 0.5 -1. 1. ] + [ 1. -1. 1. ] + [-1. -0.5 1. ] + [-0.5 -0.5 1. ] + [ 0. -0.5 1. ] + [ 0.5 -0.5 1. ] + [ 1. -0.5 1. ] + [-1. 0. 1. ] + [-0.5 0. 1. ] + [ 0. 0. 1. ] + [ 0.5 0. 1. ] + [ 1. 0. 1. ] + [-1. 0.5 1. ] + [-0.5 0.5 1. ] + [ 0. 0.5 1. ] + [ 0.5 0.5 1. ] + [ 1. 0.5 1. ] + [-1. 1. 1. ] + [-0.5 1. 1. ] + [ 0. 1. 1. ] + [ 0.5 1. 1. ] + [ 1. 1. 1. ]] + Step3: + Compute output by equation $$Output[i] = C_ * Theta[i]^T$$ + )DOC"); + } +}; + +class AffineGridOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + auto theta_dims = ctx->GetInputDim("Theta"); + if (ctx->HasOutput(framework::GradVarName("Theta"))) { + ctx->SetOutputDim(framework::GradVarName("Theta"), theta_dims); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library_{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kCUDNN; + } +#endif + return framework::OpKernelType( + framework::ToDataType(ctx.Input("Theta")->type()), + ctx.GetPlace(), framework::DataLayout::kAnyLayout, library_); + } +}; + +class AffineGridGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("affine_grid_grad"); + op->SetInput("Theta", Input("Theta")); + op->SetInput("OutputShape", Input("OutputShape")); + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("Theta"), InputGrad("Theta")); + return std::unique_ptr(op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(affine_grid, ops::AffineGridOp, ops::AffineGridOpMaker, + ops::AffineGridGradMaker); +REGISTER_OPERATOR(affine_grid_grad, ops::AffineGridOpGrad); + +REGISTER_OP_CPU_KERNEL( + affine_grid, + ops::AffineGridOpKernel, + ops::AffineGridOpKernel); +REGISTER_OP_CPU_KERNEL( + affine_grid_grad, + ops::AffineGridGradOpKernel, + ops::AffineGridGradOpKernel); diff --git a/paddle/fluid/operators/affine_grid_op.h b/paddle/fluid/operators/affine_grid_op.h new file mode 100644 index 0000000000000000000000000000000000000000..87d23831486e658374d4c011412fdef57be1b994 --- /dev/null +++ b/paddle/fluid/operators/affine_grid_op.h @@ -0,0 +1,174 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/blas.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +template +using EigenTensor = framework::EigenTensor; + +using Array1 = Eigen::DSizes; +using Array2 = Eigen::DSizes; +using Array3 = Eigen::DSizes; +using Array4 = Eigen::DSizes; + +/** + *Return a tensor with evenly spaced numbers over a specified interval. + */ +template +struct Linspace { + void operator()(T start, T end, int count, framework::Tensor* numbers, + const framework::ExecutionContext& ctx); +}; + +template +inline void GetIdxMap(int n, int h, int w, Tensor* grid, + const framework::ExecutionContext& ctx) { + auto& place = *ctx.template device_context().eigen_device(); + grid->mutable_data({n, h, w, 3}, ctx.GetPlace()); + auto grid_t = EigenTensor::From(*grid); + // Get indexes of height with shape [height, width, 1] + Tensor h_idx; + Linspace linspace; + linspace((T)-1, (T)1, h, &h_idx, ctx); + auto h_idx_t = EigenTensor::From(h_idx); + // Get indexes of width with shape [height, width, 1] + Tensor w_idx; + linspace((T)-1, (T)1, w, &w_idx, ctx); + auto w_idx_t = EigenTensor::From(w_idx); + // Get constant ones tensor with shape [height, width, 1] + Tensor ones; + ones.mutable_data({h, w, 1}, ctx.GetPlace()); + auto ones_t = EigenTensor::From(ones).setConstant((T)1); + // Get grid tensor with shape [n, h, w, 3] by concatenating h_idx, w_idx and + // ones + Tensor w_idx_map; + w_idx_map.mutable_data({h, w, 1}, ctx.GetPlace()); + auto w_idx_map_t = EigenTensor::From(w_idx_map); + Tensor h_idx_map; + h_idx_map.mutable_data({h, w, 1}, ctx.GetPlace()); + auto h_idx_map_t = EigenTensor::From(h_idx_map); + Tensor w_h_idx_map; + w_h_idx_map.mutable_data({h, w, 2}, ctx.GetPlace()); + auto w_h_idx_map_t = EigenTensor::From(w_h_idx_map); + Tensor w_h_one_idx_map; + w_h_one_idx_map.mutable_data({h, w, 3}, ctx.GetPlace()); + auto w_h_one_idx_map_t = EigenTensor::From(w_h_one_idx_map); + + w_idx_map_t.device(place) = w_idx_t.reshape(Array2(1, w)) + .broadcast(Array2(h, 1)) + .reshape(Array3(h, w, 1)); + + h_idx_map_t.device(place) = h_idx_t.reshape(Array2(1, h)) + .broadcast(Array2(w, 1)) + .shuffle(Array2(1, 0)) + .reshape(Array3(h, w, 1)); + + w_h_idx_map_t.device(place) = w_idx_map_t.concatenate(h_idx_map_t, 2); + w_h_one_idx_map_t.device(place) = w_h_idx_map_t.concatenate(ones_t, 2); + grid_t.device(place) = w_h_one_idx_map_t.reshape(Array4(1, h, w, 3)) + .broadcast(Array4(n, 1, 1, 1)); +} + +template +class AffineGridOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* theta = ctx.Input("Theta"); + int n = theta->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + int h = 0; + int w = 0; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + Tensor h_sizes; + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + const int* h_size_data = h_sizes.data(); + h = h_size_data[2]; + w = h_size_data[3]; + } else { + h = size_attr[2]; + w = size_attr[3]; + } + auto* output = ctx.Output("Output"); + output->mutable_data({n, h, w, 2}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), output, + static_cast(0)); + Tensor grid; + GetIdxMap(n, h, w, &grid, ctx); + // output = grid * theta.T + // TODO(wanghaoshuang): Refine batched matrix multiply + auto blas = math::GetBlas(ctx); + for (int i = 0; i < n; ++i) { + Tensor sliced_grid = grid.Slice(i, i + 1).Resize({h * w, 3}); + Tensor sliced_theta = theta->Slice(i, i + 1).Resize({2, 3}); + Tensor sliced_out = output->Slice(i, i + 1).Resize({h * w, 2}); + blas.MatMul(sliced_grid, false, sliced_theta, true, T(1), &sliced_out, + T(0)); + } + } +}; + +template +class AffineGridGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto output_grad = ctx.Input(framework::GradVarName("Output")); + auto theta_grad = ctx.Output(framework::GradVarName("Theta")); + int n = output_grad->dims()[0]; + auto size_attr = ctx.Attr>("output_shape"); + int h = 0; + int w = 0; + if (size_attr.size() == 0) { + auto* output_shape = ctx.Input("OutputShape"); + Tensor h_sizes; + framework::TensorCopy(*output_shape, platform::CPUPlace(), &h_sizes); + const int* h_size_data = h_sizes.data(); + h = h_size_data[2]; + w = h_size_data[3]; + } else { + h = size_attr[2]; + w = size_attr[3]; + } + theta_grad->mutable_data({n, 2, 3}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), theta_grad, + static_cast(0)); + Tensor grid; + GetIdxMap(n, h, w, &grid, ctx); + // output = grid * theta.T + // TODO(wanghaoshuang): Refine batched matrix multiply + auto blas = math::GetBlas(ctx); + for (int i = 0; i < n; ++i) { + Tensor sliced_grid = grid.Slice(i, i + 1).Resize({h * w, 3}); + Tensor sliced_out_grad = output_grad->Slice(i, i + 1).Resize({h * w, 2}); + Tensor sliced_theta_grad = theta_grad->Slice(i, i + 1).Resize({2, 3}); + blas.MatMul(sliced_out_grad, true, sliced_grid, false, T(1), + &sliced_theta_grad, T(0)); + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/array_operator.h b/paddle/fluid/operators/array_operator.h index 4309f0a5497456065e5c43bc8f7b265fa711f699..eddf34494bdab18c9d4ae1fb3d1e5d1a71fe590e 100644 --- a/paddle/fluid/operators/array_operator.h +++ b/paddle/fluid/operators/array_operator.h @@ -49,7 +49,7 @@ class ArrayOp : public framework::OperatorBase { } else { offset = static_cast(*i_tensor.data()); } - VLOG(10) << " Offset = " << offset; + VLOG(100) << " Offset = " << offset; return offset; } }; diff --git a/paddle/fluid/operators/array_to_lod_tensor_op.cc b/paddle/fluid/operators/array_to_lod_tensor_op.cc index b8b8b2290a0f002fd379032e28590b84a1da38e9..3c40135eca00f4e0bbff9b0f0f7cf2a4c85ec556 100644 --- a/paddle/fluid/operators/array_to_lod_tensor_op.cc +++ b/paddle/fluid/operators/array_to_lod_tensor_op.cc @@ -11,7 +11,7 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include +#include #include #include "paddle/fluid/framework/lod_rank_table.h" @@ -148,8 +148,8 @@ class ArrayToLoDTensorOp : public framework::OperatorBase { size_t start_offset = lod_and_offset.second.first; size_t end_offset = lod_and_offset.second.second; - VLOG(10) << "idx=" << idx << " x_idx=" << x_idx << " [" - << ", " << end_offset << "]"; + VLOG(100) << "idx=" << idx << " x_idx=" << x_idx << " [" + << ", " << end_offset << "]"; // Copy data PADDLE_ENFORCE_GE(end_offset, start_offset); size_t len = end_offset - start_offset; diff --git a/paddle/fluid/operators/batch_norm_op.cc b/paddle/fluid/operators/batch_norm_op.cc index 5912a1a17cbd29c3ebd83f37133c044f0905c8bd..2463c939bc5d19500ba36ba3c73db176bb82c62a 100644 --- a/paddle/fluid/operators/batch_norm_op.cc +++ b/paddle/fluid/operators/batch_norm_op.cc @@ -113,7 +113,10 @@ class BatchNormOp : public framework::OperatorWithKernel { class BatchNormOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - AddAttr("is_test", "").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddAttr("momentum", "").SetDefault(0.9); AddAttr("epsilon", "") .SetDefault(1e-5) @@ -135,15 +138,13 @@ class BatchNormOpMaker : public framework::OpProtoAndCheckerMaker { AddInput("Variance", "The global variance (for training) " "or estimated Variance (for testing)"); - AddOutput("Y", "result after normalization").Reuse("X"); + AddOutput("Y", "result after normalization"); AddOutput("MeanOut", "Share memory with Mean. " - "Store the global mean when training") - .Reuse("Mean"); + "Store the global mean when training"); AddOutput("VarianceOut", "Share memory with Variance. " - "Store the global Variance when training") - .Reuse("Variance"); + "Store the global Variance when training"); AddOutput("SavedMean", "Mean of the current mini batch, " "will apply to output when training") @@ -172,6 +173,15 @@ The required data format for this layer is one of the following: } }; +class BatchNormOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Y"}}; + } +}; + template class BatchNormKernel : public framework::OpKernel { @@ -527,7 +537,7 @@ class BatchNormGradMaker : public framework::SingleGradOpDescMaker { namespace ops = paddle::operators; REGISTER_OPERATOR(batch_norm, ops::BatchNormOp, ops::BatchNormOpMaker, - ops::BatchNormGradMaker); + ops::BatchNormOpInferVarType, ops::BatchNormGradMaker); REGISTER_OPERATOR(batch_norm_grad, ops::BatchNormGradOp); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/batch_norm_op.cu.cc b/paddle/fluid/operators/batch_norm_op.cu.cc index ca6cd8669352fd5814f25a04433ca97fe4abe9ff..0609027c6940533483173209176f3243ccb36f8f 100644 --- a/paddle/fluid/operators/batch_norm_op.cu.cc +++ b/paddle/fluid/operators/batch_norm_op.cu.cc @@ -96,7 +96,7 @@ class BatchNormKernel mode_ = CUDNN_BATCHNORM_SPATIAL; #endif - VLOG(3) << "Setting descriptors."; + VLOG(30) << "Setting descriptors."; std::vector dims; std::vector strides; if (data_layout == DataLayout::kNCHW) { @@ -219,8 +219,8 @@ class BatchNormGradKernel auto *d_bias = ctx.Output(framework::GradVarName("Bias")); d_x->mutable_data(ctx.GetPlace()); - d_scale->mutable_data(ctx.GetPlace()); - d_bias->mutable_data(ctx.GetPlace()); + d_scale->mutable_data>(ctx.GetPlace()); + d_bias->mutable_data>(ctx.GetPlace()); auto &dev_ctx = ctx.template device_context(); if ((N * H * W * D) == 1) { @@ -272,8 +272,10 @@ class BatchNormGradKernel const auto *saved_mean = ctx.Input("SavedMean"); const auto *saved_var = ctx.Input("SavedVariance"); - const void *saved_mean_data = saved_mean->template data(); - const void *saved_var_data = saved_var->template data(); + const void *saved_mean_data = + saved_mean->template data>(); + const void *saved_var_data = + saved_var->template data>(); CUDNN_ENFORCE(platform::dynload::cudnnBatchNormalizationBackward( dev_ctx.cudnn_handle(), mode_, CudnnDataType::kOne(), @@ -281,10 +283,10 @@ class BatchNormGradKernel CudnnDataType::kZero(), data_desc_, x->template data(), data_desc_, d_y->template data(), data_desc_, d_x->template mutable_data(ctx.GetPlace()), bn_param_desc_, - scale->template data(), - d_scale->template mutable_data(ctx.GetPlace()), - d_bias->template mutable_data(ctx.GetPlace()), epsilon, - saved_mean_data, saved_var_data)); + scale->template data>(), + d_scale->template mutable_data>(ctx.GetPlace()), + d_bias->template mutable_data>(ctx.GetPlace()), + epsilon, saved_mean_data, saved_var_data)); // clean when exit. CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(data_desc_)); @@ -304,4 +306,5 @@ REGISTER_OP_CUDA_KERNEL( ops::BatchNormKernel); REGISTER_OP_CUDA_KERNEL( batch_norm_grad, ops::BatchNormGradKernel, - ops::BatchNormGradKernel); + ops::BatchNormGradKernel, + ops::BatchNormGradKernel); diff --git a/paddle/fluid/operators/beam_search_decode_op.cc b/paddle/fluid/operators/beam_search_decode_op.cc index b6cb935814e25b31d4104f9ce24fe952680cb491..0d32cae0e1e5ff274793df50e854283d8e2f7bf8 100644 --- a/paddle/fluid/operators/beam_search_decode_op.cc +++ b/paddle/fluid/operators/beam_search_decode_op.cc @@ -79,6 +79,9 @@ struct BeamSearchDecodeFunctor { bool tensor_on_gpu_; size_t beam_size_; int end_id_; + // TODO(Superjomn) Here might result serious performance issue in the + // concurrency + // scenarios. const LoDTensorArray& step_ids_origin_; const LoDTensorArray& step_scores_origin_; LoDTensorArray step_ids_ = LoDTensorArray(); diff --git a/paddle/fluid/operators/beam_search_op.cc b/paddle/fluid/operators/beam_search_op.cc index 62771d09f112785ca1ba741a0ba239b1f0234633..791f8a4d3be6780c584997113b7ffcfb7ab35667 100644 --- a/paddle/fluid/operators/beam_search_op.cc +++ b/paddle/fluid/operators/beam_search_op.cc @@ -33,11 +33,11 @@ void BeamSearch::operator()(const framework::LoDTensor &pre_ids, auto items = SelectTopBeamSizeItems(pre_ids, pre_scores); auto selected_items = ToMap(items, high_level.back()); - VLOG(3) << "selected_items:"; + VLOG(30) << "selected_items:"; for (size_t i = 0; i < selected_items.size(); ++i) { - VLOG(3) << "offset:" << i; + VLOG(30) << "offset:" << i; for (auto &item : selected_items[i]) { - VLOG(3) << ItemToString(item); + VLOG(30) << ItemToString(item); } } @@ -138,11 +138,11 @@ std::vector> BeamSearch::SelectTopBeamSizeItems( } result.emplace_back(items); } - VLOG(3) << "SelectTopBeamSizeItems result size " << result.size(); + VLOG(30) << "SelectTopBeamSizeItems result size " << result.size(); for (auto &items : result) { - VLOG(3) << "item set:"; + VLOG(30) << "item set:"; for (auto &item : items) { - VLOG(3) << ItemToString(item); + VLOG(30) << ItemToString(item); } } diff --git a/paddle/fluid/operators/beam_search_op_test.cc b/paddle/fluid/operators/beam_search_op_test.cc index c4f4b478fbfc87e4178155132781214575c1e6b0..40b46781daa989fcd89887a3c01e97e39ea71255 100644 --- a/paddle/fluid/operators/beam_search_op_test.cc +++ b/paddle/fluid/operators/beam_search_op_test.cc @@ -46,7 +46,7 @@ void CreateInput(LoDTensor* ids, LoDTensor* scores) { auto* scores_data = scores->mutable_data(place); vector _ids({4, 2, 5, 2, 1, 3, 3, 5, 2, 8, 2, 1}); vector _scores( - {0.5, 0.3, 0.2, 0.6, 0.3, 0.1, 0.9, 0.5, 0.1, 0.7, 0.5, 0.1}); + {0.5f, 0.3f, 0.2f, 0.6f, 0.3f, 0.1f, 0.9f, 0.5f, 0.1f, 0.7f, 0.5f, 0.1f}); for (int i = 0; i < 12; i++) { ids_data[i] = _ids[i]; @@ -54,7 +54,8 @@ void CreateInput(LoDTensor* ids, LoDTensor* scores) { } } -TEST(beam_search_op, run) { +// It seems that beam_search_op has bugs. +TEST(DISABLED_beam_search_op, run) { CPUPlace place; LoDTensor ids, scores; CreateInput(&ids, &scores); @@ -79,7 +80,7 @@ TEST(beam_search_op, run) { ASSERT_EQ(sids.lod(), sscores.lod()); vector tids({4, 2, 3, 8}); - vector tscores({0.5, 0.6, 0.9, 0.7}); + vector tscores({0.5f, 0.6f, 0.9f, 0.7f}); for (int i = 0; i < 4; i++) { ASSERT_EQ(tids[i], sids.data()[i]); diff --git a/paddle/fluid/operators/bilinear_interp_op.cu b/paddle/fluid/operators/bilinear_interp_op.cu deleted file mode 100644 index 4c1971538495c6f111e9db18f4014786f6f0dd58..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/bilinear_interp_op.cu +++ /dev/null @@ -1,207 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#include "paddle/fluid/operators/bilinear_interp_op.h" -#include "paddle/fluid/platform/cuda_primitives.h" - -namespace paddle { -namespace operators { - -using framework::Tensor; - -template -__global__ void KeBilinearInterpFw( - const T* in, const size_t in_img_h, const size_t in_img_w, - const size_t input_h, const size_t input_w, T* out, const size_t out_img_h, - const size_t out_img_w, const size_t output_h, const size_t output_w, - const size_t num_channels, const T ratio_h, const T ratioW) { - int nthreads = output_h * output_w; - int tid = blockIdx.x * blockDim.x + threadIdx.x; - if (tid < nthreads) { - int out_id_h = tid / output_w; - int out_id_w = tid % output_w; - int in_img_size = input_w / num_channels; - int out_img_size = output_w / num_channels; - int channel_id = out_id_w / out_img_size; - - int out_img_idy = (out_id_w % out_img_size) / out_img_w; - int in_img_idy = ratio_h * out_img_idy; - int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; - T h1lambda = ratio_h * out_img_idy - in_img_idy; - T h2lambda = 1.f - h1lambda; - - int out_img_idx = tid % out_img_w; - int in_img_idx = ratioW * out_img_idx; - int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; - T w1lambda = ratioW * out_img_idx - in_img_idx; - T w2lambda = 1.f - w1lambda; - - const T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + - in_img_idy * in_img_w + in_img_idx]; - - // bilinear interpolation - out[out_id_h * output_w + out_id_w] = - h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[w_id]) + - h1lambda * (w2lambda * in_pos[h_id * in_img_w] + - w1lambda * in_pos[h_id * in_img_w + w_id]); - } -} - -template -__global__ void KeBilinearInterpBw( - T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h, - const size_t input_w, const T* out, const size_t out_img_h, - const size_t out_img_w, const size_t output_h, const size_t output_w, - const size_t num_channels, const T ratio_h, const T ratioW) { - int nthreads = output_h * output_w; - int tid = blockIdx.x * blockDim.x + threadIdx.x; - if (tid < nthreads) { - int out_id_h = tid / output_w; - int out_id_w = tid % output_w; - int in_img_size = input_w / num_channels; - int out_img_size = output_w / num_channels; - int channel_id = out_id_w / out_img_size; - - int out_img_idy = (out_id_w % out_img_size) / out_img_w; - int in_img_idy = ratio_h * out_img_idy; - int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; - T h1lambda = ratio_h * out_img_idy - in_img_idy; - T h2lambda = 1.f - h1lambda; - - int out_img_idx = tid % out_img_w; - int in_img_idx = ratioW * out_img_idx; - int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; - T w1lambda = ratioW * out_img_idx - in_img_idx; - T w2lambda = 1.f - w1lambda; - - T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + - in_img_idy * in_img_w + in_img_idx]; - const T* out_pos = &out[out_id_h * output_w + out_id_w]; - atomicAdd(&in_pos[0], h2lambda * w2lambda * out_pos[0]); - atomicAdd(&in_pos[w_id], h2lambda * w1lambda * out_pos[0]); - atomicAdd(&in_pos[h_id * in_img_w], h1lambda * w2lambda * out_pos[0]); - atomicAdd(&in_pos[h_id * in_img_w + w_id], - h1lambda * w1lambda * out_pos[0]); - } -} - -template -class BilinearInterpOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), - "This kernel only runs on GPU device."); - auto* input_t = ctx.Input("X"); // float tensor - auto* output_t = ctx.Output("Out"); // float tensor - auto* input = input_t->data(); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - auto out_dims = output_t->dims(); - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - Tensor sizes; - framework::TensorCopy(*out_size_t, platform::CPUPlace(), &sizes); - auto size_data = sizes.data(); - out_h = size_data[0]; - out_w = size_data[1]; - } - auto* output = output_t->mutable_data( - {out_dims[0], out_dims[1], out_h, out_w}, ctx.GetPlace()); - - int batch_size = input_t->dims()[0]; - int channels = input_t->dims()[1]; - int in_h = input_t->dims()[2]; - int in_w = input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - T ratio_h = (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - T ratio_w = (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(output, input, input_t->numel() * sizeof(T)); - } else { - int threadNum = batch_size * out_chw; - int blocks = (threadNum + 1024 - 1) / 1024; - - KeBilinearInterpFw< - T><<>>( - input, in_h, in_w, batch_size, in_chw, output, out_h, out_w, - batch_size, out_chw, channels, ratio_h, ratio_w); - } - } -}; - -template -class BilinearInterpGradOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* d_input_t = ctx.Output(framework::GradVarName("X")); - auto* d_output_t = ctx.Input(framework::GradVarName("Out")); - auto* d_output = d_output_t->data(); - auto* d_input = d_input_t->mutable_data(ctx.GetPlace()); - - auto& device_ctx = - ctx.template device_context(); - math::SetConstant zero; - zero(device_ctx, d_input_t, static_cast(0.0)); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - Tensor sizes; - framework::TensorCopy(*out_size_t, platform::CPUPlace(), &sizes); - auto size_data = sizes.data(); - out_h = size_data[0]; - out_w = size_data[1]; - } - - int batch_size = d_input_t->dims()[0]; - int channels = d_input_t->dims()[1]; - int in_h = d_input_t->dims()[2]; - int in_w = d_input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - T ratio_h = (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - T ratio_w = (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(d_input, d_output, d_input_t->numel() * sizeof(T)); - } else { - int threadNum = batch_size * out_chw; - int blocks = (threadNum + 1024 - 1) / 1024; - - KeBilinearInterpBw< - T><<>>( - d_input, in_h, in_w, batch_size, in_chw, d_output, out_h, out_w, - batch_size, out_chw, channels, ratio_h, ratio_w); - } - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL(bilinear_interp, - ops::BilinearInterpOpCUDAKernel); -REGISTER_OP_CUDA_KERNEL(bilinear_interp_grad, - ops::BilinearInterpGradOpCUDAKernel); diff --git a/paddle/fluid/operators/bilinear_interp_op.h b/paddle/fluid/operators/bilinear_interp_op.h deleted file mode 100644 index 70847cb8c1abe2e94bc844ab8117d1f23fea533b..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/bilinear_interp_op.h +++ /dev/null @@ -1,163 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. */ - -#pragma once -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/math/math_function.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; - -template -class BilinearInterpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* input_t = ctx.Input("X"); // float tensor - auto* output_t = ctx.Output("Out"); // float tensor - auto out_dims = output_t->dims(); - auto* input = input_t->data(); - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - auto out_size_data = out_size_t->data(); - out_h = out_size_data[0]; - out_w = out_size_data[1]; - } - auto* output = output_t->mutable_data( - {out_dims[0], out_dims[1], out_h, out_w}, ctx.GetPlace()); - int batch_size = input_t->dims()[0]; - int channels = input_t->dims()[1]; - int in_h = input_t->dims()[2]; - int in_w = input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - float ratio_h = - (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - float ratio_w = - (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(output, input, input_t->numel() * sizeof(T)); - } else { - for (int k = 0; k < batch_size; ++k) { // loop for batches - for (int i = 0; i < out_h; ++i) { // loop for images - int h = ratio_h * i; - int hid = (h < in_h - 1) ? 1 : 0; - float h1lambda = ratio_h * i - h; - float h2lambda = 1.f - h1lambda; - - for (int j = 0; j < out_w; ++j) { - int w = ratio_w * j; - int wid = (w < in_w - 1) ? 1 : 0; - float w1lambda = ratio_w * j - w; - float w2lambda = 1.f - w1lambda; - // calculate four position for bilinear interpolation - const T* in_pos = &input[k * in_chw + h * in_w + w]; - T* out_pos = &output[k * out_chw + i * out_w + j]; - - for (int c = 0; c < channels; ++c) { // loop for channels - // bilinear interpolation - out_pos[0] = static_cast( - h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[wid]) + - h1lambda * (w2lambda * in_pos[hid * in_w] + - w1lambda * in_pos[hid * in_w + wid])); - in_pos += in_hw; - out_pos += out_hw; - } - } - } - } - } - } -}; - -template -class BilinearInterpGradKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - auto* d_input_t = ctx.Output(framework::GradVarName("X")); - auto* d_output_t = ctx.Input(framework::GradVarName("Out")); - auto* d_output = d_output_t->data(); - auto* d_input = d_input_t->mutable_data(ctx.GetPlace()); - auto& device_ctx = - ctx.template device_context(); - math::SetConstant zero; - zero(device_ctx, d_input_t, static_cast(0.0)); - - int out_h = ctx.Attr("out_h"); - int out_w = ctx.Attr("out_w"); - - auto out_size_t = ctx.Input("OutSize"); - if (out_size_t != nullptr) { - auto out_size_data = out_size_t->data(); - out_h = out_size_data[0]; - out_w = out_size_data[1]; - } - - int batch_size = d_input_t->dims()[0]; - int channels = d_input_t->dims()[1]; - int in_h = d_input_t->dims()[2]; - int in_w = d_input_t->dims()[3]; - - int in_hw = in_h * in_w; - int out_hw = out_h * out_w; - int in_chw = channels * in_hw; - int out_chw = channels * out_hw; - - float ratio_h = - (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; - float ratio_w = - (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; - - if (in_h == out_h && in_w == out_w) { - memcpy(d_input, d_output, d_input_t->numel() * sizeof(T)); - } else { - for (int k = 0; k < batch_size; ++k) { // loop for batches - for (int i = 0; i < out_h; ++i) { // loop for images - int h = ratio_h * i; - int hid = (h < in_h - 1) ? 1 : 0; - float h1lambda = ratio_h * i - h; - float h2lambda = 1 - h1lambda; - - for (int j = 0; j < out_w; ++j) { - int w = ratio_w * j; - int wid = (w < in_w - 1) ? 1 : 0; - float w1lambda = ratio_w * j - w; - float w2lambda = 1 - w1lambda; - T* in_pos = &d_input[k * in_chw + h * in_w + w]; - const T* out_pos = &d_output[k * out_chw + i * out_w + j]; - - for (int c = 0; c < channels; ++c) { // loop for channels - in_pos[0] += static_cast(h2lambda * w2lambda * out_pos[0]); - in_pos[wid] += static_cast(h2lambda * w1lambda * out_pos[0]); - in_pos[hid * in_w] += - static_cast(h1lambda * w2lambda * out_pos[0]); - in_pos[hid * in_w + wid] += - static_cast(h1lambda * w1lambda * out_pos[0]); - in_pos += in_hw; - out_pos += out_hw; - } - } - } - } - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/concat_op.cc b/paddle/fluid/operators/concat_op.cc index 57817da71adfd80faad29a48b05ba2f326de6c07..093b0a9a1f9ac05cf4d72fc748fac827387e5dbe 100644 --- a/paddle/fluid/operators/concat_op.cc +++ b/paddle/fluid/operators/concat_op.cc @@ -37,7 +37,7 @@ class ConcatOp : public framework::OperatorWithKernel { PADDLE_ENFORCE_GT(n, 0, "Input tensors count should > 0."); if (n == 1) { - VLOG(3) << "Warning: concat op have only one input, may waste memory"; + VLOG(30) << "Warning: concat op have only one input, may waste memory"; } auto out_dims = ins[0]; diff --git a/paddle/fluid/operators/concat_op.h b/paddle/fluid/operators/concat_op.h index b2c6495c442cd02679825425becc2160c303dcc6..bd474be0facb349c53a8766412311296383a86c5 100644 --- a/paddle/fluid/operators/concat_op.h +++ b/paddle/fluid/operators/concat_op.h @@ -17,7 +17,7 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/operators/strided_memcpy.h" namespace paddle { @@ -89,29 +89,17 @@ class ConcatGradKernel : public framework::OpKernel { outputs.push_back(nullptr); } } + auto& dev_ctx = ctx.template device_context(); // Sometimes direct copies will be faster, this maybe need deeply analysis. if (axis == 0 && outs.size() < 10) { - size_t input_offset = 0; - const auto in_stride = framework::stride_numel(out_grad->dims()); - - for (size_t i = 0; i < outs.size(); ++i) { - auto out_stride = framework::stride_numel(ins[i]->dims()); - auto* out = outputs[i]; - if (out != nullptr) { - StridedNumelCopyWithAxis( - ctx.device_context(), axis, out->data(), out_stride, - out_grad->data() + input_offset, in_stride, out_stride[axis]); - } - input_offset += out_stride[axis]; - } + std::vector ref_shape; + ref_shape.insert(ref_shape.begin(), ins.begin(), ins.end()); + StridedMemcpyWithAxis0(dev_ctx, *out_grad, ref_shape, &outputs); } else { - auto& dev_ctx = ctx.template device_context(); - paddle::operators::math::ConcatGradFunctor - concat_grad_functor; - concat_grad_functor(dev_ctx, *out_grad, - ctx.MultiInput("X"), - static_cast(axis), &outputs); + math::SplitFunctor split_functor; + split_functor(dev_ctx, *out_grad, ctx.MultiInput("X"), + static_cast(axis), &outputs); } } }; diff --git a/paddle/fluid/operators/controlflow/CMakeLists.txt b/paddle/fluid/operators/controlflow/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..b1c2ee22951a3881b4ce5b82f9ff7eb01fde6e9e --- /dev/null +++ b/paddle/fluid/operators/controlflow/CMakeLists.txt @@ -0,0 +1,4 @@ +include(operators) +register_operators() + +file(APPEND ${pybind_file} "USE_OP(less_than);\nUSE_OP(logical_and);\nUSE_NO_KERNEL_OP(read_from_array);\n") diff --git a/paddle/fluid/operators/compare_op.cc b/paddle/fluid/operators/controlflow/compare_op.cc similarity index 98% rename from paddle/fluid/operators/compare_op.cc rename to paddle/fluid/operators/controlflow/compare_op.cc index f40b1ba338d429c248103eeb930ac7e1bb690218..488ca7fe95f5119c59b861011993a379d08008ba 100644 --- a/paddle/fluid/operators/compare_op.cc +++ b/paddle/fluid/operators/controlflow/compare_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/compare_op.h" +#include "paddle/fluid/operators/controlflow/compare_op.h" #include #include "paddle/fluid/framework/op_registry.h" diff --git a/paddle/fluid/operators/compare_op.cu b/paddle/fluid/operators/controlflow/compare_op.cu similarity index 94% rename from paddle/fluid/operators/compare_op.cu rename to paddle/fluid/operators/controlflow/compare_op.cu index 1bf85c64fb5b4d79c62118959fd72b13ed1c63ed..b1f306358359764b919f9e570cf44f9733a7d178 100644 --- a/paddle/fluid/operators/compare_op.cu +++ b/paddle/fluid/operators/controlflow/compare_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/compare_op.h" +#include "paddle/fluid/operators/controlflow/compare_op.h" REGISTER_COMPARE_KERNEL(less_than, CUDA, paddle::operators::LessThanFunctor); REGISTER_COMPARE_KERNEL(less_equal, CUDA, paddle::operators::LessEqualFunctor); diff --git a/paddle/fluid/operators/compare_op.h b/paddle/fluid/operators/controlflow/compare_op.h similarity index 97% rename from paddle/fluid/operators/compare_op.h rename to paddle/fluid/operators/controlflow/compare_op.h index 1cbabdaf6767815c1fedba0eabec9b5de678e047..b7529e4ae632d31524846d9d5aa4b1883f4509a1 100644 --- a/paddle/fluid/operators/compare_op.h +++ b/paddle/fluid/operators/controlflow/compare_op.h @@ -16,7 +16,7 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/platform/transform.h" namespace paddle { diff --git a/paddle/fluid/operators/conditional_block_op.cc b/paddle/fluid/operators/controlflow/conditional_block_op.cc similarity index 100% rename from paddle/fluid/operators/conditional_block_op.cc rename to paddle/fluid/operators/controlflow/conditional_block_op.cc diff --git a/paddle/fluid/operators/feed_op.cc b/paddle/fluid/operators/controlflow/feed_op.cc similarity index 96% rename from paddle/fluid/operators/feed_op.cc rename to paddle/fluid/operators/controlflow/feed_op.cc index dc7ef664958238ddbd48745bd59cc7db28e49f5b..5da0a536d96e5184d51638bc6b374d2263b5e9eb 100644 --- a/paddle/fluid/operators/feed_op.cc +++ b/paddle/fluid/operators/controlflow/feed_op.cc @@ -47,8 +47,8 @@ class FeedOp : public framework::OperatorBase { auto col = Attr("col"); - VLOG(3) << "Feed Var " << feed_var_name << "'s " << col << " column to var " - << out_name; + VLOG(30) << "Feed Var " << feed_var_name << "'s " << col + << " column to var " << out_name; auto &feed_list = feed_var->Get(); auto &feed_item = feed_list.at(static_cast(col)); diff --git a/paddle/fluid/operators/fetch_op.cc b/paddle/fluid/operators/controlflow/fetch_op.cc similarity index 97% rename from paddle/fluid/operators/fetch_op.cc rename to paddle/fluid/operators/controlflow/fetch_op.cc index c197b45e8196a47def6465128e8ca39d8daefed6..c9e759ebff63948046e67def7fb94e0241029581 100644 --- a/paddle/fluid/operators/fetch_op.cc +++ b/paddle/fluid/operators/controlflow/fetch_op.cc @@ -57,7 +57,7 @@ class FetchOp : public framework::OperatorBase { TensorCopySync(src_item, platform::CPUPlace(), &dst_item); dst_item.set_lod(src_item.lod()); - VLOG(3) << "Fetch variable " << fetch_var_name << " to " << out_name; + VLOG(30) << "Fetch variable " << fetch_var_name << " to " << out_name; } }; diff --git a/paddle/fluid/operators/get_places_op.cc b/paddle/fluid/operators/controlflow/get_places_op.cc similarity index 100% rename from paddle/fluid/operators/get_places_op.cc rename to paddle/fluid/operators/controlflow/get_places_op.cc diff --git a/paddle/fluid/operators/logical_op.cc b/paddle/fluid/operators/controlflow/logical_op.cc similarity index 99% rename from paddle/fluid/operators/logical_op.cc rename to paddle/fluid/operators/controlflow/logical_op.cc index 26970db8d2af62bb06fce4eb1a1f21fd41617bd1..6446cab5ec5f889dccaef90484476e55c4852dee 100644 --- a/paddle/fluid/operators/logical_op.cc +++ b/paddle/fluid/operators/controlflow/logical_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/logical_op.h" +#include "paddle/fluid/operators/controlflow/logical_op.h" #include #include "paddle/fluid/framework/op_registry.h" diff --git a/paddle/fluid/operators/logical_op.cu b/paddle/fluid/operators/controlflow/logical_op.cu similarity index 94% rename from paddle/fluid/operators/logical_op.cu rename to paddle/fluid/operators/controlflow/logical_op.cu index 7ffe4dfc268b1ad3894dd54cb13c2f424818aa05..7ca54b488bfbb260c422941b82145f092a150be7 100644 --- a/paddle/fluid/operators/logical_op.cu +++ b/paddle/fluid/operators/controlflow/logical_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/logical_op.h" +#include "paddle/fluid/operators/controlflow/logical_op.h" REGISTER_BINARY_LOGICAL_KERNEL(logical_and, CUDA, paddle::operators::LogicalAndFunctor); diff --git a/paddle/fluid/operators/logical_op.h b/paddle/fluid/operators/controlflow/logical_op.h similarity index 100% rename from paddle/fluid/operators/logical_op.h rename to paddle/fluid/operators/controlflow/logical_op.h diff --git a/paddle/fluid/operators/parallel_do_op.cc b/paddle/fluid/operators/controlflow/parallel_do_op.cc similarity index 98% rename from paddle/fluid/operators/parallel_do_op.cc rename to paddle/fluid/operators/controlflow/parallel_do_op.cc index ab25628d45699dbcfc1fc5792958bae9e42e72a3..c795d4bdd10c0ffbf30a4849fc773335036e34c2 100644 --- a/paddle/fluid/operators/parallel_do_op.cc +++ b/paddle/fluid/operators/controlflow/parallel_do_op.cc @@ -48,7 +48,7 @@ static void SplitTensorAndMoveTensorToScopes( auto lod_tensors = tensor.SplitLoDTensor(places); for (auto &lod : lod_tensors) { - VLOG(3) << lod.dims(); + VLOG(30) << lod.dims(); } if (num_sub_scopes == 0) { num_sub_scopes = lod_tensors.size(); @@ -263,7 +263,7 @@ class ParallelDoGradOp : public framework::OperatorBase { if (s == framework::kEmptyVarName) { continue; } - VLOG(3) << "Moving " << s; + VLOG(30) << "Moving " << s; CopyOrShare(*sub_scopes[0]->FindVar(s), place, scope.FindVar(s)); } WaitOnPlaces(places); @@ -277,7 +277,7 @@ class ParallelDoGradOp : public framework::OperatorBase { if (s == framework::kEmptyVarName) { continue; } - VLOG(3) << "Accumulating " << s; + VLOG(30) << "Accumulating " << s; if (s == framework::kEmptyVarName) continue; std::string tmp_name; auto *tmp = sub_scopes[0]->Var(&tmp_name); @@ -289,7 +289,7 @@ class ParallelDoGradOp : public framework::OperatorBase { auto sum_op = framework::OpRegistry::CreateOp( "sum", {{"X", {s, tmp_name}}}, {{"Out", {s}}}, framework::AttributeMap{{"use_mkldnn", {false}}}); - VLOG(10) << sum_op->DebugStringEx(sub_scopes[0]); + VLOG(100) << sum_op->DebugStringEx(sub_scopes[0]); sum_op->Run(*sub_scopes[0], places[0]); WaitOnPlace(places[0]); } @@ -316,7 +316,7 @@ class ParallelDoGradOpDescMaker : public framework::SingleGradOpDescMaker { auto *grad = new framework::OpDesc(); grad->SetType("parallel_do_grad"); for (auto &input_param : this->InputNames()) { - VLOG(3) << input_param; + VLOG(30) << input_param; grad->SetInput(input_param, this->Input(input_param)); if (input_param != kPlaces) { grad->SetOutput(framework::GradVarName(input_param), diff --git a/paddle/fluid/operators/tensor_array_read_write_op.cc b/paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc similarity index 94% rename from paddle/fluid/operators/tensor_array_read_write_op.cc rename to paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc index a2d44284e9de1ace42cabbce82e0b45929432d7b..484160aeb8de573c6a6c1bb2ea5da23600d2d287 100644 --- a/paddle/fluid/operators/tensor_array_read_write_op.cc +++ b/paddle/fluid/operators/controlflow/tensor_array_read_write_op.cc @@ -34,8 +34,8 @@ class WriteToArrayOp : public ArrayOp { auto *out = scope.FindVar(Output("Out"))->GetMutable(); if (offset >= out->size()) { - VLOG(10) << "Resize " << Output("Out") << " from " << out->size() - << " to " << offset + 1; + VLOG(100) << "Resize " << Output("Out") << " from " << out->size() + << " to " << offset + 1; out->resize(offset + 1); } auto *out_tensor = &out->at(offset); @@ -47,9 +47,9 @@ class WriteToArrayOp : public ArrayOp { TensorCopy(x_tensor, place, dev_ctx, out_tensor); } else { - VLOG(10) << "WARNING: The input tensor 'x_tensor' holds no memory, so " - "nothing has been written to output array[" - << offset << "]."; + VLOG(100) << "WARNING: The input tensor 'x_tensor' holds no memory, so " + "nothing has been written to output array[" + << offset << "]."; } } }; @@ -104,7 +104,7 @@ class WriteToArrayInferVarType : public framework::VarTypeInference { framework::BlockDesc *block) const override { auto x_name = op_desc.Input("X")[0]; auto out_name = op_desc.Output("Out")[0]; - VLOG(10) << "Set Variable " << out_name << " as LOD_TENSOR_ARRAY"; + VLOG(100) << "Set Variable " << out_name << " as LOD_TENSOR_ARRAY"; auto &out = block->FindRecursiveOrCreateVar(out_name); out.SetType(framework::proto::VarType::LOD_TENSOR_ARRAY); auto *x = block->FindVarRecursive(x_name); @@ -139,7 +139,7 @@ class ReadFromArrayOp : public ArrayOp { framework::TensorCopy(x_array[offset], place, dev_ctx, out_tensor); out_tensor->set_lod(x_array[offset].lod()); } else { - VLOG(10) << "offset " << offset << " >= " << x_array.size(); + VLOG(100) << "offset " << offset << " >= " << x_array.size(); } } }; diff --git a/paddle/fluid/operators/while_op.cc b/paddle/fluid/operators/controlflow/while_op.cc similarity index 95% rename from paddle/fluid/operators/while_op.cc rename to paddle/fluid/operators/controlflow/while_op.cc index 3c8a01b6e47459760b05b5ca7fa4fa5e1d37d112..2b56514fe086dd411fcf842e7e7acba4edf98990 100644 --- a/paddle/fluid/operators/while_op.cc +++ b/paddle/fluid/operators/controlflow/while_op.cc @@ -92,7 +92,10 @@ class WhileOpMaker : public framework::OpProtoAndCheckerMaker { "variables generated in the i'th step."); AddAttr(kStepBlock, "The step block inside WhileOp"); - AddAttr("is_test", "True if in test phase.").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddComment(R"DOC( )DOC"); } @@ -129,15 +132,15 @@ class WhileGradOp : public framework::OperatorBase { for (auto cur_scope_iter = step_scopes->rbegin(); cur_scope_iter != step_scopes->rend(); ++cur_scope_iter) { - VLOG(3) << "Start backward at time_step " - << cur_scope_iter - step_scopes->rbegin(); + VLOG(30) << "Start backward at time_step " + << cur_scope_iter - step_scopes->rbegin(); framework::Scope &cur_scope = **cur_scope_iter; // Link OG from outside to inside for (size_t i = 0; i < outside_og_names.size(); ++i) { auto outside_og_name = outside_og_names[i]; auto inside_og_name = inside_og_names[i]; - VLOG(8) << "Linking outside " << outside_og_name << " --> inside " - << inside_og_name; + VLOG(80) << "Linking outside " << outside_og_name << " --> inside " + << inside_og_name; if (scope.FindVar(outside_og_name) == nullptr) { continue; } @@ -159,11 +162,11 @@ class WhileGradOp : public framework::OperatorBase { auto &outside_array = og_outside.Get(); auto &inside_array = detail::Ref(og_inside.GetMutable()); - VLOG(8) << outside_og_name << " size = " << outside_array.size(); + VLOG(80) << outside_og_name << " size = " << outside_array.size(); inside_array.resize(outside_array.size()); for (size_t j = 0; j < inside_array.size(); ++j) { - VLOG(8) << j << " " << outside_array[j].numel(); + VLOG(80) << j << " " << outside_array[j].numel(); if (outside_array[j].numel() != 0) { inside_array[j].set_lod(outside_array[j].lod()); inside_array[j].ShareDataWith(outside_array[j]); @@ -289,7 +292,7 @@ class WhileGradOpDescMaker : public framework::SingleGradOpDescMaker { auto igs = InputGrad(kX, /*do not drop empty gradient*/ false); for (auto &each_ig : igs) { if (inner_op_outputs.find(each_ig) == inner_op_outputs.end()) { - VLOG(8) << "Ignore " << each_ig; + VLOG(80) << "Ignore " << each_ig; each_ig = framework::kEmptyVarName; } } @@ -353,8 +356,8 @@ class WhileGradOpVarTypeInference : public framework::VarTypeInference { auto &p_var = detail::Ref(block->FindVarRecursive(p_names[i])); auto *g_var = block->FindVarRecursive(pg_ig_names[i]); if (g_var != nullptr) { // Gradient could be @EMPTY@ - VLOG(5) << "Setting " << pg_ig_names[i] << " following " << p_names[i] - << " type: " << p_var.GetType(); + VLOG(50) << "Setting " << pg_ig_names[i] << " following " << p_names[i] + << " type: " << p_var.GetType(); g_var->SetType(p_var.GetType()); g_var->SetDataType(p_var.GetDataType()); } diff --git a/paddle/fluid/operators/conv_cudnn_op.cu.cc b/paddle/fluid/operators/conv_cudnn_op.cu.cc index 4a7a6bcf7154d5680de751e3c933be46fb09fd74..42c2b3a24c116f92f4dd6ad0966dcb963ec702d6 100644 --- a/paddle/fluid/operators/conv_cudnn_op.cu.cc +++ b/paddle/fluid/operators/conv_cudnn_op.cu.cc @@ -15,15 +15,22 @@ limitations under the License. */ #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/memory/memory.h" +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" #include "paddle/fluid/operators/conv_op.h" #include "paddle/fluid/platform/assert.h" #include "paddle/fluid/platform/cudnn_helper.h" #include "paddle/fluid/platform/float16.h" +#include "paddle/fluid/platform/profiler.h" DEFINE_bool(cudnn_deterministic, false, "Whether allow using an autotuning algorithm for convolution " "operator. The autotuning algorithm may be non-deterministic. If " "true, the algorithm is deterministic."); +DEFINE_uint64(conv_workspace_size_limit, 4096, + "cuDNN convolution workspace limit in MB unit."); +DEFINE_bool(cudnn_exhaustive_search, false, + "Whether enable exhaustive search for cuDNN convolution or " + "not, defalut is False."); namespace paddle { namespace operators { @@ -36,13 +43,11 @@ using DataLayout = platform::DataLayout; template using ScalingParamType = typename platform::CudnnDataType::ScalingParamType; -static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES = - static_cast(1024) * 1024 * 1024; - template class CUDNNConvOpKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), "It must use CUDAPlace."); auto* input = ctx.Input("Input"); @@ -55,6 +60,8 @@ class CUDNNConvOpKernel : public framework::OpKernel { int groups = ctx.Attr("groups"); int64_t user_workspace_size = static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); const T* input_data = input->data(); const T* filter_data = filter->data(); @@ -120,19 +127,19 @@ class CUDNNConvOpKernel : public framework::OpKernel { // ------------------- cudnn conv workspace --------------------- size_t workspace_size_in_bytes; // final workspace to allocate. size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; - if (user_workspace_size > 0) { - workspace_size_limit = user_workspace_size * 1024 * 1024; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; } + // ------------------- cudnn conv algorithm --------------------- cudnnConvolutionFwdAlgo_t algo; - auto& dev_ctx = ctx.template device_context(); auto handle = dev_ctx.cudnn_handle(); + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); - CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( - handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, - cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, - workspace_size_limit, &algo)); - + bool half_float = false; #if CUDA_VERSION >= 9000 && CUDNN_VERSION_MIN(7, 0, 1) // Tensor core is supported since the volta GPU and // is only enabled when input and filter data are float16 @@ -143,12 +150,66 @@ class CUDNNConvOpKernel : public framework::OpKernel { cudnn_conv_desc, CUDNN_TENSOR_OP_MATH)); // Currently tensor core is only enabled using this algo algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM; + half_float = true; + VLOG(50) << "use cudnn_tensor_op_math"; } else { CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( cudnn_conv_desc, CUDNN_DEFAULT_MATH)); + VLOG(50) << "NOT use cudnn_tensor_op_math"; } #endif + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); + if ((!exhaustive_search) && (!half_float)) { + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, + workspace_size_limit, &algo)); + VLOG(3) << "cuDNN forward algo " << algo; + } else if (exhaustive_search && (!half_float)) { + AlgorithmsCache* algo_cache = nullptr; + if (ctx.scope().FindVar(kCUDNNFwdAlgoCache)) { + algo_cache = + ctx.scope() + .FindVar(kCUDNNFwdAlgoCache) + ->GetMutable>(); + } else { + algo_cache = + const_cast(ctx.scope()) + .Var(kCUDNNFwdAlgoCache) + ->GetMutable>(); + } + algo = algo_cache->GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + fwd_perf_stat; + auto cudnn_find_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE( + platform::dynload::cudnnFindConvolutionForwardAlgorithmEx( + handle, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, cudnn_output_desc, + output_data, kNUM_CUDNN_FWD_ALGS, &returned_algo_count, + fwd_perf_stat.data(), cudnn_workspace, + workspace_size_limit)); + }; + workspace_handle.RunFunc(cudnn_find_func, workspace_size_limit); + + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = fwd_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time + << " " << stat.memory; + } + return fwd_perf_stat[0].algo; + }); + VLOG(3) << "choose algo " << algo; + } else { + PADDLE_ENFORCE(half_float, + "cuDNN exhaustive search doesn't support half float."); + } + // get workspace size able to allocate CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardWorkspaceSize( handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, @@ -168,7 +229,7 @@ class CUDNNConvOpKernel : public framework::OpKernel { cudnn_conv_desc, algo, cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_output_desc, output_data + i * group_offset_out)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } }; @@ -177,6 +238,7 @@ template class CUDNNConvGradOpKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), "It must use CUDAPlace."); auto input = ctx.Input("Input"); @@ -195,6 +257,13 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { int groups = ctx.Attr("groups"); int64_t user_workspace_size = static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); + if (exhaustive_search && FLAGS_cudnn_deterministic) { + PADDLE_THROW( + "Cann't set exhaustive_search True and " + "FLAGS_cudnn_deterministic True at same time."); + } // ------------------- cudnn descriptors --------------------- ScopedTensorDescriptor input_desc; @@ -262,14 +331,66 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { cudnnConvolutionBwdFilterAlgo_t filter_algo; size_t workspace_size_in_bytes = 0, tmp_size = 0; size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; - if (user_workspace_size > 0) { - workspace_size_limit = user_workspace_size * 1024 * 1024; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; } - auto& dev_ctx = ctx.template device_context(); + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); auto handle = dev_ctx.cudnn_handle(); + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); if (input_grad) { - if (!FLAGS_cudnn_deterministic) { + T* input_grad_data = input_grad->mutable_data(ctx.GetPlace()); + if (exhaustive_search) { + AlgorithmsCache* data_algo_cache; + if (ctx.scope().FindVar(kCUDNNBwdDataAlgoCache)) { + data_algo_cache = + ctx.scope() + .FindVar(kCUDNNBwdDataAlgoCache) + ->GetMutable< + AlgorithmsCache>(); + } else { + data_algo_cache = + const_cast(ctx.scope()) + .Var(kCUDNNBwdDataAlgoCache) + ->GetMutable< + AlgorithmsCache>(); + } + data_algo = data_algo_cache->GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + data_perf_stat; + auto cudnn_find_bd_data_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE( + platform::dynload:: + cudnnFindConvolutionBackwardDataAlgorithmEx( + handle, cudnn_filter_desc, filter_data, + cudnn_output_grad_desc, output_grad_data, + cudnn_conv_desc, cudnn_input_desc, input_grad_data, + kNUM_CUDNN_BWD_DATA_ALGS, &returned_algo_count, + data_perf_stat.data(), cudnn_workspace, + workspace_size_limit)); + }; + workspace_handle.RunFunc(cudnn_find_bd_data_func, + workspace_size_limit); + + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = data_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time + << " " << stat.memory; + } + return data_perf_stat[0].algo; + }); + VLOG(3) << "cuDNN backward data algo " << data_algo; + } else if (FLAGS_cudnn_deterministic) { + data_algo = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1; + } else { CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardDataAlgorithm( handle, cudnn_filter_desc, @@ -282,10 +403,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { cudnn_input_desc, CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT, workspace_size_limit, &data_algo)); - } else { - data_algo = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1; } - CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardDataWorkspaceSize( handle, cudnn_filter_desc, cudnn_output_grad_desc, @@ -294,17 +412,54 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { } if (filter_grad) { - if (!FLAGS_cudnn_deterministic) { + T* filter_grad_data = filter_grad->mutable_data(ctx.GetPlace()); + if (exhaustive_search) { + AlgorithmsCache* f_algo_cache; + if (ctx.scope().FindVar(kCUDNNBwdFilterAlgoCache)) { + f_algo_cache = + ctx.scope() + .FindVar(kCUDNNBwdFilterAlgoCache) + ->GetMutable< + AlgorithmsCache>(); + } else { + f_algo_cache = + const_cast(ctx.scope()) + .Var(kCUDNNBwdFilterAlgoCache) + ->GetMutable< + AlgorithmsCache>(); + } + filter_algo = f_algo_cache->GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + filter_perf_stat; + auto cudnn_find_bd_f_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE( + platform::dynload:: + cudnnFindConvolutionBackwardFilterAlgorithmEx( + handle, cudnn_input_desc, input_data, + cudnn_output_grad_desc, output_grad_data, + cudnn_conv_desc, cudnn_filter_desc, + filter_grad_data, kNUM_CUDNN_BWD_FILTER_ALGS, + &returned_algo_count, filter_perf_stat.data(), + cudnn_workspace, workspace_size_limit)); + }; + workspace_handle.RunFunc(cudnn_find_bd_f_func, + workspace_size_limit); + return filter_perf_stat[0].algo; + }); + VLOG(3) << "cuDNN backward filter algo " << filter_algo; + } else if (FLAGS_cudnn_deterministic) { + filter_algo = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1; + } else { CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardFilterAlgorithm( handle, cudnn_input_desc, cudnn_output_grad_desc, cudnn_conv_desc, cudnn_filter_desc, CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT, workspace_size_limit, &filter_algo)); - } else { - filter_algo = CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1; } - CUDNN_ENFORCE( platform::dynload::cudnnGetConvolutionBackwardFilterWorkspaceSize( handle, cudnn_input_desc, cudnn_output_grad_desc, cudnn_conv_desc, @@ -327,7 +482,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { data_algo, cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_input_desc, input_grad_data + i * group_offset_in)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } // ------------------- cudnn conv backward filter --------------------- @@ -343,7 +498,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel { filter_algo, cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_filter_desc, filter_grad_data + i * group_offset_filter)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } } @@ -359,7 +514,8 @@ REGISTER_OP_KERNEL(conv2d, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvOpKernel); REGISTER_OP_KERNEL(conv2d_grad, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvGradOpKernel, - paddle::operators::CUDNNConvGradOpKernel); + paddle::operators::CUDNNConvGradOpKernel, + paddle::operators::CUDNNConvGradOpKernel); REGISTER_OP_KERNEL(conv3d, CUDNN, plat::CUDAPlace, paddle::operators::CUDNNConvOpKernel, diff --git a/paddle/fluid/operators/conv_cudnn_op_cache.h b/paddle/fluid/operators/conv_cudnn_op_cache.h new file mode 100644 index 0000000000000000000000000000000000000000..92d394eb3c5aeb84605179cb2b5106f56a13f88e --- /dev/null +++ b/paddle/fluid/operators/conv_cudnn_op_cache.h @@ -0,0 +1,111 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +static constexpr char kCUDNNFwdAlgoCache[] = "kCUDNNFwdAlgoCache"; +static constexpr char kCUDNNBwdDataAlgoCache[] = "kCUDNNBwdDataAlgoCache"; +static constexpr char kCUDNNBwdFilterAlgoCache[] = "kCUDNNBwdFilterAlgoCache"; + +static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES = + static_cast(1024) * 1024 * 1024; + +#if CUDNN_VERSION_MIN(6, 0, 5) +static constexpr size_t kNUM_CUDNN_FWD_ALGS = CUDNN_CONVOLUTION_FWD_ALGO_COUNT; +static constexpr size_t kNUM_CUDNN_BWD_FILTER_ALGS = + CUDNN_CONVOLUTION_BWD_FILTER_ALGO_COUNT; +static constexpr size_t kNUM_CUDNN_BWD_DATA_ALGS = + CUDNN_CONVOLUTION_BWD_DATA_ALGO_COUNT; +#else +// cuDNN v5 has no CUDNN_CONVOLUTION_FWD_ALGO_COUNT etc. +static constexpr size_t kNUM_CUDNN_FWD_ALGS = 7; +static constexpr size_t kNUM_CUDNN_BWD_FILTER_ALGS = 4; +static constexpr size_t kNUM_CUDNN_BWD_DATA_ALGS = 5; +#endif + +template +class AlgorithmsCache { + public: + // Caches the best algorithm for a given + // combination of tensor dimensions & compute data type. + TAlgorithm GetAlgorithm( + const std::vector& dims1, const std::vector& dims2, + const std::vector& strides, const std::vector& paddings, + const std::vector& dilations, + int algorithmFlags, // can set for different data type + std::function gen_func); + + private: + std::unordered_map hash_; + std::mutex mutex_; +}; + +template +TAlgorithm AlgorithmsCache::GetAlgorithm( + const std::vector& dims1, const std::vector& dims2, + const std::vector& strides, const std::vector& paddings, + const std::vector& dilations, int algorithmFlags, + std::function gen_func) { + std::lock_guard lock(mutex_); + int64_t seed = 0; + // Hash all of the inputs, use to try and look up a previously + // discovered algorithm, or fall back to generating a new one. + std::hash hashFn; + // do hash like boost + // https://stackoverflow.com/questions/2590677/how-do-i-combine-hash-values-in-c0x + for (const auto num : dims1) { + seed ^= hashFn(num) + 0x9e3779b9 + (seed << 6) + (seed >> 2); + } + + for (const auto num : dims2) { + seed ^= hashFn(num) + 0x9e3779b9 + (seed << 6) + (seed >> 2) + 1; + } + + for (const auto num : strides) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 2; + } + + for (const auto num : paddings) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 3; + } + + for (const auto num : dilations) { + seed ^= hashFn(static_cast(num)) + 0x9e3779b9 + (seed << 6) + + (seed >> 2) + 4; + } + + seed ^= hashFn(static_cast(algorithmFlags)) + 0x9e3779b9 + + (seed << 6) + (seed >> 2) + 5; + + if (seed == 0) return gen_func(); + + if (hash_.find(seed) == hash_.end()) { + TAlgorithm value = gen_func(); + hash_[seed] = value; + } + return hash_[seed]; +} + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/conv_fusion_op.cc b/paddle/fluid/operators/conv_fusion_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..9bdedb10e0b1bc2d45c084bbc070875117675b75 --- /dev/null +++ b/paddle/fluid/operators/conv_fusion_op.cc @@ -0,0 +1,48 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include +#include "paddle/fluid/operators/conv_op.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +// This fused conv follows the equation: +// y = act ( alpha1 * conv(x) + alpha2 * z + bias ). +// here, y is Output, +// x is Input, +// z is ResidualData, +// bias is Bias +class Conv2DFusionOpMaker : public Conv2DOpMaker { + protected: + void Apply() override { + AddAttr( + "activation", + "The activation type can be 'identity', 'sigmoid', 'relu', 'relu6' " + "'relux' , 'tanh', 'band_pass'") + .SetDefault("relu"); + } +}; +// TODO(qingqing): add gradient operator for conv2d_fusion + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(conv2d_fusion, ops::ConvOp, ops::Conv2DFusionOpMaker, + ops::ConvOpInferVarType, paddle::framework::EmptyGradOpMaker); diff --git a/paddle/fluid/operators/conv_fusion_op.cu.cc b/paddle/fluid/operators/conv_fusion_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..2c09ee7394ad605f7a324d021ce0468a79bb71ca --- /dev/null +++ b/paddle/fluid/operators/conv_fusion_op.cu.cc @@ -0,0 +1,191 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/conv_cudnn_op_cache.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +DECLARE_uint64(conv_workspace_size_limit); +DECLARE_bool(cudnn_exhaustive_search); + +namespace paddle { +namespace operators { + +#if CUDNN_VERSION >= 7001 +using Tensor = framework::Tensor; +using ScopedTensorDescriptor = platform::ScopedTensorDescriptor; +using ScopedFilterDescriptor = platform::ScopedFilterDescriptor; +using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor; +using ScopedActivationDescriptor = platform::ScopedActivationDescriptor; +using DataLayout = platform::DataLayout; +template +using ScalingParamType = typename platform::CudnnDataType::ScalingParamType; + +template +class CUDNNConvFusionOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto& dev_ctx = ctx.template device_context(); + auto* input = ctx.Input("Input"); + auto* filter = ctx.Input("Filter"); + auto* bias = ctx.Input("Bias"); + PADDLE_ENFORCE(bias, "The bias should not be null."); + auto* residual = ctx.Input("ResidualData"); + auto* output = ctx.Output("Output"); + + std::vector strides = ctx.Attr>("strides"); + std::vector paddings = ctx.Attr>("paddings"); + std::vector dilations = ctx.Attr>("dilations"); + const std::string activation = ctx.Attr("activation"); + int groups = ctx.Attr("groups"); + int64_t user_workspace_size = + static_cast(ctx.Attr("workspace_size_MB")); + bool exhaustive_search = + FLAGS_cudnn_exhaustive_search || ctx.Attr("exhaustive_search"); + + const T* input_data = input->data(); + const T* filter_data = filter->data(); + const T* bias_data = bias->data(); + T* output_data = output->mutable_data(ctx.GetPlace()); + const T* residual_data = residual ? residual->data() : output_data; + + // ------------------- cudnn descriptors --------------------- + ScopedTensorDescriptor input_desc; + ScopedTensorDescriptor output_desc; + ScopedFilterDescriptor filter_desc; + ScopedTensorDescriptor bias_desc; + ScopedConvolutionDescriptor conv_desc; + ScopedActivationDescriptor act_desc; + DataLayout layout = DataLayout::kNCHW; + if (input->dims().size() == 5) { + layout = DataLayout::kNCDHW; + } + + cudnnConvolutionDescriptor_t cudnn_conv_desc = + conv_desc.descriptor(paddings, strides, dilations); + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionGroupCount( + cudnn_conv_desc, groups)); + + cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor( + layout, framework::vectorize2int(input->dims())); + cudnnTensorDescriptor_t cudnn_output_desc = output_desc.descriptor( + layout, framework::vectorize2int(output->dims())); + cudnnFilterDescriptor_t cudnn_filter_desc = filter_desc.descriptor( + layout, framework::vectorize2int(filter->dims())); + // Now only support NCHW + std::vector bias_dim = {1, static_cast(output->dims()[1]), 1, 1}; + cudnnTensorDescriptor_t cudnn_bias_desc = + bias_desc.descriptor(layout, bias_dim); + cudnnActivationDescriptor_t cudnn_act_desc = + act_desc.descriptor(activation); + + // ------------------- cudnn conv workspace --------------------- + size_t workspace_size_in_bytes; // final workspace to allocate. + size_t workspace_size_limit = kCONV_CUDNN_WORKSPACE_LIMIT_BYTES; + if (FLAGS_conv_workspace_size_limit > 0 || user_workspace_size > 0) { + int64_t max_user_size = + std::max(static_cast(FLAGS_conv_workspace_size_limit), + user_workspace_size); + workspace_size_limit = max_user_size * 1024 * 1024; + } + + // ------------------- cudnn conv algorithm --------------------- + cudnnConvolutionFwdAlgo_t algo; + auto handle = dev_ctx.cudnn_handle(); + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); + + CUDNN_ENFORCE(platform::dynload::cudnnSetConvolutionMathType( + cudnn_conv_desc, CUDNN_DEFAULT_MATH)); + + auto x_dims = framework::vectorize(input->dims()); + auto f_dims = framework::vectorize(filter->dims()); + if (activation == "identity") { + // Only the CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algo is + // enabled with CUDNN_ACTIVATION_IDENTITY in cuDNN lib. + algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM; + } else if (!exhaustive_search) { + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardAlgorithm( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, + workspace_size_limit, &algo)); + VLOG(3) << "cuDNN forward algo " << algo; + } else { + AlgorithmsCache* algo_cache = nullptr; + if (ctx.scope().FindVar(kCUDNNFwdAlgoCache)) { + algo_cache = + ctx.scope() + .FindVar(kCUDNNFwdAlgoCache) + ->GetMutable>(); + } else { + algo_cache = + const_cast(ctx.scope()) + .Var(kCUDNNFwdAlgoCache) + ->GetMutable>(); + } + algo = algo_cache->GetAlgorithm( + x_dims, f_dims, strides, paddings, dilations, 0, [&]() { + int returned_algo_count; + std::array + fwd_perf_stat; + auto cudnn_find_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE( + platform::dynload::cudnnFindConvolutionForwardAlgorithmEx( + handle, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, cudnn_output_desc, + output_data, kNUM_CUDNN_FWD_ALGS, &returned_algo_count, + fwd_perf_stat.data(), cudnn_workspace, + workspace_size_limit)); + }; + workspace_handle.RunFunc(cudnn_find_func, workspace_size_limit); + VLOG(3) << "Perf result: (algo: stat, time, memory)"; + for (int i = 0; i < returned_algo_count; ++i) { + const auto& stat = fwd_perf_stat[i]; + VLOG(3) << stat.algo << ": " << stat.status << " " << stat.time + << " " << stat.memory; + } + return fwd_perf_stat[0].algo; + }); + VLOG(3) << "choose algo " << algo; + } + + CUDNN_ENFORCE(platform::dynload::cudnnGetConvolutionForwardWorkspaceSize( + handle, cudnn_input_desc, cudnn_filter_desc, cudnn_conv_desc, + cudnn_output_desc, algo, &workspace_size_in_bytes)); + PADDLE_ENFORCE_LE(workspace_size_in_bytes, workspace_size_limit, + "workspace_size to be allocated exceeds the limit"); + + // ------------------- cudnn conv+bias+act forward -------------------- + ScalingParamType alpha1 = 1.0f; + ScalingParamType alpha2 = residual ? 1.0f : 0.0f; + auto cudnn_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBiasActivationForward( + handle, &alpha1, cudnn_input_desc, input_data, cudnn_filter_desc, + filter_data, cudnn_conv_desc, algo, cudnn_workspace, + workspace_size_in_bytes, &alpha2, cudnn_output_desc, residual_data, + cudnn_bias_desc, bias_data, cudnn_act_desc, cudnn_output_desc, + output_data)); + }; + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); + } +}; +#endif + +} // namespace operators +} // namespace paddle + +#if CUDNN_VERSION >= 7001 +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(conv2d_fusion, ops::CUDNNConvFusionOpKernel, + ops::CUDNNConvFusionOpKernel); +#endif diff --git a/paddle/fluid/operators/conv_mkldnn_op.cc b/paddle/fluid/operators/conv_mkldnn_op.cc index eae65968285703f5882d910e29bc5d8e1511cba6..9e2e2cf818000d9181447a0aa6b4ac4878781f35 100644 --- a/paddle/fluid/operators/conv_mkldnn_op.cc +++ b/paddle/fluid/operators/conv_mkldnn_op.cc @@ -12,6 +12,8 @@ See the License for the specific language governing permissions and limitations under the License. */ +#include "paddle/fluid/framework/data_layout_transform.h" +#include "paddle/fluid/memory/malloc.h" #include "paddle/fluid/operators/conv_op.h" #include "paddle/fluid/platform/mkldnn_helper.h" @@ -57,6 +59,11 @@ class ConvMKLDNNHandler : public platform::MKLDNNHandler { return conv_pd_->dst_primitive_desc().get_size(); } + mkldnn::memory::format GetDstFormat() const { + return static_cast( + conv_pd_->dst_primitive_desc().desc().data.format); + } + size_t GetDiffWeightsMemorySize() const { return conv_bwd_weights_pd_->diff_weights_primitive_desc().get_size(); } @@ -108,6 +115,20 @@ class ConvMKLDNNHandler : public platform::MKLDNNHandler { "@data-weights_mem_p", pipeline); } + std::shared_ptr AcquireResidualDataMemory( + const mkldnn::memory::desc& md, void* ptr) { + return this->AcquireMemory(md, ptr, "@user_residual_data_mem_p"); + } + + std::shared_ptr AcquireDstMemoryFromResidualDataMemory( + const std::shared_ptr& user_residual_memory_p, + void* dst_ptr, + std::vector& pipeline) { // NOLINT + return this->AcquireMemory(user_residual_memory_p, + this->AcquireDstMemoryFromPrimitive(dst_ptr), + "@residual_data_mem_p", pipeline); + } + std::shared_ptr AcquireDiffSrcMemoryFromDataPrimitive( void* ptr) { return this->AcquireMemoryFromPrimitive( @@ -300,10 +321,10 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { std::vector paddings = ctx.Attr>("paddings"); std::vector dilations = ctx.Attr>("dilations"); bool fuse_relu = ctx.Attr("fuse_relu"); - bool fuse_eltwise = ctx.Attr("fuse_eltwise"); + bool fuse_residual_conn = ctx.Attr("fuse_residual_connection"); int groups = ctx.Attr("groups"); - // TODO: add support for dilation + // TODO(tpatejko): add support for dilation PADDLE_ENFORCE( dilations.size() == 2 && dilations[0] == 1 && dilations[1] == 1, "dilation in convolution is not implemented yet"); @@ -354,8 +375,7 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { auto src_md = platform::MKLDNNMemDesc( src_tz, platform::MKLDNNGetDataType(), chosen_memory_format); auto weights_md = platform::MKLDNNMemDesc( - weights_tz, platform::MKLDNNGetDataType(), - (g == 1) ? chosen_memory_format : mkldnn::memory::format::goihw); + weights_tz, platform::MKLDNNGetDataType(), chosen_memory_format); std::vector bias_tz; // TODO(mgallus): avoid empty vector creation. // Currently used whenever bias is != nullptr. auto dst_md = platform::MKLDNNMemDesc( @@ -363,20 +383,22 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { // create a conv primitive descriptor and save it for usage in backward std::shared_ptr conv_pd; + auto fwd_prop_kind = is_test ? mkldnn::prop_kind::forward_inference + : mkldnn::prop_kind::forward_training; if (bias) { bias_tz = paddle::framework::vectorize2int(bias->dims()); auto bias_md = platform::MKLDNNMemDesc( bias_tz, platform::MKLDNNGetDataType(), memory::format::x); - conv_pd = ConvFwdPrimitiveDesc(src_md, weights_md, bias_md, dst_md, - strides, paddings, mkldnn_engine, - fuse_relu, fuse_eltwise); + conv_pd = ConvFwdPrimitiveDesc( + src_md, weights_md, bias_md, dst_md, strides, paddings, mkldnn_engine, + fuse_relu, fuse_residual_conn, fwd_prop_kind); } else { - conv_pd = - ConvFwdPrimitiveDesc(src_md, weights_md, dst_md, strides, paddings, - mkldnn_engine, fuse_relu, fuse_eltwise); + conv_pd = ConvFwdPrimitiveDesc(src_md, weights_md, dst_md, strides, + paddings, mkldnn_engine, fuse_relu, + fuse_residual_conn, fwd_prop_kind); } // Save conv_pd/src_memory/weights_memory for backward pass - dev_ctx.SetBlob(key_conv_pd, conv_pd); + if (!is_test) dev_ctx.SetBlob(key_conv_pd, conv_pd); ConvMKLDNNHandler handler(conv_pd, dev_ctx, mkldnn_engine, key); @@ -386,15 +408,54 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { auto user_weights_memory_p = handler.AcquireWeightsMemory( user_weights_md, to_void_cast(filter_data)); - T* output_data = - output->mutable_data(ctx.GetPlace(), handler.GetDstMemorySize()); // create reorder primitive if the input format is not the preferred one auto src_memory_p = handler.AcquireSrcMemoryFromPrimitive(user_src_memory_p, pipeline); auto weights_memory_p = handler.AcquireWeightsMemoryFromPrimitive( user_weights_memory_p, pipeline, is_test); - auto dst_memory_p = - handler.AcquireDstMemoryFromPrimitive(to_void_cast(output_data)); + + std::shared_ptr dst_memory_p; + + if (fuse_residual_conn) { + auto residual_param = ctx.Input("ResidualData"); + auto residual_param_data = residual_param->data(); + + PADDLE_ENFORCE( + residual_param_data != nullptr, + "Provide data if you want MKLDNN conv+elementwise_add fusion"); + PADDLE_ENFORCE_EQ(output->dims(), residual_param->dims(), + "Output and elementwise parameter need to have the " + "same dimension sizes"); + + if (residual_param->format() != handler.GetDstFormat()) { + auto output_data = output->mutable_data( + ctx.GetPlace(), ::paddle::memory::Allocator::kDefault, + handler.GetDstMemorySize()); + auto residual_data_tz = + paddle::framework::vectorize2int(residual_param->dims()); + auto residual_data_type = + paddle::framework::ToMKLDNNDataType(residual_param->type()); + + auto user_residual_md = platform::MKLDNNMemDesc( + residual_data_tz, residual_data_type, residual_param->format()); + auto user_residual_memory_p = handler.AcquireResidualDataMemory( + user_residual_md, to_void_cast(residual_param_data)); + + dst_memory_p = handler.AcquireDstMemoryFromResidualDataMemory( + user_residual_memory_p, to_void_cast(output_data), pipeline); + } else { + output->ShareDataWith(*residual_param); + auto output_data = output->mutable_data(ctx.GetPlace()); + dst_memory_p = + handler.AcquireDstMemoryFromPrimitive(to_void_cast(output_data)); + } + } else { + auto output_data = output->mutable_data( + ctx.GetPlace(), paddle::memory::Allocator::kDefault, + handler.GetDstMemorySize()); + dst_memory_p = + handler.AcquireDstMemoryFromPrimitive(to_void_cast(output_data)); + } // create convolution op primitive std::shared_ptr conv_p; @@ -424,14 +485,15 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { private: mkldnn::primitive_attr CreatePostOps(bool fuse_relu, - bool fuse_eltwise) const { + bool fuse_residual_conn) const { mkldnn::primitive_attr conv_attr; mkldnn::post_ops post_operations; // Fusion with Elementwise layer relies on adding a sum post-operation with - // the scale parameter. It is assumed that when fuse_eltwise is true, the - // Output tensor contains the data coming from residual connection. The - // result of this post_op is: Output = scale * Output + Conv_Out. - if (fuse_eltwise) { + // the scale parameter. It is assumed that when fuse_residual_connection is + // true, the output tensor contains the data coming from residual + // connection. The result of this post_op is: + // Output = scale * Output + Conv_Out. + if (fuse_residual_conn) { post_operations.append_sum(1.0f); } // Fusion with ReLU layer is executed through the PostOps feature. Create a @@ -452,16 +514,17 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { const memory::desc& dst, const std::vector& strides, const std::vector& paddings, const mkldnn::engine& engine, const bool fuse_relu, - const bool fuse_eltwise) const { + const bool fuse_residual_conn, + mkldnn::prop_kind fwd_prop_kind) const { memory::dims stride_dims = {strides[0], strides[1]}; memory::dims padding_dims = {paddings[0], paddings[1]}; auto conv_desc = mkldnn::convolution_forward::desc( - mkldnn::prop_kind::forward, mkldnn::convolution_direct, src, weights, - dst, stride_dims, padding_dims, padding_dims, - mkldnn::padding_kind::zero); + fwd_prop_kind, mkldnn::convolution_direct, src, weights, dst, + stride_dims, padding_dims, padding_dims, mkldnn::padding_kind::zero); - mkldnn::primitive_attr conv_attr = CreatePostOps(fuse_relu, fuse_eltwise); + mkldnn::primitive_attr conv_attr = + CreatePostOps(fuse_relu, fuse_residual_conn); auto p_conv_pd = new mkldnn::convolution_forward::primitive_desc( conv_desc, conv_attr, engine); @@ -476,16 +539,17 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel { const std::vector& strides, const std::vector& paddings, const mkldnn::engine& engine, const bool fuse_relu, - const bool fuse_eltwise) const { + const bool fuse_residual_conn, + mkldnn::prop_kind fwd_prop_kind) const { memory::dims stride_dims = {strides[0], strides[1]}; memory::dims padding_dims = {paddings[0], paddings[1]}; auto conv_desc = mkldnn::convolution_forward::desc( - mkldnn::prop_kind::forward, mkldnn::convolution_direct, src, weights, - bias, dst, stride_dims, padding_dims, padding_dims, - mkldnn::padding_kind::zero); + fwd_prop_kind, mkldnn::convolution_direct, src, weights, bias, dst, + stride_dims, padding_dims, padding_dims, mkldnn::padding_kind::zero); - mkldnn::primitive_attr conv_attr = CreatePostOps(fuse_relu, fuse_eltwise); + mkldnn::primitive_attr conv_attr = + CreatePostOps(fuse_relu, fuse_residual_conn); auto p_conv_pd = new mkldnn::convolution_forward::primitive_desc( conv_desc, conv_attr, engine); @@ -527,6 +591,10 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel { output_grad->format() != memory::format::format_undef, "Wrong layout/format set for output_grad tensor"); + PADDLE_ENFORCE( + !ctx.Attr("is_test"), + "is_test attribute should be set to False in training phase."); + if (!input_grad && !filter_grad) return; std::vector strides = ctx.Attr>("strides"); @@ -626,7 +694,8 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel { user_diff_dst_memory_p, pipeline); const size_t size = handler.GetDiffWeightsMemorySize(); - filter_grad_data = filter_grad->mutable_data(ctx.GetPlace(), size); + filter_grad_data = filter_grad->mutable_data( + ctx.GetPlace(), paddle::memory::Allocator::kDefault, size); auto diff_weights_memory_p = handler.AcquireDiffWeightsMemoryFromWeightsPrimitive( @@ -651,7 +720,8 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel { pipeline); const size_t size = handler.GetDiffSourceMemorySize(); - input_grad_data = input_grad->mutable_data(ctx.GetPlace(), size); + input_grad_data = input_grad->mutable_data( + ctx.GetPlace(), paddle::memory::Allocator::kDefault, size); auto diff_src_memory_p = handler.AcquireDiffSrcMemoryFromDataPrimitive( reinterpret_cast(input_grad_data)); diff --git a/paddle/fluid/operators/conv_op.cc b/paddle/fluid/operators/conv_op.cc index 8f84bf71a7f77606bed6672f0830e3fc80165a42..342525be49e28f1785e25d4daad38c3c81b4774f 100644 --- a/paddle/fluid/operators/conv_op.cc +++ b/paddle/fluid/operators/conv_op.cc @@ -109,7 +109,10 @@ framework::OpKernelType ConvOp::GetExpectedKernelType( } void Conv2DOpMaker::Make() { - AddAttr("is_test", "").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddInput( "Input", "(Tensor) The input tensor of convolution operator. " @@ -130,8 +133,12 @@ void Conv2DOpMaker::Make() { .AsDispensable(); AddOutput("Output", "(Tensor) The output tensor of convolution operator. " - "The format of output tensor is also NCHW.") - .Reuse("Input"); + "The format of output tensor is also NCHW."); + AddInput("ResidualData", + "(Tensor) Tensor with residual data " + "to which convolution output will be added." + "Used with fuse_residual_connection fusion.") + .AsDispensable(); AddAttr>("strides", "(vector default:{1, 1}), the " "strides(h_stride, w_stride) of " @@ -164,10 +171,10 @@ void Conv2DOpMaker::Make() { .SetDefault(false); AddAttr("fuse_relu", "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); - AddAttr("fuse_eltwise", + AddAttr("fuse_residual_connection", "(bool, default false) Only used in mkldnn kernel. Used " - "whenever convolution output is connected via skip connection " - "to a previous layer.") + "whenever convolution output is as an input to residual " + "connection.") .SetDefault(false); AddAttr( "data_format", @@ -185,6 +192,11 @@ void Conv2DOpMaker::Make() { "workspace size can increase performance but also requires " "better hardware. This size should be chosen carefully.") .SetDefault(4096); + AddAttr("exhaustive_search", + "(bool, default false) cuDNN has many algorithm to calculation " + "convolution, whether enable exhaustive search ", + "for cuDNN convolution or not, defalut is False.") + .SetDefault(false); AddComment(R"DOC( Convolution Operator. @@ -213,6 +225,7 @@ $$ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1 $$ )DOC"); + Apply(); } void Conv3DOpMaker::Make() { @@ -233,8 +246,7 @@ void Conv3DOpMaker::Make() { "input image channels divided by the groups."); AddOutput("Output", "(Tensor) The output tensor of convolution operator." - "The format of output tensor is also NCDHW.") - .Reuse("Input"); + "The format of output tensor is also NCDHW."); AddAttr>("strides", "(vector, default:{1, 1, 1}), the " "strides(d_stride, h_stride, w_stride) of " @@ -280,7 +292,11 @@ void Conv3DOpMaker::Make() { "workspace size can increase performance but also requires " "better hardware. This size should be chosen carefully.") .SetDefault(4096); - + AddAttr("exhaustive_search", + "(bool, default false) cuDNN has many algorithm to calculation " + "convolution, whether enable exhaustive search ", + "for cuDNN convolution or not, defalut is False.") + .SetDefault(false); AddComment(R"DOC( Convolution3D Operator. @@ -310,6 +326,7 @@ Example: W_{out}= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (W_f - 1) + 1))}{ strides[2]}+ 1 $$ )DOC"); + Apply(); } void ConvOpGrad::InferShape(framework::InferShapeContext* ctx) const { @@ -353,6 +370,7 @@ framework::OpKernelType ConvOpGrad::GetExpectedKernelType( namespace ops = paddle::operators; REGISTER_OPERATOR(conv2d, ops::ConvOp, ops::Conv2DOpMaker, + ops::ConvOpInferVarType, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(conv2d_grad, ops::ConvOpGrad); @@ -360,7 +378,9 @@ REGISTER_OPERATOR(conv2d_grad, ops::ConvOpGrad); REGISTER_OPERATOR(depthwise_conv2d, ops::ConvOp, ops::Conv2DOpMaker, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(depthwise_conv2d_grad, ops::ConvOpGrad); + REGISTER_OPERATOR(conv3d, ops::ConvOp, ops::Conv3DOpMaker, + ops::ConvOpInferVarType, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(conv3d_grad, ops::ConvOpGrad); diff --git a/paddle/fluid/operators/conv_op.h b/paddle/fluid/operators/conv_op.h index ef76106f17218a03d24ebc0eca43dbb0ae935093..e69814001e4da5d10e51ee57c1dbe291338b8b49 100644 --- a/paddle/fluid/operators/conv_op.h +++ b/paddle/fluid/operators/conv_op.h @@ -14,6 +14,7 @@ limitations under the License. */ #pragma once +#include #include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" @@ -60,12 +61,27 @@ inline bool IsExpand(const std::vector& filter_dim, // operator implementations can reuse the code. class Conv2DOpMaker : public framework::OpProtoAndCheckerMaker { public: - void Make() override; + void Make() final; + + protected: + virtual void Apply() {} }; class Conv3DOpMaker : public framework::OpProtoAndCheckerMaker { public: - void Make() override; + void Make() final; + + protected: + virtual void Apply() {} +}; + +class ConvOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{ + {"Input", /*->*/ "Output"}}; + } }; class ConvOp : public framework::OperatorWithKernel { diff --git a/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc b/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc index 73831611d01b8c5b8d2d9f7f15634a0094e4a608..f44094ca6b7b7f23f2e7593ad79e4e2a6f0d3070 100644 --- a/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc +++ b/paddle/fluid/operators/conv_transpose_cudnn_op.cu.cc @@ -104,6 +104,7 @@ class CUDNNConvTransposeOpKernel : public framework::OpKernel { int output_offset = output->numel() / output->dims()[0] / groups; int filter_offset = filter->numel() / groups; T alpha = 1.0f, beta = 0.0f; + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); for (int g = 0; g < groups; g++) { auto cudnn_func = [&](void* cudnn_workspace) { CUDNN_ENFORCE(platform::dynload::cudnnConvolutionBackwardData( @@ -112,7 +113,7 @@ class CUDNNConvTransposeOpKernel : public framework::OpKernel { algo, cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_output_desc, output_data + output_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } }; @@ -208,6 +209,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { output_grad->numel() / output_grad->dims()[0] / groups; int filter_offset = filter->numel() / groups; T alpha = 1.0f, beta = 0.0f; + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); if (input_grad) { T* input_grad_data = input_grad->mutable_data(ctx.GetPlace()); // Because beta is zero, it is unnecessary to reset input_grad. @@ -220,7 +222,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_input_desc, input_grad_data + input_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } @@ -238,7 +240,7 @@ class CUDNNConvTransposeGradOpKernel : public framework::OpKernel { cudnn_workspace, workspace_size_in_bytes, &beta, cudnn_filter_desc, filter_grad_data + filter_offset * g)); }; - dev_ctx.RunCudnnFuncWithWorkspace(cudnn_func, workspace_size_in_bytes); + workspace_handle.RunFunc(cudnn_func, workspace_size_in_bytes); } } } diff --git a/paddle/fluid/operators/crf_decoding_op.h b/paddle/fluid/operators/crf_decoding_op.h index 8181897c3d3844bda5574e85a08b2af038fcd664..e9d2e84a434d7084c526a6e75363a65577197262 100644 --- a/paddle/fluid/operators/crf_decoding_op.h +++ b/paddle/fluid/operators/crf_decoding_op.h @@ -16,6 +16,7 @@ limitations under the License. */ #include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/jit_kernel.h" #include "paddle/fluid/operators/math/math_function.h" namespace paddle { @@ -69,9 +70,6 @@ class CRFDecodingOpKernel : public framework::OpKernel { auto emission_dims = emission_weights.dims(); const size_t seq_len = emission_dims[0]; const size_t tag_num = emission_dims[1]; - - const size_t state_trans_base_idx = 2; - const T* x = emission_weights.data(); const T* w = transition_weights.data(); int64_t* path = decoded_path->data(); @@ -84,221 +82,10 @@ class CRFDecodingOpKernel : public framework::OpKernel { Tensor track; int* track_value = track.mutable_data(emission_dims, platform::CPUPlace()); - -#ifdef __AVX__ -// It use the AVX or AVX512 instruction to deal the data as the vector of 8 or -// 16 elements per iteration. Then it can implement the parallel processing. -// Only optimize for float type. -#ifdef __AVX512F__ - size_t step_size = 16; -#else - size_t step_size = 8; -#endif - if (std::is_same::value && (tag_num >= step_size)) { - size_t steps = tag_num / step_size; - size_t remain = tag_num % step_size; - int last_offset = static_cast(remain) - static_cast(step_size); - - // Setup the alpha initial value. - size_t i_offset = 0; - for (size_t i = 0; i <= steps; ++i) { -#ifdef __AVX512F__ - // Declare the variable for the content of weights, input and alpha - // values. - __m512 w_content, x_content, alpha_content; - - // Load the relevant data into the variables from un-aligned address. - w_content = _mm512_loadu_ps((const float*)(w + i_offset)); - x_content = _mm512_loadu_ps((const float*)(x + i_offset)); - alpha_content = _mm512_add_ps(w_content, x_content); - - // Save the alpha value. - _mm512_storeu_ps(reinterpret_cast(alpha_value + i_offset), - alpha_content); -#else - // Declare the variable for the content of weights, input and alpha - // values. - __m256 w_content, x_content, alpha_content; - - // Load the relevant data into the variables from un-aligned address. - w_content = _mm256_loadu_ps((const float*)(w + i_offset)); - x_content = _mm256_loadu_ps((const float*)(x + i_offset)); - alpha_content = _mm256_add_ps(w_content, x_content); - - // Save the alpha value. - _mm256_storeu_ps(reinterpret_cast(alpha_value + i_offset), - alpha_content); -#endif - i_offset += step_size; - if (i == steps - 1) { - if (remain > 0) { - i_offset += last_offset; - } else { - break; - } - } - } - - // Use the column-major strategy to get the location of maximum score. - size_t seq_offset = 0; - for (size_t k = 1; k < seq_len; ++k) { - size_t j_offset = 0; - for (size_t j = 0; j <= steps; ++j) { -#ifdef __AVX512F__ - // Initialize the variables of maximum score and location. - __m512 max_score = _mm512_set1_ps(-std::numeric_limits::max()); - __m512i max_j = _mm512_setzero_si512(); -#else - // Initialize the variables of maximum score and location. - __m256 max_score = _mm256_set1_ps(-std::numeric_limits::max()); - __m256i max_j = _mm256_set1_epi32(0); -#endif - // Calculate the offset of transition_weights. - size_t trans_offset = state_trans_base_idx * tag_num + j_offset; - for (size_t i = 0; i < tag_num; ++i) { -#ifdef __AVX512F__ - // Initalize the content of alpha variable with related offset. - __m512 alpha_content = - _mm512_set1_ps(*(const float*)(alpha_value + seq_offset + i)); - // Obtain the content of weights from un-aligned address. - __m512 w_content = - _mm512_loadu_ps((const float*)(w + trans_offset)); - - __m512 score_v = _mm512_add_ps(alpha_content, w_content); - - __mmask16 mask = _mm512_cmp_ps_mask(score_v, max_score, _CMP_GT_OS); - - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm512_mask_set1_epi32(max_j, mask, i); - - // Update the max_score value. - max_score = _mm512_max_ps(max_score, score_v); -#else - // Initalize the content of alpha variable with related offset. - __m256 alpha_content = _mm256_broadcast_ss( - (const float*)(alpha_value + seq_offset + i)); - // Obtain the content of weights from un-aligned address. - __m256 w_content = - _mm256_loadu_ps((const float*)(w + trans_offset)); - __m256 score_v = _mm256_add_ps(alpha_content, w_content); - - __m256 mask = _mm256_cmp_ps(score_v, max_score, _CMP_GT_OS); - -#ifdef __AVX2__ - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm256_or_si256( - _mm256_andnot_si256((__m256i)mask, max_j), - _mm256_and_si256((__m256i)mask, _mm256_set1_epi32(i))); -#else - __m128i lo_max_j = _mm256_extractf128_si256(max_j, 0); - __m128i hi_max_j = _mm256_extractf128_si256(max_j, 1); - __m128i lo_mask = _mm256_extractf128_si256((__m256i)mask, 0); - __m128i hi_mask = _mm256_extractf128_si256((__m256i)mask, 1); - - lo_max_j = _mm_andnot_si128(lo_mask, lo_max_j); - hi_max_j = _mm_andnot_si128(hi_mask, hi_max_j); - lo_mask = _mm_and_si128(lo_mask, _mm_set1_epi32(i)); - hi_mask = _mm_and_si128(hi_mask, _mm_set1_epi32(i)); - - lo_max_j = _mm_or_si128(lo_mask, lo_max_j); - hi_max_j = _mm_or_si128(hi_mask, hi_max_j); - - // According to the mask value, it update the index of the max_score - // location. - max_j = _mm256_insertf128_si256(max_j, lo_max_j, 0); - max_j = _mm256_insertf128_si256(max_j, hi_max_j, 1); -#endif - - // Update the max_score value. - max_score = _mm256_max_ps(max_score, score_v); -#endif - trans_offset += tag_num; - } - -#ifdef __AVX512F__ - // Update the alpha and track values. - __m512 x_content = _mm512_loadu_ps( - (const float*)(x + seq_offset + tag_num + j_offset)); - max_score = _mm512_add_ps(max_score, x_content); - _mm512_storeu_ps(reinterpret_cast(alpha_value + seq_offset + - tag_num + j_offset), - max_score); - _mm512_storeu_si512( - reinterpret_cast<__m512i*>(track_value + seq_offset + tag_num + - j_offset), - max_j); -#else - // Update the alpha and track values. - __m256 x_content = _mm256_loadu_ps( - (const float*)(x + seq_offset + tag_num + j_offset)); - max_score = _mm256_add_ps(max_score, x_content); - _mm256_storeu_ps(reinterpret_cast(alpha_value + seq_offset + - tag_num + j_offset), - max_score); - _mm256_storeu_si256( - reinterpret_cast<__m256i*>(track_value + seq_offset + tag_num + - j_offset), - max_j); -#endif - - // Calculate the offset of next step - j_offset += step_size; - if (j == steps - 1) { - if (remain > 0) { - j_offset += last_offset; - } else { - break; - } - } - } - - seq_offset += tag_num; - } - } else { - for (size_t i = 0; i < tag_num; ++i) alpha_value[i] = w[i] + x[i]; - - for (size_t k = 1; k < seq_len; ++k) { - for (size_t i = 0; i < tag_num; ++i) { - T max_score = -std::numeric_limits::max(); - int max_j = 0; - for (size_t j = 0; j < tag_num; ++j) { - T score = alpha_value[(k - 1) * tag_num + j] + - w[(j + state_trans_base_idx) * tag_num + i]; - if (score > max_score) { - max_score = score; - max_j = j; - } - } - - alpha_value[k * tag_num + i] = max_score + x[k * tag_num + i]; - track_value[k * tag_num + i] = max_j; - } - } - } -#else - for (size_t i = 0; i < tag_num; ++i) alpha_value[i] = w[i] + x[i]; - - for (size_t k = 1; k < seq_len; ++k) { - for (size_t i = 0; i < tag_num; ++i) { - T max_score = -std::numeric_limits::max(); - int max_j = 0; - for (size_t j = 0; j < tag_num; ++j) { - T score = alpha_value[(k - 1) * tag_num + j] + - w[(j + state_trans_base_idx) * tag_num + i]; - if (score > max_score) { - max_score = score; - max_j = j; - } - } - - alpha_value[k * tag_num + i] = max_score + x[k * tag_num + i]; - track_value[k * tag_num + i] = max_j; - } - } - -#endif + const auto& ker = math::jitkernel::KernelPool::Instance() + .template Get>( + static_cast(tag_num)); + ker->Compute(static_cast(seq_len), x, w, alpha_value, track_value); T max_score = -std::numeric_limits::max(); int max_i = 0; for (size_t i = 0; i < tag_num; ++i) { diff --git a/paddle/fluid/operators/cross_entropy_op.cc b/paddle/fluid/operators/cross_entropy_op.cc index 66f19fe7ecfa51b2ce917f0c5fcb6d486f1a7307..a904dd91302c951560dc32ac107d4d73b6024c25 100644 --- a/paddle/fluid/operators/cross_entropy_op.cc +++ b/paddle/fluid/operators/cross_entropy_op.cc @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/cross_entropy_op.h" +#include namespace paddle { namespace operators { @@ -179,6 +180,15 @@ or not. But the output only shares the LoD information with input X. )DOC"); } }; + +class CrossEntropyOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Y"}}; + } +}; } // namespace operators } // namespace paddle @@ -186,6 +196,7 @@ namespace ops = paddle::operators; using CPUCtx = paddle::platform::CPUDeviceContext; REGISTER_OPERATOR(cross_entropy, ops::CrossEntropyOp, ops::CrossEntropyOpMaker, + ops::CrossEntropyOpInferVarType, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(cross_entropy_grad, ops::CrossEntropyGradientOp); REGISTER_OP_CPU_KERNEL(cross_entropy, ops::CrossEntropyOpKernel, diff --git a/paddle/fluid/operators/cross_entropy_op.cu b/paddle/fluid/operators/cross_entropy_op.cu index 30dbd5bd3d39dd2992c3dd91364003bb7715a2eb..fcd34383a85f6984a8f27ce0625364f8fd5e31d6 100644 --- a/paddle/fluid/operators/cross_entropy_op.cu +++ b/paddle/fluid/operators/cross_entropy_op.cu @@ -13,12 +13,17 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/cross_entropy_op.h" +#include "paddle/fluid/platform/float16.h" +namespace plat = paddle::platform; namespace ops = paddle::operators; using CUDACtx = paddle::platform::CUDADeviceContext; REGISTER_OP_CUDA_KERNEL(cross_entropy, ops::CrossEntropyOpKernel, - ops::CrossEntropyOpKernel); -REGISTER_OP_CUDA_KERNEL(cross_entropy_grad, - ops::CrossEntropyGradientOpKernel, - ops::CrossEntropyGradientOpKernel); + ops::CrossEntropyOpKernel, + ops::CrossEntropyOpKernel); + +REGISTER_OP_CUDA_KERNEL( + cross_entropy_grad, ops::CrossEntropyGradientOpKernel, + ops::CrossEntropyGradientOpKernel, + ops::CrossEntropyGradientOpKernel); diff --git a/paddle/fluid/operators/csp/CMakeLists.txt b/paddle/fluid/operators/csp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/csp/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/go_op.cc b/paddle/fluid/operators/csp/go_op.cc similarity index 100% rename from paddle/fluid/operators/go_op.cc rename to paddle/fluid/operators/csp/go_op.cc diff --git a/paddle/fluid/operators/delete_var_op.cc b/paddle/fluid/operators/delete_var_op.cc index d7a9bfbc437dbf4c723b9c87ff62ec6b62c38638..89416f7ab5d07ddac5b540b9bb361f831c1ef360 100644 --- a/paddle/fluid/operators/delete_var_op.cc +++ b/paddle/fluid/operators/delete_var_op.cc @@ -32,6 +32,11 @@ class DeleteVarOp : public framework::OperatorBase { } }; +class DeleteVarOpShapeInference : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *ctx) const override {} +}; + class DeleteVarOpInfoMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { @@ -48,4 +53,5 @@ It should not be configured by users directly. REGISTER_OPERATOR(delete_var, paddle::operators::DeleteVarOp, paddle::framework::EmptyGradOpMaker, - paddle::operators::DeleteVarOpInfoMaker); + paddle::operators::DeleteVarOpInfoMaker, + paddle::operators::DeleteVarOpShapeInference); diff --git a/paddle/fluid/operators/detection/CMakeLists.txt b/paddle/fluid/operators/detection/CMakeLists.txt index aa8ed502fc94bd0970dfe5dbf00ef090e799ad30..6c85f1577e0c49d00f4ccf7fa7be0974eb62bdf3 100644 --- a/paddle/fluid/operators/detection/CMakeLists.txt +++ b/paddle/fluid/operators/detection/CMakeLists.txt @@ -20,8 +20,9 @@ detection_library(box_coder_op SRCS box_coder_op.cc box_coder_op.cu) detection_library(iou_similarity_op SRCS iou_similarity_op.cc iou_similarity_op.cu) detection_library(mine_hard_examples_op SRCS mine_hard_examples_op.cc) -detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc) +detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc poly_util.cc gpc.cc) detection_library(prior_box_op SRCS prior_box_op.cc prior_box_op.cu) +detection_library(density_prior_box_op SRCS density_prior_box_op.cc density_prior_box_op.cu) detection_library(anchor_generator_op SRCS anchor_generator_op.cc anchor_generator_op.cu) detection_library(target_assign_op SRCS target_assign_op.cc @@ -39,4 +40,8 @@ endif() detection_library(roi_perspective_transform_op SRCS roi_perspective_transform_op.cc roi_perspective_transform_op.cu) #Export local libraries to parent -set(DETECTION_LIBRARY ${LOCAL_DETECTION_LIBS} PARENT_SCOPE) +# set(DETECTION_LIBRARY ${LOCAL_DETECTION_LIBS} PARENT_SCOPE) + +foreach(src ${LOCAL_DETECTION_LIBS}) + set(OP_LIBRARY ${src} ${OP_LIBRARY} CACHE INTERNAL "op libs") +endforeach() diff --git a/paddle/fluid/operators/detection/box_coder_op.cc b/paddle/fluid/operators/detection/box_coder_op.cc index d0f95f727fdbc82777147e3e8ada6ad4f7a35e60..06fbb9815c52ea69e3aa9e893512e039853b9514 100644 --- a/paddle/fluid/operators/detection/box_coder_op.cc +++ b/paddle/fluid/operators/detection/box_coder_op.cc @@ -30,27 +30,30 @@ class BoxCoderOp : public framework::OperatorWithKernel { auto prior_box_dims = ctx->GetInputDim("PriorBox"); auto target_box_dims = ctx->GetInputDim("TargetBox"); - PADDLE_ENFORCE_EQ(prior_box_dims.size(), 2, - "The rank of Input of PriorBoxVar must be 2"); - PADDLE_ENFORCE_EQ(prior_box_dims[1], 4, "The shape of PriorBox is [N, 4]"); - if (ctx->HasInput("PriorBoxVar")) { - auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar"); - PADDLE_ENFORCE_EQ(prior_box_dims, prior_box_var_dims); + if (ctx->IsRuntime()) { + PADDLE_ENFORCE_EQ(prior_box_dims.size(), 2, + "The rank of Input of PriorBoxVar must be 2"); + PADDLE_ENFORCE_EQ(prior_box_dims[1], 4, + "The shape of PriorBox is [N, 4]"); + if (ctx->HasInput("PriorBoxVar")) { + auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar"); + PADDLE_ENFORCE_EQ(prior_box_dims, prior_box_var_dims); + } + + auto code_type = + GetBoxCodeType(ctx->Attrs().Get("code_type")); + if (code_type == BoxCodeType::kEncodeCenterSize) { + PADDLE_ENFORCE_EQ(target_box_dims.size(), 2, + "The rank of Input of TargetBox must be 2"); + PADDLE_ENFORCE_EQ(target_box_dims[1], 4, + "The shape of TargetBox is [M, 4]"); + } else if (code_type == BoxCodeType::kDecodeCenterSize) { + PADDLE_ENFORCE_EQ(target_box_dims.size(), 3, + "The rank of Input of TargetBox must be 3"); + PADDLE_ENFORCE_EQ(target_box_dims[1], prior_box_dims[0]); + PADDLE_ENFORCE_EQ(target_box_dims[2], prior_box_dims[1]); + } } - - auto code_type = GetBoxCodeType(ctx->Attrs().Get("code_type")); - if (code_type == BoxCodeType::kEncodeCenterSize) { - PADDLE_ENFORCE_EQ(target_box_dims.size(), 2, - "The rank of Input of TargetBox must be 2"); - PADDLE_ENFORCE_EQ(target_box_dims[1], 4, - "The shape of TargetBox is [M, 4]"); - } else if (code_type == BoxCodeType::kDecodeCenterSize) { - PADDLE_ENFORCE_EQ(target_box_dims.size(), 3, - "The rank of Input of TargetBox must be 3"); - PADDLE_ENFORCE_EQ(target_box_dims[1], prior_box_dims[0]); - PADDLE_ENFORCE_EQ(target_box_dims[2], prior_box_dims[1]); - } - ctx->SetOutputDim( "OutputBox", framework::make_ddim({target_box_dims[0], prior_box_dims[0], 4})); diff --git a/paddle/fluid/operators/detection/density_prior_box_op.cc b/paddle/fluid/operators/detection/density_prior_box_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..1012ba3652ddfb06af9292e8061684481f1dbef3 --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.cc @@ -0,0 +1,181 @@ +/*Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/density_prior_box_op.h" + +namespace paddle { +namespace operators { + +class DensityPriorBoxOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Input"), + "Input(Input) of DensityPriorBoxOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Image"), + "Input(Image) of DensityPriorBoxOp should not be null."); + + auto image_dims = ctx->GetInputDim("Image"); + auto input_dims = ctx->GetInputDim("Input"); + PADDLE_ENFORCE(image_dims.size() == 4, "The layout of image is NCHW."); + PADDLE_ENFORCE(input_dims.size() == 4, "The layout of input is NCHW."); + + PADDLE_ENFORCE_LT(input_dims[2], image_dims[2], + "The height of input must smaller than image."); + + PADDLE_ENFORCE_LT(input_dims[3], image_dims[3], + "The width of input must smaller than image."); + auto variances = ctx->Attrs().Get>("variances"); + + auto fixed_sizes = ctx->Attrs().Get>("fixed_sizes"); + auto fixed_ratios = ctx->Attrs().Get>("fixed_ratios"); + auto densities = ctx->Attrs().Get>("densities"); + bool flatten = ctx->Attrs().Get("flatten_to_2d"); + + PADDLE_ENFORCE_EQ(fixed_sizes.size(), densities.size(), + "The number of fixed_sizes and densities must be equal."); + size_t num_priors = 0; + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + if (!flatten) { + std::vector dim_vec(4); + dim_vec[0] = input_dims[2]; + dim_vec[1] = input_dims[3]; + dim_vec[2] = num_priors; + dim_vec[3] = 4; + ctx->SetOutputDim("Boxes", framework::make_ddim(dim_vec)); + ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec)); + } else { + int64_t dim0 = input_dims[2] * input_dims[3] * num_priors; + ctx->SetOutputDim("Boxes", {dim0, 4}); + ctx->SetOutputDim("Variances", {dim0, 4}); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType( + framework::ToDataType(ctx.Input("Input")->type()), + ctx.GetPlace()); + } +}; + +class DensityPriorBoxOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "Input", + "(Tensor, default Tensor), " + "the input feature data of DensityPriorBoxOp, the layout is NCHW."); + AddInput("Image", + "(Tensor, default Tensor), " + "the input image data of DensityPriorBoxOp, the layout is NCHW."); + AddOutput("Boxes", + "(Tensor, default Tensor), the output prior boxes of " + "DensityPriorBoxOp. The layout is [H, W, num_priors, 4]. " + "H is the height of input, W is the width of input, num_priors " + "is the box count of each position."); + AddOutput("Variances", + "(Tensor, default Tensor), the expanded variances of " + "DensityPriorBoxOp. The layout is [H, W, num_priors, 4]. " + "H is the height of input, W is the width of input, num_priors " + "is the box count of each position."); + AddAttr>("variances", + "(vector) List of variances to be " + "encoded in density prior boxes.") + .AddCustomChecker([](const std::vector& variances) { + PADDLE_ENFORCE_EQ(variances.size(), 4, + "Must and only provide 4 variance."); + for (size_t i = 0; i < variances.size(); ++i) { + PADDLE_ENFORCE_GT(variances[i], 0.0, + "variance[%d] must be greater than 0.", i); + } + }); + AddAttr("clip", "(bool) Whether to clip out-of-boundary boxes.") + .SetDefault(true); + AddAttr("flatten_to_2d", + "(bool) Whether to flatten to 2D and " + "the second dim is 4.") + .SetDefault(false); + AddAttr( + "step_w", + "Density prior boxes step across width, 0.0 for auto calculation.") + .SetDefault(0.0) + .AddCustomChecker([](const float& step_w) { + PADDLE_ENFORCE_GE(step_w, 0.0, "step_w should be larger than 0."); + }); + AddAttr( + "step_h", + "Density prior boxes step across height, 0.0 for auto calculation.") + .SetDefault(0.0) + .AddCustomChecker([](const float& step_h) { + PADDLE_ENFORCE_GE(step_h, 0.0, "step_h should be larger than 0."); + }); + + AddAttr("offset", + "(float) " + "Density prior boxes center offset.") + .SetDefault(0.5); + AddAttr>("fixed_sizes", + "(vector) List of fixed sizes " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& fixed_sizes) { + for (size_t i = 0; i < fixed_sizes.size(); ++i) { + PADDLE_ENFORCE_GT(fixed_sizes[i], 0.0, + "fixed_sizes[%d] should be larger than 0.", i); + } + }); + + AddAttr>("fixed_ratios", + "(vector) List of fixed ratios " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& fixed_ratios) { + for (size_t i = 0; i < fixed_ratios.size(); ++i) { + PADDLE_ENFORCE_GT(fixed_ratios[i], 0.0, + "fixed_ratios[%d] should be larger than 0.", i); + } + }); + + AddAttr>("densities", + "(vector) List of densities " + "of generated density prior boxes.") + .SetDefault(std::vector{}) + .AddCustomChecker([](const std::vector& densities) { + for (size_t i = 0; i < densities.size(); ++i) { + PADDLE_ENFORCE_GT(densities[i], 0, + "densities[%d] should be larger than 0.", i); + } + }); + AddComment(R"DOC( + Density Prior box operator + Each position of the input produce N density prior boxes, N is determined by + the count of fixed_ratios, densities, the calculation of N is as follows: + for density in densities: + N += size(fixed_ratios)*density^2 + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(density_prior_box, ops::DensityPriorBoxOp, + ops::DensityPriorBoxOpMaker, + paddle::framework::EmptyGradOpMaker); + +REGISTER_OP_CPU_KERNEL(density_prior_box, ops::DensityPriorBoxOpKernel, + ops::DensityPriorBoxOpKernel); diff --git a/paddle/fluid/operators/detection/density_prior_box_op.cu b/paddle/fluid/operators/detection/density_prior_box_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..3b7c781795f02b9d9c9f2ead51034193ceb2a745 --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.cu @@ -0,0 +1,170 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/detection/density_prior_box_op.h" + +namespace paddle { +namespace operators { + +template +static __device__ inline T Clip(T in) { + return min(max(in, 0.), 1.); +} + +template +static __global__ void GenDensityPriorBox( + const int height, const int width, const int im_height, const int im_width, + const T offset, const T step_width, const T step_height, + const int num_priors, const T* ratios_shift, bool is_clip, const T var_xmin, + const T var_ymin, const T var_xmax, const T var_ymax, T* out, T* var) { + int gidx = blockIdx.x * blockDim.x + threadIdx.x; + int gidy = blockIdx.y * blockDim.y + threadIdx.y; + int step_x = blockDim.x * gridDim.x; + int step_y = blockDim.y * gridDim.y; + + const T* width_ratio = ratios_shift; + const T* height_ratio = ratios_shift + num_priors; + const T* width_shift = ratios_shift + 2 * num_priors; + const T* height_shift = ratios_shift + 3 * num_priors; + + for (int j = gidy; j < height; j += step_y) { + for (int i = gidx; i < width * num_priors; i += step_x) { + int h = j; + int w = i / num_priors; + int k = i % num_priors; + + T center_x = (w + offset) * step_width; + T center_y = (h + offset) * step_height; + + T center_x_temp = center_x + width_shift[k]; + T center_y_temp = center_y + height_shift[k]; + + T box_width_ratio = width_ratio[k] / 2.; + T box_height_ratio = height_ratio[k] / 2.; + + T xmin = max((center_x_temp - box_width_ratio) / im_width, 0.); + T ymin = max((center_y_temp - box_height_ratio) / im_height, 0.); + T xmax = min((center_x_temp + box_width_ratio) / im_width, 1.); + T ymax = min((center_y_temp + box_height_ratio) / im_height, 1.); + + int out_offset = (j * width * num_priors + i) * 4; + out[out_offset] = is_clip ? Clip(xmin) : xmin; + out[out_offset + 1] = is_clip ? Clip(ymin) : ymin; + out[out_offset + 2] = is_clip ? Clip(xmax) : xmax; + out[out_offset + 3] = is_clip ? Clip(ymax) : ymax; + + var[out_offset] = var_xmin; + var[out_offset + 1] = var_ymin; + var[out_offset + 2] = var_xmax; + var[out_offset + 3] = var_ymax; + } + } +} + +template +class DensityPriorBoxOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("Input"); + auto* image = ctx.Input("Image"); + auto* boxes = ctx.Output("Boxes"); + auto* vars = ctx.Output("Variances"); + + auto variances = ctx.Attr>("variances"); + auto is_clip = ctx.Attr("clip"); + + auto fixed_sizes = ctx.Attr>("fixed_sizes"); + auto fixed_ratios = ctx.Attr>("fixed_ratios"); + auto densities = ctx.Attr>("densities"); + + T step_w = static_cast(ctx.Attr("step_w")); + T step_h = static_cast(ctx.Attr("step_h")); + T offset = static_cast(ctx.Attr("offset")); + + auto img_width = image->dims()[3]; + auto img_height = image->dims()[2]; + + auto feature_width = input->dims()[3]; + auto feature_height = input->dims()[2]; + + T step_width, step_height; + if (step_w == 0 || step_h == 0) { + step_width = static_cast(img_width) / feature_width; + step_height = static_cast(img_height) / feature_height; + } else { + step_width = step_w; + step_height = step_h; + } + + int num_priors = 0; + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + int step_average = static_cast((step_width + step_height) * 0.5); + + framework::Tensor h_temp; + T* tdata = h_temp.mutable_data({num_priors * 4}, platform::CPUPlace()); + int idx = 0; + for (size_t s = 0; s < fixed_sizes.size(); ++s) { + auto fixed_size = fixed_sizes[s]; + int density = densities[s]; + for (size_t r = 0; r < fixed_ratios.size(); ++r) { + float ar = fixed_ratios[r]; + int shift = step_average / density; + float box_width_ratio = fixed_size * sqrt(ar); + float box_height_ratio = fixed_size / sqrt(ar); + for (int di = 0; di < density; ++di) { + for (int dj = 0; dj < density; ++dj) { + float center_x_temp = shift / 2. + dj * shift - step_average / 2.; + float center_y_temp = shift / 2. + di * shift - step_average / 2.; + tdata[idx] = box_width_ratio; + tdata[num_priors + idx] = box_height_ratio; + tdata[2 * num_priors + idx] = center_x_temp; + tdata[3 * num_priors + idx] = center_y_temp; + idx++; + } + } + } + } + + boxes->mutable_data(ctx.GetPlace()); + vars->mutable_data(ctx.GetPlace()); + + framework::Tensor d_temp; + framework::TensorCopySync(h_temp, ctx.GetPlace(), &d_temp); + + // At least use 32 threads, at most 512 threads. + // blockx is multiple of 32. + int blockx = std::min(((feature_width * num_priors + 31) >> 5) << 5, 512L); + int gridx = (feature_width * num_priors + blockx - 1) / blockx; + dim3 threads(blockx, 1); + dim3 grids(gridx, feature_height); + + auto stream = + ctx.template device_context().stream(); + GenDensityPriorBox<<>>( + feature_height, feature_width, img_height, img_width, offset, + step_width, step_height, num_priors, d_temp.data(), is_clip, + variances[0], variances[1], variances[2], variances[3], + boxes->data(), vars->data()); + } +}; // namespace operators + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(density_prior_box, + ops::DensityPriorBoxOpCUDAKernel, + ops::DensityPriorBoxOpCUDAKernel); diff --git a/paddle/fluid/operators/detection/density_prior_box_op.h b/paddle/fluid/operators/detection/density_prior_box_op.h new file mode 100644 index 0000000000000000000000000000000000000000..ed2f5df80cf4d7a5a44af9b09f3b048b1b14cdb9 --- /dev/null +++ b/paddle/fluid/operators/detection/density_prior_box_op.h @@ -0,0 +1,143 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/operators/detection/prior_box_op.h" + +namespace paddle { +namespace operators { + +template +class DensityPriorBoxOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("Input"); + auto* image = ctx.Input("Image"); + auto* boxes = ctx.Output("Boxes"); + auto* vars = ctx.Output("Variances"); + + auto variances = ctx.Attr>("variances"); + auto clip = ctx.Attr("clip"); + + auto fixed_sizes = ctx.Attr>("fixed_sizes"); + auto fixed_ratios = ctx.Attr>("fixed_ratios"); + auto densities = ctx.Attr>("densities"); + + T step_w = static_cast(ctx.Attr("step_w")); + T step_h = static_cast(ctx.Attr("step_h")); + T offset = static_cast(ctx.Attr("offset")); + + auto img_width = image->dims()[3]; + auto img_height = image->dims()[2]; + + auto feature_width = input->dims()[3]; + auto feature_height = input->dims()[2]; + + T step_width, step_height; + if (step_w == 0 || step_h == 0) { + step_width = static_cast(img_width) / feature_width; + step_height = static_cast(img_height) / feature_height; + } else { + step_width = step_w; + step_height = step_h; + } + int num_priors = 0; + for (size_t i = 0; i < densities.size(); ++i) { + num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); + } + + boxes->mutable_data(ctx.GetPlace()); + vars->mutable_data(ctx.GetPlace()); + + auto box_dim = vars->dims(); + boxes->Resize({feature_height, feature_width, num_priors, 4}); + auto e_boxes = framework::EigenTensor::From(*boxes).setConstant(0.0); + int step_average = static_cast((step_width + step_height) * 0.5); + + for (int h = 0; h < feature_height; ++h) { + for (int w = 0; w < feature_width; ++w) { + T center_x = (w + offset) * step_width; + T center_y = (h + offset) * step_height; + int idx = 0; + // Generate density prior boxes with fixed sizes. + for (size_t s = 0; s < fixed_sizes.size(); ++s) { + auto fixed_size = fixed_sizes[s]; + int density = densities[s]; + // Generate density prior boxes with fixed ratios. + for (size_t r = 0; r < fixed_ratios.size(); ++r) { + float ar = fixed_ratios[r]; + int shift = step_average / density; + float box_width_ratio = fixed_size * sqrt(ar); + float box_height_ratio = fixed_size / sqrt(ar); + for (int di = 0; di < density; ++di) { + for (int dj = 0; dj < density; ++dj) { + float center_x_temp = + center_x - step_average / 2. + shift / 2. + dj * shift; + float center_y_temp = + center_y - step_average / 2. + shift / 2. + di * shift; + e_boxes(h, w, idx, 0) = + (center_x_temp - box_width_ratio / 2.) / img_width >= 0 + ? (center_x_temp - box_width_ratio / 2.) / img_width + : 0; + e_boxes(h, w, idx, 1) = + (center_y_temp - box_height_ratio / 2.) / img_height >= 0 + ? (center_y_temp - box_height_ratio / 2.) / img_height + : 0; + e_boxes(h, w, idx, 2) = + (center_x_temp + box_width_ratio / 2.) / img_width <= 1 + ? (center_x_temp + box_width_ratio / 2.) / img_width + : 1; + e_boxes(h, w, idx, 3) = + (center_y_temp + box_height_ratio / 2.) / img_height <= 1 + ? (center_y_temp + box_height_ratio / 2.) / img_height + : 1; + idx++; + } + } + } + } + } + } + if (clip) { + platform::Transform trans; + ClipFunctor clip_func; + trans(ctx.template device_context(), + boxes->data(), boxes->data() + boxes->numel(), + boxes->data(), clip_func); + } + framework::Tensor var_t; + var_t.mutable_data( + framework::make_ddim({1, static_cast(variances.size())}), + ctx.GetPlace()); + + auto var_et = framework::EigenTensor::From(var_t); + + for (size_t i = 0; i < variances.size(); ++i) { + var_et(0, i) = variances[i]; + } + + int box_num = feature_height * feature_width * num_priors; + auto var_dim = vars->dims(); + vars->Resize({box_num, static_cast(variances.size())}); + + auto e_vars = framework::EigenMatrix::From(*vars); + + e_vars = var_et.broadcast(Eigen::DSizes(box_num, 1)); + + vars->Resize(var_dim); + boxes->Resize(box_dim); + } +}; // namespace operators + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/detection/generate_proposal_labels_op.cc b/paddle/fluid/operators/detection/generate_proposal_labels_op.cc index d7a53f1bef98ecda3ba7b36323678a11a632a15c..fddd6884017c35112ba48f245759f5d846b55f9a 100644 --- a/paddle/fluid/operators/detection/generate_proposal_labels_op.cc +++ b/paddle/fluid/operators/detection/generate_proposal_labels_op.cc @@ -16,7 +16,7 @@ limitations under the License. */ #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detection/bbox_util.h" #include "paddle/fluid/operators/gather.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/operators/math/math_function.h" namespace paddle { @@ -439,31 +439,88 @@ class GenerateProposalLabelsKernel : public framework::OpKernel { class GenerateProposalLabelsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - // TODO(buxingyuan): Add Document - AddInput("RpnRois", "RpnRois."); - AddInput("GtClasses", "GtClasses."); - AddInput("IsCrowd", "IsCrowd."); - AddInput("GtBoxes", "GtBoxes."); - AddInput("ImInfo", "ImInfo."); - - AddOutput("Rois", "Rois."); - AddOutput("LabelsInt32", "LabelsInt32."); - AddOutput("BboxTargets", "BboxTargets."); - AddOutput("BboxInsideWeights", "BboxInsideWeights."); - AddOutput("BboxOutsideWeights", "BboxOutsideWeights."); - - AddAttr("batch_size_per_im", "batch_size_per_im"); - AddAttr("fg_fraction", "fg_fraction"); - AddAttr("fg_thresh", "fg_thresh"); - AddAttr("bg_thresh_hi", "bg_thresh_hi"); - AddAttr("bg_thresh_lo", "bg_thresh_lo"); - AddAttr>("bbox_reg_weights", "bbox_reg_weights"); - AddAttr("class_nums", "class_nums"); - AddAttr("use_random", "use_random").SetDefault(true); + AddInput( + "RpnRois", + "(LoDTensor), This input is a 2D LoDTensor with shape [N, 4]. " + "N is the number of the GenerateProposalOp's output, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddInput("GtClasses", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a class label of groundtruth."); + AddInput( + "IsCrowd", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. " + "M is the number of groundtruth, " + "each element is a flag indicates whether a groundtruth is crowd."); + AddInput( + "GtBoxes", + "(LoDTensor), This input is a 2D LoDTensor with shape [M, 4]. " + "M is the number of groundtruth, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddInput("ImInfo", + "(Tensor), This input is a 2D Tensor with shape [B, 3]. " + "B is the number of input images, " + "each element consists of im_height, im_width, im_scale."); + + AddOutput( + "Rois", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4]. " + "P usuall equal to batch_size_per_im * batch_size, " + "each element is a bounding box with [xmin, ymin, xmax, ymax] format."); + AddOutput("LabelsInt32", + "(LoDTensor), This output is a 2D LoDTensor with shape [P], " + "each element repersents a class label of a roi"); + AddOutput("BboxTargets", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element repersents a box label of a roi"); + AddOutput( + "BboxInsideWeights", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element indicates whether a box should contribute to loss."); + AddOutput( + "BboxOutsideWeights", + "(LoDTensor), This output is a 2D LoDTensor with shape [P, 4 * " + "class_nums], " + "each element indicates whether a box should contribute to loss."); + + AddAttr("batch_size_per_im", "Batch size of rois per images."); + AddAttr("fg_fraction", + "Foreground fraction in total batch_size_per_im."); + AddAttr( + "fg_thresh", + "Overlap threshold which is used to chose foreground sample."); + AddAttr("bg_thresh_hi", + "Overlap threshold upper bound which is used to chose " + "background sample."); + AddAttr("bg_thresh_lo", + "Overlap threshold lower bound which is used to chose " + "background sample."); + AddAttr>("bbox_reg_weights", "Box regression weights."); + AddAttr("class_nums", "Class number."); + AddAttr( + "use_random", + "Use random sampling to choose foreground and background boxes.") + .SetDefault(true); AddComment(R"DOC( -Generate Proposals Labels Operator. -)DOC"); +This operator can be, for given the GenerateProposalOp output bounding boxes and groundtruth, +to sample foreground boxes and background boxes, and compute loss target. + +RpnRois is the output boxes of RPN and was processed by generate_proposal_op, these boxes +were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction, +If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample. +If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi, +then it was considered as a background sample. +After all foreground and background boxes are chosen (so called Rois), +then we apply random sampling to make sure +the number of foreground boxes is no more than batch_size_per_im * fg_fraction. + +For each box in Rois, we assign the classification (class label) and regression targets (box label) to it. +Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss. + )DOC"); } }; diff --git a/paddle/fluid/operators/detection/generate_proposals_op.cc b/paddle/fluid/operators/detection/generate_proposals_op.cc index 818d58ea9ee327fd99182ad2f8cbeed07e6aaea2..709c2dfc4b7c67d7d04074c58ce6da85b6e790fe 100644 --- a/paddle/fluid/operators/detection/generate_proposals_op.cc +++ b/paddle/fluid/operators/detection/generate_proposals_op.cc @@ -12,10 +12,12 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include +#include #include #include #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/operators/detail/safe_ref.h" #include "paddle/fluid/operators/gather.h" #include "paddle/fluid/operators/math/math_function.h" @@ -25,21 +27,17 @@ namespace operators { using Tensor = framework::Tensor; using LoDTensor = framework::LoDTensor; -struct AppendProposalsFunctor { - LoDTensor *out_; - int64_t offset_; - Tensor *to_add_; +static const double kBBoxClipDefault = std::log(1000.0 / 16.0); - AppendProposalsFunctor(LoDTensor *out, int64_t offset, Tensor *to_add) - : out_(out), offset_(offset), to_add_(to_add) {} - - template - void apply() const { - auto *out_data = out_->data(); - auto *to_add_data = to_add_->data(); - memcpy(out_data + offset_, to_add_data, to_add_->numel() * sizeof(T)); - } -}; +static void AppendProposals(Tensor *dst, int64_t offset, const Tensor &src) { + auto *out_data = dst->data(); + auto *to_add_data = src.data(); + size_t size_of_t = framework::SizeOfType(src.type()); + offset *= size_of_t; + std::memcpy( + reinterpret_cast(reinterpret_cast(out_data) + offset), + to_add_data, src.numel() * size_of_t); +} class GenerateProposalsOp : public framework::OperatorWithKernel { public: @@ -75,8 +73,9 @@ class GenerateProposalsOp : public framework::OperatorWithKernel { }; template -void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, - Tensor *bbox_deltas, Tensor *variances, Tensor *proposals) { +static inline void BoxCoder(const platform::DeviceContext &ctx, + Tensor *all_anchors, Tensor *bbox_deltas, + Tensor *variances, Tensor *proposals) { T *proposals_data = proposals->mutable_data(ctx.GetPlace()); int64_t row = all_anchors->dims()[0]; @@ -108,11 +107,11 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, anchor_center_y; bbox_width = std::exp(std::min(variances_data[i * len + 2] * bbox_deltas_data[i * len + 2], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_width; bbox_height = std::exp(std::min(variances_data[i * len + 3] * bbox_deltas_data[i * len + 3], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_height; } else { bbox_center_x = @@ -120,10 +119,10 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, bbox_center_y = bbox_deltas_data[i * len + 1] * anchor_height + anchor_center_y; bbox_width = std::exp(std::min(bbox_deltas_data[i * len + 2], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_width; bbox_height = std::exp(std::min(bbox_deltas_data[i * len + 3], - std::log(1000.0 / 16.0))) * + kBBoxClipDefault)) * anchor_height; } @@ -136,30 +135,32 @@ void BoxCoder(const platform::DeviceContext &ctx, Tensor *all_anchors, } template -void ClipTiledBoxes(const platform::DeviceContext &ctx, const Tensor &im_info, - Tensor *boxes) { +static inline void ClipTiledBoxes(const platform::DeviceContext &ctx, + const Tensor &im_info, Tensor *boxes) { T *boxes_data = boxes->mutable_data(ctx.GetPlace()); const T *im_info_data = im_info.data(); + T zero(0); for (int64_t i = 0; i < boxes->numel(); ++i) { if (i % 4 == 0) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[1] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[1] - 1), zero); } else if (i % 4 == 1) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[0] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[0] - 1), zero); } else if (i % 4 == 2) { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[1] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[1] - 1), zero); } else { boxes_data[i] = - std::max(std::min(boxes_data[i], im_info_data[0] - 1), 0.0f); + std::max(std::min(boxes_data[i], im_info_data[0] - 1), zero); } } } template -void FilterBoxes(const platform::DeviceContext &ctx, Tensor *boxes, - float min_size, const Tensor &im_info, Tensor *keep) { +static inline void FilterBoxes(const platform::DeviceContext &ctx, + Tensor *boxes, float min_size, + const Tensor &im_info, Tensor *keep) { const T *im_info_data = im_info.data(); T *boxes_data = boxes->mutable_data(ctx.GetPlace()); T im_scale = im_info_data[2]; @@ -185,24 +186,24 @@ void FilterBoxes(const platform::DeviceContext &ctx, Tensor *boxes, keep->Resize({keep_len}); } -bool SortScorePairDescend(const std::pair &pair1, - const std::pair &pair2) { - return pair1.first > pair2.first; -} - template -void GetMaxScoreIndex(const std::vector &scores, - std::vector> *sorted_indices) { +static inline std::vector> GetSortedScoreIndex( + const std::vector &scores) { + std::vector> sorted_indices; + sorted_indices.reserve(scores.size()); for (size_t i = 0; i < scores.size(); ++i) { - sorted_indices->push_back(std::make_pair(scores[i], i)); + sorted_indices.emplace_back(scores[i], i); } // Sort the score pair according to the scores in descending order - std::stable_sort(sorted_indices->begin(), sorted_indices->end(), - SortScorePairDescend); + std::stable_sort(sorted_indices.begin(), sorted_indices.end(), + [](const std::pair &a, const std::pair &b) { + return a.first < b.first; + }); + return sorted_indices; } template -T BBoxArea(const T *box, const bool normalized) { +static inline T BBoxArea(const T *box, bool normalized) { if (box[2] < box[0] || box[3] < box[1]) { // If coordinate values are is invalid // (e.g. xmax < xmin or ymax < ymin), return 0. @@ -220,7 +221,7 @@ T BBoxArea(const T *box, const bool normalized) { } template -T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { +static inline T JaccardOverlap(const T *box1, const T *box2, bool normalized) { if (box2[0] > box1[2] || box2[2] < box1[0] || box2[1] > box1[3] || box2[3] < box1[1]) { return static_cast(0.); @@ -229,8 +230,8 @@ T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { const T inter_ymin = std::max(box1[1], box2[1]); const T inter_xmax = std::min(box1[2], box2[2]); const T inter_ymax = std::min(box1[3], box2[3]); - const T inter_w = std::max(0.0f, inter_xmax - inter_xmin + 1); - const T inter_h = std::max(0.0f, inter_ymax - inter_ymin + 1); + const T inter_w = std::max(T(0), inter_xmax - inter_xmin + 1); + const T inter_h = std::max(T(0), inter_ymax - inter_ymin + 1); const T inter_area = inter_w * inter_h; const T bbox1_area = BBoxArea(box1, normalized); const T bbox2_area = BBoxArea(box2, normalized); @@ -238,9 +239,21 @@ T JaccardOverlap(const T *box1, const T *box2, const bool normalized) { } } +template +static inline Tensor VectorToTensor(const std::vector &selected_indices, + int selected_num) { + Tensor keep_nms; + keep_nms.Resize({selected_num}); + auto *keep_data = keep_nms.mutable_data(platform::CPUPlace()); + for (int i = 0; i < selected_num; ++i) { + keep_data[i] = selected_indices[i]; + } + return keep_nms; +} + template -Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, - const T nms_threshold, const float eta) { +static inline Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, + Tensor *scores, T nms_threshold, float eta) { PADDLE_ENFORCE_NOT_NULL(bbox); int64_t num_boxes = bbox->dims()[0]; // 4: [xmin ymin xmax ymax] @@ -248,20 +261,18 @@ Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, std::vector scores_data(num_boxes); std::copy_n(scores->data(), num_boxes, scores_data.begin()); - std::vector> sorted_indices; - GetMaxScoreIndex(scores_data, &sorted_indices); + std::vector> sorted_indices = + GetSortedScoreIndex(scores_data); std::vector selected_indices; int selected_num = 0; T adaptive_threshold = nms_threshold; const T *bbox_data = bbox->data(); - bool flag; while (sorted_indices.size() != 0) { - int idx = sorted_indices.front().second; - flag = true; - for (size_t k = 0; k < selected_indices.size(); ++k) { + int idx = sorted_indices.back().second; + bool flag = true; + for (int kept_idx : selected_indices) { if (flag) { - const int kept_idx = selected_indices[k]; T overlap = JaccardOverlap(bbox_data + idx * box_size, bbox_data + kept_idx * box_size, false); flag = (overlap <= adaptive_threshold); @@ -271,32 +282,29 @@ Tensor NMS(const platform::DeviceContext &ctx, Tensor *bbox, Tensor *scores, } if (flag) { selected_indices.push_back(idx); - selected_num++; + ++selected_num; } - sorted_indices.erase(sorted_indices.begin()); + sorted_indices.erase(sorted_indices.end() - 1); if (flag && eta < 1 && adaptive_threshold > 0.5) { adaptive_threshold *= eta; } } - Tensor keep_nms; - keep_nms.Resize({selected_num}); - int *keep_data = keep_nms.mutable_data(ctx.GetPlace()); - for (int i = 0; i < selected_num; ++i) { - keep_data[i] = selected_indices[i]; - } - - return keep_nms; + return VectorToTensor(selected_indices, selected_num); } -template +template class GenerateProposalsKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext &context) const override { auto *scores = context.Input("Scores"); auto *bbox_deltas = context.Input("BboxDeltas"); auto *im_info = context.Input("ImInfo"); - auto *anchors = context.Input("Anchors"); - auto *variances = context.Input("Variances"); + auto anchors = detail::Ref(context.Input("Anchors"), + "Cannot find input Anchors(%s) in scope", + context.Inputs("Anchors")[0]); + auto variances = detail::Ref(context.Input("Variances"), + "Cannot find input Variances(%s) in scope", + context.Inputs("Variances")[0]); auto *rpn_rois = context.Output("RpnRois"); auto *rpn_roi_probs = context.Output("RpnRoiProbs"); @@ -307,15 +315,16 @@ class GenerateProposalsKernel : public framework::OpKernel { float min_size = context.Attr("min_size"); float eta = context.Attr("eta"); - auto &dev_ctx = context.template device_context(); + auto &dev_ctx = + context.template device_context(); - auto scores_dim = scores->dims(); + auto &scores_dim = scores->dims(); int64_t num = scores_dim[0]; int64_t c_score = scores_dim[1]; int64_t h_score = scores_dim[2]; int64_t w_score = scores_dim[3]; - auto bbox_dim = bbox_deltas->dims(); + auto &bbox_dim = bbox_deltas->dims(); int64_t c_bbox = bbox_dim[1]; int64_t h_bbox = bbox_dim[2]; int64_t w_bbox = bbox_dim[3]; @@ -330,17 +339,17 @@ class GenerateProposalsKernel : public framework::OpKernel { scores_swap.mutable_data({num, h_score, w_score, c_score}, dev_ctx.GetPlace()); - math::Transpose trans; + math::Transpose trans; std::vector axis = {0, 2, 3, 1}; trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis); trans(dev_ctx, *scores, &scores_swap, axis); framework::LoD lod; - std::vector lod0(1, 0); - Tensor *anchor = const_cast(anchors); - anchor->Resize({anchors->numel() / 4, 4}); - Tensor *var = const_cast(variances); - var->Resize({var->numel() / 4, 4}); + lod.resize(1); + auto &lod0 = lod[0]; + lod0.push_back(0); + anchors.Resize({anchors.numel() / 4, 4}); + variances.Resize({variances.numel() / 4, 4}); int64_t num_proposals = 0; for (int64_t i = 0; i < num; ++i) { @@ -352,24 +361,17 @@ class GenerateProposalsKernel : public framework::OpKernel { scores_slice.Resize({h_score * w_score * c_score, 1}); std::pair tensor_pair = - ProposalForOneImage(dev_ctx, im_info_slice, *anchor, *var, + ProposalForOneImage(dev_ctx, im_info_slice, anchors, variances, bbox_deltas_slice, scores_slice, pre_nms_top_n, post_nms_top_n, nms_thresh, min_size, eta); - Tensor proposals = tensor_pair.first; - Tensor scores = tensor_pair.second; - - framework::VisitDataType( - framework::ToDataType(rpn_rois->type()), - AppendProposalsFunctor(rpn_rois, 4 * num_proposals, &proposals)); - framework::VisitDataType( - framework::ToDataType(rpn_roi_probs->type()), - AppendProposalsFunctor(rpn_roi_probs, num_proposals, &scores)); + Tensor &proposals = tensor_pair.first; + Tensor &scores = tensor_pair.second; + AppendProposals(rpn_rois, 4 * num_proposals, proposals); + AppendProposals(rpn_roi_probs, num_proposals, scores); num_proposals += proposals.dims()[0]; - lod0.emplace_back(num_proposals); + lod0.push_back(num_proposals); } - - lod.emplace_back(lod0); rpn_rois->set_lod(lod); rpn_roi_probs->set_lod(lod); rpn_rois->Resize({num_proposals, 4}); @@ -377,7 +379,7 @@ class GenerateProposalsKernel : public framework::OpKernel { } std::pair ProposalForOneImage( - const DeviceContext &ctx, const Tensor &im_info_slice, + const platform::CPUDeviceContext &ctx, const Tensor &im_info_slice, const Tensor &anchors, const Tensor &variances, const Tensor &bbox_deltas_slice, // [M, 4] const Tensor &scores_slice, // [N, 1] @@ -392,10 +394,9 @@ class GenerateProposalsKernel : public framework::OpKernel { for (int i = 0; i < scores_slice.numel(); ++i) { index[i] = i; } - std::function compare = - [scores_data](const int64_t &i, const int64_t &j) { - return scores_data[i] > scores_data[j]; - }; + auto compare = [scores_data](const int64_t &i, const int64_t &j) { + return scores_data[i] > scores_data[j]; + }; if (pre_nms_top_n <= 0 || pre_nms_top_n >= scores_slice.numel()) { std::sort(index, index + scores_slice.numel(), compare); @@ -452,33 +453,45 @@ class GenerateProposalsKernel : public framework::OpKernel { class GenerateProposalsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - AddInput("Scores", "The scores of anchors should be foreground."); - AddInput("BboxDeltas", "bbox_deltas."); - AddInput("ImInfo", "Information for image reshape."); - AddInput("Anchors", "All anchors."); - AddInput("Variances", " variances"); - - AddOutput("RpnRois", "Anchors."); - AddOutput("RpnRoiProbs", "Anchors."); - AddAttr("pre_nms_topN", "pre_nms_topN"); - AddAttr("post_nms_topN", "post_nms_topN"); - AddAttr("nms_thresh", "nms_thres"); - AddAttr("min_size", "min size"); + AddInput("Scores", + "(Tensor) The scores from conv is in shape (N, A, H, W), " + "N is batch size, A is number of anchors, " + "H and W are height and width of the feature map"); + AddInput("BboxDeltas", + "(Tensor) Bounding box deltas from conv is in " + "shape (N, 4*A, H, W)."); + AddInput("ImInfo", + "(Tensor) Information for image reshape is in shape (N, 3), " + "in format (height, width, scale)"); + AddInput("Anchors", + "(Tensor) Bounding box anchors from anchor_generator_op " + "is in shape (A, H, W, 4)."); + AddInput("Variances", + "(Tensor) Bounding box variances with same shape as `Anchors`."); + + AddOutput("RpnRois", + "(LoDTensor), Output proposals with shape (rois_num, 4)."); + AddOutput("RpnRoiProbs", + "(LoDTensor) Scores of proposals with shape (rois_num, 1)."); + AddAttr("pre_nms_topN", + "Number of top scoring RPN proposals to keep before " + "applying NMS."); + AddAttr("post_nms_topN", + "Number of top scoring RPN proposals to keep after " + "applying NMS"); + AddAttr("nms_thresh", "NMS threshold used on RPN proposals."); + AddAttr("min_size", + "Proposal height and width both need to be greater " + "than this min_size."); AddAttr("eta", "The parameter for adaptive NMS."); AddComment(R"DOC( -Generate Proposals OP - -This operator proposes rois according to each box with their probability to be a foreground object and -the box can be calculated by anchors. Bbox_deltais and scores are the output of RPN. Final proposals -could be used to train detection net. - -Scores is the probability for each box to be an object. In format of (N, A, H, W) where N is batch size, A is number -of anchors, H and W are height and width of the feature map. -BboxDeltas is the differece between predicted box locatoin and anchor location. In format of (N, 4*A, H, W) +This operator Generate bounding box proposals for Faster RCNN. +The propoasls are generated for a list of images based on image +score 'Scores', bounding box regression result 'BboxDeltas' as +well as predefined bounding box shapes 'anchors'. Greedy +non-maximum suppression is applied to generate the final bounding +boxes. -For generating proposals, this operator transposes and resizes scores and bbox_deltas in size of (H*W*A, 1) and (H*W*A, 4) and - calculate box locations as proposals candidates. Then clip boxes to image and remove predicted boxes with small area. -Finally, apply nms to get final proposals as output. )DOC"); } }; @@ -490,6 +503,5 @@ namespace ops = paddle::operators; REGISTER_OPERATOR(generate_proposals, ops::GenerateProposalsOp, ops::GenerateProposalsOpMaker, paddle::framework::EmptyGradOpMaker); -REGISTER_OP_CPU_KERNEL( - generate_proposals, - ops::GenerateProposalsKernel); +REGISTER_OP_CPU_KERNEL(generate_proposals, ops::GenerateProposalsKernel, + ops::GenerateProposalsKernel); diff --git a/paddle/fluid/operators/detection/generate_proposals_op.cu b/paddle/fluid/operators/detection/generate_proposals_op.cu index 6146ff509d768c0317a5c65ed22af1a3075977a2..a0b99377109aef4776fadd68101d011a9191b1cc 100644 --- a/paddle/fluid/operators/detection/generate_proposals_op.cu +++ b/paddle/fluid/operators/detection/generate_proposals_op.cu @@ -12,14 +12,18 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include #include #include #include #include "cub/cub.cuh" +#include "paddle/fluid/framework/mixed_vector.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/memory/memory.h" +#include "paddle/fluid/operators/detail/safe_ref.h" #include "paddle/fluid/operators/gather.cu.h" #include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/for_range.h" namespace paddle { namespace operators { @@ -36,62 +40,67 @@ namespace { int const kThreadsPerBlock = sizeof(uint64_t) * 8; -template -__global__ void RangeInitKernel(const T start, const T delta, const int size, - T *out) { - CUDA_1D_KERNEL_LOOP(i, size) { out[i] = start + i * delta; } -} +static const double kBBoxClipDefault = std::log(1000.0 / 16.0); + +struct RangeInitFunctor { + int start_; + int delta_; + int *out_; + __device__ void operator()(size_t i) { out_[i] = start_ + i * delta_; } +}; template -void SortDescending(const platform::CUDADeviceContext &ctx, const Tensor &value, - Tensor *value_out, Tensor *index_out) { - int num = value.numel(); +static void SortDescending(const platform::CUDADeviceContext &ctx, + const Tensor &value, Tensor *value_out, + Tensor *index_out) { + int num = static_cast(value.numel()); Tensor index_in_t; int *idx_in = index_in_t.mutable_data({num}, ctx.GetPlace()); - int block = 512; - auto stream = ctx.stream(); - RangeInitKernel<<>>(0, 1, num, idx_in); + platform::ForRange for_range(ctx, num); + for_range(RangeInitFunctor{0, 1, idx_in}); + int *idx_out = index_out->mutable_data({num}, ctx.GetPlace()); const T *keys_in = value.data(); T *keys_out = value_out->mutable_data({num}, ctx.GetPlace()); // Determine temporary device storage requirements - void *d_temp_storage = NULL; size_t temp_storage_bytes = 0; cub::DeviceRadixSort::SortPairsDescending( - d_temp_storage, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, - num); - + nullptr, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, num); // Allocate temporary storage auto place = boost::get(ctx.GetPlace()); - d_temp_storage = memory::Alloc(place, temp_storage_bytes); + auto d_temp_storage = + memory::Alloc(place, temp_storage_bytes, memory::Allocator::kScratchpad); // Run sorting operation cub::DeviceRadixSort::SortPairsDescending( - d_temp_storage, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, - num); - - memory::Free(place, d_temp_storage); -} - -template -__device__ __forceinline__ T Min(T x, T y) { - return x < y ? x : y; + d_temp_storage->ptr(), temp_storage_bytes, keys_in, keys_out, idx_in, + idx_out, num); } template -__device__ __forceinline__ T Max(T x, T y) { - return x > y ? x : y; -} - -template -__global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, - const T *var, const int *index, - const T *im_info, const int num, - T *proposals) { - T kBBoxClipDefault = log(1000.0 / 16.0); - CUDA_1D_KERNEL_LOOP(i, num) { +struct BoxDecodeAndClipFunctor { + const T *anchor; + const T *deltas; + const T *var; + const int *index; + const T *im_info; + + T *proposals; + + BoxDecodeAndClipFunctor(const T *anchor, const T *deltas, const T *var, + const int *index, const T *im_info, T *proposals) + : anchor(anchor), + deltas(deltas), + var(var), + index(index), + im_info(im_info), + proposals(proposals) {} + + T bbox_clip_default{static_cast(kBBoxClipDefault)}; + + __device__ void operator()(size_t i) { int k = index[i] * 4; T axmin = anchor[k]; T aymin = anchor[k + 1]; @@ -108,17 +117,17 @@ __global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, T dxmax = deltas[k + 2]; T dymax = deltas[k + 3]; - T d_cx = 0., d_cy = 0., d_w = 0., d_h = 0.; + T d_cx, d_cy, d_w, d_h; if (var) { d_cx = cx + dxmin * w * var[k]; d_cy = cy + dymin * h * var[k + 1]; - d_w = exp(Min(dxmax * var[k + 2], kBBoxClipDefault)) * w; - d_h = exp(Min(dymax * var[k + 3], kBBoxClipDefault)) * h; + d_w = exp(Min(dxmax * var[k + 2], bbox_clip_default)) * w; + d_h = exp(Min(dymax * var[k + 3], bbox_clip_default)) * h; } else { d_cx = cx + dxmin * w; d_cy = cy + dymin * h; - d_w = exp(Min(dxmax, kBBoxClipDefault)) * w; - d_h = exp(Min(dymax, kBBoxClipDefault)) * h; + d_w = exp(Min(dxmax, bbox_clip_default)) * w; + d_h = exp(Min(dymax, bbox_clip_default)) * h; } T oxmin = d_cx - d_w * 0.5; @@ -126,17 +135,21 @@ __global__ void BoxDecodeAndClipKernel(const T *anchor, const T *deltas, T oxmax = d_cx + d_w * 0.5 - 1.; T oymax = d_cy + d_h * 0.5 - 1.; - proposals[i * 4] = Max(Min(oxmin, im_info[1] - 1.), 0.); - proposals[i * 4 + 1] = Max(Min(oymin, im_info[0] - 1.), 0.); - proposals[i * 4 + 2] = Max(Min(oxmax, im_info[1] - 1.), 0.); - proposals[i * 4 + 3] = Max(Min(oymax, im_info[0] - 1.), 0.); + proposals[i * 4] = Max(Min(oxmin, im_info[1] - 1.), 0.); + proposals[i * 4 + 1] = Max(Min(oymin, im_info[0] - 1.), 0.); + proposals[i * 4 + 2] = Max(Min(oxmax, im_info[1] - 1.), 0.); + proposals[i * 4 + 3] = Max(Min(oymax, im_info[0] - 1.), 0.); } -} + + __device__ __forceinline__ T Min(T a, T b) const { return a > b ? b : a; } + + __device__ __forceinline__ T Max(T a, T b) const { return a > b ? a : b; } +}; template -__global__ void FilterBBoxes(const T *bboxes, const T *im_info, - const T min_size, const int num, int *keep_num, - int *keep) { +static __global__ void FilterBBoxes(const T *bboxes, const T *im_info, + const T min_size, const int num, + int *keep_num, int *keep) { T im_h = im_info[0]; T im_w = im_info[1]; T im_scale = im_info[2]; @@ -181,7 +194,7 @@ __global__ void FilterBBoxes(const T *bboxes, const T *im_info, } } -__device__ inline float IoU(const float *a, const float *b) { +static __device__ inline float IoU(const float *a, const float *b) { float left = max(a[0], b[0]), right = min(a[2], b[2]); float top = max(a[1], b[1]), bottom = min(a[3], b[3]); float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f); @@ -191,8 +204,9 @@ __device__ inline float IoU(const float *a, const float *b) { return inter_s / (s_a + s_b - inter_s); } -__global__ void NMSKernel(const int n_boxes, const float nms_overlap_thresh, - const float *dev_boxes, uint64_t *dev_mask) { +static __global__ void NMSKernel(const int n_boxes, + const float nms_overlap_thresh, + const float *dev_boxes, uint64_t *dev_mask) { const int row_start = blockIdx.y; const int col_start = blockIdx.x; @@ -234,9 +248,9 @@ __global__ void NMSKernel(const int n_boxes, const float nms_overlap_thresh, } template -void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, - const Tensor &sorted_indices, const T nms_threshold, - Tensor *keep_out) { +static void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, + const Tensor &sorted_indices, const T nms_threshold, + Tensor *keep_out) { int boxes_num = proposals.dims()[0]; PADDLE_ENFORCE_EQ(boxes_num, sorted_indices.dims()[0]); @@ -247,13 +261,10 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, const T *boxes = proposals.data(); auto place = boost::get(ctx.GetPlace()); - int size_bytes = boxes_num * col_blocks * sizeof(uint64_t); - uint64_t *d_mask = - reinterpret_cast(memory::Alloc(place, size_bytes)); - NMSKernel<<>>(boxes_num, nms_threshold, boxes, d_mask); - uint64_t *h_mask = reinterpret_cast( - memory::Alloc(platform::CPUPlace(), size_bytes)); - memory::Copy(platform::CPUPlace(), h_mask, place, d_mask, size_bytes, 0); + framework::Vector mask(boxes_num * col_blocks); + NMSKernel<<>>( + boxes_num, nms_threshold, boxes, + mask.CUDAMutableData(boost::get(ctx.GetPlace()))); std::vector remv(col_blocks); memset(&remv[0], 0, sizeof(uint64_t) * col_blocks); @@ -267,7 +278,7 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, if (!(remv[nblock] & (1ULL << inblock))) { ++num_to_keep; keep_vec.push_back(i); - uint64_t *p = &h_mask[0] + i * col_blocks; + uint64_t *p = &mask[0] + i * col_blocks; for (int j = nblock; j < col_blocks; j++) { remv[j] |= p[j]; } @@ -276,12 +287,10 @@ void NMS(const platform::CUDADeviceContext &ctx, const Tensor &proposals, int *keep = keep_out->mutable_data({num_to_keep}, ctx.GetPlace()); memory::Copy(place, keep, platform::CPUPlace(), keep_vec.data(), sizeof(int) * num_to_keep, 0); - memory::Free(place, d_mask); - memory::Free(platform::CPUPlace(), h_mask); } template -std::pair ProposalForOneImage( +static std::pair ProposalForOneImage( const platform::CUDADeviceContext &ctx, const Tensor &im_info, const Tensor &anchors, const Tensor &variances, const Tensor &bbox_deltas, // [M, 4] @@ -300,18 +309,20 @@ std::pair ProposalForOneImage( // 2. box decode and clipping Tensor proposals; proposals.mutable_data({pre_nms_num, 4}, ctx.GetPlace()); - int block = 512; - auto stream = ctx.stream(); - BoxDecodeAndClipKernel<<>>( - anchors.data(), bbox_deltas.data(), variances.data(), - index_sort.data(), im_info.data(), pre_nms_num, - proposals.data()); + + { + platform::ForRange for_range(ctx, pre_nms_num); + for_range(BoxDecodeAndClipFunctor{ + anchors.data(), bbox_deltas.data(), variances.data(), + index_sort.data(), im_info.data(), proposals.data()}); + } // 3. filter Tensor keep_index, keep_num_t; keep_index.mutable_data({pre_nms_num}, ctx.GetPlace()); keep_num_t.mutable_data({1}, ctx.GetPlace()); min_size = std::max(min_size, 1.0f); + auto stream = ctx.stream(); FilterBBoxes<<<1, 512, 0, stream>>>( proposals.data(), im_info.data(), min_size, pre_nms_num, keep_num_t.data(), keep_index.data()); @@ -355,8 +366,12 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { auto *scores = context.Input("Scores"); auto *bbox_deltas = context.Input("BboxDeltas"); auto *im_info = context.Input("ImInfo"); - auto *anchors = context.Input("Anchors"); - auto *variances = context.Input("Variances"); + auto anchors = detail::Ref(context.Input("Anchors"), + "Cannot find input Anchors(%s) in scope", + context.Inputs("Anchors")[0]); + auto variances = detail::Ref(context.Input("Variances"), + "Cannot find input Variances(%s) in scope", + context.Inputs("Variances")[0]); auto *rpn_rois = context.Output("RpnRois"); auto *rpn_roi_probs = context.Output("RpnRoiProbs"); @@ -392,10 +407,8 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis); trans(dev_ctx, *scores, &scores_swap, axis); - Tensor *anchor = const_cast(anchors); - anchor->Resize({anchors->numel() / 4, 4}); - Tensor *var = const_cast(variances); - var->Resize({var->numel() / 4, 4}); + anchors.Resize({anchors.numel() / 4, 4}); + variances.Resize({variances.numel() / 4, 4}); rpn_rois->mutable_data({bbox_deltas->numel() / 4, 4}, context.GetPlace()); @@ -417,12 +430,12 @@ class CUDAGenerateProposalsKernel : public framework::OpKernel { scores_slice.Resize({h_score * w_score * c_score, 1}); std::pair box_score_pair = - ProposalForOneImage(dev_ctx, im_info_slice, *anchor, *var, + ProposalForOneImage(dev_ctx, im_info_slice, anchors, variances, bbox_deltas_slice, scores_slice, pre_nms_top_n, post_nms_top_n, nms_thresh, min_size, eta); - Tensor proposals = box_score_pair.first; - Tensor scores = box_score_pair.second; + Tensor &proposals = box_score_pair.first; + Tensor &scores = box_score_pair.second; memory::Copy(place, rpn_rois_data + num_proposals * 4, place, proposals.data(), sizeof(T) * proposals.numel(), 0); diff --git a/paddle/fluid/operators/detection/gpc.cc b/paddle/fluid/operators/detection/gpc.cc new file mode 100644 index 0000000000000000000000000000000000000000..7c0823c0487d39eece5be08322e7d182b931ba3c --- /dev/null +++ b/paddle/fluid/operators/detection/gpc.cc @@ -0,0 +1,2201 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +/** + * @file src/gpc.cpp + * @author huhan02(com@baidu.com) + * @date 2015/12/18 14:17:30 + * @brief + * + * @modified by sunyipeng + * @email sunyipeng@baidu.com + * @date 2018/6/12 + **/ + +#include "paddle/fluid/operators/detection/gpc.h" + +namespace gpc { + +typedef struct lmt_shape { /* Local minima table */ + double y; /* Y coordinate at local minimum */ + edge_node *first_bound; /* Pointer to bound list */ + struct lmt_shape *next; /* Pointer to next local minimum */ +} lmt_node; + +typedef struct sbt_t_shape { /* Scanbeam tree */ + double y; /* Scanbeam node y value */ + struct sbt_t_shape *less; /* Pointer to nodes with lower y */ + struct sbt_t_shape *more; /* Pointer to nodes with higher y */ +} sb_tree; + +typedef struct it_shape { /* Intersection table */ + edge_node *ie[2]; /* Intersecting edge (bundle) pair */ + gpc_vertex point; /* Point of intersection */ + struct it_shape *next; /* The next intersection table node */ +} it_node; + +typedef struct st_shape { /* Sorted edge table */ + edge_node *edge; /* Pointer to AET edge */ + double xb; /* Scanbeam bottom x coordinate */ + double xt; /* Scanbeam top x coordinate */ + double dx; /* Change in x for a unit y increase */ + struct st_shape *prev; /* Previous edge in sorted list */ +} st_node; + +typedef struct bbox_shape { /* Contour axis-aligned bounding box */ + double xmin; /* Minimum x coordinate */ + double ymin; /* Minimum y coordinate */ + double xmax; /* Maximum x coordinate */ + double ymax; /* Maximum y coordinate */ +} bbox; + +/* +=========================================================================== + Global Data +=========================================================================== +*/ + +/* Horizontal edge state transitions within scanbeam boundary */ +const h_state next_h_state[3][6] = { + /* ABOVE BELOW CROSS */ + /* L R L R L R */ + /* NH */ + {BH, TH, TH, BH, NH, NH}, + /* BH */ + {NH, NH, NH, NH, TH, TH}, + /* TH */ + {NH, NH, NH, NH, BH, BH}}; + +/* +=========================================================================== + Private Functions +=========================================================================== +*/ + +static void reset_it(it_node **it) { + it_node *itn; + + while (*it) { + itn = (*it)->next; + gpc_free(*it); + *it = itn; + } +} + +static void reset_lmt(lmt_node **lmt) { + lmt_node *lmtn; + + while (*lmt) { + lmtn = (*lmt)->next; + gpc_free(*lmt); + *lmt = lmtn; + } +} + +static void insert_bound(edge_node **b, edge_node *e) { + edge_node *existing_bound = NULL; + + if (!*b) { + /* Link node e to the tail of the list */ + *b = e; + } else { + /* Do primary sort on the x field */ + if (e[0].bot.x < (*b)[0].bot.x) { + /* Insert a new node mid-list */ + existing_bound = *b; + *b = e; + (*b)->next_bound = existing_bound; + } else { + if (e[0].bot.x == (*b)[0].bot.x) { + /* Do secondary sort on the dx field */ + if (e[0].dx < (*b)[0].dx) { + /* Insert a new node mid-list */ + existing_bound = *b; + *b = e; + (*b)->next_bound = existing_bound; + } else { + /* Head further down the list */ + insert_bound(&((*b)->next_bound), e); + } + } else { + /* Head further down the list */ + insert_bound(&((*b)->next_bound), e); + } + } + } +} + +static edge_node **bound_list(lmt_node **lmt, double y) { + lmt_node *existing_node; + + if (!*lmt) { + /* Add node onto the tail end of the LMT */ + gpc_malloc(*lmt, sizeof(lmt_node), + const_cast("LMT insertion")); + (*lmt)->y = y; + (*lmt)->first_bound = NULL; + (*lmt)->next = NULL; + return &((*lmt)->first_bound); + } else if (y < (*lmt)->y) { + /* Insert a new LMT node before the current node */ + existing_node = *lmt; + gpc_malloc(*lmt, sizeof(lmt_node), + const_cast("LMT insertion")); + (*lmt)->y = y; + (*lmt)->first_bound = NULL; + (*lmt)->next = existing_node; + return &((*lmt)->first_bound); + } else { + if (y > (*lmt)->y) { + /* Head further up the LMT */ + return bound_list(&((*lmt)->next), y); + } else { + /* Use this existing LMT node */ + return &((*lmt)->first_bound); + } + } +} + +static void add_to_sbtree(int *entries, sb_tree **sbtree, double y) { + if (!*sbtree) { + /* Add a new tree node here */ + gpc_malloc(*sbtree, sizeof(sb_tree), + const_cast("scanbeam tree insertion")); + (*sbtree)->y = y; + (*sbtree)->less = NULL; + (*sbtree)->more = NULL; + (*entries)++; + } else { + if ((*sbtree)->y > y) { + /* Head into the 'less' sub-tree */ + add_to_sbtree(entries, &((*sbtree)->less), y); + } else { + if ((*sbtree)->y < y) { + /* Head into the 'more' sub-tree */ + add_to_sbtree(entries, &((*sbtree)->more), y); + } + } + } +} + +static void build_sbt(int *entries, double *sbt, sb_tree *sbtree) { + if (sbtree->less) { + build_sbt(entries, sbt, sbtree->less); + } + sbt[*entries] = sbtree->y; + (*entries)++; + if (sbtree->more) { + build_sbt(entries, sbt, sbtree->more); + } +} + +static void free_sbtree(sb_tree **sbtree) { + if (*sbtree) { + free_sbtree(&((*sbtree)->less)); + free_sbtree(&((*sbtree)->more)); + gpc_free(*sbtree); + } +} + +static int count_optimal_vertices(gpc_vertex_list c) { + int result = 0; + int i = 0; + + /* Ignore non-contributing contours */ + if (c.num_vertices > 0) { + for (i = 0; i < c.num_vertices; i++) { + /* Ignore superfluous vertices embedded in horizontal edges */ + if (gpc_optimal(c.vertex, i, c.num_vertices)) { + result++; + } + } + } + return result; +} + +static edge_node *build_lmt(lmt_node **lmt, sb_tree **sbtree, int *sbt_entries, + gpc_polygon *p, int type, gpc_op op) { + int c = 0; + int i = 0; + int min = 0; + int max = 0; + int num_edges = 0; + int v = 0; + int num_vertices = 0; + int total_vertices = 0; + int e_index = 0; + edge_node *e = NULL; + edge_node *edge_table = NULL; + + for (c = 0; c < p->num_contours; c++) { + total_vertices += count_optimal_vertices(p->contour[c]); + } + + /* Create the entire input polygon edge table in one go */ + gpc_malloc(edge_table, total_vertices * sizeof(edge_node), + const_cast("edge table creation")); + + for (c = 0; c < p->num_contours; c++) { + if (p->contour[c].num_vertices < 0) { + /* Ignore the non-contributing contour and repair the vertex count */ + p->contour[c].num_vertices = -p->contour[c].num_vertices; + } else { + /* Perform contour optimisation */ + num_vertices = 0; + for (i = 0; i < p->contour[c].num_vertices; i++) { + if (gpc_optimal(p->contour[c].vertex, i, p->contour[c].num_vertices)) { + edge_table[num_vertices].vertex.x = p->contour[c].vertex[i].x; + edge_table[num_vertices].vertex.y = p->contour[c].vertex[i].y; + + /* Record vertex in the scanbeam table */ + add_to_sbtree(sbt_entries, sbtree, edge_table[num_vertices].vertex.y); + + num_vertices++; + } + } + + /* Do the contour forward pass */ + for (min = 0; min < num_vertices; min++) { + /* If a forward local minimum... */ + if (gpc_fwd_min(edge_table, min, num_vertices)) { + /* Search for the next local maximum... */ + num_edges = 1; + max = gpc_next_index(min, num_vertices); + while (gpc_not_fmax(edge_table, max, num_vertices)) { + num_edges++; + max = gpc_next_index(max, num_vertices); + } + + /* Build the next edge list */ + e = &edge_table[e_index]; + e_index += num_edges; + v = min; + e[0].bstate[BELOW] = UNBUNDLED; + e[0].bundle[BELOW][CLIP] = 0; + e[0].bundle[BELOW][SUBJ] = 0; + for (i = 0; i < num_edges; i++) { + e[i].xb = edge_table[v].vertex.x; + e[i].bot.x = edge_table[v].vertex.x; + e[i].bot.y = edge_table[v].vertex.y; + + v = gpc_next_index(v, num_vertices); + + e[i].top.x = edge_table[v].vertex.x; + e[i].top.y = edge_table[v].vertex.y; + e[i].dx = (edge_table[v].vertex.x - e[i].bot.x) / + (e[i].top.y - e[i].bot.y); + e[i].type = type; + e[i].outp[ABOVE] = NULL; + e[i].outp[BELOW] = NULL; + e[i].next = NULL; + e[i].prev = NULL; + e[i].succ = + ((num_edges > 1) && (i < (num_edges - 1))) ? &(e[i + 1]) : NULL; + e[i].pred = ((num_edges > 1) && (i > 0)) ? &(e[i - 1]) : NULL; + e[i].next_bound = NULL; + e[i].bside[CLIP] = (op == GPC_DIFF) ? RIGHT : LEFT; + e[i].bside[SUBJ] = LEFT; + } + insert_bound(bound_list(lmt, edge_table[min].vertex.y), e); + } + } + + /* Do the contour reverse pass */ + for (min = 0; min < num_vertices; min++) { + /* If a reverse local minimum... */ + if (gpc_rev_min(edge_table, min, num_vertices)) { + /* Search for the previous local maximum... */ + num_edges = 1; + max = gpc_prev_index(min, num_vertices); + while (gpc_not_rmax(edge_table, max, num_vertices)) { + num_edges++; + max = gpc_prev_index(max, num_vertices); + } + + /* Build the previous edge list */ + e = &edge_table[e_index]; + e_index += num_edges; + v = min; + e[0].bstate[BELOW] = UNBUNDLED; + e[0].bundle[BELOW][CLIP] = 0; + e[0].bundle[BELOW][SUBJ] = 0; + for (i = 0; i < num_edges; i++) { + e[i].xb = edge_table[v].vertex.x; + e[i].bot.x = edge_table[v].vertex.x; + e[i].bot.y = edge_table[v].vertex.y; + + v = gpc_prev_index(v, num_vertices); + + e[i].top.x = edge_table[v].vertex.x; + e[i].top.y = edge_table[v].vertex.y; + e[i].dx = (edge_table[v].vertex.x - e[i].bot.x) / + (e[i].top.y - e[i].bot.y); + e[i].type = type; + e[i].outp[ABOVE] = NULL; + e[i].outp[BELOW] = NULL; + e[i].next = NULL; + e[i].prev = NULL; + e[i].succ = + ((num_edges > 1) && (i < (num_edges - 1))) ? &(e[i + 1]) : NULL; + e[i].pred = ((num_edges > 1) && (i > 0)) ? &(e[i - 1]) : NULL; + e[i].next_bound = NULL; + e[i].bside[CLIP] = (op == GPC_DIFF) ? RIGHT : LEFT; + e[i].bside[SUBJ] = LEFT; + } + insert_bound(bound_list(lmt, edge_table[min].vertex.y), e); + } + } + } + } + return edge_table; +} // NOLINT + +static void add_edge_to_aet(edge_node **aet, edge_node *edge, edge_node *prev) { + if (!*aet) { + /* Append edge onto the tail end of the AET */ + *aet = edge; + edge->prev = prev; + edge->next = NULL; + } else { + /* Do primary sort on the xb field */ + if (edge->xb < (*aet)->xb) { + /* Insert edge here (before the AET edge) */ + edge->prev = prev; + edge->next = *aet; + (*aet)->prev = edge; + *aet = edge; + } else { + if (edge->xb == (*aet)->xb) { + /* Do secondary sort on the dx field */ + if (edge->dx < (*aet)->dx) { + /* Insert edge here (before the AET edge) */ + edge->prev = prev; + edge->next = *aet; + (*aet)->prev = edge; + *aet = edge; + } else { + /* Head further into the AET */ + add_edge_to_aet(&((*aet)->next), edge, *aet); + } + } else { + /* Head further into the AET */ + add_edge_to_aet(&((*aet)->next), edge, *aet); + } + } + } +} + +static void add_intersection(it_node **it, edge_node *edge0, edge_node *edge1, + double x, double y) { + it_node *existing_node; + + if (!*it) { + /* Append a new node to the tail of the list */ + gpc_malloc(*it, sizeof(it_node), + const_cast("IT insertion")); + (*it)->ie[0] = edge0; + (*it)->ie[1] = edge1; + (*it)->point.x = x; + (*it)->point.y = y; + (*it)->next = NULL; + } else { + if ((*it)->point.y > y) { + /* Insert a new node mid-list */ + existing_node = *it; + gpc_malloc(*it, sizeof(it_node), + const_cast("IT insertion")); + (*it)->ie[0] = edge0; + (*it)->ie[1] = edge1; + (*it)->point.x = x; + (*it)->point.y = y; + (*it)->next = existing_node; + } else { + /* Head further down the list */ + add_intersection(&((*it)->next), edge0, edge1, x, y); + } + } +} + +static void add_st_edge(st_node **st, it_node **it, edge_node *edge, + double dy) { + st_node *existing_node; + double den = 0.0; + double r = 0.0; + double x = 0.0; + double y = 0.0; + + if (!*st) { + /* Append edge onto the tail end of the ST */ + gpc_malloc(*st, sizeof(st_node), + const_cast("ST insertion")); + (*st)->edge = edge; + (*st)->xb = edge->xb; + (*st)->xt = edge->xt; + (*st)->dx = edge->dx; + (*st)->prev = NULL; + } else { + den = ((*st)->xt - (*st)->xb) - (edge->xt - edge->xb); + + /* If new edge and ST edge don't cross */ + if ((edge->xt >= (*st)->xt) || (edge->dx == (*st)->dx) || + (fabs(den) <= DBL_EPSILON)) { + /* No intersection - insert edge here (before the ST edge) */ + existing_node = *st; + gpc_malloc(*st, sizeof(st_node), + const_cast("ST insertion")); + (*st)->edge = edge; + (*st)->xb = edge->xb; + (*st)->xt = edge->xt; + (*st)->dx = edge->dx; + (*st)->prev = existing_node; + } else { + /* Compute intersection between new edge and ST edge */ + r = (edge->xb - (*st)->xb) / den; + x = (*st)->xb + r * ((*st)->xt - (*st)->xb); + y = r * dy; + + /* Insert the edge pointers and the intersection point in the IT */ + add_intersection(it, (*st)->edge, edge, x, y); + + /* Head further into the ST */ + add_st_edge(&((*st)->prev), it, edge, dy); + } + } +} + +static void build_intersection_table(it_node **it, edge_node *aet, double dy) { + st_node *st; + st_node *stp; + edge_node *edge = NULL; + + /* Build intersection table for the current scanbeam */ + reset_it(it); + st = NULL; + + /* Process each AET edge */ + for (edge = aet; edge; edge = edge->next) { + if ((edge->bstate[ABOVE] == BUNDLE_HEAD) || edge->bundle[ABOVE][CLIP] || + edge->bundle[ABOVE][SUBJ]) { + add_st_edge(&st, it, edge, dy); + } + } + + /* Free the sorted edge table */ + while (st) { + stp = st->prev; + gpc_free(st); + st = stp; + } +} + +static int count_contours(polygon_node *polygon) { + int nc = 0; + int nv = 0; + vertex_node *v = NULL; + vertex_node *nextv = NULL; + + for (nc = 0; polygon; polygon = polygon->next) { + if (polygon->active) { + /* Count the vertices in the current contour */ + nv = 0; + for (v = polygon->proxy->v[LEFT]; v; v = v->next) { + nv++; + } + + /* Record valid vertex counts in the active field */ + if (nv > 2) { + polygon->active = nv; + nc++; + } else { + /* Invalid contour: just free the heap */ + for (v = polygon->proxy->v[LEFT]; v; v = nextv) { + nextv = v->next; + gpc_free(v); + } + polygon->active = 0; + } + } + } + return nc; +} + +static void add_left(polygon_node *p, double x, double y) { + vertex_node *nv = NULL; + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + + /* Add vertex nv to the left end of the polygon's vertex list */ + nv->next = p->proxy->v[LEFT]; + + /* Update proxy->[LEFT] to point to nv */ + p->proxy->v[LEFT] = nv; +} + +static void merge_left(polygon_node *p, polygon_node *q, polygon_node *list) { + polygon_node *target = NULL; + + /* Label contour as a hole */ + q->proxy->hole = 1; + + if (p->proxy != q->proxy) { + /* Assign p's vertex list to the left end of q's list */ + p->proxy->v[RIGHT]->next = q->proxy->v[LEFT]; + q->proxy->v[LEFT] = p->proxy->v[LEFT]; + + /* Redirect any p->proxy references to q->proxy */ + + for (target = p->proxy; list; list = list->next) { + if (list->proxy == target) { + list->active = 0; + list->proxy = q->proxy; + } + } + } +} + +static void add_right(polygon_node *p, double x, double y) { + vertex_node *nv = NULL; + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + nv->next = NULL; + + /* Add vertex nv to the right end of the polygon's vertex list */ + p->proxy->v[RIGHT]->next = nv; + + /* Update proxy->v[RIGHT] to point to nv */ + p->proxy->v[RIGHT] = nv; +} + +static void merge_right(polygon_node *p, polygon_node *q, polygon_node *list) { + polygon_node *target = NULL; + + /* Label contour as external */ + q->proxy->hole = 0; + + if (p->proxy != q->proxy) { + /* Assign p's vertex list to the right end of q's list */ + q->proxy->v[RIGHT]->next = p->proxy->v[LEFT]; + q->proxy->v[RIGHT] = p->proxy->v[RIGHT]; + + /* Redirect any p->proxy references to q->proxy */ + for (target = p->proxy; list; list = list->next) { + if (list->proxy == target) { + list->active = 0; + list->proxy = q->proxy; + } + } + } +} + +static void add_local_min(polygon_node **p, edge_node *edge, double x, + double y) { + polygon_node *existing_min = NULL; + vertex_node *nv = NULL; + + existing_min = *p; + + gpc_malloc(*p, sizeof(polygon_node), + const_cast("polygon node creation")); + + /* Create a new vertex node and set its fields */ + gpc_malloc(nv, sizeof(vertex_node), + const_cast("vertex node creation")); + nv->x = x; + nv->y = y; + nv->next = NULL; + + /* Initialise proxy to point to p itself */ + (*p)->proxy = (*p); + (*p)->active = 1; + (*p)->next = existing_min; + + /* Make v[LEFT] and v[RIGHT] point to new vertex nv */ + (*p)->v[LEFT] = nv; + (*p)->v[RIGHT] = nv; + + /* Assign polygon p to the edge */ + edge->outp[ABOVE] = *p; +} + +static int count_tristrips(polygon_node *tn) { + int total = 0; + + for (total = 0; tn; tn = tn->next) { + if (tn->active > 2) { + total++; + } + } + return total; +} + +void add_vertex(vertex_node **t, double x, double y) { + if (!(*t)) { + gpc_malloc(*t, sizeof(vertex_node), + const_cast("tristrip vertex creation")); + (*t)->x = x; + (*t)->y = y; + (*t)->next = NULL; + } else { + /* Head further down the list */ + add_vertex(&((*t)->next), x, y); + } +} + +void gpc_vertex_create(edge_node *e, int p, int s, double x, double y) { + add_vertex(&(e->outp[p]->v[s]), x, y); + e->outp[p]->active++; +} + +static void new_tristrip(polygon_node **tn, edge_node *edge, double x, + double y) { + if (!(*tn)) { + gpc_malloc(*tn, sizeof(polygon_node), + const_cast("tristrip node creation")); + (*tn)->next = NULL; + (*tn)->v[LEFT] = NULL; + (*tn)->v[RIGHT] = NULL; + (*tn)->active = 1; + add_vertex(&((*tn)->v[LEFT]), x, y); + edge->outp[ABOVE] = *tn; + } else { + /* Head further down the list */ + new_tristrip(&((*tn)->next), edge, x, y); + } +} + +static bbox *create_contour_bboxes(gpc_polygon *p) { + bbox *box; + int c = 0; + int v = 0; + + gpc_malloc(box, p->num_contours * sizeof(bbox), + const_cast("Bounding box creation")); + + /* Construct contour bounding boxes */ + for (c = 0; c < p->num_contours; c++) { + /* Initialise bounding box extent */ + box[c].xmin = DBL_MAX; + box[c].ymin = DBL_MAX; + box[c].xmax = -DBL_MAX; + box[c].ymax = -DBL_MAX; + + for (v = 0; v < p->contour[c].num_vertices; v++) { + /* Adjust bounding box */ + if (p->contour[c].vertex[v].x < box[c].xmin) { + box[c].xmin = p->contour[c].vertex[v].x; + } + if (p->contour[c].vertex[v].y < box[c].ymin) { + box[c].ymin = p->contour[c].vertex[v].y; + } + if (p->contour[c].vertex[v].x > box[c].xmax) { + box[c].xmax = p->contour[c].vertex[v].x; + } + if (p->contour[c].vertex[v].y > box[c].ymax) { + box[c].ymax = p->contour[c].vertex[v].y; + } + } + } + return box; +} + +static void minimax_test(gpc_polygon *subj, gpc_polygon *clip, gpc_op op) { + bbox *s_bbox; + bbox *c_bbox; + int s = 0; + int c = 0; + int *o_table = NULL; + int overlap = 0; + + s_bbox = create_contour_bboxes(subj); + c_bbox = create_contour_bboxes(clip); + + gpc_malloc(o_table, + subj->num_contours * clip->num_contours * sizeof(int), + const_cast("overlap table creation")); + + /* Check all subject contour bounding boxes against clip boxes */ + for (s = 0; s < subj->num_contours; s++) { + for (c = 0; c < clip->num_contours; c++) { + o_table[c * subj->num_contours + s] = + (!((s_bbox[s].xmax < c_bbox[c].xmin) || + (s_bbox[s].xmin > c_bbox[c].xmax))) && + (!((s_bbox[s].ymax < c_bbox[c].ymin) || + (s_bbox[s].ymin > c_bbox[c].ymax))); + } + } + + /* For each clip contour, search for any subject contour overlaps */ + for (c = 0; c < clip->num_contours; c++) { + overlap = 0; + for (s = 0; (!overlap) && (s < subj->num_contours); s++) { + overlap = o_table[c * subj->num_contours + s]; + } + + if (!overlap) { + /* Flag non contributing status by negating vertex count */ + clip->contour[c].num_vertices = -clip->contour[c].num_vertices; + } + } + + if (op == GPC_INT) { + /* For each subject contour, search for any clip contour overlaps */ + for (s = 0; s < subj->num_contours; s++) { + overlap = 0; + for (c = 0; (!overlap) && (c < clip->num_contours); c++) { + overlap = o_table[c * subj->num_contours + s]; + } + + if (!overlap) { + /* Flag non contributing status by negating vertex count */ + subj->contour[s].num_vertices = -subj->contour[s].num_vertices; + } + } + } + + gpc_free(s_bbox); + gpc_free(c_bbox); + gpc_free(o_table); +} + +/* +=========================================================================== + Public Functions +=========================================================================== +*/ + +void gpc_free_polygon(gpc_polygon *p) { + int c = 0; + + for (c = 0; c < p->num_contours; c++) { + gpc_free(p->contour[c].vertex); + } + gpc_free(p->hole); + gpc_free(p->contour); + p->num_contours = 0; +} + +/* +void gpc_read_polygon(FILE *fp, int read_hole_flags, gpc_polygon *p) { + int c = 0; + int v = 0; + + fscanf(fp, "%d", &(p->num_contours)); + gpc_malloc(p->hole, p->num_contours * sizeof(int), + (char *)"hole flag array creation"); + gpc_malloc(p->contour, + p->num_contours * sizeof(gpc_vertex_list), + (char *)"contour creation"); + for (c = 0; c < p->num_contours; c++) { + fscanf(fp, "%d", &(p->contour[c].num_vertices)); + + if (read_hole_flags) { + fscanf(fp, "%d", &(p->hole[c])); + } else { + p->hole[c] = 0; // Assume all contours to be external + } + + gpc_malloc(p->contour[c].vertex, + p->contour[c].num_vertices * sizeof(gpc_vertex), + (char *)"vertex creation"); + for (v = 0; v < p->contour[c].num_vertices; v++) { + fscanf(fp, "%lf %lf", &(p->contour[c].vertex[v].x), + &(p->contour[c].vertex[v].y)); + } + } +} + +void gpc_write_polygon(FILE *fp, int write_hole_flags, gpc_polygon *p) { + int c = 0; + int v = 0; + + fprintf(fp, "%d\n", p->num_contours); + for (c = 0; c < p->num_contours; c++) { + fprintf(fp, "%d\n", p->contour[c].num_vertices); + + if (write_hole_flags) { + fprintf(fp, "%d\n", p->hole[c]); + } + + for (v = 0; v < p->contour[c].num_vertices; v++) { + fprintf(fp, "% .*lf % .*lf\n", DBL_DIG, p->contour[c].vertex[v].x, + DBL_DIG, p->contour[c].vertex[v].y); + } + } +} +*/ + +void gpc_add_contour(gpc_polygon *p, gpc_vertex_list *new_contour, int hole) { + int *extended_hole = NULL; + int c = 0; + int v = 0; + gpc_vertex_list *extended_contour = NULL; + + /* Create an extended hole array */ + gpc_malloc(extended_hole, (p->num_contours + 1) * sizeof(int), + const_cast("contour hole addition")); + + /* Create an extended contour array */ + gpc_malloc(extended_contour, + (p->num_contours + 1) * sizeof(gpc_vertex_list), + const_cast("contour addition")); + + /* Copy the old contour and hole data into the extended arrays */ + for (c = 0; c < p->num_contours; c++) { + extended_hole[c] = p->hole[c]; + extended_contour[c] = p->contour[c]; + } + + /* Copy the new contour and hole onto the end of the extended arrays */ + c = p->num_contours; + extended_hole[c] = hole; + extended_contour[c].num_vertices = new_contour->num_vertices; + gpc_malloc(extended_contour[c].vertex, + new_contour->num_vertices * sizeof(gpc_vertex), + const_cast("contour addition")); + for (v = 0; v < new_contour->num_vertices; v++) { + extended_contour[c].vertex[v] = new_contour->vertex[v]; + } + + /* Dispose of the old contour */ + gpc_free(p->contour); + gpc_free(p->hole); + + /* Update the polygon information */ + p->num_contours++; + p->hole = extended_hole; + p->contour = extended_contour; +} + +// gpc_polygon_clip +void gpc_polygon_clip(gpc_op op, gpc_polygon *subj, gpc_polygon *clip, + gpc_polygon *result) { + sb_tree *sbtree = NULL; + it_node *it = NULL; + it_node *intersect = NULL; + edge_node *edge = NULL; + edge_node *prev_edge = NULL; + edge_node *next_edge = NULL; + edge_node *succ_edge = NULL; + edge_node *e0 = NULL; + edge_node *e1 = NULL; + edge_node *aet = NULL; + edge_node *c_heap = NULL; + edge_node *s_heap = NULL; + lmt_node *lmt = NULL; + lmt_node *local_min = NULL; + polygon_node *out_poly = NULL; + polygon_node *p = NULL; + polygon_node *q = NULL; + polygon_node *poly = NULL; + polygon_node *npoly = NULL; + polygon_node *cf = NULL; + vertex_node *vtx = NULL; + vertex_node *nv = NULL; + h_state horiz[2]; + int in[2]; + int exists[2]; + int parity[2] = {LEFT, LEFT}; + int c = 0; + int v = 0; + int contributing = 0; + int search = 0; + int scanbeam = 0; + int sbt_entries = 0; + int vclass = 0; + int bl = 0; + int br = 0; + int tl = 0; + int tr = 0; + double *sbt = NULL; + double xb = 0.0; + double px = 0.0; + double yb = 0.0; + double yt = 0.0; + double dy = 0.0; + double ix = 0.0; + double iy = 0.0; + + /* Test for trivial NULL result cases */ + if (((subj->num_contours == 0) && (clip->num_contours == 0)) || + ((subj->num_contours == 0) && ((op == GPC_INT) || (op == GPC_DIFF))) || + ((clip->num_contours == 0) && (op == GPC_INT))) { + result->num_contours = 0; + result->hole = NULL; + result->contour = NULL; + return; + } + /* Identify potentialy contributing contours */ + if (((op == GPC_INT) || (op == GPC_DIFF)) && (subj->num_contours > 0) && + (clip->num_contours > 0)) { + minimax_test(subj, clip, op); + } + /* Build LMT */ + if (subj->num_contours > 0) { + s_heap = build_lmt(&lmt, &sbtree, &sbt_entries, subj, SUBJ, op); + } + if (clip->num_contours > 0) { + c_heap = build_lmt(&lmt, &sbtree, &sbt_entries, clip, CLIP, op); + } + /* Return a NULL result if no contours contribute */ + if (lmt == NULL) { + result->num_contours = 0; + result->hole = NULL; + result->contour = NULL; + reset_lmt(&lmt); + gpc_free(s_heap); + gpc_free(c_heap); + return; + } + + /* Build scanbeam table from scanbeam tree */ + gpc_malloc(sbt, sbt_entries * sizeof(double), + const_cast("sbt creation")); + build_sbt(&scanbeam, sbt, sbtree); + scanbeam = 0; + free_sbtree(&sbtree); + /* Allow pointer re-use without causing memory leak */ + if (subj == result) { + gpc_free_polygon(subj); + } + if (clip == result) { + gpc_free_polygon(clip); + } + /* Invert clip polygon for difference operation */ + if (op == GPC_DIFF) { + parity[CLIP] = RIGHT; + } + local_min = lmt; + + // Process each scanbeam + while (scanbeam < sbt_entries) { + /* Set yb and yt to the bottom and top of the scanbeam */ + yb = sbt[scanbeam++]; + if (scanbeam < sbt_entries) { + yt = sbt[scanbeam]; + dy = yt - yb; + } + /* === SCANBEAM BOUNDARY PROCESSING ================================ */ + /* If LMT node corresponding to yb exists */ + if (local_min) { + if (local_min->y == yb) { + /* Add edges starting at this local minimum to the AET */ + for (edge = local_min->first_bound; edge; edge = edge->next_bound) { + add_edge_to_aet(&aet, edge, NULL); + } + local_min = local_min->next; + } + } + /* Set dummy previous x value */ + px = -DBL_MAX; + /* Create bundles within AET */ + e0 = aet; + e1 = aet; + /* Set up bundle fields of first edge */ + aet->bundle[ABOVE][aet->type] = (aet->top.y != yb); + aet->bundle[ABOVE][!aet->type] = 0; + aet->bstate[ABOVE] = UNBUNDLED; + + for (next_edge = aet->next; next_edge; next_edge = next_edge->next) { + /* Set up bundle fields of next edge */ + next_edge->bundle[ABOVE][next_edge->type] = (next_edge->top.y != yb); + next_edge->bundle[ABOVE][!next_edge->type] = 0; + next_edge->bstate[ABOVE] = UNBUNDLED; + /* Bundle edges above the scanbeam boundary if they coincide */ + if (next_edge->bundle[ABOVE][next_edge->type]) { + if (gpc_eq(e0->xb, next_edge->xb) && gpc_eq(e0->dx, next_edge->dx) && + (e0->top.y != yb)) { + next_edge->bundle[ABOVE][next_edge->type] ^= + e0->bundle[ABOVE][next_edge->type]; + next_edge->bundle[ABOVE][!next_edge->type] = + e0->bundle[ABOVE][!next_edge->type]; + next_edge->bstate[ABOVE] = BUNDLE_HEAD; + e0->bundle[ABOVE][CLIP] = 0; + e0->bundle[ABOVE][SUBJ] = 0; + e0->bstate[ABOVE] = BUNDLE_TAIL; + } + e0 = next_edge; + } + } + horiz[CLIP] = NH; + horiz[SUBJ] = NH; + + // Process each edge at this scanbeam boundary + for (edge = aet; edge; edge = edge->next) { + exists[CLIP] = + edge->bundle[ABOVE][CLIP] + (edge->bundle[BELOW][CLIP] << 1); + exists[SUBJ] = + edge->bundle[ABOVE][SUBJ] + (edge->bundle[BELOW][SUBJ] << 1); + if (exists[CLIP] || exists[SUBJ]) { + /* Set bundle side */ + edge->bside[CLIP] = parity[CLIP]; + edge->bside[SUBJ] = parity[SUBJ]; + /* Determine contributing status and quadrant occupancies */ + switch (op) { + case GPC_DIFF: + case GPC_INT: + contributing = (exists[CLIP] && (parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) && (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) && + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_XOR: + contributing = exists[CLIP] || exists[SUBJ]; + br = (parity[CLIP]) ^ (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) ^ + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_UNION: + contributing = (exists[CLIP] && (!parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (!parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) || (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) || + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + } + // Update parity + parity[CLIP] ^= edge->bundle[ABOVE][CLIP]; + parity[SUBJ] ^= edge->bundle[ABOVE][SUBJ]; + /* Update horizontal state */ + if (exists[CLIP]) { + horiz[CLIP] = next_h_state[horiz[CLIP]] + [((exists[CLIP] - 1) << 1) + parity[CLIP]]; + } + if (exists[SUBJ]) { + horiz[SUBJ] = next_h_state[horiz[SUBJ]] + [((exists[SUBJ] - 1) << 1) + parity[SUBJ]]; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + if (contributing) { + xb = edge->xb; + switch (vclass) { + case EMN: + case IMN: + add_local_min(&out_poly, edge, xb, yb); + px = xb; + cf = edge->outp[ABOVE]; + break; + case ERI: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + edge->outp[ABOVE] = cf; + cf = NULL; + break; + case ELI: + add_left(edge->outp[BELOW], xb, yb); + px = xb; + cf = edge->outp[BELOW]; + break; + case EMX: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + merge_right(cf, edge->outp[BELOW], out_poly); + cf = NULL; + break; + case ILI: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + edge->outp[ABOVE] = cf; + cf = NULL; + break; + case IRI: + add_right(edge->outp[BELOW], xb, yb); + px = xb; + cf = edge->outp[BELOW]; + edge->outp[BELOW] = NULL; + break; + case IMX: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + merge_left(cf, edge->outp[BELOW], out_poly); + cf = NULL; + edge->outp[BELOW] = NULL; + break; + case IMM: + if (xb != px) { + add_right(cf, xb, yb); + px = xb; + } + merge_left(cf, edge->outp[BELOW], out_poly); + edge->outp[BELOW] = NULL; + add_local_min(&out_poly, edge, xb, yb); + cf = edge->outp[ABOVE]; + break; + case EMM: + if (xb != px) { + add_left(cf, xb, yb); + px = xb; + } + merge_right(cf, edge->outp[BELOW], out_poly); + edge->outp[BELOW] = NULL; + add_local_min(&out_poly, edge, xb, yb); + cf = edge->outp[ABOVE]; + break; + case LED: + if (edge->bot.y == yb) { + add_left(edge->outp[BELOW], xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + px = xb; + break; + case RED: + if (edge->bot.y == yb) { + add_right(edge->outp[BELOW], xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + px = xb; + break; + default: + break; + } /* End of switch */ + } /* End of contributing conditional */ + } /* End of edge exists conditional */ + } // End of AET loop + + /* Delete terminating edges from the AET, otherwise compute xt */ + for (edge = aet; edge; edge = edge->next) { + if (edge->top.y == yb) { + prev_edge = edge->prev; + next_edge = edge->next; + if (prev_edge) { + prev_edge->next = next_edge; + } else { + aet = next_edge; + } + if (next_edge) { + next_edge->prev = prev_edge; + } + /* Copy bundle head state to the adjacent tail edge if required */ + if ((edge->bstate[BELOW] == BUNDLE_HEAD) && prev_edge) { + if (prev_edge->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->outp[BELOW] = edge->outp[BELOW]; + prev_edge->bstate[BELOW] = UNBUNDLED; + if (prev_edge->prev) { + if (prev_edge->prev->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->bstate[BELOW] = BUNDLE_HEAD; + } + } + } + } + } else { + if (edge->top.y == yt) { + edge->xt = edge->top.x; + } else { + edge->xt = edge->bot.x + edge->dx * (yt - edge->bot.y); + } + } + } + + if (scanbeam < sbt_entries) { + /* === SCANBEAM INTERIOR PROCESSING ============================== */ + build_intersection_table(&it, aet, dy); + /* Process each node in the intersection table */ + for (intersect = it; intersect; intersect = intersect->next) { + e0 = intersect->ie[0]; + e1 = intersect->ie[1]; + /* Only generate output for contributing intersections */ + if ((e0->bundle[ABOVE][CLIP] || e0->bundle[ABOVE][SUBJ]) && + (e1->bundle[ABOVE][CLIP] || e1->bundle[ABOVE][SUBJ])) { + p = e0->outp[ABOVE]; + q = e1->outp[ABOVE]; + ix = intersect->point.x; + iy = intersect->point.y + yb; + + in[CLIP] = (e0->bundle[ABOVE][CLIP] && !e0->bside[CLIP]) || + (e1->bundle[ABOVE][CLIP] && e1->bside[CLIP]) || + (!e0->bundle[ABOVE][CLIP] && !e1->bundle[ABOVE][CLIP] && + e0->bside[CLIP] && e1->bside[CLIP]); + in[SUBJ] = (e0->bundle[ABOVE][SUBJ] && !e0->bside[SUBJ]) || + (e1->bundle[ABOVE][SUBJ] && e1->bside[SUBJ]) || + (!e0->bundle[ABOVE][SUBJ] && !e1->bundle[ABOVE][SUBJ] && + e0->bside[SUBJ] && e1->bside[SUBJ]); + + // Determine quadrant occupancies + switch (op) { + case GPC_DIFF: + case GPC_INT: + tr = (in[CLIP]) && (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_XOR: + tr = (in[CLIP]) ^ (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_UNION: + tr = (in[CLIP]) || (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + switch (vclass) { + case EMN: + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + break; + case ERI: + if (p) { + add_right(p, ix, iy); + e1->outp[ABOVE] = p; + e0->outp[ABOVE] = NULL; + } + break; + case ELI: + if (q) { + add_left(q, ix, iy); + e0->outp[ABOVE] = q; + e1->outp[ABOVE] = NULL; + } + break; + case EMX: + if (p && q) { + add_left(p, ix, iy); + merge_right(p, q, out_poly); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMN: + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + break; + case ILI: + if (p) { + add_left(p, ix, iy); + e1->outp[ABOVE] = p; + e0->outp[ABOVE] = NULL; + } + break; + case IRI: + if (q) { + add_right(q, ix, iy); + e0->outp[ABOVE] = q; + e1->outp[ABOVE] = NULL; + } + break; + case IMX: + if (p && q) { + add_right(p, ix, iy); + merge_left(p, q, out_poly); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMM: + if (p && q) { + add_right(p, ix, iy); + merge_left(p, q, out_poly); + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + } + break; + case EMM: + if (p && q) { + add_left(p, ix, iy); + merge_right(p, q, out_poly); + add_local_min(&out_poly, e0, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + } + break; + default: + break; + } // End of switch + } /* End of contributing intersection conditional */ + + /* Swap bundle sides in response to edge crossing */ + if (e0->bundle[ABOVE][CLIP]) { + e1->bside[CLIP] = !e1->bside[CLIP]; + } + if (e1->bundle[ABOVE][CLIP]) { + e0->bside[CLIP] = !e0->bside[CLIP]; + } + if (e0->bundle[ABOVE][SUBJ]) { + e1->bside[SUBJ] = !e1->bside[SUBJ]; + } + if (e1->bundle[ABOVE][SUBJ]) { + e0->bside[SUBJ] = !e0->bside[SUBJ]; + } + + /* Swap e0 and e1 bundles in the AET */ + prev_edge = e0->prev; + next_edge = e1->next; + if (next_edge) { + next_edge->prev = e0; + } + if (e0->bstate[ABOVE] == BUNDLE_HEAD) { + search = 1; + while (search) { + prev_edge = prev_edge->prev; + if (prev_edge) { + if (prev_edge->bstate[ABOVE] != BUNDLE_TAIL) { + search = 0; + } + } else { + search = 0; + } + } + } + if (!prev_edge) { + aet->prev = e1; + e1->next = aet; + aet = e0->next; + } else { + prev_edge->next->prev = e1; + e1->next = prev_edge->next; + prev_edge->next = e0->next; + } + e0->next->prev = prev_edge; + e1->next->prev = e1; + e0->next = next_edge; + } /* End of IT loop*/ + + // Prepare for next scanbeam + for (edge = aet; edge; edge = next_edge) { + next_edge = edge->next; + succ_edge = edge->succ; + if ((edge->top.y == yt) && succ_edge) { + /* Replace AET edge by its successor */ + succ_edge->outp[BELOW] = edge->outp[ABOVE]; + succ_edge->bstate[BELOW] = edge->bstate[ABOVE]; + succ_edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + succ_edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + prev_edge = edge->prev; + if (prev_edge) { + prev_edge->next = succ_edge; + } else { + aet = succ_edge; + } + if (next_edge) { + next_edge->prev = succ_edge; + } + succ_edge->prev = prev_edge; + succ_edge->next = next_edge; + } else { + /* Update this edge */ + edge->outp[BELOW] = edge->outp[ABOVE]; + edge->bstate[BELOW] = edge->bstate[ABOVE]; + edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + edge->xb = edge->xt; + } + edge->outp[ABOVE] = NULL; + } + } + } /* === END OF SCANBEAM PROCESSING ================================== */ + // Generate result polygon from out_poly + result->contour = NULL; + result->hole = NULL; + result->num_contours = count_contours(out_poly); + if (result->num_contours > 0) { + gpc_malloc(result->hole, result->num_contours * sizeof(int), + const_cast("hole flag table creation")); + gpc_malloc(result->contour, + result->num_contours * sizeof(gpc_vertex_list), + const_cast("contour creation")); + + c = 0; + for (poly = out_poly; poly; poly = npoly) { + npoly = poly->next; + if (poly->active) { + result->hole[c] = poly->proxy->hole; + result->contour[c].num_vertices = poly->active; + gpc_malloc( + result->contour[c].vertex, + result->contour[c].num_vertices * sizeof(gpc_vertex), + const_cast("vertex creation")); + + v = result->contour[c].num_vertices - 1; + for (vtx = poly->proxy->v[LEFT]; vtx; vtx = nv) { + nv = vtx->next; + result->contour[c].vertex[v].x = vtx->x; + result->contour[c].vertex[v].y = vtx->y; + gpc_free(vtx); + v--; + } + c++; + } + gpc_free(poly); + } + } else { + for (poly = out_poly; poly; poly = npoly) { + npoly = poly->next; + gpc_free(poly); + } + } + + // Tidy up + reset_it(&it); + reset_lmt(&lmt); + gpc_free(c_heap); + gpc_free(s_heap); + gpc_free(sbt); +} // NOLINT + +void gpc_free_tristrip(gpc_tristrip *t) { + int s = 0; + for (s = 0; s < t->num_strips; s++) { + gpc_free(t->strip[s].vertex); + } + gpc_free(t->strip); + t->num_strips = 0; +} + +void gpc_polygon_to_tristrip(gpc_polygon *s, gpc_tristrip *t) { + gpc_polygon c; + c.num_contours = 0; + c.hole = NULL; + c.contour = NULL; + gpc_tristrip_clip(GPC_DIFF, s, &c, t); +} + +// gpc_tristrip_clip +void gpc_tristrip_clip(gpc_op op, gpc_polygon *subj, gpc_polygon *clip, + gpc_tristrip *result) { + sb_tree *sbtree = NULL; + it_node *it = NULL; + it_node *intersect = NULL; + edge_node *edge = NULL; + edge_node *prev_edge = NULL; + edge_node *next_edge = NULL; + edge_node *succ_edge = NULL; + edge_node *e0 = NULL; + edge_node *e1 = NULL; + edge_node *aet = NULL; + edge_node *c_heap = NULL; + edge_node *s_heap = NULL; + edge_node *cf = NULL; + lmt_node *lmt = NULL; + lmt_node *local_min = NULL; + polygon_node *tlist = NULL; + polygon_node *tn = NULL; + polygon_node *tnn = NULL; + polygon_node *p = NULL; + polygon_node *q = NULL; + vertex_node *lt = NULL; + vertex_node *ltn = NULL; + vertex_node *rt = NULL; + vertex_node *rtn = NULL; + h_state horiz[2]; + vertex_type cft = NUL; + int in[2]; + int exists[2]; + int parity[2] = {LEFT, LEFT}; + int s = 0; + int v = 0; + int contributing = 0; + int search = 0; + int scanbeam = 0; + int sbt_entries = 0; + int vclass = 0; + int bl = 0; + int br = 0; + int tl = 0; + int tr = 0; + double *sbt = NULL; + double xb = 0.0; + double px = 0.0; + double nx = 0.0; + double yb = 0.0; + double yt = 0.0; + double dy = 0.0; + double ix = 0.0; + double iy = 0.0; + + /* Test for trivial NULL result cases */ + if (((subj->num_contours == 0) && (clip->num_contours == 0)) || + ((subj->num_contours == 0) && ((op == GPC_INT) || (op == GPC_DIFF))) || + ((clip->num_contours == 0) && (op == GPC_INT))) { + result->num_strips = 0; + result->strip = NULL; + return; + } + + /* Identify potentialy contributing contours */ + if (((op == GPC_INT) || (op == GPC_DIFF)) && (subj->num_contours > 0) && + (clip->num_contours > 0)) { + minimax_test(subj, clip, op); + } + /* Build LMT */ + if (subj->num_contours > 0) { + s_heap = build_lmt(&lmt, &sbtree, &sbt_entries, subj, SUBJ, op); + } + if (clip->num_contours > 0) { + c_heap = build_lmt(&lmt, &sbtree, &sbt_entries, clip, CLIP, op); + } + /* Return a NULL result if no contours contribute */ + if (lmt == NULL) { + result->num_strips = 0; + result->strip = NULL; + reset_lmt(&lmt); + gpc_free(s_heap); + gpc_free(c_heap); + return; + } + + /* Build scanbeam table from scanbeam tree */ + gpc_malloc(sbt, sbt_entries * sizeof(double), + const_cast("sbt creation")); + build_sbt(&scanbeam, sbt, sbtree); + scanbeam = 0; + free_sbtree(&sbtree); + + /* Invert clip polygon for difference operation */ + if (op == GPC_DIFF) { + parity[CLIP] = RIGHT; + } + local_min = lmt; + + // Process each scanbeam + while (scanbeam < sbt_entries) { + /* Set yb and yt to the bottom and top of the scanbeam */ + yb = sbt[scanbeam++]; + if (scanbeam < sbt_entries) { + yt = sbt[scanbeam]; + dy = yt - yb; + } + + /* === SCANBEAM BOUNDARY PROCESSING ================================ */ + /* If LMT node corresponding to yb exists */ + if (local_min) { + if (local_min->y == yb) { + /* Add edges starting at this local minimum to the AET */ + for (edge = local_min->first_bound; edge; edge = edge->next_bound) { + add_edge_to_aet(&aet, edge, NULL); + } + local_min = local_min->next; + } + } + /* Set dummy previous x value */ + /* Create bundles within AET */ + px = -DBL_MAX; + e0 = aet; + e1 = aet; + + /* Set up bundle fields of first edge */ + aet->bundle[ABOVE][aet->type] = (aet->top.y != yb); + aet->bundle[ABOVE][!aet->type] = 0; + aet->bstate[ABOVE] = UNBUNDLED; + + for (next_edge = aet->next; next_edge; next_edge = next_edge->next) { + /* Set up bundle fields of next edge */ + next_edge->bundle[ABOVE][next_edge->type] = (next_edge->top.y != yb); + next_edge->bundle[ABOVE][!next_edge->type] = 0; + next_edge->bstate[ABOVE] = UNBUNDLED; + + /* Bundle edges above the scanbeam boundary if they coincide */ + if (next_edge->bundle[ABOVE][next_edge->type]) { + if (gpc_eq(e0->xb, next_edge->xb) && gpc_eq(e0->dx, next_edge->dx) && + (e0->top.y != yb)) { + next_edge->bundle[ABOVE][next_edge->type] ^= + e0->bundle[ABOVE][next_edge->type]; + next_edge->bundle[ABOVE][!next_edge->type] = + e0->bundle[ABOVE][!next_edge->type]; + next_edge->bstate[ABOVE] = BUNDLE_HEAD; + e0->bundle[ABOVE][CLIP] = 0; + e0->bundle[ABOVE][SUBJ] = 0; + e0->bstate[ABOVE] = BUNDLE_TAIL; + } + e0 = next_edge; + } + } + horiz[CLIP] = NH; + horiz[SUBJ] = NH; + + /* Process each edge at this scanbeam boundary */ + for (edge = aet; edge; edge = edge->next) { + exists[CLIP] = + edge->bundle[ABOVE][CLIP] + (edge->bundle[BELOW][CLIP] << 1); + exists[SUBJ] = + edge->bundle[ABOVE][SUBJ] + (edge->bundle[BELOW][SUBJ] << 1); + + if (exists[CLIP] || exists[SUBJ]) { + /* Set bundle side */ + edge->bside[CLIP] = parity[CLIP]; + edge->bside[SUBJ] = parity[SUBJ]; + + /* Determine contributing status and quadrant occupancies */ + switch (op) { + case GPC_DIFF: + case GPC_INT: + contributing = (exists[CLIP] && (parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) && (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) && + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) && + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_XOR: + contributing = exists[CLIP] || exists[SUBJ]; + br = (parity[CLIP]) ^ (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) ^ + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) ^ + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + case GPC_UNION: + contributing = (exists[CLIP] && (!parity[SUBJ] || horiz[SUBJ])) || + (exists[SUBJ] && (!parity[CLIP] || horiz[CLIP])) || + (exists[CLIP] && exists[SUBJ] && + (parity[CLIP] == parity[SUBJ])); + br = (parity[CLIP]) || (parity[SUBJ]); + bl = (parity[CLIP] ^ edge->bundle[ABOVE][CLIP]) || + (parity[SUBJ] ^ edge->bundle[ABOVE][SUBJ]); + tr = (parity[CLIP] ^ (horiz[CLIP] != NH)) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH)); + tl = (parity[CLIP] ^ (horiz[CLIP] != NH) ^ + edge->bundle[BELOW][CLIP]) || + (parity[SUBJ] ^ (horiz[SUBJ] != NH) ^ + edge->bundle[BELOW][SUBJ]); + break; + } + + // Update parity + parity[CLIP] ^= edge->bundle[ABOVE][CLIP]; + parity[SUBJ] ^= edge->bundle[ABOVE][SUBJ]; + + /* Update horizontal state */ + if (exists[CLIP]) { + horiz[CLIP] = next_h_state[horiz[CLIP]] + [((exists[CLIP] - 1) << 1) + parity[CLIP]]; + } + if (exists[SUBJ]) { + horiz[SUBJ] = next_h_state[horiz[SUBJ]] + [((exists[SUBJ] - 1) << 1) + parity[SUBJ]]; + } + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + + if (contributing) { + xb = edge->xb; + switch (vclass) { + case EMN: + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + break; + case ERI: + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (xb != cf->xb) { + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + } + cf = NULL; + break; + case ELI: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = NULL; + cf = edge; + break; + case EMX: + if (xb != cf->xb) { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } + edge->outp[ABOVE] = NULL; + cf = NULL; + break; + case IMN: + if (cft == LED) { + if (cf->bot.y != yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + } + new_tristrip(&tlist, cf, cf->xb, yb); + } + edge->outp[ABOVE] = cf->outp[ABOVE]; + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + break; + case ILI: + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + cft = ILI; + break; + case IRI: + if (cft == LED) { + if (cf->bot.y != yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + } + new_tristrip(&tlist, cf, cf->xb, yb); + } + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + edge->outp[ABOVE] = NULL; + break; + case IMX: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = NULL; + cft = IMX; + break; + case IMM: + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (xb != cf->xb) { + gpc_vertex_create(cf, ABOVE, RIGHT, xb, yb); + } + cf = edge; + break; + case EMM: + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + edge->outp[ABOVE] = NULL; + new_tristrip(&tlist, edge, xb, yb); + cf = edge; + break; + case LED: + if (edge->bot.y == yb) { + gpc_vertex_create(edge, BELOW, LEFT, xb, yb); + } + edge->outp[ABOVE] = edge->outp[BELOW]; + cf = edge; + cft = LED; + break; + case RED: + edge->outp[ABOVE] = cf->outp[ABOVE]; + if (cft == LED) { + if (cf->bot.y == yb) { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } else { + if (edge->bot.y == yb) { + gpc_vertex_create(cf, BELOW, LEFT, cf->xb, yb); + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + } + } + } else { + gpc_vertex_create(edge, BELOW, RIGHT, xb, yb); + gpc_vertex_create(edge, ABOVE, RIGHT, xb, yb); + } + cf = NULL; + break; + default: + break; + } /* End of switch */ + } /* End of contributing conditional */ + } /* End of edge exists conditional */ + } // End of AET loop + + /* Delete terminating edges from the AET, otherwise compute xt */ + for (edge = aet; edge; edge = edge->next) { + if (edge->top.y == yb) { + prev_edge = edge->prev; + next_edge = edge->next; + if (prev_edge) { + prev_edge->next = next_edge; + } else { + aet = next_edge; + } + if (next_edge) { + next_edge->prev = prev_edge; + } + + /* Copy bundle head state to the adjacent tail edge if required */ + if ((edge->bstate[BELOW] == BUNDLE_HEAD) && prev_edge) { + if (prev_edge->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->outp[BELOW] = edge->outp[BELOW]; + prev_edge->bstate[BELOW] = UNBUNDLED; + if (prev_edge->prev) { + if (prev_edge->prev->bstate[BELOW] == BUNDLE_TAIL) { + prev_edge->bstate[BELOW] = BUNDLE_HEAD; + } + } + } + } + } else { + if (edge->top.y == yt) { + edge->xt = edge->top.x; + } else { + edge->xt = edge->bot.x + edge->dx * (yt - edge->bot.y); + } + } + } + + if (scanbeam < sbt_entries) { + /* === SCANBEAM INTERIOR PROCESSING ============================== */ + build_intersection_table(&it, aet, dy); + /* Process each node in the intersection table */ + for (intersect = it; intersect; intersect = intersect->next) { + e0 = intersect->ie[0]; + e1 = intersect->ie[1]; + + /* Only generate output for contributing intersections */ + if ((e0->bundle[ABOVE][CLIP] || e0->bundle[ABOVE][SUBJ]) && + (e1->bundle[ABOVE][CLIP] || e1->bundle[ABOVE][SUBJ])) { + p = e0->outp[ABOVE]; + q = e1->outp[ABOVE]; + ix = intersect->point.x; + iy = intersect->point.y + yb; + + in[CLIP] = (e0->bundle[ABOVE][CLIP] && !e0->bside[CLIP]) || + (e1->bundle[ABOVE][CLIP] && e1->bside[CLIP]) || + (!e0->bundle[ABOVE][CLIP] && !e1->bundle[ABOVE][CLIP] && + e0->bside[CLIP] && e1->bside[CLIP]); + in[SUBJ] = (e0->bundle[ABOVE][SUBJ] && !e0->bside[SUBJ]) || + (e1->bundle[ABOVE][SUBJ] && e1->bside[SUBJ]) || + (!e0->bundle[ABOVE][SUBJ] && !e1->bundle[ABOVE][SUBJ] && + e0->bside[SUBJ] && e1->bside[SUBJ]); + + switch (op) { // Determine quadrant occupancies + case GPC_DIFF: + case GPC_INT: + tr = (in[CLIP]) && (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) && + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_XOR: + tr = (in[CLIP]) ^ (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) ^ + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + case GPC_UNION: + tr = (in[CLIP]) || (in[SUBJ]); + tl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ]); + br = (in[CLIP] ^ e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e0->bundle[ABOVE][SUBJ]); + bl = (in[CLIP] ^ e1->bundle[ABOVE][CLIP] ^ + e0->bundle[ABOVE][CLIP]) || + (in[SUBJ] ^ e1->bundle[ABOVE][SUBJ] ^ + e0->bundle[ABOVE][SUBJ]); + break; + } + + vclass = tr + (tl << 1) + (br << 2) + (bl << 3); + switch (vclass) { + case EMN: + new_tristrip(&tlist, e1, ix, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + break; + case ERI: + if (p) { + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + e0->outp[ABOVE] = NULL; + } + break; + case ELI: + if (q) { + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + e1->outp[ABOVE] = NULL; + } + break; + case EMX: + if (p && q) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + } + break; + case IMN: + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + new_tristrip(&tlist, prev_edge, px, iy); + e1->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + new_tristrip(&tlist, e0, ix, iy); + next_edge->outp[ABOVE] = e0->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + break; + case ILI: + if (p) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e1->outp[ABOVE] = e0->outp[ABOVE]; + e0->outp[ABOVE] = NULL; + } + break; + case IRI: + if (q) { + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + e1->outp[ABOVE] = NULL; + } + break; + case IMX: + if (p && q) { + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + e0->outp[ABOVE] = NULL; + e1->outp[ABOVE] = NULL; + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + new_tristrip(&tlist, prev_edge, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + next_edge->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + } + break; + case IMM: + if (p && q) { + gpc_vertex_create(e0, ABOVE, RIGHT, ix, iy); + gpc_vertex_create(e1, ABOVE, LEFT, ix, iy); + gpc_p_edge(prev_edge, e0, ABOVE); + gpc_vertex_create(prev_edge, ABOVE, LEFT, px, iy); + new_tristrip(&tlist, prev_edge, px, iy); + gpc_n_edge(next_edge, e1, ABOVE); + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + e1->outp[ABOVE] = prev_edge->outp[ABOVE]; + gpc_vertex_create(e1, ABOVE, RIGHT, ix, iy); + new_tristrip(&tlist, e0, ix, iy); + next_edge->outp[ABOVE] = e0->outp[ABOVE]; + gpc_vertex_create(next_edge, ABOVE, RIGHT, nx, iy); + } + break; + case EMM: + if (p && q) { + gpc_vertex_create(e0, ABOVE, LEFT, ix, iy); + new_tristrip(&tlist, e1, ix, iy); + e0->outp[ABOVE] = e1->outp[ABOVE]; + } + break; + default: + break; + } /* End of switch */ + } /* End of contributing intersection conditional */ + + // Swap bundle sides in response to edge crossing + if (e0->bundle[ABOVE][CLIP]) { + e1->bside[CLIP] = !e1->bside[CLIP]; + } + if (e1->bundle[ABOVE][CLIP]) { + e0->bside[CLIP] = !e0->bside[CLIP]; + } + if (e0->bundle[ABOVE][SUBJ]) { + e1->bside[SUBJ] = !e1->bside[SUBJ]; + } + if (e1->bundle[ABOVE][SUBJ]) { + e0->bside[SUBJ] = !e0->bside[SUBJ]; + } + + /* Swap e0 and e1 bundles in the AET */ + prev_edge = e0->prev; + next_edge = e1->next; + if (e1->next) { + e1->next->prev = e0; + } + + if (e0->bstate[ABOVE] == BUNDLE_HEAD) { + search = 1; + while (search) { + prev_edge = prev_edge->prev; + if (prev_edge) { + if (prev_edge->bundle[ABOVE][CLIP] || + prev_edge->bundle[ABOVE][SUBJ] || + (prev_edge->bstate[ABOVE] == BUNDLE_HEAD)) { + search = 0; + } + } else { + search = 0; + } + } + } + if (!prev_edge) { + e1->next = aet; + aet = e0->next; + } else { + e1->next = prev_edge->next; + prev_edge->next = e0->next; + } + e0->next->prev = prev_edge; + e1->next->prev = e1; + e0->next = next_edge; + } /* End of IT loop*/ + + /* Prepare for next scanbeam */ + for (edge = aet; edge; edge = next_edge) { + next_edge = edge->next; + succ_edge = edge->succ; + + if ((edge->top.y == yt) && succ_edge) { + /* Replace AET edge by its successor */ + succ_edge->outp[BELOW] = edge->outp[ABOVE]; + succ_edge->bstate[BELOW] = edge->bstate[ABOVE]; + succ_edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + succ_edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + prev_edge = edge->prev; + if (prev_edge) { + prev_edge->next = succ_edge; + } else { + aet = succ_edge; + } + if (next_edge) { + next_edge->prev = succ_edge; + } + succ_edge->prev = prev_edge; + succ_edge->next = next_edge; + } else { + /* Update this edge */ + edge->outp[BELOW] = edge->outp[ABOVE]; + edge->bstate[BELOW] = edge->bstate[ABOVE]; + edge->bundle[BELOW][CLIP] = edge->bundle[ABOVE][CLIP]; + edge->bundle[BELOW][SUBJ] = edge->bundle[ABOVE][SUBJ]; + edge->xb = edge->xt; + } + edge->outp[ABOVE] = NULL; + } + } + } /* === END OF SCANBEAM PROCESSING ================================== */ + + // Generate result tristrip from tlist + result->strip = NULL; + result->num_strips = count_tristrips(tlist); + if (result->num_strips > 0) { + gpc_malloc(result->strip, + result->num_strips * sizeof(gpc_vertex_list), + const_cast("tristrip list creation")); + + s = 0; + for (tn = tlist; tn; tn = tnn) { + tnn = tn->next; + if (tn->active > 2) { + /* Valid tristrip: copy the vertices and free the heap */ + result->strip[s].num_vertices = tn->active; + gpc_malloc(result->strip[s].vertex, + tn->active * sizeof(gpc_vertex), + const_cast("tristrip creation")); + v = 0; + if (0) { + lt = tn->v[RIGHT]; + rt = tn->v[LEFT]; + } else { + lt = tn->v[LEFT]; + rt = tn->v[RIGHT]; + } + while (lt || rt) { + if (lt) { + ltn = lt->next; + result->strip[s].vertex[v].x = lt->x; + result->strip[s].vertex[v].y = lt->y; + v++; + gpc_free(lt); + lt = ltn; + } + if (rt) { + rtn = rt->next; + result->strip[s].vertex[v].x = rt->x; + result->strip[s].vertex[v].y = rt->y; + v++; + gpc_free(rt); + rt = rtn; + } + } + s++; + } else { + /* Invalid tristrip: just free the heap */ + for (lt = tn->v[LEFT]; lt; lt = ltn) { + ltn = lt->next; + gpc_free(lt); + } + for (rt = tn->v[RIGHT]; rt; rt = rtn) { + rtn = rt->next; + gpc_free(rt); + } + } + gpc_free(tn); + } + } + // Tidy up + reset_it(&it); + reset_lmt(&lmt); + gpc_free(c_heap); + gpc_free(s_heap); + gpc_free(sbt); +} // NOLINT + +} // namespace gpc + +/* vim: set expandtab ts=4 sw=4 sts=4 tw=100: */ diff --git a/paddle/fluid/operators/detection/gpc.h b/paddle/fluid/operators/detection/gpc.h new file mode 100644 index 0000000000000000000000000000000000000000..ee86262ef2c486e4eaeeeaf56c2392d2a1c5851b --- /dev/null +++ b/paddle/fluid/operators/detection/gpc.h @@ -0,0 +1,246 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +/*************************************************************************** + * + * Copyright (c) 2015 Baidu.com, Inc. All Rights Reserved + * + **************************************************************************/ + +/** + * @file include/gpc.h + * @author huhan02(com@baidu.com) + * @date 2015/12/18 13:52:10 + * @brief + * + * @modified by sunyipeng + * @email sunyipeng@baidu.com + * @date 2018/6/12 + **/ + +#ifndef PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ // GPC_H_ +#define PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ // GPC_H_ + +#include +#include +#include +#include + +namespace gpc { + +typedef enum { // Set operation type + GPC_DIFF, // Difference + GPC_INT, // Intersection + GPC_XOR, // Exclusive or + GPC_UNION // Union +} gpc_op; + +typedef struct { // Polygon vertex structure + double x; // Vertex x component + double y; // vertex y component +} gpc_vertex; + +typedef struct { // Vertex list structure + int num_vertices; // Number of vertices in list + gpc_vertex *vertex; // Vertex array pointer +} gpc_vertex_list; + +typedef struct { // Polygon set structure + int num_contours; // Number of contours in polygon + int *hole; // Hole external contour flags + gpc_vertex_list *contour; // Contour array pointer +} gpc_polygon; + +typedef struct { // Tristrip set structure + int num_strips; // Number of tristrips + gpc_vertex_list *strip; // Tristrip array pointer +} gpc_tristrip; + +typedef enum { LEFT, RIGHT } gpc_left_right; + +typedef enum { ABOVE, BELOW } gpc_above_below; + +typedef enum { CLIP, SUBJ } gpc_clip_subj; + +typedef enum { /* Edge intersection classes */ + NUL, /* Empty non-intersection */ + EMX, /* External maximum */ + ELI, /* External left intermediate */ + TED, /* Top edge */ + ERI, /* External right intermediate */ + RED, /* Right edge */ + IMM, /* Internal maximum and minimum */ + IMN, /* Internal minimum */ + EMN, /* External minimum */ + EMM, /* External maximum and minimum */ + LED, /* Left edge */ + ILI, /* Internal left intermediate */ + BED, /* Bottom edge */ + IRI, /* Internal right intermediate */ + IMX, /* Internal maximum */ + FUL /* Full non-intersection */ +} vertex_type; + +typedef enum { /* Horizontal edge states */ + NH, /* No horizontal edge */ + BH, /* Bottom horizontal edge */ + TH /* Top horizontal edge */ +} h_state; + +typedef enum { /* Edge bundle state */ + UNBUNDLED, /* Isolated edge not within a bundle */ + BUNDLE_HEAD, /* Bundle head node */ + BUNDLE_TAIL /* Passive bundle tail node */ +} bundle_state; + +typedef struct v_shape { /* Internal vertex list datatype */ + double x; /* X coordinate component */ + double y; /* Y coordinate component */ + struct v_shape *next; /* Pointer to next vertex in list */ +} vertex_node; + +typedef struct p_shape { /* Internal contour / tristrip type */ + int active; /* Active flag / vertex count */ + int hole; /* Hole / external contour flag */ + vertex_node *v[2]; /* Left and right vertex list ptrs */ + struct p_shape *next; /* Pointer to next polygon contour */ + struct p_shape *proxy; /* Pointer to actual structure used */ +} polygon_node; + +typedef struct edge_shape { + gpc_vertex vertex; /* Piggy-backed contour vertex data */ + gpc_vertex bot; /* Edge lower (x, y) coordinate */ + gpc_vertex top; /* Edge upper (x, y) coordinate */ + double xb; /* Scanbeam bottom x coordinate */ + double xt; /* Scanbeam top x coordinate */ + double dx; /* Change in x for a unit y increase */ + int type; /* Clip / subject edge flag */ + int bundle[2][2]; /* Bundle edge flags */ + int bside[2]; /* Bundle left / right indicators */ + bundle_state bstate[2]; /* Edge bundle state */ + polygon_node *outp[2]; /* Output polygon / tristrip pointer */ + struct edge_shape *prev; /* Previous edge in the AET */ + struct edge_shape *next; /* Next edge in the AET */ + struct edge_shape *pred; /* Edge connected at the lower end */ + struct edge_shape *succ; /* Edge connected at the upper end */ + struct edge_shape *next_bound; /* Pointer to next bound in LMT */ +} edge_node; + +inline bool gpc_eq(float a, float b) { return (fabs(a - b) <= 1e-6); } + +inline bool gpc_prev_index(float a, float b) { return (fabs(a - b) <= 1e-6); } + +inline int gpc_prev_index(int i, int n) { return ((i - 1 + n) % n); } + +inline int gpc_next_index(int i, int n) { return ((i + 1) % n); } + +inline int gpc_optimal(gpc_vertex *v, int i, int n) { + return (v[(i + 1) % n].y != v[i].y || v[(i - 1 + n) % n].y != v[i].y); +} + +inline int gpc_fwd_min(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y > v[i].vertex.y && + v[(i - 1 + n) % n].vertex.y >= v[i].vertex.y); +} + +inline int gpc_not_fmax(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y > v[i].vertex.y); +} + +inline int gpc_rev_min(edge_node *v, int i, int n) { + return (v[(i + 1) % n].vertex.y >= v[i].vertex.y && + v[(i - 1 + n) % n].vertex.y > v[i].vertex.y); +} + +inline int gpc_not_rmax(edge_node *v, int i, int n) { + return (v[(i - 1 + n) % n].vertex.y > v[i].vertex.y); +} + +// inline void gpc_p_edge(edge_node *d, edge_node *e, int p, double i, double j) +// { +inline void gpc_p_edge(edge_node *d, edge_node *e, int p) { + d = e; + do { + d = d->prev; + } while (!d->outp[p]); + // i = d->bot.x + d->dx * (j - d->bot.y); +} + +// inline void gpc_n_edge(edge_node *d, edge_node *e, int p, double i, double j) +// { +inline void gpc_n_edge(edge_node *d, edge_node *e, int p) { + d = e; + do { + d = d->next; + } while (!d->outp[p]); + // i = d->bot.x + d->dx * (j - d->bot.y); +} + +template +void gpc_malloc(T *&p, int b, char *s) { + if (b > 0) { + p = (T *)malloc(b); + + if (!p) { + fprintf(stderr, "gpc malloc failure: %s\n", s); + exit(0); + } + } else { + p = NULL; + } +} +template +void gpc_free(T *&p) { + if (p) { + free(p); + p = NULL; + } +} + +/* +=========================================================================== + Public Function Prototypes +=========================================================================== +*/ + +void add_vertex(vertex_node **t, double x, double y); + +void gpc_vertex_create(edge_node *e, int p, int s, double x, double y); + +/* +void gpc_read_polygon(FILE *infile_ptr, int read_hole_flags, + gpc_polygon *polygon); + +void gpc_write_polygon(FILE *outfile_ptr, int write_hole_flags, + gpc_polygon *polygon); +*/ +void gpc_add_contour(gpc_polygon *polygon, gpc_vertex_list *contour, int hole); + +void gpc_polygon_clip(gpc_op set_operation, gpc_polygon *subject_polygon, + gpc_polygon *clip_polygon, gpc_polygon *result_polygon); + +void gpc_tristrip_clip(gpc_op set_operation, gpc_polygon *subject_polygon, + gpc_polygon *clip_polygon, + gpc_tristrip *result_tristrip); + +void gpc_polygon_to_tristrip(gpc_polygon *polygon, gpc_tristrip *tristrip); + +void gpc_free_polygon(gpc_polygon *polygon); + +void gpc_free_tristrip(gpc_tristrip *tristrip); + +} // namespace gpc + +#endif // PADDLE_FLUID_OPERATORS_DETECTION_GPC_H_ +/* vim: set expandtab ts=4 sw=4 sts=4 tw=100: */ diff --git a/paddle/fluid/operators/detection/multiclass_nms_op.cc b/paddle/fluid/operators/detection/multiclass_nms_op.cc index 60b93efdce810f8552374449fe5a6fc79b1a92c1..f0f8851be0ec2b532c570dc82b8ed5c290981aab 100644 --- a/paddle/fluid/operators/detection/multiclass_nms_op.cc +++ b/paddle/fluid/operators/detection/multiclass_nms_op.cc @@ -9,10 +9,11 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and + limitations under the License. */ #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/poly_util.h" namespace paddle { namespace operators { @@ -20,9 +21,6 @@ namespace operators { using Tensor = framework::Tensor; using LoDTensor = framework::LoDTensor; -constexpr int64_t kOutputDim = 6; -constexpr int64_t kBBoxSize = 4; - class MultiClassNMSOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -38,22 +36,29 @@ class MultiClassNMSOp : public framework::OperatorWithKernel { auto box_dims = ctx->GetInputDim("BBoxes"); auto score_dims = ctx->GetInputDim("Scores"); - PADDLE_ENFORCE_EQ(box_dims.size(), 3, - "The rank of Input(BBoxes) must be 3."); - PADDLE_ENFORCE_EQ(score_dims.size(), 3, - "The rank of Input(Scores) must be 3."); - PADDLE_ENFORCE_EQ(box_dims[2], 4, - "The 2nd dimension of Input(BBoxes) must be 4, " - "represents the layout of coordinate " - "[xmin, ymin, xmax, ymax]"); - PADDLE_ENFORCE_EQ(box_dims[1], score_dims[2], - "The 1st dimensiong of Input(BBoxes) must be equal to " - "3rd dimension of Input(Scores), which represents the " - "predicted bboxes."); - + if (ctx->IsRuntime()) { + PADDLE_ENFORCE_EQ(box_dims.size(), 3, + "The rank of Input(BBoxes) must be 3."); + PADDLE_ENFORCE_EQ(score_dims.size(), 3, + "The rank of Input(Scores) must be 3."); + PADDLE_ENFORCE(box_dims[2] == 4 || box_dims[2] == 8 || + box_dims[2] == 16 || box_dims[2] == 24 || + box_dims[2] == 32, + "The 2nd dimension of Input(BBoxes) must be 4 or 8, " + "represents the layout of coordinate " + "[xmin, ymin, xmax, ymax] or " + "4 points: [x1, y1, x2, y2, x3, y3, x4, y4] or " + "8 points: [xi, yi] i= 1,2,...,8 or " + "12 points: [xi, yi] i= 1,2,...,12 or " + "16 points: [xi, yi] i= 1,2,...,16"); + PADDLE_ENFORCE_EQ(box_dims[1], score_dims[2], + "The 1st dimensiong of Input(BBoxes) must be equal to " + "3rd dimension of Input(Scores), which represents the " + "predicted bboxes."); + } // Here the box_dims[0] is not the real dimension of output. // It will be rewritten in the computing kernel. - ctx->SetOutputDim("Out", {box_dims[1], 6}); + ctx->SetOutputDim("Out", {box_dims[1], box_dims[2] + 2}); } protected: @@ -128,6 +133,21 @@ static inline T JaccardOverlap(const T* box1, const T* box2, } } +template +T PolyIoU(const T* box1, const T* box2, const size_t box_size, + const bool normalized) { + T bbox1_area = PolyArea(box1, box_size, normalized); + T bbox2_area = PolyArea(box2, box_size, normalized); + T inter_area = PolyOverlapArea(box1, box2, box_size, normalized); + if (bbox1_area == 0 || bbox2_area == 0 || inter_area == 0) { + // If coordinate values are is invalid + // if area size <= 0, return 0. + return T(0.); + } else { + return inter_area / (bbox1_area + bbox2_area - inter_area); + } +} + template class MultiClassNMSKernel : public framework::OpKernel { public: @@ -137,6 +157,8 @@ class MultiClassNMSKernel : public framework::OpKernel { // The total boxes for each instance. int64_t num_boxes = bbox.dims()[0]; // 4: [xmin ymin xmax ymax] + // 8: [x1 y1 x2 y2 x3 y3 x4 y4] + // 16, 24, or 32: [x1 y1 x2 y2 ... xn yn], n = 8, 12 or 16 int64_t box_size = bbox.dims()[1]; std::vector scores_data(num_boxes); @@ -154,8 +176,19 @@ class MultiClassNMSKernel : public framework::OpKernel { for (size_t k = 0; k < selected_indices->size(); ++k) { if (keep) { const int kept_idx = (*selected_indices)[k]; - T overlap = JaccardOverlap(bbox_data + idx * box_size, + T overlap = T(0.); + // 4: [xmin ymin xmax ymax] + if (box_size == 4) { + overlap = JaccardOverlap(bbox_data + idx * box_size, bbox_data + kept_idx * box_size, true); + } + // 8: [x1 y1 x2 y2 x3 y3 x4 y4] or 16, 24, 32 + if (box_size == 8 || box_size == 16 || box_size == 24 || + box_size == 32) { + overlap = + PolyIoU(bbox_data + idx * box_size, + bbox_data + kept_idx * box_size, box_size, true); + } keep = overlap <= adaptive_threshold; } else { break; @@ -228,7 +261,9 @@ class MultiClassNMSKernel : public framework::OpKernel { void MultiClassOutput(const Tensor& scores, const Tensor& bboxes, const std::map>& selected_indices, Tensor* outs) const { - int predict_dim = scores.dims()[1]; + int64_t predict_dim = scores.dims()[1]; + int64_t box_size = bboxes.dims()[1]; + int64_t out_dim = bboxes.dims()[1] + 2; auto* scores_data = scores.data(); auto* bboxes_data = bboxes.data(); auto* odata = outs->data(); @@ -240,11 +275,11 @@ class MultiClassNMSKernel : public framework::OpKernel { const std::vector& indices = it.second; for (size_t j = 0; j < indices.size(); ++j) { int idx = indices[j]; - const T* bdata = bboxes_data + idx * kBBoxSize; - odata[count * kOutputDim] = label; // label - odata[count * kOutputDim + 1] = sdata[idx]; // score - // xmin, ymin, xmax, ymax - std::memcpy(odata + count * kOutputDim + 2, bdata, 4 * sizeof(T)); + const T* bdata = bboxes_data + idx * box_size; + odata[count * out_dim] = label; // label + odata[count * out_dim + 1] = sdata[idx]; // score + // xmin, ymin, xmax, ymax or multi-points coordinates + std::memcpy(odata + count * out_dim + 2, bdata, box_size * sizeof(T)); count++; } } @@ -261,6 +296,7 @@ class MultiClassNMSKernel : public framework::OpKernel { int64_t class_num = score_dims[1]; int64_t predict_dim = score_dims[2]; int64_t box_dim = boxes->dims()[2]; + int64_t out_dim = boxes->dims()[2] + 2; std::vector>> all_indices; std::vector batch_starts = {0}; @@ -283,7 +319,7 @@ class MultiClassNMSKernel : public framework::OpKernel { T* od = outs->mutable_data({1}, ctx.GetPlace()); od[0] = -1; } else { - outs->mutable_data({num_kept, kOutputDim}, ctx.GetPlace()); + outs->mutable_data({num_kept, out_dim}, ctx.GetPlace()); for (int64_t i = 0; i < batch_size; ++i) { Tensor ins_score = scores->Slice(i, i + 1); ins_score.Resize({class_num, predict_dim}); @@ -311,10 +347,11 @@ class MultiClassNMSOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { AddInput("BBoxes", - "(Tensor) A 3-D Tensor with shape [N, M, 4] represents the " + "(Tensor) A 3-D Tensor with shape " + "[N, M, 4 or 8 16 24 32] represents the " "predicted locations of M bounding bboxes, N is the batch size. " "Each bounding box has four coordinate values and the layout is " - "[xmin, ymin, xmax, ymax]."); + "[xmin, ymin, xmax, ymax], when box size equals to 4."); AddInput("Scores", "(Tensor) A 3-D Tensor with shape [N, C, M] represents the " "predicted confidence predictions. N is the batch size, C is the " @@ -351,8 +388,12 @@ class MultiClassNMSOpMaker : public framework::OpProtoAndCheckerMaker { AddOutput("Out", "(LoDTensor) A 2-D LoDTensor with shape [No, 6] represents the " "detections. Each row has 6 values: " - "[label, confidence, xmin, ymin, xmax, ymax], No is the total " - "number of detections in this mini-batch. For each instance, " + "[label, confidence, xmin, ymin, xmax, ymax] or " + "(LoDTensor) A 2-D LoDTensor with shape [No, 10] represents the " + "detections. Each row has 10 values: " + "[label, confidence, x1, y1, x2, y2, x3, y3, x4, y4]. No is the " + "total number of detections in this mini-batch." + "For each instance, " "the offsets in first dimension are called LoD, the number of " "offset is N + 1, if LoD[i + 1] - LoD[i] == 0, means there is " "no detected bbox."); diff --git a/paddle/fluid/operators/detection/poly_util.cc b/paddle/fluid/operators/detection/poly_util.cc new file mode 100644 index 0000000000000000000000000000000000000000..1af2c95c6cf526d651b196b54614a21a9cddde8c --- /dev/null +++ b/paddle/fluid/operators/detection/poly_util.cc @@ -0,0 +1,132 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifndef POLY_UTIL_CC_ +#define POLY_UTIL_CC_ + +#include "paddle/fluid/operators/detection/poly_util.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using gpc::gpc_polygon_clip; +using gpc::gpc_free_polygon; + +template +void Array2PointVec(const T*& box, const size_t box_size, + std::vector>& vec) { + size_t pts_num = box_size / 2; + vec.resize(pts_num); + for (size_t i = 0; i < pts_num; i++) { + vec.at(i).x = box[2 * i]; + vec.at(i).y = box[2 * i + 1]; + } +} + +template +void Array2Poly(const T*& box, const size_t box_size, gpc::gpc_polygon& poly) { + size_t pts_num = box_size / 2; + poly.num_contours = 1; + poly.hole = (int*)malloc(sizeof(int)); + poly.hole[0] = 0; + poly.contour = (gpc::gpc_vertex_list*)malloc(sizeof(gpc::gpc_vertex_list)); + poly.contour->num_vertices = pts_num; + poly.contour->vertex = + (gpc::gpc_vertex*)malloc(sizeof(gpc::gpc_vertex) * pts_num); + for (size_t i = 0; i < pts_num; ++i) { + poly.contour->vertex[i].x = box[2 * i]; + poly.contour->vertex[i].y = box[2 * i + 1]; + } +} + +template +void PointVec2Poly(const std::vector>& vec, gpc::gpc_polygon& poly) { + int pts_num = vec.size(); + poly.num_contours = 1; + poly.hole = (int*)malloc(sizeof(int)); + poly.hole[0] = 0; + poly.contour = (gpc::gpc_vertex_list*)malloc(sizeof(gpc::gpc_vertex_list)); + poly.contour->num_vertices = pts_num; + poly.contour->vertex = + (gpc::gpc_vertex*)malloc(sizeof(gpc::gpc_vertex) * pts_num); + for (size_t i = 0; i < pts_num; ++i) { + poly.contour->vertex[i].x = vec[i].x; + poly.contour->vertex[i].y = vec[i].y; + } +} + +template +void Poly2PointVec(const gpc::gpc_vertex_list& contour, + std::vector>& vec) { + int pts_num = contour.num_vertices; + vec.resize(pts_num); + for (int i = 0; i < pts_num; i++) { + vec.at(i).x = contour.vertex[i].x; + vec.at(i).y = contour.vertex[i].y; + } +} + +template +T GetContourArea(std::vector>& vec) { + size_t pts_num = vec.size(); + if (pts_num < 3) return T(0.); + T area = T(0.); + for (size_t i = 0; i < pts_num; ++i) { + area += vec[i].x * vec[(i + 1) % pts_num].y - + vec[i].y * vec[(i + 1) % pts_num].x; + } + return std::fabs(area / 2.0); +} + +template +T PolyArea(const T* box, const size_t box_size, const bool normalized) { + // If coordinate values are is invalid + // if area size <= 0, return 0. + std::vector> vec; + Array2PointVec(box, box_size, vec); + return GetContourArea(vec); +} + +template +T PolyOverlapArea(const T* box1, const T* box2, const size_t box_size, + const bool normalized) { + gpc::gpc_polygon poly1; + gpc::gpc_polygon poly2; + Array2Poly(box1, box_size, poly1); + Array2Poly(box2, box_size, poly2); + gpc::gpc_polygon respoly; + gpc::gpc_op op = gpc::GPC_INT; + gpc::gpc_polygon_clip(op, &poly2, &poly1, &respoly); + + T inter_area = T(0.); + int contour_num = respoly.num_contours; + for (int i = 0; i < contour_num; ++i) { + std::vector> resvec; + Poly2PointVec(respoly.contour[i], resvec); + // inter_area += std::fabs(cv::contourArea(resvec)) + 0.5f * + // (cv::arcLength(resvec, true)); + inter_area += GetContourArea(resvec); + } + + gpc::gpc_free_polygon(&poly1); + gpc::gpc_free_polygon(&poly2); + gpc::gpc_free_polygon(&respoly); + return inter_area; +} + +} // namespace operators +} // namespace paddle + +#endif diff --git a/paddle/fluid/operators/detection/poly_util.h b/paddle/fluid/operators/detection/poly_util.h new file mode 100644 index 0000000000000000000000000000000000000000..f07baf72d9ff07b8fcb45dcfb2a35741fb1aeed0 --- /dev/null +++ b/paddle/fluid/operators/detection/poly_util.h @@ -0,0 +1,73 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#ifndef POLY_UTIL_H_ +#define POLY_UTIL_H_ + +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/detection/gpc.h" + +namespace paddle { +namespace operators { + +template +class Point_ { + public: + // default constructor + Point_() {} + Point_(T _x, T _y) {} + Point_(const Point_& pt) {} + + Point_& operator=(const Point_& pt); + // conversion to another data type + // template operator Point_<_T>() const; + // conversion to the old-style C structures + // operator Vec() const; + + // checks whether the point is inside the specified rectangle + // bool inside(const Rect_& r) const; + T x; //!< x coordinate of the point + T y; //!< y coordinate of the point +}; + +template +void Array2PointVec(const T*& box, const size_t box_size, + std::vector>& vec); + +template +void Array2Poly(const T*& box, const size_t box_size, gpc::gpc_polygon& poly); + +template +void PointVec2Poly(const std::vector>& vec, gpc::gpc_polygon& poly); + +template +void Poly2PointVec(const gpc::gpc_vertex_list& contour, + std::vector>& vec); + +template +T GetContourArea(std::vector>& vec); + +template +T PolyArea(const T* box, const size_t box_size, const bool normalized); + +template +T PolyOverlapArea(const T* box1, const T* box2, const size_t box_size, + const bool normalized); +} // namespace operators +} // namespace paddle + +#include "paddle/fluid/operators/detection/poly_util.cc" + +#endif // POLY_UTIL_H_ diff --git a/paddle/fluid/operators/detection/polygon_box_transform_op.cc b/paddle/fluid/operators/detection/polygon_box_transform_op.cc index 568d50d457d838d5f11605710c0d3b987af01d10..4b3bc2edb58fe23393d906094c41b6ad62c71155 100644 --- a/paddle/fluid/operators/detection/polygon_box_transform_op.cc +++ b/paddle/fluid/operators/detection/polygon_box_transform_op.cc @@ -41,9 +41,9 @@ class PolygonBoxTransformCPUKernel : public framework::OpKernel { for (int id_w = 0; id_w < width; ++id_w) { id = id_n * height * width + width * id_h + id_w; if (id_n % 2 == 0) { - out_data[id] = id_w - in_data[id]; + out_data[id] = id_w * 4 - in_data[id]; } else { - out_data[id] = id_h - in_data[id]; + out_data[id] = id_h * 4 - in_data[id]; } } } diff --git a/paddle/fluid/operators/detection/polygon_box_transform_op.cu b/paddle/fluid/operators/detection/polygon_box_transform_op.cu index 6187ac6622c65d2bbc525c3fe2cb397cf74ac612..e1eaf084a3413dd1d13514e2d7b22572d21dd119 100644 --- a/paddle/fluid/operators/detection/polygon_box_transform_op.cu +++ b/paddle/fluid/operators/detection/polygon_box_transform_op.cu @@ -32,9 +32,9 @@ __global__ void PolygonBoxTransformKernel(const int n, const int h, const int w, if (id_n < n && id_h < h && id_w < w) { int id = id_n * h * w + w * id_h + id_w; if (id_n % 2 == 0) { - output[id] = id_w - input[id]; + output[id] = id_w * 4 - input[id]; } else { - output[id] = id_h - input[id]; + output[id] = id_h * 4 - input[id]; } } } diff --git a/paddle/fluid/operators/detection/roi_perspective_transform_op.cu b/paddle/fluid/operators/detection/roi_perspective_transform_op.cu index c82930cc4994c3854e60f40ae9909a90d82cbff6..2d262f932aed9761143f7983c9a38f7a97c374ea 100644 --- a/paddle/fluid/operators/detection/roi_perspective_transform_op.cu +++ b/paddle/fluid/operators/detection/roi_perspective_transform_op.cu @@ -15,6 +15,10 @@ limitations under the License. */ #include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/platform/cuda_primitives.h" +#include "paddle/fluid/platform/float16.h" + +using paddle::platform::PADDLE_CUDA_NUM_THREADS; +using paddle::platform::float16; namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/detection/rpn_target_assign_op.cc b/paddle/fluid/operators/detection/rpn_target_assign_op.cc index dda423efd35b96f5e1d7c55389818f46ef3d8694..46fff9d338b7759496faaf6dd9960d34887755ba 100644 --- a/paddle/fluid/operators/detection/rpn_target_assign_op.cc +++ b/paddle/fluid/operators/detection/rpn_target_assign_op.cc @@ -52,6 +52,9 @@ class RpnTargetAssignOp : public framework::OperatorWithKernel { PADDLE_ENFORCE( ctx->HasOutput("TargetBBox"), "Output(TargetBBox) of RpnTargetAssignOp should not be null"); + PADDLE_ENFORCE( + ctx->HasOutput("BBoxInsideWeight"), + "Output(BBoxInsideWeight) of RpnTargetAssignOp should not be null"); auto anchor_dims = ctx->GetInputDim("Anchor"); auto gt_boxes_dims = ctx->GetInputDim("GtBoxes"); @@ -68,6 +71,7 @@ class RpnTargetAssignOp : public framework::OperatorWithKernel { ctx->SetOutputDim("ScoreIndex", {-1}); ctx->SetOutputDim("TargetLabel", {-1, 1}); ctx->SetOutputDim("TargetBBox", {-1, 4}); + ctx->SetOutputDim("BBoxInsideWeight", {-1, 4}); } protected: @@ -169,6 +173,7 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, const float rpn_positive_overlap, const float rpn_negative_overlap, std::vector* fg_inds, std::vector* bg_inds, std::vector* tgt_lbl, + std::vector* fg_fake, std::vector* bbox_inside_weight, std::minstd_rand engine, bool use_random) { float epsilon = 0.00001; int anchor_num = anchor_to_gt_max.dims()[0]; @@ -201,12 +206,12 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, // Reservoir Sampling int fg_num = static_cast(rpn_fg_fraction * rpn_batch_size_per_im); ReservoirSampling(fg_num, &fg_inds_fake, engine, use_random); - fg_num = static_cast(fg_inds_fake.size()); - for (int64_t i = 0; i < fg_num; ++i) { + int fg_fake_num = static_cast(fg_inds_fake.size()); + for (int64_t i = 0; i < fg_fake_num; ++i) { target_label[fg_inds_fake[i]] = 1; } - int bg_num = rpn_batch_size_per_im - fg_num; + int bg_num = rpn_batch_size_per_im - fg_fake_num; for (int64_t i = 0; i < anchor_num; ++i) { if (anchor_to_gt_max_data[i] < rpn_negative_overlap) { bg_inds_fake.push_back(i); @@ -214,12 +219,28 @@ void ScoreAssign(const T* anchor_by_gt_overlap_data, } ReservoirSampling(bg_num, &bg_inds_fake, engine, use_random); bg_num = static_cast(bg_inds_fake.size()); + int fake_num = 0; for (int64_t i = 0; i < bg_num; ++i) { + // fg fake found + if (target_label[bg_inds_fake[i]] == 1) { + fake_num++; + fg_fake->emplace_back(fg_inds_fake[0]); + for (int j = 0; j < 4; ++j) { + bbox_inside_weight->emplace_back(T(0.)); + } + } target_label[bg_inds_fake[i]] = 0; } + for (int64_t i = 0; i < (fg_fake_num - fake_num) * 4; ++i) { + bbox_inside_weight->emplace_back(T(1.)); + } + for (int64_t i = 0; i < anchor_num; ++i) { - if (target_label[i] == 1) fg_inds->emplace_back(i); + if (target_label[i] == 1) { + fg_inds->emplace_back(i); + fg_fake->emplace_back(i); + } if (target_label[i] == 0) bg_inds->emplace_back(i); } fg_num = fg_inds->size(); @@ -248,7 +269,8 @@ std::vector SampleRpnFgBgGt(const platform::CPUDeviceContext& ctx, std::vector bg_inds; std::vector gt_inds; std::vector tgt_lbl; - + std::vector fg_fake; + std::vector bbox_inside_weight; // Calculate the max IoU between anchors and gt boxes // Map from anchor to gt box that has highest overlap auto place = ctx.GetPlace(); @@ -275,32 +297,37 @@ std::vector SampleRpnFgBgGt(const platform::CPUDeviceContext& ctx, // Follow the Faster RCNN's implementation ScoreAssign(anchor_by_gt_overlap_data, anchor_to_gt_max, gt_to_anchor_max, rpn_batch_size_per_im, rpn_fg_fraction, rpn_positive_overlap, - rpn_negative_overlap, &fg_inds, &bg_inds, &tgt_lbl, engine, - use_random); + rpn_negative_overlap, &fg_inds, &bg_inds, &tgt_lbl, &fg_fake, + &bbox_inside_weight, engine, use_random); int fg_num = fg_inds.size(); int bg_num = bg_inds.size(); - gt_inds.reserve(fg_num); - for (int i = 0; i < fg_num; ++i) { - gt_inds.emplace_back(argmax[fg_inds[i]]); + int fg_fake_num = fg_fake.size(); + gt_inds.reserve(fg_fake_num); + for (int i = 0; i < fg_fake_num; ++i) { + gt_inds.emplace_back(argmax[fg_fake[i]]); } - - Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t; - int* loc_index_data = loc_index_t.mutable_data({fg_num}, place); + Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t, bbox_inside_weight_t; + int* loc_index_data = loc_index_t.mutable_data({fg_fake_num}, place); int* score_index_data = score_index_t.mutable_data({fg_num + bg_num}, place); int* tgt_lbl_data = tgt_lbl_t.mutable_data({fg_num + bg_num}, place); - int* gt_inds_data = gt_inds_t.mutable_data({fg_num}, place); - std::copy(fg_inds.begin(), fg_inds.end(), loc_index_data); + int* gt_inds_data = gt_inds_t.mutable_data({fg_fake_num}, place); + T* bbox_inside_weight_data = + bbox_inside_weight_t.mutable_data({fg_fake_num, 4}, place); + std::copy(fg_fake.begin(), fg_fake.end(), loc_index_data); std::copy(fg_inds.begin(), fg_inds.end(), score_index_data); std::copy(bg_inds.begin(), bg_inds.end(), score_index_data + fg_num); std::copy(tgt_lbl.begin(), tgt_lbl.end(), tgt_lbl_data); std::copy(gt_inds.begin(), gt_inds.end(), gt_inds_data); + std::copy(bbox_inside_weight.begin(), bbox_inside_weight.end(), + bbox_inside_weight_data); std::vector loc_score_tgtlbl_gt; loc_score_tgtlbl_gt.emplace_back(loc_index_t); loc_score_tgtlbl_gt.emplace_back(score_index_t); loc_score_tgtlbl_gt.emplace_back(tgt_lbl_t); loc_score_tgtlbl_gt.emplace_back(gt_inds_t); + loc_score_tgtlbl_gt.emplace_back(bbox_inside_weight_t); return loc_score_tgtlbl_gt; } @@ -318,6 +345,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { auto* score_index = context.Output("ScoreIndex"); auto* tgt_bbox = context.Output("TargetBBox"); auto* tgt_lbl = context.Output("TargetLabel"); + auto* bbox_inside_weight = context.Output("BBoxInsideWeight"); PADDLE_ENFORCE_EQ(gt_boxes->lod().size(), 1UL, "RpnTargetAssignOp gt_boxes needs 1 level of LoD"); @@ -340,7 +368,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { score_index->mutable_data({max_num}, place); tgt_bbox->mutable_data({max_num, 4}, place); tgt_lbl->mutable_data({max_num, 1}, place); - + bbox_inside_weight->mutable_data({max_num, 4}, place); auto& dev_ctx = context.device_context(); std::random_device rnd; @@ -394,6 +422,7 @@ class RpnTargetAssignKernel : public framework::OpKernel { Tensor sampled_score_index = loc_score_tgtlbl_gt[1]; Tensor sampled_tgtlbl = loc_score_tgtlbl_gt[2]; Tensor sampled_gt_index = loc_score_tgtlbl_gt[3]; + Tensor sampled_bbox_inside_weight = loc_score_tgtlbl_gt[4]; int loc_num = sampled_loc_index.dims()[0]; int score_num = sampled_score_index.dims()[0]; @@ -432,6 +461,8 @@ class RpnTargetAssignKernel : public framework::OpKernel { AppendRpns(score_index, total_score_num, &sampled_score_index_unmap); AppendRpns(tgt_bbox, total_loc_num * 4, &sampled_tgt_bbox); AppendRpns(tgt_lbl, total_score_num, &sampled_tgtlbl); + AppendRpns(bbox_inside_weight, total_loc_num * 4, + &sampled_bbox_inside_weight); total_loc_num += loc_num; total_score_num += score_num; @@ -448,10 +479,12 @@ class RpnTargetAssignKernel : public framework::OpKernel { score_index->set_lod(loc_score); tgt_bbox->set_lod(lod_loc); tgt_lbl->set_lod(loc_score); + bbox_inside_weight->set_lod(lod_loc); loc_index->Resize({total_loc_num}); score_index->Resize({total_score_num}); tgt_bbox->Resize({total_loc_num, 4}); tgt_lbl->Resize({total_score_num, 1}); + bbox_inside_weight->Resize({total_loc_num, 4}); } }; @@ -514,6 +547,9 @@ class RpnTargetAssignOpMaker : public framework::OpProtoAndCheckerMaker { "TargetLabel", "(Tensor), The target labels of each anchor with shape " "[F + B, 1], F and B are sampled foreground and backgroud number."); + AddOutput("BBoxInsideWeight", + "(Tensor), The bbox inside weight with shape " + "[F, 4], F is the sampled foreground number."); AddComment(R"DOC( This operator can be, for a given set of ground truth bboxes and the anchors, to assign classification and regression targets to each prediction. diff --git a/paddle/fluid/operators/distributed/CMakeLists.txt b/paddle/fluid/operators/distributed/CMakeLists.txt index 56734b81e8716a0c0c37a11e35c9118ee7b55020..21db93958a4a586c74a1e060f1f04b5af1dcd889 100644 --- a/paddle/fluid/operators/distributed/CMakeLists.txt +++ b/paddle/fluid/operators/distributed/CMakeLists.txt @@ -20,7 +20,7 @@ if(WITH_GRPC) DEPS grpc++_unsecure grpc_unsecure gpr cares zlib protobuf sendrecvop_grpc scope profiler math_function SERIAL) cc_test(rpc_server_test SRCS rpc_server_test.cc DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf executor proto_desc lookup_sparse_table_op SERIAL) - cc_test(varhandle_test SRCS varhandle_test.cc) + cc_test(varhandle_test SRCS varhandle_test.cc DEPS profiler) return() endif() diff --git a/paddle/fluid/operators/distributed/brpc_server.cc b/paddle/fluid/operators/distributed/brpc_server.cc index 862167f02084cfe81db1c0936bbfb0415fa85721..47a06dd0f378f6cc4f79aee52052717188d72420 100644 --- a/paddle/fluid/operators/distributed/brpc_server.cc +++ b/paddle/fluid/operators/distributed/brpc_server.cc @@ -133,10 +133,10 @@ void AsyncBRPCServer::StartServer() { void AsyncBRPCServer::ShutDownImpl() { server_.Stop(1000); } void AsyncBRPCServer::WaitServerReady() { - VLOG(3) << "AsyncGRPCServer is wait server ready"; + VLOG(30) << "AsyncGRPCServer is wait server ready"; std::unique_lock lock(this->mutex_ready_); condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); - VLOG(3) << "AsyncGRPCServer WaitSeverReady"; + VLOG(30) << "AsyncGRPCServer WaitSeverReady"; } }; // namespace distributed diff --git a/paddle/fluid/operators/distributed/grpc_client.cc b/paddle/fluid/operators/distributed/grpc_client.cc index 13682b78f0eccf049daa315f3a26aafd22e42a41..7ce9669517bf115274e314e1116f8a5e69277b50 100644 --- a/paddle/fluid/operators/distributed/grpc_client.cc +++ b/paddle/fluid/operators/distributed/grpc_client.cc @@ -12,18 +12,18 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/distributed/grpc_client.h" - -#include - #include #include "glog/logging.h" // For VLOG #include "paddle/fluid/framework/threadpool.h" +#include "paddle/fluid/operators/distributed/grpc_client.h" #include "paddle/fluid/operators/distributed/grpc_serde.h" #include "paddle/fluid/operators/distributed/request_handler.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/platform/profiler.h" +DECLARE_bool(rpc_disable_reuse_port); + namespace paddle { namespace operators { namespace distributed { @@ -40,7 +40,7 @@ void GRPCClient::SendComplete() { std::unique_lock lk(completed_mutex_); if (!completed_) { for (auto& it : channels_) { - VLOG(3) << "send complete message to " << it.first; + VLOG(30) << "send complete message to " << it.first; this->AsyncSendComplete(it.first); } PADDLE_ENFORCE(this->Wait(), "internal grpc error"); @@ -73,24 +73,31 @@ VarHandlePtr GRPCClient::AsyncSendVar(const std::string& ep, const framework::Scope* p_scope = &scope; const auto ch = GetChannel(ep_val); SendProcessor* s = new SendProcessor(ch); - VarHandlePtr h(new VarHandle(ep, "Send", var_name_val, p_ctx, p_scope)); + const std::string method = "SendRPC"; + VarHandlePtr h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); s->Prepare(h, time_out); - framework::AsyncIO([var_name_val, p_scope, p_ctx, s, this] { + framework::AsyncIO([var_name_val, p_scope, p_ctx, s, method, h, this] { auto* var = p_scope->FindVar(var_name_val); ::grpc::ByteBuffer req; - SerializeToByteBuffer(var_name_val, var, *p_ctx, &req); + SerializeToByteBuffer(var_name_val, var, *p_ctx, &req, "", trainer_id_); - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + VLOG(30) << s->GetVarHandlePtr()->String() << " begin"; // stub context s->response_call_back_ = nullptr; + platform::RecordRPCEvent record_event(method, p_ctx); + auto call = s->stub_g_.PrepareUnaryCall( s->context_.get(), "/sendrecv.SendRecvService/SendVariable", req, &cq_); call->StartCall(); call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } }); req_count_++; @@ -100,7 +107,10 @@ VarHandlePtr GRPCClient::AsyncSendVar(const std::string& ep, void ProcGetResponse(const VarHandle& var_h, const ::grpc::ByteBuffer& ret_msg) { framework::Variable* outvar = nullptr; - DeserializeFromByteBuffer(ret_msg, *var_h.ctx(), var_h.scope(), &outvar); + // get response's trainer_id is not used + int trainer_id; + DeserializeFromByteBuffer(ret_msg, *var_h.ctx(), var_h.scope(), &outvar, + &trainer_id); } template @@ -122,25 +132,33 @@ VarHandlePtr GRPCClient::AsyncGetVar(const std::string& ep, const framework::Scope* p_scope = &scope; const auto ch = GetChannel(ep_val); GetProcessor* s = new GetProcessor(ch); - VarHandlePtr h(new VarHandle(ep, "Get", var_name_val, p_ctx, p_scope)); + const std::string method = "GetRPC"; + VarHandlePtr h(new VarHandle(ep, method, var_name_val, p_ctx, p_scope)); s->Prepare(h, time_out); - framework::AsyncIO([var_name_val, s, this] { + framework::AsyncIO([var_name_val, s, method, p_ctx, h, this] { // prepare input sendrecv::VariableMessage req; req.set_varname(var_name_val); + req.set_trainer_id(trainer_id_); ::grpc::ByteBuffer buf; RequestToByteBuffer(req, &buf); - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + VLOG(30) << s->GetVarHandlePtr()->String() << " begin"; // stub context s->response_call_back_ = ProcGetResponse; + platform::RecordRPCEvent record_event(method, p_ctx); + auto call = s->stub_g_.PrepareUnaryCall( s->context_.get(), "/sendrecv.SendRecvService/GetVariable", buf, &cq_); call->StartCall(); call->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } }); req_count_++; @@ -161,27 +179,35 @@ VarHandlePtr GRPCClient::AsyncPrefetchVar(const std::string& ep, const framework::Scope* p_scope = &scope; const auto ch = GetChannel(ep_val); GetProcessor* s = new GetProcessor(ch); - VarHandlePtr h( - new VarHandle(ep, "Prefetch", out_var_name_val, p_ctx, p_scope)); + + const std::string method = "PrefetchRPC"; + + VarHandlePtr h(new VarHandle(ep, method, out_var_name_val, p_ctx, p_scope)); s->Prepare(h, time_out); framework::AsyncIO([in_var_name_val, out_var_name_val, ep_val, p_scope, p_ctx, - s, this] { + s, method, h, this] { auto* var = p_scope->FindVar(in_var_name_val); ::grpc::ByteBuffer req; SerializeToByteBuffer(in_var_name_val, var, *p_ctx, &req, out_var_name_val); - VLOG(3) << s->GetVarHandlePtr()->String() << " begin"; + VLOG(30) << s->GetVarHandlePtr()->String() << " begin"; // stub context s->response_call_back_ = ProcGetResponse; + platform::RecordRPCEvent record_event(method, p_ctx); + auto call = s->stub_g_.PrepareUnaryCall( s->context_.get(), "/sendrecv.SendRecvService/PrefetchVariable", req, &cq_); call->StartCall(); call->Finish(&s->reply_, &s->status_, static_cast(s)); + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } }); req_count_++; @@ -193,15 +219,24 @@ VarHandlePtr GRPCClient::AsyncSendBatchBarrier(const std::string& ep, const auto ch = GetChannel(ep); BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); - VarHandlePtr h(new VarHandle(ep, "BatchBarrier", BATCH_BARRIER_MESSAGE, - nullptr, nullptr)); + const std::string method = "BatchBarrierRPC"; + VarHandlePtr h( + new VarHandle(ep, method, BATCH_BARRIER_MESSAGE, nullptr, nullptr)); s->Prepare(h, time_out); sendrecv::VariableMessage req; req.set_varname(BATCH_BARRIER_MESSAGE); + + platform::RecordRPCEvent record_event(method, nullptr); + auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + return h; } @@ -209,15 +244,24 @@ VarHandlePtr GRPCClient::AsyncSendFetchBarrier(const std::string& ep, int64_t time_out) { const auto ch = GetChannel(ep); FetchBarrierProcessor* s = new FetchBarrierProcessor(ch); - VarHandlePtr h(new VarHandle(ep, "FetchBarrier", FETCH_BARRIER_MESSAGE, - nullptr, nullptr)); + const std::string method = "FetchBarrierRPC"; + VarHandlePtr h( + new VarHandle(ep, method, FETCH_BARRIER_MESSAGE, nullptr, nullptr)); s->Prepare(h, time_out); sendrecv::VariableMessage req; req.set_varname(FETCH_BARRIER_MESSAGE); + + platform::RecordRPCEvent record_event(method, nullptr); + auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_); rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + return h; } @@ -226,15 +270,23 @@ VarHandlePtr GRPCClient::AsyncSendComplete(const std::string& ep, const auto ch = GetChannel(ep); BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); - VarHandlePtr h( - new VarHandle(ep, "SendComplete", COMPLETE_MESSAGE, nullptr, nullptr)); + const std::string method = "SendCompleteRPC"; + VarHandlePtr h(new VarHandle(ep, method, COMPLETE_MESSAGE, nullptr, nullptr)); s->Prepare(h, time_out); sendrecv::VariableMessage req; req.set_varname(COMPLETE_MESSAGE); + + platform::RecordRPCEvent record_event(method, nullptr); + auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + return h; } @@ -244,17 +296,27 @@ VarHandlePtr GRPCClient::AsyncCheckpointNotify(const std::string& ep, const auto ch = GetChannel(ep); CheckpointNotifyProcessor* s = new CheckpointNotifyProcessor(ch); - VarHandlePtr h(new VarHandle(ep, "CheckPointNotify", CHECKPOINT_SAVE_MESSAGE, - nullptr, nullptr)); + + const std::string method = "CheckPointNotifyRPC"; + + VarHandlePtr h( + new VarHandle(ep, method, CHECKPOINT_SAVE_MESSAGE, nullptr, nullptr)); s->Prepare(h, time_out); sendrecv::VariableMessage req; req.set_varname(CHECKPOINT_SAVE_MESSAGE); req.set_out_varname(dir); + platform::RecordRPCEvent record_event(method, nullptr); + auto rpc = s->stub_->AsyncCheckpointNotify(s->context_.get(), req, &cq_); rpc->Finish(&s->reply_, &s->status_, reinterpret_cast(s)); req_count_++; + + if (UNLIKELY(platform::IsProfileEnabled())) { + h->Wait(); + } + return h; } @@ -268,17 +330,21 @@ void GRPCClient::Proceed() { void* tag = nullptr; bool ok = false; - VLOG(3) << "GRPCClient Proceed begin"; + VLOG(30) << "GRPCClient Proceed begin"; while (!stopped_ && cq_.Next(&tag, &ok)) { BaseProcessor* c = static_cast(tag); GPR_ASSERT(ok); PADDLE_ENFORCE(c); + if (c->status_.ok()) { - VLOG(3) << c->GetVarHandlePtr()->String() << " process"; + VLOG(30) << c->GetVarHandlePtr()->String() << " process"; c->Process(); } else if (c->status_.error_code() == grpc::StatusCode::DEADLINE_EXCEEDED) { + // FIXME(gongwb): parse error_details? LOG(ERROR) << c->GetVarHandlePtr()->String() - << " meets grpc error:" << c->status_.error_message(); + << " meets grpc error, error_code:" << c->status_.error_code() + << " error_message:" << c->status_.error_message() + << " error_details:" << c->status_.error_details(); { std::lock_guard lk(sync_mutex_); ok_ = false; @@ -286,7 +352,10 @@ void GRPCClient::Proceed() { c->Finish(false); } else { LOG(FATAL) << c->GetVarHandlePtr()->String() - << " meets grpc error:" << c->status_.error_message(); + << " meets grpc error, error_code:" << c->status_.error_code() + << " error_message:" << c->status_.error_message() + << " error_details:" << c->status_.error_details(); + c->Finish(false); } @@ -303,7 +372,7 @@ void GRPCClient::Proceed() { sync_cond_.notify_all(); } } - VLOG(3) << "GRPCClient Proceed end"; + VLOG(30) << "GRPCClient Proceed end"; } std::shared_ptr GRPCClient::GetChannel(const std::string& ep) { @@ -316,6 +385,9 @@ std::shared_ptr GRPCClient::GetChannel(const std::string& ep) { // Channel configurations: grpc::ChannelArguments args; args.SetInt(GRPC_ARG_MAX_RECONNECT_BACKOFF_MS, 2000); + if (FLAGS_rpc_disable_reuse_port) { + args.SetInt(GRPC_ARG_ALLOW_REUSEPORT, 0); + } args.SetCompressionAlgorithm(GRPC_COMPRESS_NONE); args.SetMaxSendMessageSize(std::numeric_limits::max()); args.SetMaxReceiveMessageSize(std::numeric_limits::max()); diff --git a/paddle/fluid/operators/distributed/grpc_serde.cc b/paddle/fluid/operators/distributed/grpc_serde.cc index 3f8796713a6b89a308113981614673e07e8d367f..e6856676d49e867214801810949076151e34356a 100644 --- a/paddle/fluid/operators/distributed/grpc_serde.cc +++ b/paddle/fluid/operators/distributed/grpc_serde.cc @@ -15,7 +15,6 @@ limitations under the License. */ #ifdef PADDLE_WITH_CUDA #include #endif -#include #include // NOLINT #include "google/protobuf/io/coded_stream.h" @@ -26,24 +25,30 @@ limitations under the License. */ #include "paddle/fluid/operators/distributed/grpc_variable_response.h" #include "paddle/fluid/operators/distributed/proto_encoder_helper.h" #include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/platform/profiler.h" namespace paddle { namespace operators { namespace distributed { +static void SerializeDestroyCallback(void* payload) { + if (payload != nullptr) { + auto* shared_payload = reinterpret_cast(payload); + delete shared_payload; + } +} + void SerializeToByteBuffer(const std::string& name, framework::Variable* var, const platform::DeviceContext& ctx, - ::grpc::ByteBuffer* msg, - const std::string& out_name) { - // Default DestroyCallback does nothing, When using GPU - // the CPU buffer need to be freed. - DestroyCallback destroy_callback = [](void* backing) {}; + ::grpc::ByteBuffer* msg, const std::string& out_name, + const int trainer_id) { + platform::RecordRPCEvent record_event("serial", &ctx); VarMsg request; - void* payload = nullptr; - size_t payload_size; + TensorPayload* payload = nullptr; request.set_varname(name); + request.set_trainer_id(trainer_id); // Note: normally the profiler is enabled in 1 trainer, hence only // 1 trainer returns true for ShouldSendProfileState(). It tells PS // servers the trainer's profiling state so that PS can follow the @@ -60,10 +65,10 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var, } if (var->IsType()) { request.set_type(::sendrecv::LOD_TENSOR); - GetTensorPayload(var, ctx, &request, &payload, &payload_size); + payload = new TensorPayload(GetTensorPayload(var, ctx, &request)); } else if (var->IsType()) { request.set_type(::sendrecv::SELECTED_ROWS); - GetSelectedRowsPayload(var, ctx, &request, &payload, &payload_size); + payload = new TensorPayload(GetSelectedRowsPayload(var, ctx, &request)); #ifdef PADDLE_WITH_CUDA } else if (var->IsType()) { request.set_type(::sendrecv::NCCL_ID); @@ -73,17 +78,6 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var, typeid(var->Type()).name()); } - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - // GPU data is copied to CPU buffer when sending, - // free the buffer when possible. - destroy_callback = [](void* backing) { - platform::CUDAPinnedPlace cuda_pinned; - memory::Free(cuda_pinned, backing); - }; -#endif - } - std::string header; request.AppendToString(&header); auto buffer = std::unique_ptr(new char[1024]); @@ -107,16 +101,18 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var, return; } #endif + PADDLE_ENFORCE_NOT_NULL(payload); - e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size); + e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, + payload->memory_size()); // steal reference of tensor data ::grpc::Slice slices[4]; // metadata, tensor, rows meta, rows int num_slices = 2; // only SelectedRows have rows buffer slices[0] = ::grpc::Slice(e.size()); memcpy(const_cast(slices[0].begin()), e.data(), e.size()); slices[1] = ::grpc::Slice( - grpc_slice_new_with_user_data(payload, payload_size, destroy_callback, - static_cast(payload)), + grpc_slice_new_with_user_data(payload->ptr(), payload->memory_size(), + SerializeDestroyCallback, payload), ::grpc::Slice::STEAL_REF); if (var->IsType()) { @@ -146,10 +142,12 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var, void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, const platform::DeviceContext& ctx, const framework::Scope* scope, - framework::Variable** var) { + framework::Variable** var, int* trainer_id) { + platform::RecordRPCEvent record_event("deserial", &ctx); operators::distributed::GRPCVariableResponse resp(scope, &ctx); PADDLE_ENFORCE(resp.Parse(msg) == 0, "parse bytebuffer to tensor error!"); *var = resp.GetVar(); + *trainer_id = resp.GetTrainerId(); } } // namespace distributed diff --git a/paddle/fluid/operators/distributed/grpc_serde.h b/paddle/fluid/operators/distributed/grpc_serde.h index 450c41dcd6b1bf9a33d3bbef3a1c94a2f83ff322..17290d3fb4478191c59623913a82d4142d3c49f9 100644 --- a/paddle/fluid/operators/distributed/grpc_serde.h +++ b/paddle/fluid/operators/distributed/grpc_serde.h @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#include + #include #include #include @@ -25,6 +25,7 @@ limitations under the License. */ #include "paddle/fluid/framework/tensor_util.h" #include "paddle/fluid/framework/var_type.h" #include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/operators/distributed/send_recv.grpc.pb.h" #include "paddle/fluid/operators/distributed/send_recv.pb.h" @@ -38,12 +39,13 @@ typedef void (*DestroyCallback)(void*); void SerializeToByteBuffer(const std::string& name, framework::Variable* var, const platform::DeviceContext& ctx, ::grpc::ByteBuffer* msg, - const std::string& out_varname = std::string()); + const std::string& out_varname = std::string(), + const int trainer_id = 0); void DeserializeFromByteBuffer(const ::grpc::ByteBuffer& msg, const platform::DeviceContext& ctx, const framework::Scope* scope, - framework::Variable** var); + framework::Variable** var, int* trainer_id); } // namespace distributed } // namespace operators diff --git a/paddle/fluid/operators/distributed/grpc_server.cc b/paddle/fluid/operators/distributed/grpc_server.cc index 8edb00276df3ade1b320fbf2873e8b54ff3e1464..77bf67be25780b1fe18c0b2c56680b549637d547 100644 --- a/paddle/fluid/operators/distributed/grpc_server.cc +++ b/paddle/fluid/operators/distributed/grpc_server.cc @@ -20,6 +20,8 @@ limitations under the License. */ using ::grpc::ServerAsyncResponseWriter; +DECLARE_bool(rpc_disable_reuse_port); + namespace paddle { namespace operators { namespace distributed { @@ -98,13 +100,14 @@ class RequestSend final : public RequestBase { void Process() override { std::string varname = GetReqName(); - VLOG(4) << "RequestSend var_name:" << varname; + VLOG(40) << "RequestSend var_name:" << varname; auto scope = request_->GetMutableLocalScope(); auto invar = request_->GetVar(); + int trainer_id = request_->GetTrainerId(); framework::Variable* outvar = nullptr; - request_handler_->Handle(varname, scope, invar, &outvar); + request_handler_->Handle(varname, scope, invar, &outvar, trainer_id); Finish(reply_, &responder_); } @@ -133,13 +136,14 @@ class RequestGet final : public RequestBase { void Process() override { // proc request. std::string varname = request_.varname(); - VLOG(4) << "RequestGet " << varname; + int trainer_id = request_.trainer_id(); + VLOG(40) << "RequestGet " << varname; auto scope = request_handler_->scope(); auto invar = scope->FindVar(varname); framework::Variable* outvar = nullptr; - request_handler_->Handle(varname, scope, invar, &outvar); + request_handler_->Handle(varname, scope, invar, &outvar, trainer_id); if (outvar) { SerializeToByteBuffer(varname, outvar, *request_handler_->dev_ctx(), @@ -179,15 +183,17 @@ class RequestPrefetch final : public RequestBase { // prefetch process... std::string in_var_name = request_->Varname(); std::string out_var_name = request_->OutVarname(); - VLOG(4) << "RequestPrefetch, in_var_name: " << in_var_name - << " out_var_name: " << out_var_name; + int trainer_id = request_->GetTrainerId(); + VLOG(40) << "RequestPrefetch, in_var_name: " << in_var_name + << " out_var_name: " << out_var_name; auto scope = request_->GetMutableLocalScope(); auto invar = scope->FindVar(in_var_name); // out var must be created in local scope! framework::Variable* outvar = scope->Var(out_var_name); - request_handler_->Handle(in_var_name, scope, invar, &outvar, out_var_name); + request_handler_->Handle(in_var_name, scope, invar, &outvar, trainer_id, + out_var_name); SerializeToByteBuffer(out_var_name, outvar, *request_handler_->dev_ctx(), &reply_); @@ -225,12 +231,13 @@ class RequestCheckpointNotify final : public RequestBase { std::string checkpoint_notify = request_->Varname(); std::string checkpoint_dir = request_->OutVarname(); + int trainer_id = request_->GetTrainerId(); - VLOG(4) << "RequestCheckpointNotify notify: " << checkpoint_notify - << ", dir: " << checkpoint_dir; + VLOG(40) << "RequestCheckpointNotify notify: " << checkpoint_notify + << ", dir: " << checkpoint_dir; request_handler_->Handle(checkpoint_notify, scope, nullptr, nullptr, - checkpoint_dir); + trainer_id, checkpoint_dir); Finish(reply_, &responder_); } @@ -241,12 +248,26 @@ class RequestCheckpointNotify final : public RequestBase { }; void AsyncGRPCServer::WaitServerReady() { - VLOG(4) << "AsyncGRPCServer is wait server ready"; + VLOG(40) << "AsyncGRPCServer is wait server ready"; std::unique_lock lock(this->mutex_ready_); condition_ready_.wait(lock, [=] { return this->ready_ == 1; }); - VLOG(4) << "AsyncGRPCServer WaitSeverReady"; + VLOG(40) << "AsyncGRPCServer WaitSeverReady"; } +// Define an option subclass in order to disable SO_REUSEPORT for the +// server socket. +// Come from: +// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc +class NoReusePortOption : public ::grpc::ServerBuilderOption { + public: + void UpdateArguments(::grpc::ChannelArguments* args) override { + args->SetInt(GRPC_ARG_ALLOW_REUSEPORT, 0); + } + + void UpdatePlugins(std::vector>* + plugins) override {} +}; + void AsyncGRPCServer::StartServer() { ::grpc::ServerBuilder builder; builder.AddListeningPort(bind_address_, ::grpc::InsecureServerCredentials(), @@ -254,6 +275,10 @@ void AsyncGRPCServer::StartServer() { builder.SetMaxSendMessageSize(std::numeric_limits::max()); builder.SetMaxReceiveMessageSize(std::numeric_limits::max()); + if (FLAGS_rpc_disable_reuse_port) { + builder.SetOption( + std::unique_ptr<::grpc::ServerBuilderOption>(new NoReusePortOption)); + } builder.RegisterService(&service_); for (auto t : rpc_call_map_) { @@ -277,14 +302,15 @@ void AsyncGRPCServer::StartServer() { reqs.reserve(kRequestBufSize); for (int i = 0; i < kRequestBufSize; i++) { - VLOG(6) << "TryToRegisterNewOne on RPC NAME: " << rpc_name << " I: " << i; + VLOG(60) << "TryToRegisterNewOne on RPC NAME: " << rpc_name + << " I: " << i; TryToRegisterNewOne(rpc_name, i); } for (int i = 0; i < threadnum; i++) { rpc_threads_[rpc_name].emplace_back(new std::thread(std::bind( &AsyncGRPCServer::HandleRequest, this, cq.get(), rpc_name, f))); - VLOG(4) << t.first << " creates threads!"; + VLOG(40) << t.first << " creates threads!"; } } @@ -301,7 +327,7 @@ void AsyncGRPCServer::StartServer() { auto& threads = t.second; for (size_t i = 0; i < threads.size(); ++i) { threads[i]->join(); - VLOG(4) << t.first << " threads ends!"; + VLOG(40) << t.first << " threads ends!"; } } } @@ -309,7 +335,7 @@ void AsyncGRPCServer::StartServer() { void AsyncGRPCServer::ShutdownQueue() { for (auto& t : rpc_cq_) { t.second->Shutdown(); - VLOG(4) << t.first << " queue shutdown!"; + VLOG(40) << t.first << " queue shutdown!"; } } @@ -318,7 +344,7 @@ void AsyncGRPCServer::ShutDownImpl() { is_shut_down_ = true; ShutdownQueue(); - VLOG(4) << "server_ shutdown!"; + VLOG(40) << "server_ shutdown!"; server_->Shutdown(); } @@ -326,12 +352,12 @@ void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name, int req_id) { std::unique_lock lock(cq_mutex_); if (is_shut_down_) { - VLOG(4) << "shutdown, do not TryToRegisterNewSendOne"; + VLOG(40) << "shutdown, do not TryToRegisterNewSendOne"; return; } - VLOG(4) << "TryToRegisterNewOne on RPC NAME: " << rpc_name - << " REQ ID: " << req_id; + VLOG(40) << "TryToRegisterNewOne on RPC NAME: " << rpc_name + << " REQ ID: " << req_id; auto& reqs = rpc_reqs_[rpc_name]; auto& handler = rpc_call_map_[rpc_name]; @@ -352,7 +378,7 @@ void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name, reqs[req_id] = b; - VLOG(4) << "Create RequestSend status:" << b->Status(); + VLOG(40) << "Create RequestSend status:" << b->Status(); } void AsyncGRPCServer::HandleRequest( @@ -362,15 +388,15 @@ void AsyncGRPCServer::HandleRequest( bool ok = false; while (true) { - VLOG(4) << "HandleRequest " << rpc_name << " wait next"; + VLOG(40) << "HandleRequest " << rpc_name << " wait next"; if (!cq->Next(&tag, &ok)) { - VLOG(3) << "CompletionQueue " << rpc_name << " shutdown!"; + VLOG(30) << "CompletionQueue " << rpc_name << " shutdown!"; break; } int req_id = static_cast(reinterpret_cast(tag)); - VLOG(4) << "HandleRequest " << rpc_name << ", req_id:" << req_id - << " get next"; + VLOG(40) << "HandleRequest " << rpc_name << ", req_id:" << req_id + << " get next"; auto& reqs = rpc_reqs_[rpc_name]; RequestBase* base = nullptr; @@ -380,7 +406,7 @@ void AsyncGRPCServer::HandleRequest( base = reqs[req_id]; } - VLOG(3) << base->Status2String(rpc_name); + VLOG(30) << base->Status2String(rpc_name); // reference: // https://github.com/tensorflow/tensorflow/issues/5596 diff --git a/paddle/fluid/operators/distributed/grpc_variable_response.cc b/paddle/fluid/operators/distributed/grpc_variable_response.cc index 34d47f3ec0f3025109447b66078b724607d2953a..d6d219d4369ba785e5c369538d4a18dc682952c1 100644 --- a/paddle/fluid/operators/distributed/grpc_variable_response.cc +++ b/paddle/fluid/operators/distributed/grpc_variable_response.cc @@ -286,13 +286,21 @@ int GRPCVariableResponse::Parse(Source* source) { platform::EnableProfiler(platform::ProfilerState::kCPU); } else if (profiling == platform::kDisableProfiler && platform::IsProfileEnabled()) { - // TODO(panyx0718): Should we allow to customize file dir. platform::DisableProfiler( platform::EventSortingKey::kDefault, - string::Sprintf("/tmp/profile_ps_%lld", listener_id)); + string::Sprintf("%s_%lld", FLAGS_rpc_server_profile_path, + listener_id)); } break; } + case sendrecv::VariableMessage::kTrainerIdFieldNumber: { + uint64_t trainer_id = 0; + if (!input.ReadVarint64(&trainer_id)) { + return tag; + } + meta_.set_trainer_id(trainer_id); + break; + } default: { // Unknown tag, return unknown error. return -1; diff --git a/paddle/fluid/operators/distributed/request_handler.h b/paddle/fluid/operators/distributed/request_handler.h index 5be7095acd3c5ac6f880a8a26c246f60a93643b5..3bcc59a47ba5f52da1374f220828a0f392e13d27 100644 --- a/paddle/fluid/operators/distributed/request_handler.h +++ b/paddle/fluid/operators/distributed/request_handler.h @@ -75,7 +75,7 @@ class VarHandle { wait_cond_.wait(lk, [this] { return status_ != kDefaultState; }); ret = status_; } - VLOG(7) << "VarHandle wait:" << ret; + VLOG(70) << "VarHandle wait:" << ret; return ret != kErrorState; } @@ -84,7 +84,7 @@ class VarHandle { std::unique_lock lk(sync_mutex_); status_ = ok ? kFinishState : kErrorState; } - VLOG(7) << "VarHandle finish:" << ok; + VLOG(70) << "VarHandle finish:" << ok; wait_cond_.notify_all(); } @@ -190,6 +190,7 @@ class RequestHandler { // } virtual bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "") = 0; protected: diff --git a/paddle/fluid/operators/distributed/request_handler_impl.cc b/paddle/fluid/operators/distributed/request_handler_impl.cc index 849e412504eb9180b746db65fd4fa353ed0c05a1..dae56cc8436c2241bfc8ae37ba3cad4069a054bf 100644 --- a/paddle/fluid/operators/distributed/request_handler_impl.cc +++ b/paddle/fluid/operators/distributed/request_handler_impl.cc @@ -36,21 +36,21 @@ bool RequestSendHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name) { - VLOG(4) << "RequestSendHandler:" << varname; + VLOG(40) << "RequestSendHandler:" << varname; // Sync if (varname == BATCH_BARRIER_MESSAGE) { - VLOG(3) << "sync: recv BATCH_BARRIER_MESSAGE"; + VLOG(30) << "sync: recv BATCH_BARRIER_MESSAGE"; rpc_server_->IncreaseBatchBarrier(kRequestSend); } else if (varname == COMPLETE_MESSAGE) { - VLOG(3) << "sync: recv complete message"; + VLOG(30) << "sync: recv complete message"; rpc_server_->Complete(); } else { // Async if (!sync_mode_) { - VLOG(3) << "async process var: " << varname; - rpc_server_->Profiler().OneStep(); + VLOG(30) << "async process var: " << varname; try { executor_->RunPreparedContext((*grad_to_prepared_ctx_)[varname].get(), scope); @@ -61,7 +61,7 @@ bool RequestSendHandler::Handle(const std::string& varname, return true; } else { // sync rpc_server_->WaitCond(kRequestSend); - VLOG(3) << "sync: processing received var: " << varname; + VLOG(30) << "sync: processing received var: " << varname; if (invar == nullptr) { LOG(FATAL) << "sync: Can not find server side var: " << varname; @@ -76,11 +76,12 @@ bool RequestGetHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name) { - VLOG(4) << "RequestGetHandler:" << varname; + VLOG(40) << "RequestGetHandler:" << varname; if (sync_mode_) { if (varname == FETCH_BARRIER_MESSAGE) { - VLOG(3) << "sync: recv fetch barrier message"; + VLOG(30) << "sync: recv fetch barrier message"; rpc_server_->IncreaseBatchBarrier(kRequestGet); } else { rpc_server_->WaitCond(kRequestGet); @@ -88,6 +89,20 @@ bool RequestGetHandler::Handle(const std::string& varname, } } else { if (varname != FETCH_BARRIER_MESSAGE && varname != COMPLETE_MESSAGE) { + if (enable_dc_asgd_) { + // NOTE: the format is determined by distributed_transpiler.py + std::string param_bak_name = + string::Sprintf("%s.trainer_%d_bak", varname, trainer_id); + VLOG(30) << "getting " << param_bak_name << " trainer_id " + << trainer_id; + auto var = scope_->FindVar(varname); + auto t_orig = var->Get(); + auto param_bak = scope_->Var(param_bak_name); + auto t = param_bak->GetMutable(); + t->mutable_data(dev_ctx_->GetPlace(), t_orig.type()); + VLOG(30) << "copying " << varname << " to " << param_bak_name; + framework::TensorCopy(t_orig, dev_ctx_->GetPlace(), t); + } *outvar = scope_->FindVar(varname); } } @@ -98,8 +113,9 @@ bool RequestPrefetchHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name) { - VLOG(4) << "RequestPrefetchHandler " << varname; + VLOG(40) << "RequestPrefetchHandler " << varname; auto var_desc = program_->Block(0).FindVar(out_var_name); InitializeVariable(*outvar, var_desc->GetType()); @@ -113,6 +129,7 @@ bool RequestCheckpointHandler::Handle(const std::string& varname, framework::Scope* scope, framework::Variable* invar, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name) { PADDLE_ENFORCE( checkpoint_notify_id != -1, @@ -122,8 +139,8 @@ bool RequestCheckpointHandler::Handle(const std::string& varname, auto* lt_var = scope_->FindVar(LOOKUP_TABLE_PATH)->GetMutable(); lt_var->clear(); lt_var->append(out_var_name); - VLOG(4) << "RequestCheckpointHandler update var kLookupTablePath to: " - << out_var_name; + VLOG(40) << "RequestCheckpointHandler update var kLookupTablePath to: " + << out_var_name; executor_->RunPreparedContext(checkpoint_prepared_ctx_.get(), scope_); return true; } diff --git a/paddle/fluid/operators/distributed/request_handler_impl.h b/paddle/fluid/operators/distributed/request_handler_impl.h index 8be5b21bb89a580f4091de19186fd2d7e5802478..c1afda9dd2445e492d8b93659c9ff13e6e1030b8 100644 --- a/paddle/fluid/operators/distributed/request_handler_impl.h +++ b/paddle/fluid/operators/distributed/request_handler_impl.h @@ -36,20 +36,34 @@ namespace distributed { class RequestSendHandler final : public RequestHandler { public: - explicit RequestSendHandler(bool sync_mode) : RequestHandler(sync_mode) {} + explicit RequestSendHandler(bool sync_mode, bool enable_dc_asgd = false) + : RequestHandler(sync_mode) { + enable_dc_asgd_ = enable_dc_asgd; + } virtual ~RequestSendHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "") override; + + private: + bool enable_dc_asgd_; }; class RequestGetHandler final : public RequestHandler { public: - explicit RequestGetHandler(bool sync_mode) : RequestHandler(sync_mode) {} + explicit RequestGetHandler(bool sync_mode, bool enable_dc_asgd = false) + : RequestHandler(sync_mode) { + enable_dc_asgd_ = enable_dc_asgd; + } virtual ~RequestGetHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "") override; + + private: + bool enable_dc_asgd_; }; class RequestPrefetchHandler final : public RequestHandler { @@ -58,6 +72,7 @@ class RequestPrefetchHandler final : public RequestHandler { virtual ~RequestPrefetchHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "") override; }; @@ -70,6 +85,7 @@ class RequestCheckpointHandler final : public RequestHandler { virtual ~RequestCheckpointHandler() {} bool Handle(const std::string& varname, framework::Scope* scope, framework::Variable* var, framework::Variable** outvar, + const int trainer_id, const std::string& out_var_name = "") override; private: diff --git a/paddle/fluid/operators/distributed/rpc_client.cc b/paddle/fluid/operators/distributed/rpc_client.cc index b5ec9fe5367beb97b3cc7298102deff1e8ca4ec9..390e9af0f38c920f39e8cd3e1b3d28fd89b239fe 100644 --- a/paddle/fluid/operators/distributed/rpc_client.cc +++ b/paddle/fluid/operators/distributed/rpc_client.cc @@ -24,6 +24,7 @@ namespace distributed { std::once_flag RPCClient::init_flag_; std::unique_ptr RPCClient::rpc_client_(nullptr); +int RPCClient::trainer_id_ = 0; } // namespace distributed } // namespace operators diff --git a/paddle/fluid/operators/distributed/rpc_client.h b/paddle/fluid/operators/distributed/rpc_client.h index 3539ee5e459d6dfe0b6510806464bcc6817910bb..1983802e49506c79041112ac87d429e4c084ddfd 100644 --- a/paddle/fluid/operators/distributed/rpc_client.h +++ b/paddle/fluid/operators/distributed/rpc_client.h @@ -72,14 +72,15 @@ class RPCClient { virtual bool Wait() = 0; template - static RPCClient* GetInstance() { - std::call_once(init_flag_, &RPCClient::Init); + static RPCClient* GetInstance(int trainer_id) { + std::call_once(init_flag_, &RPCClient::Init, trainer_id); return rpc_client_.get(); } // Init is called by GetInstance. template - static void Init() { + static void Init(int trainer_id) { + trainer_id_ = trainer_id; if (rpc_client_.get() == nullptr) { rpc_client_.reset(new T()); rpc_client_->InitImpl(); @@ -88,6 +89,8 @@ class RPCClient { protected: virtual void InitImpl() {} + // each trainer have exact one trainer id, it should be static + static int trainer_id_; private: static std::once_flag init_flag_; diff --git a/paddle/fluid/operators/distributed/rpc_server.cc b/paddle/fluid/operators/distributed/rpc_server.cc index 084480ae48b8b9267ade1a840f6a70519cb28e48..4055091104f2f96070d0c4e806c6908da691d732 100644 --- a/paddle/fluid/operators/distributed/rpc_server.cc +++ b/paddle/fluid/operators/distributed/rpc_server.cc @@ -20,42 +20,10 @@ #include "paddle/fluid/operators/distributed/rpc_server.h" #include "paddle/fluid/platform/profiler.h" -DEFINE_int32(rpc_server_profile_period, 0, - "the period of listen_and_serv to do profile"); -DEFINE_string(rpc_server_profile_path, "/dev/null", - "the profile log file path"); - namespace paddle { namespace operators { namespace distributed { -RPCServerProfiler::RPCServerProfiler(int profile_period, - const std::string& profile_log_path) - : profile_period_(profile_period), profile_log_path_(profile_log_path) { - step_ = 0; -} - -void RPCServerProfiler::OneStep() { - PADDLE_ENFORCE_LE(step_, profile_period_, - "step_ should not be larger then " - "profile_period_"); - if (profile_period_ <= 0) { - return; - } - - if (step_ == 0) { - auto pf_state = paddle::platform::ProfilerState::kCPU; - paddle::platform::EnableProfiler(pf_state); - } - if (step_ == profile_period_) { - paddle::platform::DisableProfiler(paddle::platform::EventSortingKey::kTotal, - profile_log_path_); - step_ = 0; - } else { - step_++; - } -} - void RPCServer::ShutDown() { LOG(INFO) << "RPCServer ShutDown "; ShutDownImpl(); @@ -71,7 +39,7 @@ void RPCServer::SavePort() const { port_file.open(file_path); port_file << selected_port_; port_file.close(); - VLOG(4) << "selected port written to " << file_path; + VLOG(40) << "selected port written to " << file_path; } void RPCServer::WaitBarrier(const std::string& rpc_name) { @@ -81,12 +49,12 @@ void RPCServer::WaitBarrier(const std::string& rpc_name) { exit_flag_.load()); }); - VLOG(3) << "batch_barrier_: " << rpc_name << " " - << barrier_counter_[rpc_name]; + VLOG(30) << "batch_barrier_: " << rpc_name << " " + << barrier_counter_[rpc_name]; } void RPCServer::IncreaseBatchBarrier(const std::string rpc_name) { - VLOG(4) << "RPCServer begin IncreaseBatchBarrier " << rpc_name; + VLOG(40) << "RPCServer begin IncreaseBatchBarrier " << rpc_name; int b = 0; std::unique_lock lock(mutex_); b = ++barrier_counter_[rpc_name]; @@ -103,7 +71,7 @@ void RPCServer::Complete() { client_num_--; need_reset_all_vars_ = true; - VLOG(4) << "decrease client_num to: " << client_num_; + VLOG(40) << "decrease client_num to: " << client_num_; if (cur_cond_.load() == rpc_cond_map_[kRequestGet]) { barrier_counter_[kRequestGet]--; } @@ -122,7 +90,7 @@ int RPCServer::GetClientNum() { } void RPCServer::ResetBarrierCounter() { - VLOG(3) << "RPCServer ResetBarrierCounter "; + VLOG(30) << "RPCServer ResetBarrierCounter "; std::unique_lock lock(mutex_); for (auto& t : barrier_counter_) { t.second = 0; @@ -137,12 +105,12 @@ void RPCServer::RegisterRPC(const std::string& rpc_name, static int cond = -1; rpc_cond_map_[rpc_name] = ++cond; - VLOG(4) << "RegisterRPC rpc_name:" << rpc_name << ", handler:" << handler - << ", cond:" << rpc_cond_map_[rpc_name]; + VLOG(40) << "RegisterRPC rpc_name:" << rpc_name << ", handler:" << handler + << ", cond:" << rpc_cond_map_[rpc_name]; } void RPCServer::SetCond(const std::string& rpc_name) { - VLOG(3) << "RPCServer SetCond " << rpc_name; + VLOG(30) << "RPCServer SetCond " << rpc_name; { std::unique_lock lock(mutex_); cur_cond_ = rpc_cond_map_[rpc_name]; @@ -152,7 +120,7 @@ void RPCServer::SetCond(const std::string& rpc_name) { } void RPCServer::WaitCond(const std::string& rpc_name) { - VLOG(4) << "RPCServer WaitCond " << rpc_name; + VLOG(40) << "RPCServer WaitCond " << rpc_name; int cond = 0; { std::unique_lock lock(mutex_); diff --git a/paddle/fluid/operators/distributed/rpc_server.h b/paddle/fluid/operators/distributed/rpc_server.h index f3e61e1575ced0b9ffbad23e6973121daca9751b..c78c5007a7f262f15305b6c284e8c4fbddef42a0 100644 --- a/paddle/fluid/operators/distributed/rpc_server.h +++ b/paddle/fluid/operators/distributed/rpc_server.h @@ -23,30 +23,14 @@ #include "paddle/fluid/operators/distributed/request_handler.h" -DECLARE_int32(rpc_server_profile_period); -DECLARE_string(rpc_server_profile_path); - namespace paddle { namespace operators { namespace distributed { -class RPCServerProfiler { - public: - RPCServerProfiler(int profile_period, const std::string& profile_log_path); - void OneStep(); - - private: - const int profile_period_; - std::string profile_log_path_; - int step_; -}; - class RPCServer { public: explicit RPCServer(const std::string& address, int client_num) : cur_cond_(0), - profiler_(FLAGS_rpc_server_profile_period, - FLAGS_rpc_server_profile_path), bind_address_(address), exit_flag_(false), selected_port_(0), @@ -86,7 +70,6 @@ class RPCServer { void Complete(); void ResetBarrierCounter(); - RPCServerProfiler& Profiler() { return profiler_; } bool NeedResetAllVars(); @@ -101,7 +84,6 @@ class RPCServer { std::unordered_map rpc_cond_map_; std::atomic cur_cond_; std::condition_variable rpc_cond_; - RPCServerProfiler profiler_; protected: std::string bind_address_; diff --git a/paddle/fluid/operators/distributed/rpc_server_test.cc b/paddle/fluid/operators/distributed/rpc_server_test.cc index d6176e1443d2a441af7878e5efe99796d486bb7a..c3dd459fc4e8c4bd304c09d7a3ed4f456c4dc69f 100644 --- a/paddle/fluid/operators/distributed/rpc_server_test.cc +++ b/paddle/fluid/operators/distributed/rpc_server_test.cc @@ -125,7 +125,7 @@ TEST(PREFETCH, CPU) { g_req_handler.reset(new distributed::RequestPrefetchHandler(true)); g_rpc_service.reset(new RPCSERVER_T("127.0.0.1:0", 1)); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); std::thread server_thread(StartServer, distributed::kRequestPrefetch); g_rpc_service->WaitServerReady(); @@ -165,7 +165,7 @@ TEST(COMPLETE, CPU) { g_req_handler.reset(new distributed::RequestSendHandler(true)); g_rpc_service.reset(new RPCSERVER_T("127.0.0.1:0", 2)); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); PADDLE_ENFORCE(client != nullptr); std::thread server_thread(StartServer, distributed::kRequestSend); g_rpc_service->WaitServerReady(); diff --git a/paddle/fluid/operators/distributed/send_recv.proto.in b/paddle/fluid/operators/distributed/send_recv.proto.in index 8b0a09abe1d05dda10eda0030eb91cb9ca40683e..55820c980e8139625c1b589f9d2d68dfee74a212 100644 --- a/paddle/fluid/operators/distributed/send_recv.proto.in +++ b/paddle/fluid/operators/distributed/send_recv.proto.in @@ -79,6 +79,7 @@ message VariableMessage { // server stops profiling and generates a profile to /tmp/profile_ps_* // when profile switches from 1 to 2. int64 profile = 11; + int64 trainer_id = 12; } message VoidMessage {} diff --git a/paddle/fluid/operators/distributed/sendrecvop_utils.cc b/paddle/fluid/operators/distributed/sendrecvop_utils.cc index 6a3f8fd544bc5d669b725765a863b42ec069a7b6..6ba883ba01f73bd941e7b1b58afb777e2588764b 100644 --- a/paddle/fluid/operators/distributed/sendrecvop_utils.cc +++ b/paddle/fluid/operators/distributed/sendrecvop_utils.cc @@ -15,12 +15,14 @@ limitations under the License. */ #ifdef PADDLE_WITH_CUDA #include #endif -#include #include // NOLINT #include "paddle/fluid/framework/data_type.h" #include "paddle/fluid/operators/distributed/sendrecvop_utils.h" #include "paddle/fluid/operators/distributed/variable_response.h" +#include "paddle/fluid/platform/port.h" + +DEFINE_bool(rpc_disable_reuse_port, false, "Disable SO_REUSEPORT or not."); namespace paddle { namespace operators { @@ -28,16 +30,34 @@ namespace distributed { using VarMsg = sendrecv::VariableMessage; +static TensorPayload GetCommunicationAllocationFromTensor( + const platform::DeviceContext& ctx, const framework::Tensor& tensor) { + if (is_gpu_place(ctx.GetPlace())) { #ifdef PADDLE_WITH_CUDA -void* GetVarPayLoad(const std::string varname, int64_t size) { - platform::CUDAPinnedPlace cuda_pinned; - return memory::Alloc(cuda_pinned, size); -} -#endif + PADDLE_ENFORCE(is_gpu_place(tensor.place())); + auto& gpu_dev_ctx = + reinterpret_cast(ctx); + auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type()); + platform::CUDAPinnedPlace cuda_pinned; + auto result = memory::AllocShared( + cuda_pinned, copy_size, memory::allocation::Allocator::kCrossDevice); + + memory::Copy(cuda_pinned, result->ptr(), + boost::get(tensor.place()), + tensor.data(), copy_size, gpu_dev_ctx.stream()); -void GetTensorPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size) { + ctx.Wait(); + return TensorPayload(result); +#else + PADDLE_THROW("This situation should not be happened"); +#endif + } else { + return TensorPayload(tensor); + } +} +TensorPayload GetTensorPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request) { auto tensor = var->Get(); // FIXME(wuyi): data types in send_recv.proto is copied from // framework.proto @@ -56,31 +76,12 @@ void GetTensorPayload(framework::Variable* var, } } } - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - PADDLE_ENFORCE(platform::is_gpu_place(tensor.place())); - // platform::CUDAPinnedPlace cuda_pinned; - auto& gpu_dev_ctx = static_cast(ctx); - auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type()); - *payload = GetVarPayLoad(request->varname(), copy_size); - - platform::CUDAPinnedPlace cuda_pinned; - memory::Copy(cuda_pinned, *payload, - boost::get(tensor.place()), - reinterpret_cast(tensor.data()), copy_size, - gpu_dev_ctx.stream()); - - ctx.Wait(); -#endif - } else { - *payload = tensor.data(); - } - *payload_size = tensor.numel() * framework::SizeOfType(tensor.type()); + return GetCommunicationAllocationFromTensor(ctx, tensor); } -void GetSelectedRowsPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size) { +TensorPayload GetSelectedRowsPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request) { auto* slr = var->GetMutable(); request->set_data_type( static_cast(framework::ToDataType(slr->value().type()))); @@ -92,25 +93,20 @@ void GetSelectedRowsPayload(framework::Variable* var, } auto* tensor = slr->mutable_value(); - if (platform::is_gpu_place(ctx.GetPlace())) { -#ifdef PADDLE_WITH_CUDA - auto& gpu_dev_ctx = static_cast(ctx); - auto copy_size = tensor->numel() * framework::SizeOfType(tensor->type()); - *payload = GetVarPayLoad(request->varname(), copy_size); - - platform::CUDAPinnedPlace cuda_pinned; - memory::Copy(cuda_pinned, *payload, - boost::get(tensor->place()), - reinterpret_cast(tensor->data()), copy_size, - gpu_dev_ctx.stream()); - ctx.Wait(); -#endif - } else { - *payload = slr->mutable_value()->data(); - } - *payload_size = tensor->numel() * framework::SizeOfType(tensor->type()); + return GetCommunicationAllocationFromTensor(ctx, *tensor); } +TensorPayload::TensorPayload(std::shared_ptr allocation) + : allocation_(allocation), offset_(0), memory_size_(allocation->size()) {} +TensorPayload::TensorPayload(const framework::Tensor& tensor) + : allocation_(tensor.Holder()), + offset_(tensor.offset()), + memory_size_(tensor.numel() * framework::SizeOfType(tensor.type())) {} +void* TensorPayload::ptr() const { + return reinterpret_cast( + reinterpret_cast(allocation_->ptr()) + offset_); +} +size_t TensorPayload::memory_size() const { return memory_size_; } } // namespace distributed } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/distributed/sendrecvop_utils.h b/paddle/fluid/operators/distributed/sendrecvop_utils.h index 4d08d3c77afa3c1f2b4d7602f7199558bb5a79c0..523e56fe3e4eaf9d4faa8f0bf91d0b6075e0b0e5 100644 --- a/paddle/fluid/operators/distributed/sendrecvop_utils.h +++ b/paddle/fluid/operators/distributed/sendrecvop_utils.h @@ -13,7 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#include #include #include #include @@ -24,6 +23,7 @@ limitations under the License. */ #include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/framework/tensor_util.h" #include "paddle/fluid/framework/var_type.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/operators/distributed/send_recv.pb.h" @@ -33,13 +33,30 @@ namespace distributed { using VarMsg = sendrecv::VariableMessage; -void GetTensorPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size); +class TensorPayload final { + public: + explicit TensorPayload(const framework::Tensor& tensor); + explicit TensorPayload(std::shared_ptr allocation); -void GetSelectedRowsPayload(framework::Variable* var, - const platform::DeviceContext& ctx, VarMsg* request, - void** payload, size_t* payload_size); + TensorPayload(const TensorPayload& o) = default; + TensorPayload& operator=(const TensorPayload& o) = default; + + void* ptr() const; + size_t memory_size() const; + + private: + std::shared_ptr allocation_; + size_t offset_; + size_t memory_size_; +}; + +TensorPayload GetTensorPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request); + +TensorPayload GetSelectedRowsPayload(framework::Variable* var, + const platform::DeviceContext& ctx, + VarMsg* request); inline std::type_index ToTypeIndex(sendrecv::VariableMessage::Type type) { switch (type) { diff --git a/paddle/fluid/operators/distributed/variable_response.cc b/paddle/fluid/operators/distributed/variable_response.cc index c4854d50b6371064003a10e18efc9e5f160d9a42..f831793e9b2aeedb6a073013494a86fcd3246b38 100644 --- a/paddle/fluid/operators/distributed/variable_response.cc +++ b/paddle/fluid/operators/distributed/variable_response.cc @@ -16,6 +16,9 @@ #include #include "paddle/fluid/operators/distributed/sendrecvop_utils.h" +DEFINE_string(rpc_server_profile_path, "./profile_ps", + "the profile log file path"); + namespace paddle { namespace operators { namespace distributed { @@ -47,7 +50,7 @@ bool VariableResponse::ReadRaw(::google::protobuf::io::CodedInputStream* input, size_to_write = length - total_written; } // This log is useful to see how long a internal block size is of rpc. - VLOG(7) << "copy " << size_to_write << " data to CUDAPlace"; + VLOG(70) << "copy " << size_to_write << " data to CUDAPlace"; memory::Copy(boost::get(place), reinterpret_cast(p), cpu, data, size_to_write, gpu_dev_ctx.stream()); @@ -76,7 +79,7 @@ bool VariableResponse::ReadRaw(::google::protobuf::io::CodedInputStream* input, // TODO(gongwb): can we avoid copy? platform::CPUPlace cpu; // This log is useful to see how long a internal block size is of rpc. - VLOG(7) << "copy " << size_to_write << " data to CPUPlace"; + VLOG(70) << "copy " << size_to_write << " data to CPUPlace"; memory::Copy(cpu, reinterpret_cast(p), cpu, data, size_to_write); p += size_to_write; @@ -112,11 +115,11 @@ bool VariableResponse::CopyLodTensorData( void* tensor_data = tensor->mutable_data(ctx.GetPlace(), ToTypeIndex(meta_.data_type())); - if (!ReadRaw(input, ctx, tensor->place(), tensor_data, length)) { - return false; - } - return true; + VLOG(6) << "Tensor.memory_size = " << tensor->memory_size() + << ", Buffer Size = " << length; + PADDLE_ENFORCE_EQ(tensor->memory_size(), length); + return ReadRaw(input, ctx, tensor->place(), tensor_data, length); } inline framework::DDim GetDims( @@ -195,8 +198,8 @@ bool VariableResponse::ProcSerializedField( #endif } - VLOG(7) << "ProcSerializedField:" << meta_.varname() - << ", type:" << meta_.type() << std::endl; + VLOG(70) << "ProcSerializedField:" << meta_.varname() + << ", type:" << meta_.type() << std::endl; framework::DDim dims = GetDims(meta_.dims()); if (meta_.type() == sendrecv::LOD_TENSOR) { PADDLE_ENFORCE(meta_.lod_size() >= 0, "lod info should be got first!"); diff --git a/paddle/fluid/operators/distributed/variable_response.h b/paddle/fluid/operators/distributed/variable_response.h index 6aec52ca00f59a42ecca01da8df1680ce4eda432..4c7fcbbdfb305ce6b4fc9d1edd9738899b200ec6 100644 --- a/paddle/fluid/operators/distributed/variable_response.h +++ b/paddle/fluid/operators/distributed/variable_response.h @@ -27,6 +27,8 @@ #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/operators/distributed/send_recv.pb.h" +DECLARE_string(rpc_server_profile_path); + namespace paddle { namespace operators { namespace distributed { @@ -92,6 +94,8 @@ class VariableResponse { return scope_->FindVar(meta_.varname()); } + int GetTrainerId() { return static_cast(meta_.trainer_id()); } + protected: bool ReadRaw(::google::protobuf::io::CodedInputStream* input, const platform::DeviceContext& dev_ctx, platform::Place place, diff --git a/paddle/fluid/operators/distributed_ops/CMakeLists.txt b/paddle/fluid/operators/distributed_ops/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..28bb90af5675b2fe14813675ad001c0cf1d71e12 --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/CMakeLists.txt @@ -0,0 +1,40 @@ +include(operators) + +set(DISTRIBUTE_DEPS "") +if(WITH_GRPC) + set(DISTRIBUTE_DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf node) +else() + set(DISTRIBUTE_DEPS sendrecvop_brpc brpc leveldb snappystream snappy protobuf ssl crypto zlib node) + if(WITH_BRPC_RDMA) + find_library(IBVERBS_LIBRARY NAMES ibverbs) + ADD_LIBRARY(ibverbs SHARED IMPORTED GLOBAL) + SET_PROPERTY(TARGET ibverbs PROPERTY IMPORTED_LOCATION ${IBVERBS_LIBRARY}) + + + find_library(RDMACM_LIBRARY NAMES rdmacm) + ADD_LIBRARY(rdmacm SHARED IMPORTED GLOBAL) + SET_PROPERTY(TARGET rdmacm PROPERTY IMPORTED_LOCATION ${RDMACM_LIBRARY}) + + set(DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} ibverbs rdmacm) + endif() +endif() + +set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") + + +file(GLOB OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*_op.cc") +list(REMOVE_DUPLICATES OPS) + +foreach(src ${OPS}) + set_source_files_properties(${src} PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) +endforeach() + +register_operators(EXCLUDES gen_nccl_id_op DEPS ${DISTRIBUTE_DEPS}) + +if(WITH_GPU AND NOT WIN32) + set(DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} nccl_common) + op_library(gen_nccl_id_op ${DISTRIBUTE_DEPS} nccl_common) +endif() + +set(OPERATOR_DEPS ${OPERATOR_DEPS} ${DISTRIBUTE_DEPS} PARENT_SCOPE) +set(GLOB_DISTRIBUTE_DEPS ${DISTRIBUTE_DEPS} CACHE INTERNAL "distributed dependency") diff --git a/paddle/fluid/operators/checkpoint_notify_op.cc b/paddle/fluid/operators/distributed_ops/checkpoint_notify_op.cc similarity index 88% rename from paddle/fluid/operators/checkpoint_notify_op.cc rename to paddle/fluid/operators/distributed_ops/checkpoint_notify_op.cc index 3a2527e407bb179c4873fa3ffe2e8f22fb47faf7..ed4dced51356515d5910e2962c9ee91a1997dbf0 100644 --- a/paddle/fluid/operators/checkpoint_notify_op.cc +++ b/paddle/fluid/operators/distributed_ops/checkpoint_notify_op.cc @@ -19,7 +19,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/macros.h" -#include "paddle/fluid/operators/send_recv_util.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" #include "paddle/fluid/string/printf.h" namespace paddle { @@ -38,15 +38,16 @@ class CheckpointNotifyOp : public framework::OperatorBase { std::vector epmap = Attr>("epmap"); std::string dir = Attr("dir"); std::string lookup_table_name = Attr("lookup_table"); + int trainer_id = Attr("trainer_id"); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(trainer_id); for (size_t i = 0; i < epmap.size(); i++) { auto lookup_table_save_dir = string::Sprintf("%s/%s_%d", dir, lookup_table_name, i); rpc_client->AsyncCheckpointNotify(epmap[i], lookup_table_save_dir); - VLOG(3) << "checkpoint notify sending lookup table: " << lookup_table_name - << " and dir:" << dir << " to " << epmap[i]; + VLOG(30) << "checkpoint notify sending lookup table: " + << lookup_table_name << " and dir:" << dir << " to " << epmap[i]; } PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); } @@ -63,6 +64,7 @@ class CheckpointNotifyOpMaker : public framework::OpProtoAndCheckerMaker { "dir", "(string, default '') indicate the folder checkpoint will use"); AddAttr("lookup_table", "(string, default '') the lookup table name"); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddComment(R"DOC( CheckpointNotify operator diff --git a/paddle/fluid/operators/distributed_ops/fake_init_op.cc b/paddle/fluid/operators/distributed_ops/fake_init_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..28ebdcb03ea83f3ec701106111a7cc5f0f7ed7dc --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/fake_init_op.cc @@ -0,0 +1,86 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/data_type.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +class FakeInitInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of FakeInitOp should not be null."); + auto &shape = ctx->Attrs().Get>("shape"); + ctx->SetOutputDim("Out", framework::make_ddim(shape)); + } +}; + +class FakeInitOp : public framework::OperatorBase { + public: + using framework::OperatorBase::OperatorBase; + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &dev_place) const override { + framework::Tensor *tensor = nullptr; + + auto &out_var = *scope.FindVar(Output("Out")); + + if (out_var.IsType()) { + tensor = out_var.GetMutable(); + tensor->Resize(framework::make_ddim(Attr>("shape"))); + } else if (out_var.IsType()) { + tensor = out_var.GetMutable()->mutable_value(); + tensor->Resize(framework::make_ddim(Attr>("shape"))); + } else { + PADDLE_THROW( + "fake init op's output only" + "supports SelectedRows and LoDTensor"); + } + } +}; + +class FakeInitOpVarTypeInference : public framework::VarTypeInference { + public: + void operator()(const framework::OpDesc &op_desc, + framework::BlockDesc *block) const override {} +}; + +class FakeInitOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddAttr>("shape", + "(vector) The shape of the output"); + AddOutput("Out", + "(Tensor) Tensor of specified shape will be filled " + "with the specified value"); + AddComment(R"DOC( +FakeInit Operator. + +Init an variable but not alloc memory for it, it is used for init the +table parameter at trainer side in distributed lookup table. + +)DOC"); + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(fake_init, ops::FakeInitOp, ops::FakeInitInferShape, + ops::FakeInitOpMaker, paddle::framework::EmptyGradOpMaker, + ops::FakeInitOpVarTypeInference); diff --git a/paddle/fluid/operators/fetch_barrier_op.cc b/paddle/fluid/operators/distributed_ops/fetch_barrier_op.cc similarity index 92% rename from paddle/fluid/operators/fetch_barrier_op.cc rename to paddle/fluid/operators/distributed_ops/fetch_barrier_op.cc index 9d7ac7ab6194593747548fac3cefc8d4ed3058d8..88a5e59ce7d6c0d14e480922bd328d632c9178e5 100644 --- a/paddle/fluid/operators/fetch_barrier_op.cc +++ b/paddle/fluid/operators/distributed_ops/fetch_barrier_op.cc @@ -37,12 +37,13 @@ class FetchBarrierOp : public framework::OperatorBase { const platform::Place& place) const override { std::vector eps = Attr>("endpoints"); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance( + Attr("trainer_id")); PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); for (auto& ep : eps) { - VLOG(3) << "fetch barrier, ep: " << ep; + VLOG(30) << "fetch barrier, ep: " << ep; rpc_client->AsyncSendFetchBarrier(ep); } PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); @@ -61,6 +62,7 @@ This operator will send a send barrier signal to list_and_serv op, so that the Parameter Server would knew all variables have been sent. )DOC"); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddAttr>("endpoints", "(string vector, default 127.0.0.1:6164)" "Server endpoints to send variables to.") diff --git a/paddle/fluid/operators/gen_nccl_id_op.cc b/paddle/fluid/operators/distributed_ops/gen_nccl_id_op.cc similarity index 94% rename from paddle/fluid/operators/gen_nccl_id_op.cc rename to paddle/fluid/operators/distributed_ops/gen_nccl_id_op.cc index 697c239e59d158428ae9ba9f7feded19637dff28..56ea165ff84291babc0e9ee56ada669cbbbe79fe 100644 --- a/paddle/fluid/operators/gen_nccl_id_op.cc +++ b/paddle/fluid/operators/distributed_ops/gen_nccl_id_op.cc @@ -61,10 +61,10 @@ class GenNCCLIdOp : public framework::OperatorBase { std::vector endpoint_list = Attr>("endpoint_list"); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); for (auto& ep : endpoint_list) { - VLOG(3) << "sending nccl id to " << ep; + VLOG(30) << "sending nccl id to " << ep; client->AsyncSendVar(ep, dev_ctx, *scope, NCCL_ID_VARNAME); } client->Wait(); @@ -72,7 +72,7 @@ class GenNCCLIdOp : public framework::OperatorBase { client->AsyncSendBatchBarrier(ep); } client->Wait(); - VLOG(3) << "sending completed..."; + VLOG(30) << "sending completed..."; } void GetIdByServer(framework::Scope* scope, @@ -99,11 +99,11 @@ class GenNCCLIdOp : public framework::OperatorBase { std::bind(&distributed::RPCServer::StartServer, rpc_service.get())); rpc_service->SetCond(distributed::kRequestSend); - VLOG(3) << "start getting nccl id from trainer 0..."; + VLOG(30) << "start getting nccl id from trainer 0..."; rpc_service->WaitBarrier(distributed::kRequestSend); - VLOG(3) << "got nccl id and stop server..."; + VLOG(30) << "got nccl id and stop server..."; rpc_service->ShutDown(); - VLOG(3) << "rpc server stopped"; + VLOG(30) << "rpc server stopped"; server_thread.join(); } }; diff --git a/paddle/fluid/operators/listen_and_serv_op.cc b/paddle/fluid/operators/distributed_ops/listen_and_serv_op.cc similarity index 85% rename from paddle/fluid/operators/listen_and_serv_op.cc rename to paddle/fluid/operators/distributed_ops/listen_and_serv_op.cc index dc008d16971bc762b401ddece56f9ec56f7a47d6..9f0c7db0e1133f6d73e73a9d162a945ba4c17dc6 100644 --- a/paddle/fluid/operators/listen_and_serv_op.cc +++ b/paddle/fluid/operators/distributed_ops/listen_and_serv_op.cc @@ -25,14 +25,18 @@ limitations under the License. */ #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/distributed/request_handler_impl.h" -#include "paddle/fluid/operators/listen_and_serv_op.h" +#include "paddle/fluid/operators/distributed_ops/listen_and_serv_op.h" + +DEFINE_int32(rpc_send_thread_num, 5, "number of threads for rpc send"); +DEFINE_int32(rpc_get_thread_num, 5, "number of threads for rpc get"); +DEFINE_int32(rpc_prefetch_thread_num, 5, "number of threads for rpc prefetch"); namespace paddle { namespace operators { void RunServer(std::shared_ptr service) { service->StartServer(); - VLOG(4) << "RunServer thread end"; + VLOG(40) << "RunServer thread end"; } static void split(const std::string &str, char sep, std::vector *pieces) { @@ -62,11 +66,11 @@ static void ParallelExecuteBlocks( fs.push_back(framework::Async([&executor, &prepared, &scope, idx]() { int run_block = idx; // thread local try { - VLOG(3) << "running server block: " << run_block - << "pointer: " << prepared[run_block].get(); + VLOG(30) << "running server block: " << run_block + << "pointer: " << prepared[run_block].get(); executor->RunPreparedContext(prepared[run_block].get(), scope); } catch (const std::exception &e) { - LOG(ERROR) << "run sub program error " << e.what(); + LOG(FATAL) << "run sub program:" << idx << " error " << e.what(); } })); } @@ -104,7 +108,7 @@ void ListenAndServOp::RunSyncLoop( framework::Scope *recv_scope, platform::DeviceContext *dev_ctx, const std::vector &prefetch_block_id_list, const int checkpoint_point_block_id) const { - VLOG(2) << "RunSyncLoop"; + VLOG(20) << "RunSyncLoop"; size_t num_blocks = program->Size(); auto optimize_blocks = Attr>(kOptimizeBlocks); @@ -130,7 +134,6 @@ void ListenAndServOp::RunSyncLoop( rpc_service_->ResetBarrierCounter(); while (true) { - rpc_service_->Profiler().OneStep(); // Get from multiple trainers, we don't care about the order in which // the gradients arrives, just add suffix 0~n and merge the gradient. rpc_service_->SetCond(distributed::kRequestSend); @@ -164,7 +167,7 @@ void ListenAndServOp::RunSyncLoop( } ParallelExecuteBlocks(parallel_blkids, executor, optimize_prepared, program, recv_scope); - VLOG(2) << "run all blocks spent " << GetTimestamp() - ts << "(ms)"; + VLOG(20) << "run all blocks spent " << GetTimestamp() - ts << "(ms)"; ResetReceivedVars(recv_scope, dev_ctx, rpc_service_->NeedResetAllVars()); @@ -180,11 +183,11 @@ void ListenAndServOp::ResetReceivedVars(framework::Scope *recv_scope, for (auto &varname : sparse_vars_) { auto var = recv_scope->FindVar(varname); if (var == nullptr) { - VLOG(2) << "can not find var " << varname << " in received scope"; + VLOG(20) << "can not find var " << varname << " in received scope"; continue; } if (var->IsType()) { - VLOG(3) << "reset sparse var: " << varname; + VLOG(30) << "reset sparse var: " << varname; var->GetMutable()->mutable_rows()->clear(); } else { PADDLE_THROW("The type of sparse var should be SelectedRows"); @@ -194,7 +197,7 @@ void ListenAndServOp::ResetReceivedVars(framework::Scope *recv_scope, for (auto &varname : dense_vars_) { auto var = recv_scope->FindVar(varname); if (var == nullptr) { - VLOG(2) << "can not find var " << varname << " in received scope"; + VLOG(20) << "can not find var " << varname << " in received scope"; continue; } if (var->IsType()) { @@ -213,24 +216,27 @@ void ListenAndServOp::ResetReceivedVars(framework::Scope *recv_scope, void ListenAndServOp::RunAsyncLoop(framework::Executor *executor, framework::ProgramDesc *program, framework::Scope *recv_scope) const { - VLOG(2) << "RunAsyncLoop"; - // grad name to block id - std::unordered_map grad_to_block_id; - std::unordered_map id_to_grad; - + VLOG(20) << "RunAsyncLoop"; auto grad_to_block_id_str = Attr>("grad_to_block_id"); - for (const auto &grad_and_id : grad_to_block_id_str) { + DoubleFindMap grad_to_block_id; + + auto append_block_maps = [](DoubleFindMap *out_map, + const std::string &grad_and_id) { std::vector pieces; split(grad_and_id, ':', &pieces); - VLOG(3) << "after split, grad = " << pieces[0] << ", id=" << pieces[1]; + VLOG(30) << "after split, key = " << pieces[0] << ", id=" << pieces[1]; PADDLE_ENFORCE_EQ(pieces.size(), 2); - PADDLE_ENFORCE_EQ(grad_to_block_id.count(pieces[0]), 0); + PADDLE_ENFORCE_EQ(out_map->count(pieces[0]), 0); int block_id = std::stoi(pieces[1]); - grad_to_block_id[pieces[0]] = block_id; - id_to_grad[block_id] = pieces[0]; + (*out_map)[pieces[0]] = block_id; + }; + + for (const auto &grad_and_id : grad_to_block_id_str) { + append_block_maps(&grad_to_block_id, grad_and_id); } + size_t num_blocks = program->Size(); PADDLE_ENFORCE_GE(num_blocks, 2, "server program should have at least 2 blocks"); @@ -240,15 +246,22 @@ void ListenAndServOp::RunAsyncLoop(framework::Executor *executor, block_list.push_back(blkid); } auto optimize_prepared = executor->Prepare(*program, block_list); - // execute global block if needed - if (block_list[0] == 1 && id_to_grad.count(1) == 0) { + // execute global block if needed, block id 1 in the program is global + // block if it's not bind to a grad var for it's update. + if (block_list[0] == 1 && + grad_to_block_id.find_value(static_cast(1)) == + grad_to_block_id.end()) { executor->RunPreparedContext(optimize_prepared[0].get(), recv_scope); } std::unordered_map> - grad_to_prepared_ctx; + grad_to_prepared_ctx, param_to_prepared_ctx; for (size_t i = 0; i < block_list.size(); ++i) { - grad_to_prepared_ctx[id_to_grad[block_list[i]]] = optimize_prepared[i]; + auto blkid = block_list[i]; + auto it = grad_to_block_id.find_value(blkid); + if (it != grad_to_block_id.end()) { + grad_to_prepared_ctx[it->first] = optimize_prepared[i]; + } } request_send_handler_->SetGradToPreparedCtx(&grad_to_prepared_ctx); @@ -257,7 +270,7 @@ void ListenAndServOp::RunAsyncLoop(framework::Executor *executor, while (true) { if (rpc_service_->IsExit()) { - VLOG(4) << "get exit!rpc_processor break!"; + VLOG(40) << "get exit!rpc_processor break!"; break; } @@ -311,6 +324,7 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope, framework::Scope &recv_scope = scope.NewScope(); bool sync_mode = Attr("sync_mode"); + bool dc_sgd = Attr("dc_asgd"); auto fan_in = Attr("Fanin"); auto inputs = Inputs("X"); @@ -318,25 +332,30 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope, std::string endpoint = Attr("endpoint"); int checkpoint_block_id = Attr(kCheckpointBlockId); - VLOG(4) << "sync_mode:" << sync_mode << ", fan_in:" << fan_in - << ", end_point:" << endpoint - << ", checkpoint_block_id: " << checkpoint_block_id; + VLOG(40) << "sync_mode:" << sync_mode << ", fan_in:" << fan_in + << ", end_point:" << endpoint + << ", checkpoint_block_id: " << checkpoint_block_id; rpc_service_.reset(new RPCSERVER_T(endpoint, fan_in)); - request_send_handler_.reset(new distributed::RequestSendHandler(sync_mode)); - request_get_handler_.reset(new distributed::RequestGetHandler(sync_mode)); + request_send_handler_.reset( + new distributed::RequestSendHandler(sync_mode, dc_sgd)); + request_get_handler_.reset( + new distributed::RequestGetHandler(sync_mode, dc_sgd)); request_prefetch_handler_.reset( new distributed::RequestPrefetchHandler(sync_mode)); request_checkpoint_handler_.reset(new distributed::RequestCheckpointHandler( sync_mode, checkpoint_block_id)); rpc_service_->RegisterRPC(distributed::kRequestSend, - request_send_handler_.get()); + request_send_handler_.get(), + FLAGS_rpc_send_thread_num); rpc_service_->RegisterRPC(distributed::kRequestGet, - request_get_handler_.get()); + request_get_handler_.get(), + FLAGS_rpc_get_thread_num); rpc_service_->RegisterRPC(distributed::kRequestPrefetch, - request_prefetch_handler_.get()); + request_prefetch_handler_.get(), + FLAGS_rpc_prefetch_thread_num); rpc_service_->RegisterRPC(distributed::kRequestCheckpoint, request_checkpoint_handler_.get()); @@ -364,8 +383,8 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope, prefetch_var_name_to_block_id_str) { std::vector pieces; split(prefetch_var_name_and_id, ':', &pieces); - VLOG(3) << "after split, prefetch_var = " << pieces[0] - << ", id=" << pieces[1]; + VLOG(30) << "after split, prefetch_var = " << pieces[0] + << ", id=" << pieces[1]; PADDLE_ENFORCE_EQ(pieces.size(), 2); int block_id = std::stoi(pieces[1]); @@ -396,7 +415,7 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope, // start the server listening after all member initialized. server_thread_.reset(new std::thread(RunServer, rpc_service_)); - VLOG(3) << "wait server thread to become ready..."; + VLOG(30) << "wait server thread to become ready..."; rpc_service_->WaitServerReady(); // register SIGINT(from ctrl+C) and SIGTERM(from kill) signal handlers @@ -436,6 +455,8 @@ class ListenAndServOpMaker : public framework::OpProtoAndCheckerMaker { "a map from grad name to it's optimize block id") .SetDefault({}); AddAttr("sync_mode", "if works at sync_mode or not").SetDefault(true); + AddAttr("dc_asgd", "set to true will enable DC-ASGD training.") + .SetDefault(false); AddAttr>( kOptimizeBlocks, "Optimize blocks to run on server side.") .SetDefault({}); diff --git a/paddle/fluid/operators/listen_and_serv_op.h b/paddle/fluid/operators/distributed_ops/listen_and_serv_op.h similarity index 89% rename from paddle/fluid/operators/listen_and_serv_op.h rename to paddle/fluid/operators/distributed_ops/listen_and_serv_op.h index 5f889793ab16249a4e06801090db087a089dbed1..9431978df836121baacc12ed4e1ee6b218cc7d7a 100644 --- a/paddle/fluid/operators/listen_and_serv_op.h +++ b/paddle/fluid/operators/distributed_ops/listen_and_serv_op.h @@ -18,6 +18,7 @@ limitations under the License. */ #include #include #include +#include #include #include "paddle/fluid/framework/executor.h" @@ -37,6 +38,17 @@ constexpr char kCheckpointBlockId[] = "checkpint_block_id"; void RunServer(std::shared_ptr service); +template +class DoubleFindMap : public std::unordered_map { + public: + typename std::unordered_map::iterator find_value(TValue v) { + return std::find_if(this->begin(), this->end(), + [&v](const std::pair p) { + return p.second == v; + }); + } +}; + class ListenAndServOp : public framework::OperatorBase { public: ListenAndServOp(const std::string& type, diff --git a/paddle/fluid/operators/merge_ids_op.cc b/paddle/fluid/operators/distributed_ops/merge_ids_op.cc similarity index 80% rename from paddle/fluid/operators/merge_ids_op.cc rename to paddle/fluid/operators/distributed_ops/merge_ids_op.cc index c6ec4ab047d5e91625e646fd26108d2e477cdce5..252a63cb605f65e8572281a05e884fb8b020a820 100644 --- a/paddle/fluid/operators/merge_ids_op.cc +++ b/paddle/fluid/operators/distributed_ops/merge_ids_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/merge_ids_op.h" +#include "paddle/fluid/operators/distributed_ops/merge_ids_op.h" namespace paddle { namespace operators { @@ -20,13 +20,16 @@ namespace operators { class MergeIdsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - AddInput("Ids", "(LoDTensor) the input ids with shape{batch_num, 1}"); - AddInput( - "X", - "(LoDTensors) multi input tensor with shape{batch_num, N}, N is the " - "size of embedding table") + AddInput("Ids", "(LoDTensor) the input ids with shape{batch_num, 1}") + .AsDuplicable(); + AddInput("Rows", "(LoDTensor) the input ids with shape{row_size, 1}, ") + .AsDuplicable(); + AddInput("X", + "(LoDTensors) multi input tensor with shape{Rows, N}, N is the " + "size of embedding table") + .AsDuplicable(); + AddOutput("Out", "(LoDTensor) The merged outputs of the input tensors.") .AsDuplicable(); - AddOutput("Out", "(LoDTensor) The merged outputs of the input tensors."); AddComment(R"DOC( Merge multi LoDTensor's into one according to Ids's shard num. @@ -79,15 +82,19 @@ class MergeIdsOp : public framework::OperatorWithKernel { using framework::OperatorWithKernel::OperatorWithKernel; void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Ids"), "MergeIdsOp must has input Ids."); - PADDLE_ENFORCE(ctx->HasInputs("X"), "MergeIdsOp must has input X."); - PADDLE_ENFORCE(ctx->HasOutput("Out"), "MergeIdsOp must has output Out."); + PADDLE_ENFORCE(ctx->HasInputs("Ids"), + "MergeIdsOp must has multi input Ids."); + PADDLE_ENFORCE(ctx->HasInputs("Rows"), + "MergeIdsOp must has multi input Rows."); + PADDLE_ENFORCE(ctx->HasInputs("X"), "MergeIdsOp must has multi input X."); + PADDLE_ENFORCE(ctx->HasOutputs("Out"), + "MergeIdsOp must has multi output Out."); auto ids_var_type = ctx->GetInputsVarType("Ids").front(); - auto ids_dims = ctx->GetInputDim("Ids"); + auto ids_dims = ctx->GetInputsDim("Ids"); if (ids_var_type == framework::proto::VarType::LOD_TENSOR) { - PADDLE_ENFORCE_EQ(ids_dims.size(), 2); - PADDLE_ENFORCE_EQ(ids_dims[1], 1); + PADDLE_ENFORCE_EQ(ids_dims[0].size(), 2); + PADDLE_ENFORCE_EQ(ids_dims[0][1], 1); } auto x_var_type = ctx->GetInputsVarType("X"); for (auto &var_type : x_var_type) { diff --git a/paddle/fluid/operators/distributed_ops/merge_ids_op.h b/paddle/fluid/operators/distributed_ops/merge_ids_op.h new file mode 100644 index 0000000000000000000000000000000000000000..99c57590191d58a12760fb335df76037685d1ced --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/merge_ids_op.h @@ -0,0 +1,105 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/operators/math/selected_rows_functor.h" + +namespace paddle { +namespace operators { + +template +class MergeIdsOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &ctx) const override { + auto place = ctx.GetPlace(); + if (!platform::is_cpu_place(place)) { + PADDLE_THROW("MergeIds do not support GPU kernel"); + } + + const auto ids = ctx.MultiInput("Ids"); + const auto row_ids = ctx.MultiInput("Rows"); + const auto x_tensors = ctx.MultiInput("X"); + auto outs = ctx.MultiOutput("Out"); + + PADDLE_ENFORCE_EQ(row_ids.size(), x_tensors.size(), + "the number of Rows and X should be the same"); + PADDLE_ENFORCE_EQ(ids.size(), outs.size(), + "the number of Ids and Out should be the same"); + + size_t row_ids_size = 0; + int row_size = 0; + int embedding_size = 0; + + for (size_t i = 0; i < x_tensors.size(); ++i) { + const auto *x_tensor = x_tensors[i]; + const auto *row_id = row_ids[i]; + + if (embedding_size == 0) { + embedding_size = x_tensor->dims()[1]; + } + PADDLE_ENFORCE_EQ(embedding_size, x_tensor->dims()[1], + "embedding size of all input should be the same"); + row_size += x_tensor->dims()[0]; + row_ids_size += row_id->dims()[0]; + } + + PADDLE_ENFORCE_EQ( + row_size, row_ids_size, + "the merged X dim[0] and merged Rows dim[0] should be the same"); + + std::unordered_map> + selected_rows_idx_map; + for (size_t i = 0; i < x_tensors.size(); ++i) { + const auto *row_id = row_ids[i]; + + for (int j = 0; j < row_id->numel(); ++j) { + int64_t key = row_id->data()[j]; + std::tuple val = std::make_tuple(i, j); + selected_rows_idx_map.insert(std::make_pair(key, val)); + } + } + PADDLE_ENFORCE_EQ(row_ids_size, selected_rows_idx_map.size(), + "the rows and tensor map size should be the same"); + + for (size_t i = 0; i < outs.size(); ++i) { + auto *out_ids = ids[i]; + auto *out = outs[i]; + + out->set_lod(out_ids->lod()); + + int nums = static_cast(out_ids->dims()[0]); + auto *out_data = out->mutable_data( + framework::make_ddim({nums, embedding_size}), place); + for (int j = 0; j < nums; ++j) { + int id = out_ids->data()[j]; + auto row_tuple = selected_rows_idx_map[id]; + int64_t row_idx = std::get<1>(row_tuple); + const auto *x_tensor = x_tensors[std::get<0>(row_tuple)]; + + memcpy(out_data + embedding_size * j, + x_tensor->data() + row_idx * embedding_size, + sizeof(T) * embedding_size); + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/prefetch_op.cc b/paddle/fluid/operators/distributed_ops/prefetch_op.cc similarity index 87% rename from paddle/fluid/operators/prefetch_op.cc rename to paddle/fluid/operators/distributed_ops/prefetch_op.cc index 0519c15e13aac99802ff0f95b975712b36b44246..faa67a28d86235625a87b8bd7b87685e09c75f0b 100644 --- a/paddle/fluid/operators/prefetch_op.cc +++ b/paddle/fluid/operators/distributed_ops/prefetch_op.cc @@ -19,7 +19,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/macros.h" -#include "paddle/fluid/operators/send_recv_util.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" namespace paddle { namespace operators { @@ -42,17 +42,18 @@ class PrefetchOp : public framework::OperatorBase { auto& ctx = *pool.Get(place); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance( + Attr("trainer_id")); std::vector rets; for (size_t i = 0; i < ins.size(); i++) { if (NeedSend(scope, ins[i])) { - VLOG(3) << "sending " << ins[i] << " to " << epmap[i] << " to get " - << outs[i] << " back"; + VLOG(30) << "sending " << ins[i] << " to " << epmap[i] << " to get " + << outs[i] << " back"; rets.push_back(rpc_client->AsyncPrefetchVar(epmap[i], ctx, scope, ins[i], outs[i])); } else { - VLOG(3) << "don't send no-initialied variable: " << ins[i]; + VLOG(30) << "don't send no-initialied variable: " << ins[i]; } } for (size_t i = 0; i < rets.size(); i++) { @@ -69,6 +70,7 @@ class PrefetchOpMaker : public framework::OpProtoAndCheckerMaker { "(LoDTensor) result " "to be fetched from parameter server") .AsDuplicable(); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddAttr>( "epmap", "(string vector, default 127.0.0.1:6164)" diff --git a/paddle/fluid/operators/recv_op.cc b/paddle/fluid/operators/distributed_ops/recv_op.cc similarity index 92% rename from paddle/fluid/operators/recv_op.cc rename to paddle/fluid/operators/distributed_ops/recv_op.cc index 4d34b8a1686efb1fc30020f0d27e9a3c3a6c0866..fbbd86502bfc61c004f88971526195f6a083d5a9 100644 --- a/paddle/fluid/operators/recv_op.cc +++ b/paddle/fluid/operators/distributed_ops/recv_op.cc @@ -42,11 +42,12 @@ class RecvOp : public framework::OperatorBase { auto& ctx = *pool.Get(place); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance( + Attr("trainer_id")); std::vector rets; for (size_t i = 0; i < outs.size(); i++) { - VLOG(3) << "getting " << outs[i] << " from " << epmap[i]; + VLOG(30) << "getting " << outs[i] << " from " << epmap[i]; rets.push_back(rpc_client->AsyncGetVar(epmap[i], ctx, scope, outs[i])); } if (sync_mode) { @@ -73,6 +74,7 @@ This operator can get variables from server side. "Server endpoints in the order of input " "variables for mapping") .SetDefault({}); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddAttr("sync_mode", "(int, default 0)" "sync recv or async recv.") diff --git a/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cc b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..98b0af7688b928f21573247b327bee1d22a73f17 --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cc @@ -0,0 +1,79 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.h" +#include + +namespace paddle { +namespace operators { + +class RefByTrainerIdOp : public framework::OperatorWithKernel { + public: + RefByTrainerIdOp(const std::string &type, + const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorWithKernel(type, inputs, outputs, attrs) {} + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInputs("X"), + "Input(X) of RefByTrainerIdOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("TrainerId"), + "Input(TrainerId) of RefByTrainerIdOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of RefByTrainerIdOp should not be null."); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("TrainerId").size(), 1, + "TrainerId should be a scalar."); + // Out's shape is determined at runtime. + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + return framework::OpKernelType( + framework::ToDataType( + ctx.MultiInput("X")[0]->type()), + ctx.GetPlace()); + } +}; + +class RefByTrainerIdOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "(Tensor) Input tensor list.").AsDuplicable(); + AddInput("TrainerId", "(Tensor) Scalar int, the trainer id runtime value."); + AddOutput("Out", "(Tensor) Return one tensor reference of X[trainer_id]"); + AddComment(R"DOC( +**RefByTrainerId operator** + +Return a reference of a tensor, using trainer_id as the index to find from the input. + +$$Out = X[TrainerId]$$ +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OP_WITHOUT_GRADIENT(ref_by_trainer_id, ops::RefByTrainerIdOp, + ops::RefByTrainerIdOpMaker); +REGISTER_OP_CPU_KERNEL( + ref_by_trainer_id, + ops::RefByTrainerIdKernel, + ops::RefByTrainerIdKernel, + ops::RefByTrainerIdKernel, + ops::RefByTrainerIdKernel); diff --git a/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cu.cc b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..168cd51355de56c2e2a83ba73d7eb14f6ba6e533 --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.cu.cc @@ -0,0 +1,26 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.h" + +REGISTER_OP_CUDA_KERNEL( + ref_by_trainer_id, + paddle::operators::RefByTrainerIdKernel, + paddle::operators::RefByTrainerIdKernel, + paddle::operators::RefByTrainerIdKernel, + paddle::operators::RefByTrainerIdKernel); diff --git a/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.h b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.h new file mode 100644 index 0000000000000000000000000000000000000000..34192278d84758d720e021215c14a54349ba0c62 --- /dev/null +++ b/paddle/fluid/operators/distributed_ops/ref_by_trainer_id_op.h @@ -0,0 +1,48 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { +template +class RefByTrainerIdKernel : public framework::OpKernel { + public: + virtual void Compute(const framework::ExecutionContext& context) const { + auto* out = context.Output("Out"); + auto in_list = context.MultiInput("X"); + auto* trainer_id_t = context.Input("TrainerId"); + int64_t trainer_id = 0; + auto* trainer_id_data = trainer_id_t->data(); + if (platform::is_gpu_place(context.GetPlace())) { +#ifdef PADDLE_WITH_CUDA + auto stream = context.cuda_device_context().stream(); + memory::Copy<>(platform::CPUPlace(), &trainer_id, + boost::get(context.GetPlace()), + trainer_id_data, sizeof(int64_t), stream); +#endif + } else { + trainer_id = *trainer_id_data; + } + PADDLE_ENFORCE_LT((size_t)trainer_id, in_list.size()); + out->mutable_data(context.GetPlace()); + out->ShareDataWith(*(in_list[trainer_id])); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/send_barrier_op.cc b/paddle/fluid/operators/distributed_ops/send_barrier_op.cc similarity index 91% rename from paddle/fluid/operators/send_barrier_op.cc rename to paddle/fluid/operators/distributed_ops/send_barrier_op.cc index 40404295266899c6ac2f7b1e08fdf7db40958794..02ca107ca35348df1827805e40730acd39f39e87 100644 --- a/paddle/fluid/operators/send_barrier_op.cc +++ b/paddle/fluid/operators/distributed_ops/send_barrier_op.cc @@ -39,14 +39,15 @@ class SendBarrierOp : public framework::OperatorBase { std::vector eps = Attr>("endpoints"); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance( + Attr("trainer_id")); - VLOG(3) << "SendBarrierOp sync"; + VLOG(30) << "SendBarrierOp sync"; // need to wait before sending send_barrier message PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); for (auto& ep : eps) { - VLOG(3) << "send barrier, ep: " << ep; + VLOG(30) << "send barrier, ep: " << ep; rpc_client->AsyncSendBatchBarrier(ep); } PADDLE_ENFORCE(rpc_client->Wait(), "internal error in RPCClient"); @@ -67,6 +68,7 @@ This operator will send a send barrier signal to list_and_serv op, so that the Parameter Server would knew all variables have been sent. )DOC"); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddAttr>("endpoints", "(string vector, default 127.0.0.1:6164)" "Server endpoints to send variables to.") diff --git a/paddle/fluid/operators/send_op.cc b/paddle/fluid/operators/distributed_ops/send_op.cc similarity index 89% rename from paddle/fluid/operators/send_op.cc rename to paddle/fluid/operators/distributed_ops/send_op.cc index 48322ac7fd54a2e4cc3405a2c4dcddfc273f5a66..be53a1a32b59d7c0235382f5db18d2203b4a035a 100644 --- a/paddle/fluid/operators/send_op.cc +++ b/paddle/fluid/operators/distributed_ops/send_op.cc @@ -20,7 +20,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/macros.h" -#include "paddle/fluid/operators/send_recv_util.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" #include "paddle/fluid/platform/profiler.h" namespace paddle { @@ -44,15 +44,16 @@ class SendOp : public framework::OperatorBase { auto& ctx = *pool.Get(place); distributed::RPCClient* rpc_client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance( + Attr("trainer_id")); std::vector rets; for (size_t i = 0; i < ins.size(); i++) { if (NeedSend(scope, ins[i])) { - VLOG(3) << "sending " << ins[i] << " to " << epmap[i]; + VLOG(30) << "sending " << ins[i] << " to " << epmap[i]; rets.push_back(rpc_client->AsyncSendVar(epmap[i], ctx, scope, ins[i])); } else { - VLOG(3) << "don't send no-initialied variable: " << ins[i]; + VLOG(30) << "don't send no-initialied variable: " << ins[i]; } } if (sync_send) { @@ -79,6 +80,7 @@ This operator will send variables to listen_and_serve op at the parameter server "(int, default 0)" "sync send or async send.") .SetDefault(0); + AddAttr("trainer_id", "trainer id from 0 ~ worker_num.").SetDefault(0); AddAttr>("epmap", "(string vector, default 127.0.0.1:6164)" "Server endpoints in the order of input " diff --git a/paddle/fluid/operators/send_recv_op_test.cc b/paddle/fluid/operators/distributed_ops/send_recv_op_test.cc similarity index 98% rename from paddle/fluid/operators/send_recv_op_test.cc rename to paddle/fluid/operators/distributed_ops/send_recv_op_test.cc index aee6180add5708d31f7ce927b37c4524a291fe3c..bf798a8251fcb4148db486f26d32525b59299c81 100644 --- a/paddle/fluid/operators/send_recv_op_test.cc +++ b/paddle/fluid/operators/distributed_ops/send_recv_op_test.cc @@ -20,7 +20,7 @@ limitations under the License. */ #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/framework/program_desc.h" -#include "paddle/fluid/operators/listen_and_serv_op.h" +#include "paddle/fluid/operators/distributed_ops/listen_and_serv_op.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/math/selected_rows_functor.h" #include "paddle/fluid/string/printf.h" @@ -120,7 +120,7 @@ void AddOp(const std::string &type, const f::VariableNameMap &inputs, void StartServerNet(bool is_sparse, std::atomic *initialized) { f::Scope scope; p::CPUPlace place; - VLOG(4) << "before init tensor"; + VLOG(40) << "before init tensor"; if (is_sparse) { InitSelectedRowsInScope(place, &scope); } else { @@ -146,7 +146,7 @@ void StartServerNet(bool is_sparse, std::atomic *initialized) { attrs.insert({"PrefetchBlock", prefetch_block}); attrs.insert({"grad_to_block_id", std::vector({""})}); attrs.insert({"sync_mode", true}); - VLOG(4) << "before init op"; + VLOG(40) << "before init op"; listen_and_serv_op = f::OpRegistry::CreateOp("listen_and_serv", {{"X", {"x1"}}}, {}, attrs); *initialized = true; diff --git a/paddle/fluid/operators/send_recv_util.h b/paddle/fluid/operators/distributed_ops/send_recv_util.h similarity index 100% rename from paddle/fluid/operators/send_recv_util.h rename to paddle/fluid/operators/distributed_ops/send_recv_util.h diff --git a/paddle/fluid/operators/split_byref_op.cc b/paddle/fluid/operators/distributed_ops/split_byref_op.cc similarity index 98% rename from paddle/fluid/operators/split_byref_op.cc rename to paddle/fluid/operators/distributed_ops/split_byref_op.cc index bc998e1abbd7131a7497288cc9d66315a6fedc85..d65e7ffe5a492fe5df038bb6bd469e09de6f95ca 100644 --- a/paddle/fluid/operators/split_byref_op.cc +++ b/paddle/fluid/operators/distributed_ops/split_byref_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/split_byref_op.h" +#include "paddle/fluid/operators/distributed_ops/split_byref_op.h" #include "paddle/fluid/operators/split_op.h" namespace paddle { diff --git a/paddle/fluid/operators/split_byref_op.cu.cc b/paddle/fluid/operators/distributed_ops/split_byref_op.cu.cc similarity index 91% rename from paddle/fluid/operators/split_byref_op.cu.cc rename to paddle/fluid/operators/distributed_ops/split_byref_op.cu.cc index 5ee6186f3541b7dcb845ce0c6d28081685925da0..056659c3ea61f6233a6dda56ca1e272e72770d4a 100644 --- a/paddle/fluid/operators/split_byref_op.cu.cc +++ b/paddle/fluid/operators/distributed_ops/split_byref_op.cu.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/split_byref_op.h" +#include "paddle/fluid/operators/distributed_ops/split_byref_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( split_byref, diff --git a/paddle/fluid/operators/split_byref_op.h b/paddle/fluid/operators/distributed_ops/split_byref_op.h similarity index 94% rename from paddle/fluid/operators/split_byref_op.h rename to paddle/fluid/operators/distributed_ops/split_byref_op.h index fedd7218dd6cc9481e94a92a3820cafbe4157bd0..3b7ae6fc91e0a9e08406e38b9a557cab442c2560 100644 --- a/paddle/fluid/operators/split_byref_op.h +++ b/paddle/fluid/operators/distributed_ops/split_byref_op.h @@ -32,7 +32,7 @@ class SplitByrefOpKernel : public framework::OpKernel { for (size_t i = 0; i < outs.size(); ++i) { // NOTE: no need to call mutable_data here to allocate memory. auto* out = outs[i]; - VLOG(3) << "spliting by ref: " << row_offset << " " << out->dims()[0]; + VLOG(30) << "spliting by ref: " << row_offset << " " << out->dims()[0]; *out = in->Slice(row_offset, row_offset + out->dims()[0]); row_offset += out->dims()[0]; } diff --git a/paddle/fluid/operators/split_ids_op.cc b/paddle/fluid/operators/distributed_ops/split_ids_op.cc similarity index 60% rename from paddle/fluid/operators/split_ids_op.cc rename to paddle/fluid/operators/distributed_ops/split_ids_op.cc index c867c46873ae7ddbdbda280351e4ab28235bcc08..f61d387fbef636298c412c227bf7a56a04f69c63 100644 --- a/paddle/fluid/operators/split_ids_op.cc +++ b/paddle/fluid/operators/distributed_ops/split_ids_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/split_ids_op.h" +#include "paddle/fluid/operators/distributed_ops/split_ids_op.h" namespace paddle { namespace operators { @@ -20,20 +20,27 @@ namespace operators { class SplitIdsOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { - AddInput("Ids", "(LoDTensor) the input ids with shape{batch_num, 1}"); - AddOutput("Out", "(LoDTensor) The outputs of the input Ids.") + AddInput("Ids", "(LoDTensor) the input ids with shape{batch_num, 1}") + .AsDuplicable(); + + AddOutput("Out", "(LoDTensors) The outputs of the input Ids.") .AsDuplicable(); AddComment(R"DOC( Split a LoDTensor of Ids into multi LoDTensors, the number is pserver's number Example: Input: - X = [1,2,3,4,5,6] + X = [[1,2,3,4,5,6],[2,3]] Out(3 output): - out0 = [3, 6] - out1 = [1, 4] - out2 = [2, 5] + if compress is True: + out0 = [3, 3, 6] + out1 = [1, 4] + out2 = [2, 2, 5] + else: + out0 = [3, 6] + out1 = [1, 4] + out2 = [2, 5] )DOC"); } }; @@ -43,16 +50,23 @@ class SplitIdsOp : public framework::OperatorWithKernel { using framework::OperatorWithKernel::OperatorWithKernel; void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Ids"), "SplitIdsOp must has input Ids."); + PADDLE_ENFORCE(ctx->HasInputs("Ids"), "SplitIdsOp must has input Ids."); PADDLE_ENFORCE(ctx->HasOutputs("Out"), "SplitIdsOp must has output Out."); auto ids_var_type = ctx->GetInputsVarType("Ids").front(); - auto ids_dims = ctx->GetInputDim("Ids"); + auto ids_dims = ctx->GetInputsDim("Ids"); if (ids_var_type == framework::proto::VarType::LOD_TENSOR) { - PADDLE_ENFORCE_EQ(ids_dims.size(), 2); - PADDLE_ENFORCE_EQ(ids_dims[1], 1); + PADDLE_ENFORCE_EQ(ids_dims[0].size(), 2); } } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + return framework::OpKernelType( + framework::GetDataTypeOfVar(ctx.MultiInputVar("Ids").front()), + ctx.GetPlace()); + } }; class SplitIdsOpInferVarType : public framework::VarTypeInference { @@ -66,12 +80,28 @@ class SplitIdsOpInferVarType : public framework::VarTypeInference { } }; +class SplitIdsOpGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto grad = new framework::OpDesc(); + grad->SetType("concat"); + grad->SetInput("X", OutputGrad("Out")); + grad->SetOutput("Out", InputGrad("Ids")); + grad->SetAttr("axis", 0); + return std::unique_ptr(grad); + } +}; + } // namespace operators } // namespace paddle namespace ops = paddle::operators; REGISTER_OPERATOR(split_ids, ops::SplitIdsOp, ops::SplitIdsOpMaker, - ops::SplitIdsOpInferVarType); + ops::SplitIdsOpGradMaker, ops::SplitIdsOpInferVarType); + REGISTER_OP_CPU_KERNEL( split_ids, ops::SplitIdsOpKernel, ops::SplitIdsOpKernel); diff --git a/paddle/fluid/operators/split_ids_op.h b/paddle/fluid/operators/distributed_ops/split_ids_op.h similarity index 72% rename from paddle/fluid/operators/split_ids_op.h rename to paddle/fluid/operators/distributed_ops/split_ids_op.h index c4af5a65fc5f81c1af7c1fdcca637ca37c940637..f5d6d85d7d75507f82de212812ecee0a650d3aad 100644 --- a/paddle/fluid/operators/split_ids_op.h +++ b/paddle/fluid/operators/distributed_ops/split_ids_op.h @@ -14,6 +14,8 @@ limitations under the License. */ #pragma once +#include +#include #include #include #include "paddle/fluid/framework/op_registry.h" @@ -31,19 +33,39 @@ class SplitIdsOpKernel : public framework::OpKernel { PADDLE_THROW("SplitIds do not support GPU kernel"); } - const auto *ids_var = ctx.InputVar("Ids"); + const auto ids_vars = ctx.MultiInputVar("Ids"); + + PADDLE_ENFORCE_GT(ids_vars.size(), 0, "The number of Ids should > 0"); + auto *ids_var = ids_vars[0]; + if (ids_var->IsType()) { - const auto &ids_dims = ctx.Input("Ids")->dims(); - const T *ids = ctx.Input("Ids")->data(); + int batch_size = 0; + const auto ids_tensors = ctx.MultiInput("Ids"); + for (size_t i = 0; i < ids_tensors.size(); ++i) { + batch_size += ids_tensors[i]->dims()[0]; + } + VLOG(40) << "Get Total BatchSize is: " << batch_size; + + std::vector all_ids(batch_size); + int offset = 0; + for (size_t i = 0; i < ids_tensors.size(); ++i) { + const auto *ids = ids_tensors[i]; + std::memcpy(all_ids.data() + offset, ids->data(), + ids->numel() * sizeof(T)); + offset += ids->numel(); + } + + std::set st(all_ids.begin(), all_ids.end()); + all_ids.assign(st.begin(), st.end()); + auto outs = ctx.MultiOutput("Out"); const size_t shard_num = outs.size(); - std::vector> out_ids; out_ids.resize(outs.size()); // split id by their shard_num. - for (int i = 0; i < ids_dims[0]; ++i) { - T id = ids[i]; + for (size_t i = 0; i < all_ids.size(); ++i) { + T id = all_ids[i]; size_t shard_id = static_cast(id) % shard_num; out_ids[shard_id].push_back(id); } @@ -64,7 +86,7 @@ class SplitIdsOpKernel : public framework::OpKernel { PADDLE_ENFORCE_EQ(ids_dims[0], static_cast(ids_selected_rows->rows().size()), ""); - const T *ids = ids_selected_rows->value().data(); + const T *ids_data = ids_selected_rows->value().data(); const auto &ids_rows = ids_selected_rows->rows(); auto outs = ctx.MultiOutput("Out"); const size_t shard_num = outs.size(); @@ -87,10 +109,14 @@ class SplitIdsOpKernel : public framework::OpKernel { T *output = out->mutable_value()->mutable_data(ddim, place); for (int64_t i = 0; i < ddim[0]; ++i) { memcpy(output + i * row_width, - ids + id_to_index[out->rows()[i]] * row_width, + ids_data + id_to_index[out->rows()[i]] * row_width, row_width * sizeof(T)); } } + } else { + PADDLE_THROW( + "% should be LoDTensor or SelectedRows, but the received type is %s", + ctx.Inputs("Ids")[0], ids_var->Type().name()); } } }; diff --git a/paddle/fluid/operators/test_send_nccl_id.cc b/paddle/fluid/operators/distributed_ops/test_send_nccl_id.cc similarity index 94% rename from paddle/fluid/operators/test_send_nccl_id.cc rename to paddle/fluid/operators/distributed_ops/test_send_nccl_id.cc index e2b7b6b8e447381229e4ad594b7974bc0aa159d5..a73cb08eca272b044501d48e7b8c5b7dc8553a50 100644 --- a/paddle/fluid/operators/test_send_nccl_id.cc +++ b/paddle/fluid/operators/distributed_ops/test_send_nccl_id.cc @@ -22,14 +22,14 @@ limitations under the License. */ #include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/operators/detail/macros.h" #include "paddle/fluid/operators/distributed/request_handler_impl.h" -#include "paddle/fluid/operators/listen_and_serv_op.h" +#include "paddle/fluid/operators/distributed_ops/listen_and_serv_op.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/math/selected_rows_functor.h" #include "paddle/fluid/platform/nccl_helper.h" #include "paddle/fluid/string/printf.h" #ifdef PADDLE_WITH_GRPC -#include "paddle/fluid/operators/send_recv_util.h" +#include "paddle/fluid/operators/distributed_ops/send_recv_util.h" #endif USE_NO_KERNEL_OP(listen_and_serv); @@ -92,7 +92,7 @@ TEST(SendNcclId, RPCServer) { std::string ep = string::Sprintf("127.0.0.1:%d", port); distributed::RPCClient* client = - distributed::RPCClient::GetInstance(); + distributed::RPCClient::GetInstance(0); LOG(INFO) << "connect to server" << ep; client->AsyncSendVar(ep, dev_ctx, scope, NCCL_ID_VARNAME); diff --git a/paddle/fluid/operators/dropout_op.cc b/paddle/fluid/operators/dropout_op.cc index 07322e720f26213ea777be3cd22f2fead28507f0..dd3474dd2529b5e2cb2cd32aec41fb6357b5d537 100644 --- a/paddle/fluid/operators/dropout_op.cc +++ b/paddle/fluid/operators/dropout_op.cc @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/dropout_op.h" +#include namespace paddle { namespace operators { @@ -48,7 +49,10 @@ class DropoutOpMaker : public framework::OpProtoAndCheckerMaker { PADDLE_ENFORCE(drop_p >= 0.0f && drop_p <= 1.0f, "'dropout_prob' must be between 0.0 and 1.0."); }); - AddAttr("is_test", "True if in test phase.").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddAttr("fix_seed", "A flag indicating whether to use a fixed seed to generate " "random mask. NOTE: DO NOT set this flag to true in " @@ -57,6 +61,29 @@ class DropoutOpMaker : public framework::OpProtoAndCheckerMaker { "will be dropped.") .SetDefault(false); AddAttr("seed", "Dropout random seed.").SetDefault(0); + AddAttr( + "dropout_implementation", + "[\"downgrade_in_infer\"|\"upscale_in_train\"]" + "There are two kinds of ways to implement dropout" + "(the mask below is a tensor have the same shape with input" + "the value of mask is 0 or 1, the ratio of 0 is dropout_prob)" + "1. downgrade_in_infer(default), downgrade the outcome at inference " + "time" + " train: out = input * mask" + " inference: out = input * dropout_prob" + "2. upscale_in_train, upscale the outcome at training time, do nothing " + "in inference" + " train: out = input * mask / ( 1.0 - dropout_prob )" + " inference: out = input" + " dropout op can be removed from the program. the program will be " + "efficient") + .SetDefault("downgrade_in_infer") + .AddCustomChecker([](const std::string& type) { + PADDLE_ENFORCE( + type == "downgrade_in_infer" || type == "upscale_in_train", + "dropout_implementation can only be downgrade_in_infer or " + "upscale_in_train"); + }); AddComment(R"DOC( Dropout Operator. @@ -104,7 +131,9 @@ REGISTER_OPERATOR(dropout, ops::DropoutOp, ops::DropoutOpMaker, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(dropout_grad, ops::DropoutOpGrad); REGISTER_OP_CPU_KERNEL( - dropout, ops::CPUDropoutKernel); + dropout, ops::CPUDropoutKernel, + ops::CPUDropoutKernel); REGISTER_OP_CPU_KERNEL( dropout_grad, - ops::DropoutGradKernel); + ops::DropoutGradKernel, + ops::DropoutGradKernel); diff --git a/paddle/fluid/operators/dropout_op.cu b/paddle/fluid/operators/dropout_op.cu index 1dd66e0280c46c0624ff70e822cb6fa6f06b7aa9..e011f47e086183a4ef3a3373c17acd6c21b6cf7e 100644 --- a/paddle/fluid/operators/dropout_op.cu +++ b/paddle/fluid/operators/dropout_op.cu @@ -17,6 +17,7 @@ limitations under the License. */ #include #include #include +#include #include "paddle/fluid/operators/dropout_op.h" #include "paddle/fluid/platform/float16.h" @@ -26,7 +27,8 @@ namespace operators { template __global__ void RandomGenerator(const size_t n, const int seed, const float dropout_prob, const T* src, - T* mask_data, T* dst) { + T* mask_data, T* dst, + bool is_upscale_in_train) { thrust::minstd_rand rng; rng.seed(seed); thrust::uniform_real_distribution dist(0, 1); @@ -47,7 +49,11 @@ __global__ void RandomGenerator(const size_t n, const int seed, if (dist(rng) < dropout_prob) { mask = static_cast(0); } else { - mask = static_cast(1); + if (is_upscale_in_train) { + mask = static_cast(1.0f / (1.0f - dropout_prob)); + } else { + mask = static_cast(1); + } } dest = s * mask; mask_data[idx] = mask; @@ -67,6 +73,8 @@ class GPUDropoutKernel : public framework::OpKernel { y->mutable_data(context.GetPlace()); float dropout_prob = context.Attr("dropout_prob"); + auto dropout_implementation = + context.Attr("dropout_implementation"); auto& place = *context.template device_context().eigen_device(); if (!context.Attr("is_test")) { auto* mask = context.Output("Mask"); @@ -83,11 +91,16 @@ class GPUDropoutKernel : public framework::OpKernel { int grid = (x->numel() + threads - 1) / threads; RandomGenerator< T><<>>( - size, seed, dropout_prob, x_data, mask_data, y_data); + size, seed, dropout_prob, x_data, mask_data, y_data, + (dropout_implementation == "upscale_in_train")); } else { auto X = EigenMatrix::Reshape(*x, 1); auto Y = EigenMatrix::Reshape(*y, 1); - Y.device(place) = X * static_cast(1.0f - dropout_prob); + if (dropout_implementation == "upscale_in_train") { + Y.device(place) = X; + } else { + Y.device(place) = X * static_cast(1.0f - dropout_prob); + } } } }; @@ -99,6 +112,8 @@ namespace ops = paddle::operators; namespace plat = paddle::platform; REGISTER_OP_CUDA_KERNEL( dropout, ops::GPUDropoutKernel, - ops::GPUDropoutKernel); -REGISTER_OP_CUDA_KERNEL(dropout_grad, - ops::DropoutGradKernel); + ops::GPUDropoutKernel, + ops::GPUDropoutKernel); +REGISTER_OP_CUDA_KERNEL( + dropout_grad, ops::DropoutGradKernel, + ops::DropoutGradKernel); diff --git a/paddle/fluid/operators/dropout_op.h b/paddle/fluid/operators/dropout_op.h index 0628b4b826d2730a8e3fb4842e4ae550b8c00569..6c629b7b6d255828023ed25680675ca104a33e12 100644 --- a/paddle/fluid/operators/dropout_op.h +++ b/paddle/fluid/operators/dropout_op.h @@ -14,6 +14,7 @@ limitations under the License. */ #pragma once #include +#include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" @@ -36,6 +37,8 @@ class CPUDropoutKernel : public framework::OpKernel { auto* y_data = y->mutable_data(context.GetPlace()); float dropout_prob = context.Attr("dropout_prob"); + auto dropout_implementation = + context.Attr("dropout_implementation"); if (!context.Attr("is_test")) { auto* mask = context.Output("Mask"); auto* mask_data = mask->mutable_data(context.GetPlace()); @@ -49,14 +52,20 @@ class CPUDropoutKernel : public framework::OpKernel { engine.seed(seed); std::uniform_real_distribution dist(0, 1); + size_t size = framework::product(mask->dims()); for (size_t i = 0; i < size; ++i) { if (dist(engine) < dropout_prob) { mask_data[i] = 0; y_data[i] = 0; } else { - mask_data[i] = 1; - y_data[i] = x_data[i]; + if (dropout_implementation == "upscale_in_train") { + mask_data[i] = 1.0f / static_cast(1.0f - dropout_prob); + y_data[i] = x_data[i] / static_cast(1.0f - dropout_prob); + } else { + mask_data[i] = 1; + y_data[i] = x_data[i]; + } } } } else { @@ -64,7 +73,11 @@ class CPUDropoutKernel : public framework::OpKernel { auto Y = EigenMatrix::Reshape(*y, 1); auto& place = *context.template device_context().eigen_device(); - Y.device(place) = X * (1.0f - dropout_prob); + if (dropout_implementation == "upscale_in_train") { + Y.device(place) = X; + } else { + Y.device(place) = X * static_cast(1.0f - dropout_prob); + } } } }; diff --git a/paddle/fluid/operators/elementwise/CMakeLists.txt b/paddle/fluid/operators/elementwise/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/elementwise/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/elementwise_add_mkldnn_op.cc b/paddle/fluid/operators/elementwise/elementwise_add_mkldnn_op.cc similarity index 97% rename from paddle/fluid/operators/elementwise_add_mkldnn_op.cc rename to paddle/fluid/operators/elementwise/elementwise_add_mkldnn_op.cc index 9ad82aec8182d6ba06b67391d71317a3d0df1833..6a6741d8fc54d22addca91b75dfabf5950c1a35a 100644 --- a/paddle/fluid/operators/elementwise_add_mkldnn_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_add_mkldnn_op.cc @@ -13,8 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/memory/memcpy.h" -#include "paddle/fluid/operators/elementwise_add_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_add_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/platform/mkldnn_helper.h" diff --git a/paddle/fluid/operators/elementwise_add_op.cc b/paddle/fluid/operators/elementwise/elementwise_add_op.cc similarity index 92% rename from paddle/fluid/operators/elementwise_add_op.cc rename to paddle/fluid/operators/elementwise/elementwise_add_op.cc index 3c97ac995c649ecd0d196a584240e1e7ac04f08e..7e789cd8d9143164c2346b067855eb904e00075f 100644 --- a/paddle/fluid/operators/elementwise_add_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_add_op.cc @@ -12,8 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_add_op.h" -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_add_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace ops = paddle::operators; REGISTER_ELEMWISE_GRAD_MAKER(elementwise_add, Add); REGISTER_ELEMWISE_EXPLICIT_OP(elementwise_add, "Add", "Out = X + Y", "Out", diff --git a/paddle/fluid/operators/elementwise_add_op.cu b/paddle/fluid/operators/elementwise/elementwise_add_op.cu similarity index 90% rename from paddle/fluid/operators/elementwise_add_op.cu rename to paddle/fluid/operators/elementwise/elementwise_add_op.cu index dfff518f170b56d180b6883c363effb8dbd677b6..2fb7eeb4b9e3119a6eea3e69a2a6002a80f6c0f3 100644 --- a/paddle/fluid/operators/elementwise_add_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_add_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_add_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_add_op.h" #include "paddle/fluid/platform/float16.h" namespace ops = paddle::operators; @@ -30,4 +30,5 @@ REGISTER_OP_CUDA_KERNEL( ops::ElementwiseAddGradKernel, ops::ElementwiseAddGradKernel, ops::ElementwiseAddGradKernel, - ops::ElementwiseAddGradKernel); + ops::ElementwiseAddGradKernel, + ops::ElementwiseAddGradKernel); diff --git a/paddle/fluid/operators/elementwise_add_op.h b/paddle/fluid/operators/elementwise/elementwise_add_op.h similarity index 65% rename from paddle/fluid/operators/elementwise_add_op.h rename to paddle/fluid/operators/elementwise/elementwise_add_op.h index c60cb1f92e99329d52f6ed39dccde406a5f83563..69f640ab6649df673f07ac0cef81bf80d16eb98d 100644 --- a/paddle/fluid/operators/elementwise_add_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_add_op.h @@ -15,8 +15,8 @@ limitations under the License. */ #pragma once #include "paddle/fluid/framework/eigen.h" -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/blas.h" namespace paddle { @@ -28,9 +28,9 @@ struct AddFunctor { }; template -void default_elementwise_add(const framework::ExecutionContext& ctx, - const framework::Tensor* x, - const framework::Tensor* y, framework::Tensor* z) { +void default_elementwise_add(const framework::ExecutionContext &ctx, + const framework::Tensor *x, + const framework::Tensor *y, framework::Tensor *z) { int axis = ctx.Attr("axis"); ElementwiseComputeEx, DeviceContext, T>(ctx, x, y, axis, AddFunctor(), z); @@ -40,9 +40,9 @@ template typename std::enable_if< std::is_floating_point::value && std::is_same::value>::type -elementwise_add(const framework::ExecutionContext& ctx, - const framework::Tensor* x, const framework::Tensor* y, - framework::Tensor* z) { +elementwise_add(const framework::ExecutionContext &ctx, + const framework::Tensor *x, const framework::Tensor *y, + framework::Tensor *z) { auto eigen_x = framework::EigenVector::Flatten(*x); auto eigen_y = framework::EigenVector::Flatten(*y); auto eigen_z = framework::EigenVector::Flatten(*z); @@ -55,21 +55,20 @@ template typename std::enable_if< !std::is_floating_point::value || !std::is_same::value>::type -elementwise_add(const framework::ExecutionContext& ctx, - const framework::Tensor* x, const framework::Tensor* y, - framework::Tensor* z) { +elementwise_add(const framework::ExecutionContext &ctx, + const framework::Tensor *x, const framework::Tensor *y, + framework::Tensor *z) { default_elementwise_add(ctx, x, y, z); } template class ElementwiseAddKernel : public framework::OpKernel { public: - void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + void Compute(const framework::ExecutionContext &ctx) const override { + auto *x = ctx.Input("X"); + auto *y = ctx.Input("Y"); + auto *z = ctx.Output("Out"); - const auto x = ctx.Input("X"); - const auto y = ctx.Input("Y"); - auto z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); auto dims_equal = x->dims() == y->dims(); @@ -87,13 +86,13 @@ struct IdentityGrad { }; template -void default_elementwise_add_grad(const framework::ExecutionContext& ctx, - const framework::Tensor* x, - const framework::Tensor* y, - const framework::Tensor* out, - const framework::Tensor* dout, - framework::Tensor* dx, - framework::Tensor* dy) { +void default_elementwise_add_grad(const framework::ExecutionContext &ctx, + const framework::Tensor *x, + const framework::Tensor *y, + const framework::Tensor *out, + const framework::Tensor *dout, + framework::Tensor *dx, + framework::Tensor *dy) { int axis = ctx.Attr("axis"); ElemwiseExplicitGradCompute, @@ -106,11 +105,11 @@ template typename std::enable_if< std::is_floating_point::value && std::is_same::value>::type -elementwise_add_grad(const framework::ExecutionContext& ctx, - const framework::Tensor* x, const framework::Tensor* y, - const framework::Tensor* out, - const framework::Tensor* dout, framework::Tensor* dx, - framework::Tensor* dy) { +elementwise_add_grad(const framework::ExecutionContext &ctx, + const framework::Tensor *x, const framework::Tensor *y, + const framework::Tensor *out, + const framework::Tensor *dout, framework::Tensor *dx, + framework::Tensor *dy) { auto blas = math::GetBlas(ctx); if (dx) { @@ -128,27 +127,27 @@ template typename std::enable_if< !std::is_floating_point::value || !std::is_same::value>::type -elementwise_add_grad(const framework::ExecutionContext& ctx, - const framework::Tensor* x, const framework::Tensor* y, - const framework::Tensor* out, - const framework::Tensor* dout, framework::Tensor* dx, - framework::Tensor* dy) { +elementwise_add_grad(const framework::ExecutionContext &ctx, + const framework::Tensor *x, const framework::Tensor *y, + const framework::Tensor *out, + const framework::Tensor *dout, framework::Tensor *dx, + framework::Tensor *dy) { default_elementwise_add_grad(ctx, x, y, out, dout, dx, dy); } template class ElementwiseAddGradKernel : public ElemwiseGradKernel { public: - void Compute(const framework::ExecutionContext& ctx) const override { + void Compute(const framework::ExecutionContext &ctx) const override { ElemwiseGradKernel::Compute(ctx); using Tensor = framework::Tensor; - auto* dout = ctx.Input(framework::GradVarName("Out")); - auto* dx = ctx.Output(framework::GradVarName("X")); - auto* dy = ctx.Output(framework::GradVarName("Y")); + auto *dout = ctx.Input(framework::GradVarName("Out")); + auto *dx = ctx.Output(framework::GradVarName("X")); + auto *dy = ctx.Output(framework::GradVarName("Y")); // skip out, x, y - auto* out = dout; + auto *out = dout; auto *x = dout, *y = dout; if (platform::is_cpu_place(ctx.GetPlace()) && dx != nullptr && diff --git a/paddle/fluid/operators/elementwise_div_op.cc b/paddle/fluid/operators/elementwise/elementwise_div_op.cc similarity index 91% rename from paddle/fluid/operators/elementwise_div_op.cc rename to paddle/fluid/operators/elementwise/elementwise_div_op.cc index 84c8a65e5f859d276ae6d5f1a3f25c9d713a7a61..85612ba47448a7b0d712e9314e3980019c96e9c3 100644 --- a/paddle/fluid/operators/elementwise_div_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_div_op.cc @@ -12,8 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_div_op.h" -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_div_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace ops = paddle::operators; REGISTER_ELEMWISE_OP(elementwise_div, "Div", "Out = X / Y"); diff --git a/paddle/fluid/operators/elementwise_div_op.cu b/paddle/fluid/operators/elementwise/elementwise_div_op.cu similarity index 95% rename from paddle/fluid/operators/elementwise_div_op.cu rename to paddle/fluid/operators/elementwise/elementwise_div_op.cu index 588d1f7420241ba1697e5141e4e4a2870f2dc87c..c5a1a7e08d89f3ef205af4c37246f8fa288189f3 100644 --- a/paddle/fluid/operators/elementwise_div_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_div_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_div_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_div_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_div_op.h b/paddle/fluid/operators/elementwise/elementwise_div_op.h similarity index 88% rename from paddle/fluid/operators/elementwise_div_op.h rename to paddle/fluid/operators/elementwise/elementwise_div_op.h index 41a7950bf0c598507c0fda48c6a43f2fd38c41d2..8a07339077aeaa4403ffd1e1e30e0d58a9cc30e7 100644 --- a/paddle/fluid/operators/elementwise_div_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_div_op.h @@ -14,8 +14,8 @@ limitations under the License. */ #pragma once -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" namespace paddle { namespace operators { @@ -28,11 +28,10 @@ template class ElementwiseDivKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); - auto* x = ctx.Input("X"); - auto* y = ctx.Input("Y"); - auto* z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); int axis = ctx.Attr("axis"); ElementwiseComputeEx, DeviceContext, T>(ctx, x, y, axis, diff --git a/paddle/fluid/operators/elementwise_max_op.cc b/paddle/fluid/operators/elementwise/elementwise_max_op.cc similarity index 91% rename from paddle/fluid/operators/elementwise_max_op.cc rename to paddle/fluid/operators/elementwise/elementwise_max_op.cc index 411671335a19ae2283ca9db8b8f6bcbb6a6b630a..ea0dcd736e5700fb0f341938ac3e3e3b178f29c1 100644 --- a/paddle/fluid/operators/elementwise_max_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_max_op.cc @@ -12,8 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_max_op.h" -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_max_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace ops = paddle::operators; REGISTER_ELEMWISE_OP(elementwise_max, "Max", "Out = max(X, Y)"); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/elementwise_max_op.cu b/paddle/fluid/operators/elementwise/elementwise_max_op.cu similarity index 95% rename from paddle/fluid/operators/elementwise_max_op.cu rename to paddle/fluid/operators/elementwise/elementwise_max_op.cu index 32c99835d66d8b11b72af162230aa383c7e4a57c..a90dcd3ecf0da114110db5946e111a8b3a925e42 100644 --- a/paddle/fluid/operators/elementwise_max_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_max_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_max_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_max_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_max_op.h b/paddle/fluid/operators/elementwise/elementwise_max_op.h similarity index 88% rename from paddle/fluid/operators/elementwise_max_op.h rename to paddle/fluid/operators/elementwise/elementwise_max_op.h index bfb5c931958b4ca890ea720af42dad91d5625abb..3ee0c32e0d5d5df02d5d157416918fb4fb3aca92 100644 --- a/paddle/fluid/operators/elementwise_max_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_max_op.h @@ -14,8 +14,8 @@ limitations under the License. */ #pragma once -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" namespace paddle { namespace operators { @@ -29,11 +29,10 @@ template class ElementwiseMaxKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); - auto* x = ctx.Input("X"); - auto* y = ctx.Input("Y"); - auto* z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); int axis = ctx.Attr("axis"); ElementwiseComputeEx, DeviceContext, T>(ctx, x, y, axis, diff --git a/paddle/fluid/operators/elementwise_min_op.cc b/paddle/fluid/operators/elementwise/elementwise_min_op.cc similarity index 91% rename from paddle/fluid/operators/elementwise_min_op.cc rename to paddle/fluid/operators/elementwise/elementwise_min_op.cc index 816192083d2275b26e6dd9afc76f2c021a01cf73..b263b9addd40cfd329d2cc8588c278df2cb008e9 100644 --- a/paddle/fluid/operators/elementwise_min_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_min_op.cc @@ -12,8 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_min_op.h" -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_min_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace ops = paddle::operators; REGISTER_ELEMWISE_OP(elementwise_min, "Min", "Out = min(X, Y)"); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/elementwise_min_op.cu b/paddle/fluid/operators/elementwise/elementwise_min_op.cu similarity index 95% rename from paddle/fluid/operators/elementwise_min_op.cu rename to paddle/fluid/operators/elementwise/elementwise_min_op.cu index a237c9c503ec998fd74fec50a1d7949279bb38f0..ab77709c28c15a925bd3deac07c43e12b12cb781 100644 --- a/paddle/fluid/operators/elementwise_min_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_min_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_min_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_min_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_min_op.h b/paddle/fluid/operators/elementwise/elementwise_min_op.h similarity index 88% rename from paddle/fluid/operators/elementwise_min_op.h rename to paddle/fluid/operators/elementwise/elementwise_min_op.h index db035ffb52e619b337c8190af4ed0e155aaac48d..d04e372faaa4e6296e982afe6155cdde2fec4f81 100644 --- a/paddle/fluid/operators/elementwise_min_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_min_op.h @@ -14,8 +14,8 @@ limitations under the License. */ #pragma once -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" namespace paddle { namespace operators { @@ -28,11 +28,10 @@ template class ElementwiseMinKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); - auto* x = ctx.Input("X"); - auto* y = ctx.Input("Y"); - auto* z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); int axis = ctx.Attr("axis"); ElementwiseComputeEx, DeviceContext, T>(ctx, x, y, axis, diff --git a/paddle/fluid/operators/elementwise/elementwise_mul_mkldnn_op.cc b/paddle/fluid/operators/elementwise/elementwise_mul_mkldnn_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..10290a4aeff6b6a023fb28961d12728aff891e83 --- /dev/null +++ b/paddle/fluid/operators/elementwise/elementwise_mul_mkldnn_op.cc @@ -0,0 +1,201 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" + +#include "paddle/fluid/platform/mkldnn_helper.h" + +#include "paddle/fluid/operators/math/jit_kernel.h" +#include "xbyak.h" +#include "xbyak_util.h" + +namespace paddle { +namespace operators { + +using framework::DataLayout; +using mkldnn::memory; + +static mkldnn::memory::format StringToMKLDNNFormat(std::string& format) { + std::transform(format.begin(), format.end(), format.begin(), ::tolower); + + if (!format.compare("nchw")) { + return memory::format::nchw; + } else if (!format.compare("nchw16c")) { + return memory::format::nChw16c; + } else if (!format.compare("nchw8c")) { + return memory::format::nChw8c; + } else if (!format.compare("nhwc")) { + return memory::format::nhwc; + } else { + return memory::format::any; + } +} + +static void UpdateDataFormat(const framework::ExecutionContext& ctx, + framework::Tensor* tensor, const char* attribute) { + if (ctx.op().HasAttr(attribute)) { + auto format_as_string = ctx.Attr(attribute); + auto format = StringToMKLDNNFormat(format_as_string); + if (format != memory::format::any) { + tensor->set_format(format); + } + } +} + +template +static void ReorderInput(framework::Tensor* tensor, + const platform::Place& place, + const mkldnn::engine& engine, bool isFourDim) { + using platform::to_void_cast; + auto dims = paddle::framework::vectorize2int(tensor->dims()); + framework::Tensor out_tensor; + out_tensor.Resize(tensor->dims()); + out_tensor.set_format(isFourDim ? memory::format::nchw : memory::format::nc); + out_tensor.set_layout(tensor->layout()); + mkldnn::memory input_memory = { + {{dims, platform::MKLDNNGetDataType(), tensor->format()}, engine}, + to_void_cast(tensor->data())}; + mkldnn::memory output_memory = { + {{dims, platform::MKLDNNGetDataType(), out_tensor.format()}, engine}, + to_void_cast(out_tensor.mutable_data(place))}; + platform::Reorder(input_memory, output_memory); + tensor->ShareDataWith(out_tensor); +} + +template +class ElementwiseMulMKLDNNKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + using Tensor = framework::Tensor; + + int axis = ctx.Attr("axis"); + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); + const T* x_data = x->data(); + const T* y_data = y->data(); + T* z_data = z->mutable_data(ctx.GetPlace()); + + auto x_dims = x->dims(); + auto y_dims_untrimmed = y->dims(); + auto x_int_dims = paddle::framework::vectorize2int(x_dims); + + UpdateDataFormat(ctx, (Tensor*)x, "x_data_format"); + UpdateDataFormat(ctx, (Tensor*)y, "y_data_format"); + + Xbyak::util::Cpu cpu; + const bool is_avx512_enabled = cpu.has(Xbyak::util::Cpu::tAVX512F); + const bool are_dims_divisable = !(x_int_dims[1] % 16); + const bool is_x_format_correct = x->format() == memory::format::nChw16c; + const bool is_y_format_correct = y->format() == memory::format::nc; + if (is_x_format_correct && is_y_format_correct && are_dims_divisable && + is_avx512_enabled) { + int pre, n, post; + get_mid_dims(x_dims, y_dims_untrimmed, axis, &pre, &n, &post); + + if (post == 1) { + PADDLE_THROW("Not implemented when post is 1"); + } else { + // Just check whether it works for RE-Resnext. + PADDLE_ENFORCE_EQ(x_dims.size(), 4, "X should have 4 dimensions"); + + int n = x_dims[0]; + int c = x_dims[1]; + int h = x_dims[2]; + int w = x_dims[3]; + + PADDLE_ENFORCE(y_dims_untrimmed[0] == n && y_dims_untrimmed[1] == c, + "Y should be in nc format"); + + constexpr int simd_width = 16; + int C = c / simd_width; + + const auto& multiply = + math::jitkernel::KernelPool::Instance() + .template Get>(n); + +#pragma omp parallel for collapse(2) + for (int ni = 0; ni < n; ni++) { + for (int ci = 0; ci < C; ci++) { + auto ptr_x = + x_data + ni * C * h * w * simd_width + ci * h * w * simd_width; + + auto ptr_y = y_data + ni * C * simd_width + ci * simd_width; + auto ptr_z = + z_data + ni * C * h * w * simd_width + ci * h * w * simd_width; + + multiply->Compute(ptr_x, ptr_y, ptr_z, h, w); + } + } + } + + z->set_layout(DataLayout::kMKLDNN); + z->set_format(x->format()); + } else { + // Fallback to naive version: + const bool are_inputs_in_same_format = x->format() == y->format(); + const bool is_x_nchw = x->format() == memory::format::nchw; + const bool is_x_nc = x->format() == memory::format::nc; + const bool is_y_nchw = y->format() == memory::format::nchw; + const bool is_y_nc = y->format() == memory::format::nc; + if (!are_inputs_in_same_format) { + using platform::MKLDNNDeviceContext; + auto& dev_ctx = ctx.template device_context(); + const auto& mkldnn_engine = dev_ctx.GetEngine(); + if (!(is_x_nchw || is_x_nc)) + ReorderInput((Tensor*)x, ctx.GetPlace(), mkldnn_engine, + x->dims().size() == 4); + if (!(is_y_nchw || is_y_nc)) + ReorderInput((Tensor*)y, ctx.GetPlace(), mkldnn_engine, + y->dims().size() == 4); + } + + auto mul_func = [](T a, T b) -> T { return a * b; }; + + TransformFunctor + functor( + x, y, z, + ctx.template device_context(), + mul_func); + + axis = (axis == -1 ? x_dims.size() - y_dims_untrimmed.size() : axis); + PADDLE_ENFORCE(axis >= 0 && axis < x_dims.size(), + "Axis should be in range [0, x_dims)"); + + auto y_dims = trim_trailing_singular_dims(y_dims_untrimmed); + axis = (y_dims.size() == 0) ? x_dims.size() : axis; + + int pre, n, post; + get_mid_dims(x_dims, y_dims, axis, &pre, &n, &post); + + if (post == 1) { + functor.RunRowWise(n, pre); + } else { + functor.RunMidWise(n, pre, post); + } + z->set_layout(DataLayout::kMKLDNN); + z->set_format(x->format()); + } + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OP_KERNEL(elementwise_mul, MKLDNN, ::paddle::platform::CPUPlace, + ops::ElementwiseMulMKLDNNKernel) diff --git a/paddle/fluid/operators/elementwise_mul_op.cc b/paddle/fluid/operators/elementwise/elementwise_mul_op.cc similarity index 95% rename from paddle/fluid/operators/elementwise_mul_op.cc rename to paddle/fluid/operators/elementwise/elementwise_mul_op.cc index 86a8459a79135d1fbcba6886172acc5a2abdb88b..d5e3300ac954aebf34a9c65fbca8de8fa2685932 100644 --- a/paddle/fluid/operators/elementwise_mul_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_mul_op.cc @@ -12,9 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_mul_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_mul_op.h" #include -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/elementwise_mul_op.cu b/paddle/fluid/operators/elementwise/elementwise_mul_op.cu similarity index 95% rename from paddle/fluid/operators/elementwise_mul_op.cu rename to paddle/fluid/operators/elementwise/elementwise_mul_op.cu index 2fb1b4bee689c059625e3dbd59f80c541ace83a0..4d16bc38e1d8e4cbbe3afbe08f233e14329e0f2e 100644 --- a/paddle/fluid/operators/elementwise_mul_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_mul_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_mul_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_mul_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_mul_op.h b/paddle/fluid/operators/elementwise/elementwise_mul_op.h similarity index 92% rename from paddle/fluid/operators/elementwise_mul_op.h rename to paddle/fluid/operators/elementwise/elementwise_mul_op.h index b870d08a1a28fd3e678aeb7211f7e3ec8b2c4c65..dc25bc57103286ce183a4649964fd96c62169b7f 100644 --- a/paddle/fluid/operators/elementwise_mul_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_mul_op.h @@ -13,8 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/blas.h" namespace paddle { @@ -60,11 +60,10 @@ template class ElementwiseMulKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); - auto* x = ctx.Input("X"); - auto* y = ctx.Input("Y"); - auto* z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); if (x->numel() == y->numel()) { elementwise_mul(ctx, x, y, z); diff --git a/paddle/fluid/operators/elementwise_op.h b/paddle/fluid/operators/elementwise/elementwise_op.h similarity index 83% rename from paddle/fluid/operators/elementwise_op.h rename to paddle/fluid/operators/elementwise/elementwise_op.h index 7e5975ead64ab39a9c618a33e300c4fce55a5b22..85a7817be9b3a82d40853b417d78a7fdf67f6c1f 100644 --- a/paddle/fluid/operators/elementwise_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_op.h @@ -13,10 +13,12 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once + #include #include "paddle/fluid/framework/data_layout.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/operator.h" + #ifdef PADDLE_WITH_MKLDNN #include "paddle/fluid/platform/mkldnn_helper.h" #endif @@ -29,7 +31,8 @@ class ElementwiseOp : public framework::OperatorWithKernel { using framework::OperatorWithKernel::OperatorWithKernel; using Tensor = framework::Tensor; - void InferShape(framework::InferShapeContext* ctx) const override { + + void InferShape(framework::InferShapeContext *ctx) const override { PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of elementwise op should not be null."); PADDLE_ENFORCE(ctx->HasInput("Y"), @@ -37,6 +40,17 @@ class ElementwiseOp : public framework::OperatorWithKernel { PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) of elementwise op should not be null."); + PADDLE_ENFORCE( + ctx->GetInputsVarType("X").front() == + framework::proto::VarType::LOD_TENSOR, + "The input var's type should be LoDTensor, but the received is %s", + ctx->Inputs("X").front(), ctx->GetInputsVarType("X").front()); + PADDLE_ENFORCE( + ctx->GetInputsVarType("Y").front() == + framework::proto::VarType::LOD_TENSOR, + "The input var's type should be LoDTensor, but the received is %s", + ctx->Inputs("Y").front(), ctx->GetInputsVarType("Y").front()); + auto x_dim = ctx->GetInputDim("X"); auto y_dim = ctx->GetInputDim("Y"); PADDLE_ENFORCE_GE(x_dim.size(), y_dim.size(), @@ -47,9 +61,8 @@ class ElementwiseOp : public framework::OperatorWithKernel { } framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext& ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("X")->type()); + const framework::ExecutionContext &ctx) const override { + auto input_data_type = framework::GetDataTypeOfVar(ctx.InputVar("X")); #ifdef PADDLE_WITH_MKLDNN if (platform::CanMKLDNNBeUsed(ctx)) { @@ -62,16 +75,12 @@ class ElementwiseOp : public framework::OperatorWithKernel { } }; -class ElementwiseOpInferVarType : public framework::VarTypeInference { - public: - void operator()(const framework::OpDesc& op_desc, - framework::BlockDesc* block) const override { - auto x_name = op_desc.Input("X")[0]; - auto out_name = op_desc.Output("Out")[0]; - auto& x = block->FindRecursiveOrCreateVar(x_name); - auto& out = block->FindRecursiveOrCreateVar(out_name); - out.SetType(x.GetType()); - out.SetDataType(x.GetDataType()); +class ElementwiseOpInferVarType + : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; } }; @@ -80,8 +89,6 @@ class ElementwiseOpMaker : public framework::OpProtoAndCheckerMaker { void Make() final { AddInput("X", "(Tensor), The first input tensor of elementwise op."); AddInput("Y", "(Tensor), The second input tensor of elementwise op."); - // AddOutput("SavedShape", "(Tensor), save X, Y shape for grad to save - // memory.").AsIntermediate(); AddOutput("Out", "The output of elementwise op."); AddAttr("axis", "(int, default -1). The start dimension index " @@ -90,6 +97,20 @@ class ElementwiseOpMaker : public framework::OpProtoAndCheckerMaker { .EqualGreaterThan(-1); AddAttr("use_mkldnn", "(bool, default false). Used by MKLDNN.") .SetDefault(false); + AddAttr( + "x_data_format", + "(string, default NCHW) Only used in mkldnn" + "An optional string from: \"NHWC\", \"NCHW\", \"NCHW16C\", \"NCHW8C\". " + "Defaults to \"\". Specify the data format of the output data, " + "the input will be transformed automatically. ") + .SetDefault(""); + AddAttr( + "y_data_format", + "(string, default \"\") Only used in mkldnn" + "An optional string from: \"NHWC\", \"NCHW\", \"NCHW16C\", \"NCHW8C\". " + "Defaults to \"\". Specify the data format of the output data, " + "the input will be transformed automatically. ") + .SetDefault(""); AddComment(string::Sprintf(R"DOC( Elementwise %s Operator @@ -129,13 +150,12 @@ But the output only shares the LoD information with the input $X$. )DOC", GetName(), GetEquation())); - SetReuse(); } protected: virtual std::string GetName() const = 0; + virtual std::string GetEquation() const = 0; - virtual void SetReuse() {} }; class ElementwiseOpGrad : public framework::OperatorWithKernel { @@ -143,7 +163,7 @@ class ElementwiseOpGrad : public framework::OperatorWithKernel { using framework::OperatorWithKernel::OperatorWithKernel; using Tensor = framework::Tensor; - void InferShape(framework::InferShapeContext* ctx) const override { + void InferShape(framework::InferShapeContext *ctx) const override { PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null"); PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null"); PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), @@ -169,7 +189,7 @@ class ElementwiseOpGrad : public framework::OperatorWithKernel { } framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext& ctx) const override { + const framework::ExecutionContext &ctx) const override { auto input_data_type = framework::ToDataType( ctx.Input(framework::GradVarName("Out"))->type()); @@ -191,7 +211,7 @@ class ElementwiseOpExplicitGrad : public ElementwiseOpGrad { using operators::ElementwiseOpGrad::GetExpectedKernelType; using Tensor = framework::Tensor; - void InferShape(framework::InferShapeContext* ctx) const override { + void InferShape(framework::InferShapeContext *ctx) const override { PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), "Input(Out@GRAD) should not be null"); @@ -213,11 +233,11 @@ class ElementwiseOpExplicitGrad : public ElementwiseOpGrad { template class ElemwiseGradKernel : public framework::OpKernel { public: - void Compute(const framework::ExecutionContext& context) const override { - auto* dx = + void Compute(const framework::ExecutionContext &context) const override { + auto *dx = context.Output(framework::GradVarName("X")); if (dx != nullptr) { - auto& dout = + auto &dout = *context.Input(framework::GradVarName("Out")); dx->set_lod(dout.lod()); } @@ -238,7 +258,7 @@ class ElemwiseGradKernel : public framework::OpKernel { \ protected: \ std::unique_ptr Apply() const override { \ - auto* op = new paddle::framework::OpDesc(); \ + auto *op = new paddle::framework::OpDesc(); \ op->SetType(#kernel_type "_grad"); \ op->SetInput("Y", Input("Y")); \ op->SetInput(::paddle::framework::GradVarName("Out"), \ @@ -269,7 +289,6 @@ class ElemwiseGradKernel : public framework::OpKernel { protected: \ virtual std::string GetName() const { return op_name; } \ virtual std::string GetEquation() const { return equation; } \ - virtual void SetReuse() { Reuse(__VA_ARGS__); } \ }; \ REGISTER_OPERATOR(op_type, ::paddle::operators::ElementwiseOp, \ __ElemwiseOp##op_type##Maker__, \ diff --git a/paddle/fluid/operators/elementwise_op_function.h b/paddle/fluid/operators/elementwise/elementwise_op_function.h similarity index 99% rename from paddle/fluid/operators/elementwise_op_function.h rename to paddle/fluid/operators/elementwise/elementwise_op_function.h index 7c84a9d813948ab7347446872643c2e00823a5ad..7bb6934e1496cc989eee8ba82f56959522803bfb 100644 --- a/paddle/fluid/operators/elementwise_op_function.h +++ b/paddle/fluid/operators/elementwise/elementwise_op_function.h @@ -111,6 +111,17 @@ class RowwiseTransformIterator return *this; } + RowwiseTransformIterator &operator+(int n) { + while (n-- > 0) { + ++i_; + if (UNLIKELY(i_ == n_)) { + i_ = 0; + } + } + + return *this; + } + bool operator==(const RowwiseTransformIterator &rhs) const { return (ptr_ + i_) == &(*rhs); @@ -149,6 +160,21 @@ class MidWiseTransformIterator return *this; } + MidWiseTransformIterator &operator+(int n) { + while (n-- > 0) { + ++j_; + if (UNLIKELY(j_ == post_)) { + ++i_; + j_ = 0; + if (UNLIKELY(i_ == n_)) { + i_ = 0; + } + } + } + + return *this; + } + bool operator==(const MidWiseTransformIterator &rhs) const { return (ptr_ + i_) == &(*rhs); @@ -365,7 +391,7 @@ static __global__ void ElemwiseGradBroadcast1CUDAKernel( int j = blockIdx.x; int i = threadIdx.x; int tid = threadIdx.x; - T val = 0; + T val(0); do { int x_offset = i * w + j; @@ -433,7 +459,7 @@ static __global__ void ElemwiseGradBroadcast2CUDAKernel( int tid = threadIdx.x; int j = blockIdx.x; - T val = 0; + T val(0); int ttid = tid; while (true) { diff --git a/paddle/fluid/operators/elementwise_pow_op.cc b/paddle/fluid/operators/elementwise/elementwise_pow_op.cc similarity index 90% rename from paddle/fluid/operators/elementwise_pow_op.cc rename to paddle/fluid/operators/elementwise/elementwise_pow_op.cc index 5fd6bde9ba0930e29f2161f1ff23ff9f5e7dc85d..6335e67a8a48c8702f0cb14ce947275d47e01d17 100644 --- a/paddle/fluid/operators/elementwise_pow_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_pow_op.cc @@ -12,9 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_pow_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_pow_op.h" #include -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/elementwise_pow_op.cu b/paddle/fluid/operators/elementwise/elementwise_pow_op.cu similarity index 92% rename from paddle/fluid/operators/elementwise_pow_op.cu rename to paddle/fluid/operators/elementwise/elementwise_pow_op.cu index 1f19ebd470973137b465381e498ab07a36323c14..6ee0779f23bc2c734aa1d439abb12f366227e686 100644 --- a/paddle/fluid/operators/elementwise_pow_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_pow_op.cu @@ -10,7 +10,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_pow_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_pow_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_pow_op.h b/paddle/fluid/operators/elementwise/elementwise_pow_op.h similarity index 95% rename from paddle/fluid/operators/elementwise_pow_op.h rename to paddle/fluid/operators/elementwise/elementwise_pow_op.h index 8c1c5f9f98018d8d4368a9333e2004141615775d..dc584b4c32fc3063da0c6de50577d28afcb63b83 100644 --- a/paddle/fluid/operators/elementwise_pow_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_pow_op.h @@ -15,7 +15,7 @@ limitations under the License. */ #pragma once #include -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/elementwise_sub_op.cc b/paddle/fluid/operators/elementwise/elementwise_sub_op.cc similarity index 92% rename from paddle/fluid/operators/elementwise_sub_op.cc rename to paddle/fluid/operators/elementwise/elementwise_sub_op.cc index b7224261e6a7ca82dff92a25f5fe8818c08e676d..efc66374c812cbd07adef6ac25c9616b880ec383 100644 --- a/paddle/fluid/operators/elementwise_sub_op.cc +++ b/paddle/fluid/operators/elementwise/elementwise_sub_op.cc @@ -12,8 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/elementwise_sub_op.h" -#include "paddle/fluid/operators/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_sub_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" namespace ops = paddle::operators; REGISTER_ELEMWISE_GRAD_MAKER(elementwise_sub, Sub); REGISTER_ELEMWISE_EXPLICIT_OP(elementwise_sub, "Sub", "Out = X - Y", "Out", diff --git a/paddle/fluid/operators/elementwise_sub_op.cu b/paddle/fluid/operators/elementwise/elementwise_sub_op.cu similarity index 95% rename from paddle/fluid/operators/elementwise_sub_op.cu rename to paddle/fluid/operators/elementwise/elementwise_sub_op.cu index 8709f686f9af1bf4dacbc2dfc3e2d5dcc1c59b9a..8d9bf7c4d81d49d83b5d1cf0369be5c9957242b4 100644 --- a/paddle/fluid/operators/elementwise_sub_op.cu +++ b/paddle/fluid/operators/elementwise/elementwise_sub_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/elementwise_sub_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_sub_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/elementwise_sub_op.h b/paddle/fluid/operators/elementwise/elementwise_sub_op.h similarity index 88% rename from paddle/fluid/operators/elementwise_sub_op.h rename to paddle/fluid/operators/elementwise/elementwise_sub_op.h index 3385df0897700d37d60d8804a01db777ebc02a7e..770323fe5a8fe7c1051b418b2541ab4c669635b4 100644 --- a/paddle/fluid/operators/elementwise_sub_op.h +++ b/paddle/fluid/operators/elementwise/elementwise_sub_op.h @@ -13,8 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#include "paddle/fluid/operators/elementwise_op.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" namespace paddle { namespace operators { @@ -28,11 +28,10 @@ template class ElementwiseSubKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - using Tensor = framework::Tensor; + auto* x = ctx.Input("X"); + auto* y = ctx.Input("Y"); + auto* z = ctx.Output("Out"); - auto* x = ctx.Input("X"); - auto* y = ctx.Input("Y"); - auto* z = ctx.Output("Out"); z->mutable_data(ctx.GetPlace()); int axis = ctx.Attr("axis"); ElementwiseComputeEx, DeviceContext, T>(ctx, x, y, axis, diff --git a/paddle/fluid/operators/expand_op.cc b/paddle/fluid/operators/expand_op.cc index 5ad0ec251328cc1ba580026bb47bf05316e7dc77..40f7c1c54c861abebc84428f55e2769ac8969f0f 100644 --- a/paddle/fluid/operators/expand_op.cc +++ b/paddle/fluid/operators/expand_op.cc @@ -47,6 +47,11 @@ class ExpandOp : public framework::OperatorWithKernel { out_shape[i] = x_dims[i] * expand_times[i]; } + // set the first dim to -1 in compile time + if (!ctx->IsRuntime()) { + out_shape[0] = x_dims[0]; + } + ctx->SetOutputDim("Out", framework::make_ddim(out_shape)); if (out_shape[0] == x_dims[0]) { ctx->ShareLoD("X", "Out"); @@ -109,7 +114,16 @@ class ExpandGradOp : public framework::OperatorWithKernel { ctx->Attrs().Get>("expand_times"); auto out_dims = ctx->GetInputDim(framework::GradVarName("Out")); - for (size_t i = 0; i < expand_times.size(); ++i) { + size_t start_pos = 0u; + if (!ctx->IsRuntime()) { + PADDLE_ENFORCE_EQ( + x_dims[0], out_dims[0], + "The first dimension size of Input(Out@GRAD) should be " + "equal to the crroresponding dimension size of Input(X)"); + start_pos = 1u; + } + + for (size_t i = start_pos; i < expand_times.size(); ++i) { PADDLE_ENFORCE_EQ(x_dims[i] * expand_times[i], out_dims[i], "Each dimension size of Input(Out@GRAD) should be " "equal to multiplication of crroresponding dimension " diff --git a/paddle/fluid/operators/extract_rows_op.cc b/paddle/fluid/operators/extract_rows_op.cc deleted file mode 100644 index 3acae3bcdf4a509ab6e7e19f21c4b2ec4d72b7d7..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/extract_rows_op.cc +++ /dev/null @@ -1,103 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include -#include -#include "paddle/fluid/framework/op_registry.h" - -namespace paddle { -namespace operators { - -class ExtractRowsOpInferShape : public framework::InferShapeBase { - public: - void operator()(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("X"), - "Input(X) of ExtractRowsOp should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("Out"), - "Output(Out) of ExtractRowsOp should not be null."); - PADDLE_ENFORCE_EQ(ctx->GetInputsVarType("X")[0], - framework::proto::VarType::SELECTED_ROWS, - "The type of input(X) must be SelectedRows."); - auto in_dims = ctx->GetInputDim("X"); - - ctx->SetOutputDim( - "Out", framework::make_ddim(std::vector{in_dims[0], 1})); - } -}; - -class ExtractRowsOp : public framework::OperatorBase { - public: - ExtractRowsOp(const std::string &type, - const framework::VariableNameMap &inputs, - const framework::VariableNameMap &outputs, - const framework::AttributeMap &attrs) - : framework::OperatorBase(type, inputs, outputs, attrs) {} - - private: - void RunImpl(const framework::Scope &scope, - const platform::Place &place) const override { - auto &in = scope.FindVar(Input("X"))->Get(); - auto out = scope.FindVar(Output("Out"))->GetMutable(); - - auto &in_rows = in.rows(); - auto out_dim = framework::make_ddim( - std::vector{static_cast(in_rows.size()), 1}); - auto dst_ptr = out->mutable_data(out_dim, in.place()); - - if (paddle::platform::is_gpu_place(in.place())) { -#ifdef PADDLE_WITH_CUDA - platform::DeviceContextPool &pool = - platform::DeviceContextPool::Instance(); - auto *dev_ctx = pool.Get(in.place()); - auto src_ptr = in_rows.Data(in.place()); - auto stream = - reinterpret_cast(*dev_ctx) - .stream(); - memory::Copy(boost::get(out->place()), dst_ptr, - boost::get(in.place()), src_ptr, - in_rows.size() * sizeof(int64_t), stream); -#else - PADDLE_THROW("Not compiled with CUDA."); -#endif - } else { - memory::Copy(platform::CPUPlace(), dst_ptr, platform::CPUPlace(), - in_rows.data(), in_rows.size() * sizeof(int64_t)); - } - } -}; - -class ExtractRowsOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("X", - "(SelectedRows). The input tensor of extract_rows operator," - " and its type is SelectedRows."); - AddOutput("Out", "(Tensor). The the rows of input(X)."); - - AddComment(R"DOC( - ExtractRows Operator. - -The function of extract_rows_op is extracting the rows from the input(X) -whose type is SelectedRows. - - )DOC"); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OPERATOR(extract_rows, ops::ExtractRowsOp, ops::ExtractRowsOpMaker, - ops::ExtractRowsOpInferShape); diff --git a/paddle/fluid/operators/fake_quantize_op.cc b/paddle/fluid/operators/fake_quantize_op.cc index e608eba05d5680254835f7b25f53d6a59e310e2a..43af83fd693b433337bdc80188bd0568f76b3e66 100644 --- a/paddle/fluid/operators/fake_quantize_op.cc +++ b/paddle/fluid/operators/fake_quantize_op.cc @@ -138,7 +138,7 @@ class FakeQuantizeAbsMaxOpMaker : public framework::OpProtoAndCheckerMaker { AddComment(R"DOC( FakeQuantize operator -$$scale = max(abs(X))$$ +$$scale = max(abs(X))$$ $$range = 2^{bit_length - 1} - 1$$ $$Out = round(X/scale * range)$$ @@ -199,11 +199,14 @@ class FakeQuantizeRangeAbsMaxOpMaker PADDLE_ENFORCE(bit_length >= 1 && bit_length <= 16, "'bit_length' should be between 1 and 16."); }); - AddAttr("is_test", "").SetDefault(false); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddComment(R"DOC( FakeQuantize operator is used in static quantization. -$$scale = max(max(abs(x)), history_abs_max)$$ +$$scale = max(max(abs(x)), history_abs_max)$$ $$range = 2^{bit_length - 1} - 1$$ $$Out = round(X/scale * range)$$ diff --git a/paddle/fluid/operators/fc_op.cc b/paddle/fluid/operators/fc_op.cc index fa4dec9cf118cef9b836943fd4eae90d23e6218a..e80249fc87855311479b35af61f872182292795a 100644 --- a/paddle/fluid/operators/fc_op.cc +++ b/paddle/fluid/operators/fc_op.cc @@ -27,11 +27,9 @@ void FCOp::InferShape(framework::InferShapeContext* ctx) const { "Out(Output) of Fully Connected should not be null."); PADDLE_ENFORCE(ctx->HasInput("W"), "W(Input) of Fully Connected should not be null."); - // NCHW + auto in_dims = ctx->GetInputDim("Input"); - // IO, I=C*H*W auto w_dims = ctx->GetInputDim("W"); - std::vector output_shape({in_dims[0], w_dims[1]}); if (ctx->HasInput("Bias")) { auto bias_dims = ctx->GetInputDim("Bias"); @@ -44,14 +42,32 @@ void FCOp::InferShape(framework::InferShapeContext* ctx) const { "The shape of Bias must be [1, dim]."); } } - PADDLE_ENFORCE(in_dims.size() == 2 || in_dims.size() == 4, - "Fully Connected input should be 2-D or 4-D tensor."); + + if (ctx->Attrs().Get("use_mkldnn")) { + PADDLE_ENFORCE(in_dims.size() == 2 || in_dims.size() == 4, + "Fully Connected input should be 2-D or 4-D tensor."); + } PADDLE_ENFORCE_EQ(w_dims.size(), 2UL, "Fully Connected input should be 2-D tensor."); - PADDLE_ENFORCE_EQ(framework::product(in_dims) / in_dims[0], w_dims[0], - "Fully Connected input and weigth size do not match."); + int in_num_col_dims = ctx->Attrs().Get("in_num_col_dims"); + PADDLE_ENFORCE_GT( + in_dims.size(), in_num_col_dims, + "The input tensor Input's rank of FCOp should be larger than " + "in_num_col_dims."); + + auto in_mat_dims = framework::flatten_to_2d(in_dims, in_num_col_dims); + PADDLE_ENFORCE_EQ( + in_mat_dims[1], w_dims[0], + "Fully Connected input and weigth size do not match. %s, %s"); + + std::vector output_dims; + output_dims.reserve(static_cast(in_num_col_dims + 1)); + for (int i = 0; i < in_num_col_dims; ++i) { + output_dims.push_back(in_dims[i]); + } + output_dims.push_back(w_dims[1]); - ctx->SetOutputDim("Out", framework::make_ddim(output_shape)); + ctx->SetOutputDim("Out", framework::make_ddim(output_dims)); ctx->ShareLoD("Input", "Out"); } @@ -101,12 +117,15 @@ framework::OpKernelType FCOpGrad::GetExpectedKernelType( } void FCOpMaker::Make() { - AddInput("Input", - "(Tensor), The input tensor of fully connected operator with format " - "(NCHW). "); + AddInput("Input", "(Tensor), The input tensor of fully connected operator."); AddInput("W", "(Tensor), The weight fc op with shape (I, O)."); AddInput("Bias", "(Tensor, optional) Bias vector with shape (1 x O") .AsDispensable(); + AddAttr("in_num_col_dims", + "(int, default 1), The fc op can take tensors with more than " + "two dimensions as its inputs.") + .SetDefault(1) + .EqualGreaterThan(1); AddOutput("Out", "(Tensor) The output tensor of fully connected operator. "); AddAttr("use_mkldnn", "(bool, default false) Only used in mkldnn kernel") @@ -131,13 +150,15 @@ class FCOpKernel : public framework::OpKernel { auto output = ctx.Output("Out"); auto in_dims = input->dims(); auto w_dims = w->dims(); + auto out_dims = output->dims(); + int M = framework::product(out_dims) / out_dims[out_dims.size() - 1]; const T* input_data = input->data(); const T* w_data = w->data(); T* output_data = output->mutable_data(ctx.GetPlace()); auto blas = math::GetBlas(ctx); math::FCCompute( - blas, in_dims[0], w_dims[1], w_dims[0], input_data, w_data, output_data, + blas, M, w_dims[1], w_dims[0], input_data, w_data, output_data, bias ? bias->data() : NULL); // TODO(TJ): fuse act diff --git a/paddle/fluid/operators/fill_constant_op.cc b/paddle/fluid/operators/fill_constant_op.cc index e04a68717b351ddb0be5a7e70aa9297e5eb0125f..252f313440296bd9e5eebf26f67b08bbe7decce8 100644 --- a/paddle/fluid/operators/fill_constant_op.cc +++ b/paddle/fluid/operators/fill_constant_op.cc @@ -24,7 +24,7 @@ class FillConstantInferShape : public framework::InferShapeBase { void operator()(framework::InferShapeContext *ctx) const override { PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) of FillConstantOp should not be null."); - auto &shape = ctx->Attrs().Get>("shape"); + auto &shape = ctx->Attrs().Get>("shape"); ctx->SetOutputDim("Out", framework::make_ddim(shape)); } }; @@ -47,10 +47,10 @@ class FillConstantOp : public framework::OperatorBase { if (out_var.IsType()) { tensor = out_var.GetMutable(); - tensor->Resize(framework::make_ddim(Attr>("shape"))); + tensor->Resize(framework::make_ddim(Attr>("shape"))); } else if (out_var.IsType()) { tensor = out_var.GetMutable()->mutable_value(); - tensor->Resize(framework::make_ddim(Attr>("shape"))); + tensor->Resize(framework::make_ddim(Attr>("shape"))); } else { PADDLE_THROW( "fill constant op's output only" @@ -83,7 +83,8 @@ class FillConstantOpMaker : public framework::OpProtoAndCheckerMaker { "(int, default 5 (FP32)) " "Output data type") .SetDefault(framework::proto::VarType::FP32); - AddAttr>("shape", "(vector) The shape of the output"); + AddAttr>("shape", + "(vector) The shape of the output"); AddAttr("value", "(float, default 0) The value to be filled") .SetDefault(0.0f); AddAttr("force_cpu", diff --git a/paddle/fluid/operators/fused/CMakeLists.txt b/paddle/fluid/operators/fused/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..a0397acab1267365b8aeba30a63152b61b5b25bb --- /dev/null +++ b/paddle/fluid/operators/fused/CMakeLists.txt @@ -0,0 +1,6 @@ +include(operators) +register_operators(EXCLUDES fusion_transpose_flatten_concat_op) +if (WITH_GPU) + op_library(fusion_transpose_flatten_concat_op) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(fusion_transpose_flatten_concat);\n") +endif() diff --git a/paddle/fluid/operators/fused_elemwise_activation_op.cc b/paddle/fluid/operators/fused/fused_elemwise_activation_op.cc similarity index 99% rename from paddle/fluid/operators/fused_elemwise_activation_op.cc rename to paddle/fluid/operators/fused/fused_elemwise_activation_op.cc index d88ef15949da3809bffe41e4bf303d1fee568675..3771aac0dfd98a52dcd8b789e5a6114e977e22f8 100644 --- a/paddle/fluid/operators/fused_elemwise_activation_op.cc +++ b/paddle/fluid/operators/fused/fused_elemwise_activation_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fused_elemwise_activation_op.h" +#include "paddle/fluid/operators/fused/fused_elemwise_activation_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/fused_elemwise_activation_op.cu b/paddle/fluid/operators/fused/fused_elemwise_activation_op.cu similarity index 94% rename from paddle/fluid/operators/fused_elemwise_activation_op.cu rename to paddle/fluid/operators/fused/fused_elemwise_activation_op.cu index e1d2b16b4b5e3a480777f834c2cbeb6d00a755e4..e10693bae1859307c9cf266965d4ce20e6de1bf9 100644 --- a/paddle/fluid/operators/fused_elemwise_activation_op.cu +++ b/paddle/fluid/operators/fused/fused_elemwise_activation_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fused_elemwise_activation_op.h" +#include "paddle/fluid/operators/fused/fused_elemwise_activation_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/fused_elemwise_activation_op.h b/paddle/fluid/operators/fused/fused_elemwise_activation_op.h similarity index 99% rename from paddle/fluid/operators/fused_elemwise_activation_op.h rename to paddle/fluid/operators/fused/fused_elemwise_activation_op.h index 5ae9aea959c268985c17643f2f47199c852c2bcb..01dc2dbfd61cc88f72174233382aa49f61c9b60f 100644 --- a/paddle/fluid/operators/fused_elemwise_activation_op.h +++ b/paddle/fluid/operators/fused/fused_elemwise_activation_op.h @@ -19,7 +19,7 @@ limitations under the License. */ #include "paddle/fluid/framework/op_desc.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/safe_ref.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/compound_functors.h" #include "paddle/fluid/operators/math/functors.h" diff --git a/paddle/fluid/operators/fused_embedding_fc_lstm_op.cc b/paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.cc similarity index 99% rename from paddle/fluid/operators/fused_embedding_fc_lstm_op.cc rename to paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.cc index fdc9cb4888b3468b85abfa0c693ed8ac5b0d450b..6d463538d232e1a38f845e7abc3786568ca3bb21 100644 --- a/paddle/fluid/operators/fused_embedding_fc_lstm_op.cc +++ b/paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fused_embedding_fc_lstm_op.h" +#include "paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.h" #include #include "paddle/fluid/operators/math/blas.h" #include "paddle/fluid/operators/math/cpu_vec.h" diff --git a/paddle/fluid/operators/fused_embedding_fc_lstm_op.h b/paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.h similarity index 100% rename from paddle/fluid/operators/fused_embedding_fc_lstm_op.h rename to paddle/fluid/operators/fused/fused_embedding_fc_lstm_op.h diff --git a/paddle/fluid/operators/fusion_gru_op.cc b/paddle/fluid/operators/fused/fusion_gru_op.cc similarity index 75% rename from paddle/fluid/operators/fusion_gru_op.cc rename to paddle/fluid/operators/fused/fusion_gru_op.cc index a04c1c1263fba659e2d3f623b607e9f476bb40ed..25b7ae7c2828b0919b805a2d0aac8b9cacf7d6e3 100644 --- a/paddle/fluid/operators/fusion_gru_op.cc +++ b/paddle/fluid/operators/fused/fusion_gru_op.cc @@ -12,14 +12,13 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fusion_gru_op.h" +#include "paddle/fluid/operators/fused/fusion_gru_op.h" #include // for memcpy #include #include "paddle/fluid/operators/math/blas.h" -#include "paddle/fluid/operators/math/cpu_vec.h" #include "paddle/fluid/operators/math/fc_compute.h" +#include "paddle/fluid/operators/math/jit_kernel.h" #include "paddle/fluid/operators/math/sequence2batch.h" -#include "paddle/fluid/platform/cpu_info.h" namespace paddle { namespace operators { @@ -174,58 +173,47 @@ class FusionGRUKernel : public framework::OpKernel { } } -#define INIT_VEC_FUNC \ - std::function act_gate, act_state; \ - std::function cross; \ - auto& act_gate_str = ctx.Attr("gate_activation"); \ - auto& act_state_str = ctx.Attr("activation"); \ - if (platform::jit::MayIUse(platform::jit::avx)) { \ - math::VecActivations act_functor; \ - act_gate = act_functor(act_gate_str); \ - act_state = act_functor(act_state_str); \ - cross = math::vec_cross; \ - } else { \ - math::VecActivations act_functor; \ - act_gate = act_functor(act_gate_str); \ - act_state = act_functor(act_state_str); \ - cross = math::vec_cross; \ - } - -#define INIT_BASE_INPUT_OUTPUT \ - auto* h0 = ctx.Input("H0"); \ - auto* wx = ctx.Input("WeightX"); \ - auto* wh = ctx.Input("WeightH"); \ - auto* bias = ctx.Input("Bias"); \ - auto* xx = ctx.Output("XX"); \ - auto* hidden_out = ctx.Output("Hidden"); \ - bool is_reverse = ctx.Attr("is_reverse"); - -#define INIT_BASE_SIZES \ - auto x_dims = x->dims(); /* T x M*/ \ - auto wh_dims = wh->dims(); /* D x 3D*/ \ - const int total_T = x_dims[0]; \ - const int M = x_dims[1]; \ - const int D = wh_dims[0]; \ - const int D3 = wh_dims[1]; \ - const int D2 = D * 2; +#define INIT_BASE_DEFINES \ + auto* x = ctx.Input("X"); \ + auto* wh = ctx.Input("WeightH"); \ + auto* xx = ctx.Output("XX"); \ + auto x_lod = x->lod(); \ + auto x_dims = x->dims(); /* T x M*/ \ + auto wh_dims = wh->dims(); /* D x 3D*/ \ + const int total_T = x_dims[0]; \ + const int D3 = wh_dims[1] + +#define INIT_OTHER_DEFINES \ + auto* h0 = ctx.Input("H0"); \ + auto* wx = ctx.Input("WeightX"); \ + auto* bias = ctx.Input("Bias"); \ + auto* hidden_out = ctx.Output("Hidden"); \ + bool is_reverse = ctx.Attr("is_reverse"); \ + const int M = x_dims[1]; \ + const int D = wh_dims[0]; \ + const int D2 = D * 2; \ + const math::jitkernel::gru_attr_t attr( \ + D, ctx.Attr("gate_activation"), \ + ctx.Attr("activation")); \ + math::jitkernel::gru_t one_step; \ + const auto& ker = \ + math::jitkernel::KernelPool::Instance() \ + .template Get, \ + const math::jitkernel::gru_attr_t&>(attr); \ + const T* x_data = x->data(); \ + const T* wx_data = wx->data(); \ + const T* wh_data = wh->data(); \ + auto place = ctx.GetPlace(); \ + T* xx_data = xx->mutable_data(place) void SeqCompute(const framework::ExecutionContext& ctx) const { using DeviceContext = paddle::platform::CPUDeviceContext; - auto* x = ctx.Input("X"); - INIT_BASE_INPUT_OUTPUT - INIT_BASE_SIZES - INIT_VEC_FUNC - - auto x_lod = x->lod(); + INIT_BASE_DEFINES; + INIT_OTHER_DEFINES; const int N = x_lod[0].size() - 1; - const T* x_data = x->data(); const T* h0_data = h0 ? h0->data() : nullptr; - const T* wx_data = wx->data(); - const T* wh_data = wh->data(); const T* wh_state_data = wh_data + D * D2; - T* xx_data = xx->mutable_data(ctx.GetPlace()); - T* hidden_out_data = hidden_out->mutable_data(ctx.GetPlace()); - + T* hidden_out_data = hidden_out->mutable_data(place); auto blas = math::GetBlas(ctx); math::FCCompute(blas, total_T, D3, M, x_data, wx_data, xx_data, @@ -252,14 +240,9 @@ class FusionGRUKernel : public framework::OpKernel { if (h0_data) { prev_hidden_data = h0_data + bid * D; } else { - // W: {W_update, W_reset; W_state} - // update gate - act_gate(D, xx_data, xx_data); - // state gate - act_state(D, xx_data + D2, xx_data + D2); - // out = a*b - blas.VMUL(D, xx_data, xx_data + D2, hidden_out_data); - // save prev + one_step.gates = xx_data; + one_step.ht = hidden_out_data; + ker->ComputeH1(&one_step, &attr); prev_hidden_data = hidden_out_data; tstart = 1; move_step(); @@ -269,17 +252,15 @@ class FusionGRUKernel : public framework::OpKernel { blas.GEMM(CblasNoTrans, CblasNoTrans, 1, D2, D, static_cast(1), prev_hidden_data, D, wh_data, D2, static_cast(1), xx_data, D3); - act_gate(D2, xx_data, xx_data); - // rt = rt*ht_1 inplace result - blas.VMUL(D, prev_hidden_data, xx_data + D, hidden_out_data); - + one_step.gates = xx_data; + one_step.ht_1 = prev_hidden_data; + one_step.ht = hidden_out_data; + ker->ComputeHtPart1(&one_step, &attr); // gemm rt * Ws blas.GEMM(CblasNoTrans, CblasNoTrans, 1, D, D, static_cast(1), hidden_out_data, D, wh_state_data, D, static_cast(1), xx_data + D2, D3); - act_state(D, xx_data + D2, xx_data + D2); - // out = zt*ht~ + (1-zt)*ht_1 - cross(D, xx_data, xx_data + D2, prev_hidden_data, hidden_out_data); + ker->ComputeHtPart2(&one_step, &attr); // save prev prev_hidden_data = hidden_out_data; move_step(); @@ -289,28 +270,19 @@ class FusionGRUKernel : public framework::OpKernel { void BatchCompute(const framework::ExecutionContext& ctx) const { using DeviceContext = paddle::platform::CPUDeviceContext; - auto* x = ctx.Input("X"); - INIT_BASE_INPUT_OUTPUT - INIT_BASE_SIZES - if (x->lod()[0].size() == 2) { + INIT_BASE_DEFINES; + if (x_lod[0].size() == 2) { xx->Resize({total_T, D3}); SeqCompute(ctx); return; } - INIT_VEC_FUNC - + INIT_OTHER_DEFINES; auto* reordered_h0 = ctx.Output("ReorderedH0"); auto* batched_input = ctx.Output("BatchedInput"); auto* batched_out = ctx.Output("BatchedOut"); - - const T* x_data = x->data(); - const T* wx_data = wx->data(); - const T* wh_data = wh->data(); - T* xx_data = xx->mutable_data(ctx.GetPlace()); - T* batched_input_data = batched_input->mutable_data(ctx.GetPlace()); - T* batched_out_data = batched_out->mutable_data(ctx.GetPlace()); - hidden_out->mutable_data(ctx.GetPlace()); - + T* batched_input_data = batched_input->mutable_data(place); + T* batched_out_data = batched_out->mutable_data(place); + hidden_out->mutable_data(place); auto& dev_ctx = ctx.template device_context(); auto blas = math::GetBlas(dev_ctx); math::LoDTensor2BatchFunctor to_batch; @@ -336,7 +308,7 @@ class FusionGRUKernel : public framework::OpKernel { T* prev_hidden_data = nullptr; if (h0) { // reorder h0 - T* reordered_h0_data = reordered_h0->mutable_data(ctx.GetPlace()); + T* reordered_h0_data = reordered_h0->mutable_data(place); const T* h0_data = h0->data(); prev_hidden_data = reordered_h0_data; size_t sz = sizeof(T) * D; @@ -350,12 +322,9 @@ class FusionGRUKernel : public framework::OpKernel { T* cur_out_data = batched_out_data; // W: {W_update, W_reset; W_state} for (int i = 0; i < max_bs; ++i) { - // update gate - act_gate(D, cur_in_data, cur_in_data); - // state gate - act_state(D, cur_in_data + D2, cur_in_data + D2); - // out = a*b - blas.VMUL(D, cur_in_data, cur_in_data + D2, cur_out_data); + one_step.gates = cur_in_data; + one_step.ht = cur_out_data; + ker->ComputeH1(&one_step, &attr); // add offset cur_in_data += D3; cur_out_data += D; @@ -380,9 +349,10 @@ class FusionGRUKernel : public framework::OpKernel { T* cur_out_data = batched_out_data; T* cur_prev_hidden_data = prev_hidden_data; for (int i = 0; i < cur_bs; ++i) { - act_gate(D2, cur_batched_data, cur_batched_data); - // rt = rt*ht_1 inplace result - blas.VMUL(D, cur_prev_hidden_data, cur_batched_data + D, cur_out_data); + one_step.gates = cur_batched_data; + one_step.ht_1 = cur_prev_hidden_data; + one_step.ht = cur_out_data; + ker->ComputeHtPart1(&one_step, &attr); cur_batched_data += D3; cur_prev_hidden_data += D; @@ -397,12 +367,10 @@ class FusionGRUKernel : public framework::OpKernel { cur_prev_hidden_data = prev_hidden_data; for (int i = 0; i < cur_bs; ++i) { - // ht~ = act_state(...) - act_state(D, cur_batched_data + D2, cur_batched_data + D2); - // out = zt*ht~ + (1-zt)*ht_1 - cross(D, cur_batched_data, cur_batched_data + D2, cur_prev_hidden_data, - cur_out_data); - + one_step.gates = cur_batched_data; + one_step.ht_1 = cur_prev_hidden_data; + one_step.ht = cur_out_data; + ker->ComputeHtPart2(&one_step, &attr); cur_batched_data += D3; cur_prev_hidden_data += D; cur_out_data += D; @@ -416,9 +384,8 @@ class FusionGRUKernel : public framework::OpKernel { batched_out->set_lod(batched_lod); to_seq(dev_ctx, *batched_out, hidden_out); } -#undef INIT_VEC_FUNC -#undef INIT_BASE_SIZES -#undef INIT_BASE_INPUT_OUTPUT +#undef INIT_OTHER_DEFINES +#undef INIT_BASE_DEFINES }; } // namespace operators diff --git a/paddle/fluid/operators/fusion_gru_op.h b/paddle/fluid/operators/fused/fusion_gru_op.h similarity index 100% rename from paddle/fluid/operators/fusion_gru_op.h rename to paddle/fluid/operators/fused/fusion_gru_op.h diff --git a/paddle/fluid/operators/fusion_lstm_op.cc b/paddle/fluid/operators/fused/fusion_lstm_op.cc similarity index 93% rename from paddle/fluid/operators/fusion_lstm_op.cc rename to paddle/fluid/operators/fused/fusion_lstm_op.cc index 067e6a3e7cccc1f15ebdd984f3a2441339a989ab..8021a896ceaa808f5754b5d165aa0c8c8cb8034a 100644 --- a/paddle/fluid/operators/fusion_lstm_op.cc +++ b/paddle/fluid/operators/fused/fusion_lstm_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fusion_lstm_op.h" +#include "paddle/fluid/operators/fused/fusion_lstm_op.h" #include #include "paddle/fluid/operators/math/blas.h" #include "paddle/fluid/operators/math/fc_compute.h" @@ -236,27 +236,31 @@ class FuisonLSTMKernel : public framework::OpKernel { const int D = wh_dims[0]; \ const int D4 = wh_dims[1] -#define INIT_OTHER_DEFINES \ - const T* x_data = x->data(); \ - const T* wx_data = wx->data(); \ - const T* wh_data = wh->data(); \ - /* diagonal weight*/ \ - const T* wp_data = bias->data() + D4; \ - /* for peephole only*/ \ - T* checked_cell_data = nullptr; \ - auto place = ctx.GetPlace(); \ - if (use_peepholes) { \ - /* w_ic * Ct-1, w_fc * Ct-1 ; w_oc * Ct => ih*/ \ - auto* checked_cell = ctx.Output("CheckedCell"); \ - checked_cell_data = checked_cell->mutable_data(place); \ - } \ - const auto& ker = \ - math::jitkernel::KernelPool::Instance() \ - .template Get, const std::string&, \ - const std::string&, const std::string&>( \ - ctx.Attr("gate_activation"), \ - ctx.Attr("candidate_activation"), \ - ctx.Attr("cell_activation"), D, use_peepholes) +#define INIT_OTHER_DEFINES \ + const T* x_data = x->data(); \ + const T* wx_data = wx->data(); \ + const T* wh_data = wh->data(); \ + /* diagonal weight*/ \ + const T* wp_data = bias->data() + D4; \ + /* for peephole only*/ \ + T* checked_cell_data = nullptr; \ + auto place = ctx.GetPlace(); \ + if (use_peepholes) { \ + /* w_ic * Ct-1, w_fc * Ct-1 ; w_oc * Ct => ih*/ \ + auto* checked_cell = ctx.Output("CheckedCell"); \ + checked_cell_data = checked_cell->mutable_data(place); \ + } \ + const math::jitkernel::lstm_attr_t attr( \ + D, ctx.Attr("gate_activation"), \ + ctx.Attr("candidate_activation"), \ + ctx.Attr("cell_activation"), use_peepholes); \ + math::jitkernel::lstm_t one_step; \ + one_step.wp = wp_data; \ + one_step.checked = checked_cell_data; \ + const auto& ker = \ + math::jitkernel::KernelPool::Instance() \ + .template Get, \ + const math::jitkernel::lstm_attr_t&>(attr) // Wh GEMM #define GEMM_WH_ADDON(bs, prev, out) \ @@ -299,7 +303,10 @@ class FuisonLSTMKernel : public framework::OpKernel { prev_h_data = h0_data + bid * D; prev_c_data = c0_data + bid * D; } else { - ker->ComputeC1H1(xx_data, c_out_data, h_out_data, wp_data); + one_step.gates = xx_data; + one_step.ct = c_out_data; + one_step.ht = h_out_data; + ker->ComputeC1H1(&one_step, &attr); tstart = 1; // move one step prev_h_data = h_out_data; @@ -310,8 +317,12 @@ class FuisonLSTMKernel : public framework::OpKernel { } for (int step = tstart; step < seq_len; ++step) { GEMM_WH_ADDON(1, prev_h_data, xx_data); - ker->ComputeCtHt(xx_data, prev_c_data, c_out_data, h_out_data, wp_data, - checked_cell_data); + + one_step.gates = xx_data; + one_step.ct_1 = prev_c_data; + one_step.ct = c_out_data; + one_step.ht = h_out_data; + ker->ComputeCtHt(&one_step, &attr); // move one step prev_h_data = h_out_data; prev_c_data = c_out_data; @@ -388,7 +399,11 @@ class FuisonLSTMKernel : public framework::OpKernel { T* cur_h_out_data = batched_h_out_data; T* cur_c_out_data = batched_c_out_data; for (int i = 0; i < max_bs; ++i) { - ker->ComputeC1H1(cur_in_data, cur_c_out_data, cur_h_out_data, wp_data); + one_step.gates = cur_in_data; + one_step.ct = cur_c_out_data; + one_step.ht = cur_h_out_data; + ker->ComputeC1H1(&one_step, &attr); + cur_in_data += D4; cur_c_out_data += D; cur_h_out_data += D; @@ -413,8 +428,12 @@ class FuisonLSTMKernel : public framework::OpKernel { T* cur_c_out_data = batched_c_out_data; T* cur_h_out_data = batched_h_out_data; for (int i = 0; i < cur_bs; ++i) { - ker->ComputeCtHt(cur_in_data, cur_prev_c_data, cur_c_out_data, - cur_h_out_data, wp_data, checked_cell_data); + one_step.gates = cur_in_data; + one_step.ct_1 = cur_prev_c_data; + one_step.ct = cur_c_out_data; + one_step.ht = cur_h_out_data; + ker->ComputeCtHt(&one_step, &attr); + // move one batch cur_in_data += D4; cur_prev_c_data += D; diff --git a/paddle/fluid/operators/fusion_lstm_op.h b/paddle/fluid/operators/fused/fusion_lstm_op.h similarity index 100% rename from paddle/fluid/operators/fusion_lstm_op.h rename to paddle/fluid/operators/fused/fusion_lstm_op.h diff --git a/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.cc b/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..40bba09f3ef71021b7daff83b9d63005f7580395 --- /dev/null +++ b/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.cc @@ -0,0 +1,229 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.h" +#include // for min, max +#include +#include "paddle/fluid/operators/math/blas.h" +#include "paddle/fluid/operators/math/fc_compute.h" + +namespace paddle { +namespace operators { + +void FusionSeqConvEltAddReluOp::InferShape( + framework::InferShapeContext* ctx) const { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of FusionSeqConvEltAddReluOp should not be null."); + PADDLE_ENFORCE( + ctx->HasInput("Filter"), + "Input(Filter) of FusionSeqConvEltAddReluOp should not be null."); + PADDLE_ENFORCE( + ctx->HasInput("Bias"), + "Input(Bias) of FusionSeqConvEltAddReluOp should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("Out"), + "Output(Out) of FusionSeqConvEltAddReluOp should not be null."); + PADDLE_ENFORCE( + ctx->HasOutput("ColMat"), + "Output(ColMat) of FusionSeqConvEltAddReluOp should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + auto w_dims = ctx->GetInputDim("Filter"); + int context_length = ctx->Attrs().Get("contextLength"); + PADDLE_ENFORCE( + ctx->Attrs().Get("contextStride") == 1, + "Currently, FusionSeqConvEltAddReluOp only supports contextStride=1."); + PADDLE_ENFORCE(x_dims.size() == 2 && w_dims.size() == 2, + "Input(X, Filter) should be 2-D tensor."); + PADDLE_ENFORCE(x_dims.size() == 2 && w_dims.size() == 2, + "Input(X, Filter) should be 2-D tensor."); + PADDLE_ENFORCE(w_dims[0] == context_length * x_dims[1], + "Filter's height should be context_length * " + "input_hidden_size ."); + PADDLE_ENFORCE_GT(context_length + ctx->Attrs().Get("contextStart"), 0, + "contextStart size should be smaller than contextLength."); + + ctx->SetOutputDim("Out", {x_dims[0], w_dims[1]}); + ctx->SetOutputDim("ColMat", {x_dims[0], w_dims[0]}); + ctx->ShareLoD("X", "Out"); +} + +framework::OpKernelType FusionSeqConvEltAddReluOp::GetExpectedKernelType( + const framework::ExecutionContext& ctx) const { + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), + ctx.device_context()); +} + +void FusionSeqConvEltAddReluOpMaker::Make() { + AddInput("X", + "(LoDTensor) the input is a LodTensor, which support " + "variable-time length input sequence. The underlying tensor in " + "this LoDTensor is a matrix with shape (T X M), where T is the " + "total time steps in this mini-batch, M is the dim size of x."); + // PaddingData only support false yet, should be ensured at pass. + AddInput("Filter", + "(Tensor) same as the input(Filter) of sequence conv op is an " + "learnable parameter." + "This is a tensor with shape (K, N), where K is the " + "context_length * dim size of x, N is the output feature size."); + AddInput("Bias", + "(Tensor) the learnable weights. shape (1, N), where N is the " + "output feature size"); + AddOutput( + "Out", + "(LoDTensor) the output(Out) is a LodTensor, which support " + "variable-time length output sequence. The underlying tensor in " + "this LoDTensor is a matrix with shape (T, N), where, T is the " + "total time steps in this mini-batch, N is the output feature size."); + AddOutput("ColMat", + "(Tensor) (T, K), where T is where T is the " + "total time steps in this mini-batch, K is height of Filter") + .AsIntermediate(); + AddAttr("contextLength", + "(int) the contextLength of FusionSeqConvEltAddReluOp is the " + "height of the convolution kernel.") + .GreaterThan(0); + AddAttr("contextStart", + "(int, default:0) the contextStart of FusionSeqConvEltAddReluOp " + "represents the beginning of the convolution of the number of " + "rows of sequence, which can be negative. The negative number " + "means to pad contextStart time-steps of zeros or learnable " + "parameters at the beginning of each instance. The positive " + "number means to skip contextStart time-steps of each " + "instance.") + .SetDefault(0); + AddAttr( + "contextStride", + "(int, default:1) the contextStride of FusionSeqConvEltAddReluOp " + "represents the stride length of convolution kernel. " + "Currently, FusionSeqConvEltAddReluOp only supports" + "contextStride=1.") + .SetDefault(1) + .GreaterThan(0); + AddComment(R"DOC( +Fusion Sequence Conv and ElementwiseAdd Operator. +)DOC"); +} + +template +class FusionSeqConvEltAddReluKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + using DeviceContext = paddle::platform::CPUDeviceContext; + auto* x = ctx.Input("X"); + auto* w = ctx.Input("Filter"); + auto* b = ctx.Input("Bias"); + auto* y = ctx.Output("Out"); + auto* col = ctx.Output("ColMat"); + + auto x_lod = x->lod(); + auto x_dims = x->dims(); + auto w_dims = w->dims(); + PADDLE_ENFORCE_EQ(b->numel(), w_dims[1], + "bias size should be equal to output feature size."); + PADDLE_ENFORCE_EQ(x_lod.size(), 1UL, + "Only support one level sequence now."); + + const T* x_data = x->data(); + const T* w_data = w->data(); + const T* b_data = b->data(); + T* y_data = y->mutable_data(ctx.GetPlace()); + T* col_data = col->mutable_data(ctx.GetPlace()); + + int context_start = ctx.Attr("contextStart"); + int context_length = ctx.Attr("contextLength"); + int up_pad = std::max(0, -context_start); + int down_pad = std::max(0, context_start + context_length - 1); + // im2col + int src_mat_w = static_cast(x_dims[1]); + int src_mat_w_sz = src_mat_w * sizeof(T); + int col_mat_w = static_cast(w_dims[0]); + int col_mat_w_sz = col_mat_w * sizeof(T); + for (int i = 0; i < static_cast(x_lod[0].size()) - 1; ++i) { + int st = x_lod[0][i]; + int ed = x_lod[0][i + 1]; + const T* src_data = x_data + st * src_mat_w; + T* dst_data = col_data + st * col_mat_w; + int seq_len = ed - st; + if (seq_len > up_pad + down_pad) { + // zero all up_pad and fill data + std::memset(dst_data, 0, up_pad * col_mat_w_sz); + dst_data = dst_data + up_pad * src_mat_w; + int copy_size = col_mat_w_sz - up_pad * src_mat_w_sz; + for (int j = 0; j < up_pad; ++j) { + // blas.VCOPY? + std::memcpy(dst_data, src_data, copy_size); + dst_data += (col_mat_w - src_mat_w); + copy_size += src_mat_w_sz; + } + // fill data + for (int j = 0; j < seq_len - up_pad - down_pad; ++j) { + std::memcpy(dst_data, src_data, copy_size); + dst_data += col_mat_w; + src_data += src_mat_w; + } + // zero all down_pad and fill data + std::memset(dst_data, 0, down_pad * col_mat_w_sz); + copy_size -= src_mat_w_sz; + for (int j = 0; j < down_pad; ++j) { + std::memcpy(dst_data, src_data, copy_size); + dst_data += col_mat_w; + src_data += src_mat_w; + copy_size -= src_mat_w_sz; + } + } else { + PADDLE_ENFORCE_GE(context_length, up_pad + down_pad + 1); + std::memset(dst_data, 0, seq_len * col_mat_w_sz); + dst_data = dst_data + up_pad * src_mat_w; + int zero_sz = up_pad * src_mat_w_sz; + int cur_src_sz = seq_len * src_mat_w_sz; + for (int j = 0; j < std::min(up_pad, seq_len); ++j) { + int copy_size = std::min(cur_src_sz, col_mat_w_sz - zero_sz); + std::memcpy(dst_data, src_data, copy_size); + dst_data += (col_mat_w - src_mat_w); + zero_sz -= src_mat_w_sz; + } + // from bottom + dst_data = col_data + ed * col_mat_w; + src_data = x_data + st * src_mat_w; + zero_sz = down_pad * src_mat_w_sz; + for (int j = 1; j <= std::min(down_pad, seq_len); ++j) { + int copy_size = std::min(cur_src_sz, col_mat_w_sz - zero_sz); + std::memcpy(dst_data - (zero_sz + copy_size) / sizeof(T), + src_data + std::max(seq_len - j - up_pad, 0) * src_mat_w, + copy_size); + dst_data -= col_mat_w; + zero_sz -= src_mat_w_sz; + } + } + } + auto& dev_ctx = ctx.template device_context(); + auto blas = math::GetBlas(dev_ctx); + math::FCCompute(blas, x_dims[0], w_dims[1], w_dims[0], + col_data, w_data, y_data, b_data, true); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(fusion_seqconv_eltadd_relu, ops::FusionSeqConvEltAddReluOp, + ops::FusionSeqConvEltAddReluOpMaker, + paddle::framework::DefaultGradOpDescMaker); + +REGISTER_OP_CPU_KERNEL(fusion_seqconv_eltadd_relu, + ops::FusionSeqConvEltAddReluKernel, + ops::FusionSeqConvEltAddReluKernel); diff --git a/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.h b/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.h new file mode 100644 index 0000000000000000000000000000000000000000..028d79dc2a1ee8d789fe4b8724b320442041a71b --- /dev/null +++ b/paddle/fluid/operators/fused/fusion_seqconv_eltadd_relu_op.h @@ -0,0 +1,42 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using LoDTensor = framework::LoDTensor; +using Tensor = framework::Tensor; + +class FusionSeqConvEltAddReluOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override; + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override; +}; + +class FusionSeqConvEltAddReluOpMaker + : public framework::OpProtoAndCheckerMaker { + public: + void Make() override; +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/fusion_seqexpand_concat_fc_op.cc b/paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.cc similarity index 97% rename from paddle/fluid/operators/fusion_seqexpand_concat_fc_op.cc rename to paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.cc index 0cd3d3887cf5167c779a8b20442fdb458cd7eab4..288b56fc2485138b20c5b53af3e950f1c1886ba5 100644 --- a/paddle/fluid/operators/fusion_seqexpand_concat_fc_op.cc +++ b/paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/fusion_seqexpand_concat_fc_op.h" +#include "paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.h" #include #include "paddle/fluid/operators/math/blas.h" #include "paddle/fluid/operators/math/cpu_vec.h" @@ -136,9 +136,9 @@ class FusionSeqExpandConcatFCOpKernel : public framework::OpKernel { // since infershape can not get lod info PADDLE_ENFORCE_EQ(ref_lod.size(), 1UL, "Only support input lod size is 1."); PADDLE_ENFORCE_EQ(in1_lod.size(), 1UL, "Only support input lod size is 1."); - PADDLE_ENFORCE_EQ(in1_lod[0].size() - 1, N, + PADDLE_ENFORCE_EQ(static_cast(in1_lod[0].size() - 1), N, "Batch size of all inputs should be equal."); - PADDLE_ENFORCE_EQ(in1_lod[0][N], N, + PADDLE_ENFORCE_EQ(static_cast(in1_lod[0][N]), N, "Seq_length of other inputs should be 1."); PADDLE_ENFORCE_EQ(in1_dims[0], N, "input height should be batch size."); for (size_t i = 2; i < ins.size(); ++i) { diff --git a/paddle/fluid/operators/fusion_seqexpand_concat_fc_op.h b/paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.h similarity index 100% rename from paddle/fluid/operators/fusion_seqexpand_concat_fc_op.h rename to paddle/fluid/operators/fused/fusion_seqexpand_concat_fc_op.h diff --git a/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cc b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..39356c9afccbf9af3eacf99a6bccb15e18f7e485 --- /dev/null +++ b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cc @@ -0,0 +1,114 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.h" +#include +#include +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; + +class TransposeFlattenConcatFusionOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE_GE(ctx->Inputs("X").size(), 1UL, + "Inputs(X) of ConcatOp should be empty."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of ConcatOp should not be null."); + + auto ins = ctx->GetInputsDim("X"); + const size_t n = ins.size(); + PADDLE_ENFORCE_GT(n, 0, "Input tensors count should > 0."); + + std::vector trans_axis = + ctx->Attrs().Get>("trans_axis"); + int flatten_axis = ctx->Attrs().Get("flatten_axis"); + int concat_axis = ctx->Attrs().Get("concat_axis"); + + size_t x_rank = ins[0].size(); + size_t trans_axis_size = trans_axis.size(); + PADDLE_ENFORCE_EQ(x_rank, trans_axis_size, + "The input tensor's rank(%d) " + "should be equal to the permutation axis's size(%d)", + x_rank, trans_axis_size); + + auto dims0 = + GetFlattenShape(flatten_axis, GetPermuteShape(trans_axis, ins[0])); + std::vector out_dims(dims0); + for (size_t i = 1; i < n; i++) { + auto dimsi = + GetFlattenShape(flatten_axis, GetPermuteShape(trans_axis, ins[i])); + for (int j = 0; j < static_cast(dims0.size()); j++) { + if (j == concat_axis) { + out_dims[concat_axis] += dimsi[j]; + } else { + PADDLE_ENFORCE_EQ(out_dims[j], dimsi[j], + "After flatting, the %d-th dim should be save " + "except the specify axis.", + j); + } + } + } + if (out_dims[concat_axis] < 0) { + out_dims[concat_axis] = -1; + } + ctx->SetOutputDim("Out", framework::make_ddim(out_dims)); + } +}; + +class TransposeFlattenConcatFusionOpMaker + : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput( + "X", + "(Tensor) The input tensor, tensors with rank up to 6 are supported.") + .AsDuplicable(); + AddOutput("Out", "(Tensor)The output tensor."); + AddAttr>( + "trans_axis", + "(vector) A list of values, and the size of the list should be " + "the same with the input tensor rank. This operator permutes the input " + "tensor's axes according to the values given."); + AddAttr("flatten_axis", + "(int)" + "Indicate up to which input dimensions (exclusive) should be" + "flattened to the outer dimension of the output. The value" + "for axis must be in the range [0, R], where R is the rank of" + "the input tensor. When axis = 0, the shape of the output" + "tensor is (1, (d_0 X d_1 ... d_n), where the shape of the" + "input tensor is (d_0, d_1, ... d_n)."); + AddAttr("concat_axis", + "The axis along which the input tensors will be concatenated. " + "It should be 0 or 1, since the tensor is 2D after flatting."); + AddComment(R"DOC( + + +)DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(fusion_transpose_flatten_concat, + ops::TransposeFlattenConcatFusionOp, + ops::TransposeFlattenConcatFusionOpMaker, + paddle::framework::EmptyGradOpMaker); diff --git a/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cu.cc b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..6ccb670d73c803bb1b9827f0f30b99d272bfce79 --- /dev/null +++ b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.cu.cc @@ -0,0 +1,115 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.h" +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +template +using CudnnDataType = platform::CudnnDataType; + +template +class TransposeFlattenConcatFusionKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto ins = ctx.MultiInput("X"); + auto* out = ctx.Output("Out"); + out->mutable_data(ctx.GetPlace()); + auto odims = out->dims(); + + std::vector trans_axis = ctx.Attr>("trans_axis"); + int flatten_axis = ctx.Attr("flatten_axis"); + int concat_axis = ctx.Attr("concat_axis"); + + int rank = ins[0]->dims().size(); + // use at least 4D in cudnnTransformTensor + int max_dim = rank < 4 ? 4 : rank; + std::vector stride_x(max_dim, 0); + std::vector stride_y(max_dim, 0); + std::vector dims_y(max_dim, 0); + + cudnnTensorDescriptor_t in_desc; + cudnnTensorDescriptor_t out_desc; + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&in_desc)); + CUDNN_ENFORCE(platform::dynload::cudnnCreateTensorDescriptor(&out_desc)); + cudnnDataType_t cudnn_dtype = CudnnDataType::type; + + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + + T* odata = out->data(); + for (size_t k = 0; k < ins.size(); ++k) { + auto perm_shape = GetPermuteShape(trans_axis, ins[k]->dims()); + int osize = 1; + auto idims = ins[k]->dims(); + for (int i = 0; i < rank; i++) { + stride_x[i] = 1; + for (int j = trans_axis[i] + 1; j < rank; j++) { + stride_x[i] *= idims[j]; + } + dims_y[i] = perm_shape[i]; + osize *= perm_shape[i]; + } + stride_y[rank - 1] = 1; + for (int i = rank - 2; i >= 0; i--) { + if (((i + 1) == flatten_axis) && (concat_axis == 1)) { + stride_y[i] = odims[1]; + } else { + stride_y[i] = stride_y[i + 1] * perm_shape[i + 1]; + } + } + + // Since concat is aftern flatten, the output is 2D tensor. + // If concat_axis is 0, each input's permutated tensor is continuous. + // If concat_axis is 1, the stride of 0-th dim of each input's + // permutated tensor is odims()[1]. + + for (int i = rank; i < max_dim; i++) { + stride_x[i] = 1; + stride_y[i] = 1; + dims_y[i] = 1; + } + + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + in_desc, cudnn_dtype, max_dim, dims_y.data(), stride_x.data())); + CUDNN_ENFORCE(platform::dynload::cudnnSetTensorNdDescriptor( + out_desc, cudnn_dtype, max_dim, dims_y.data(), stride_y.data())); + + CUDNN_ENFORCE(platform::dynload::cudnnTransformTensor( + handle, CudnnDataType::kOne(), in_desc, + static_cast(ins[k]->data()), + CudnnDataType::kZero(), out_desc, static_cast(odata))); + if (concat_axis == 0) { + odata += osize; + } else { + auto flat_shape = GetFlattenShape(flatten_axis, perm_shape); + odata += flat_shape[1]; + } + } + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(in_desc)); + CUDNN_ENFORCE(platform::dynload::cudnnDestroyTensorDescriptor(out_desc)); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(fusion_transpose_flatten_concat, + ops::TransposeFlattenConcatFusionKernel, + ops::TransposeFlattenConcatFusionKernel); diff --git a/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.h b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.h new file mode 100644 index 0000000000000000000000000000000000000000..66d5bea679fc85ce6b1ba64921107aef987ccaa8 --- /dev/null +++ b/paddle/fluid/operators/fused/fusion_transpose_flatten_concat_op.h @@ -0,0 +1,50 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/framework/ddim.h" + +namespace paddle { +namespace operators { + +inline std::vector GetPermuteShape(const std::vector& axis, + const framework::DDim& in_dims) { + std::vector out_dims(in_dims.size()); + for (size_t i = 0; i < axis.size(); i++) { + out_dims[i] = in_dims[axis[i]]; + } + return out_dims; +} + +inline std::vector GetFlattenShape(const int axis, + const std::vector& in_dims) { + int64_t outer = 1, inner = 1; + for (int i = 0; i < static_cast(in_dims.size()); ++i) { + if (i < axis) { + outer *= in_dims[i]; + } else { + inner *= in_dims[i]; + } + } + std::vector out_shape(2); + out_shape[0] = outer; + out_shape[1] = inner; + return out_shape; +} + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/gather.cu.h b/paddle/fluid/operators/gather.cu.h index d74d4db92528d69492ab7b90a98d3775dadc35d1..e4df59c5d51c390cf593add0c5562665c91f33f6 100644 --- a/paddle/fluid/operators/gather.cu.h +++ b/paddle/fluid/operators/gather.cu.h @@ -50,7 +50,9 @@ void GPUGather(const platform::DeviceContext& ctx, const Tensor& src, const Tensor& index, Tensor* output) { // PADDLE_ENFORCE(platform::is_gpu_place(place)); // check index of shape 1-D - PADDLE_ENFORCE(index.dims().size() == 1); + PADDLE_ENFORCE(index.dims().size() == 1 || + (index.dims().size() == 2 && index.dims()[1] == 1)); + int index_size = index.dims()[0]; auto src_dims = src.dims(); diff --git a/paddle/fluid/operators/gather.h b/paddle/fluid/operators/gather.h index d15cb55647ade2415041b11099974484835f55eb..dc08ee5efacde5e232d751b13aaf11f51237634a 100644 --- a/paddle/fluid/operators/gather.h +++ b/paddle/fluid/operators/gather.h @@ -38,12 +38,11 @@ void CPUGather(const platform::DeviceContext& ctx, const Tensor& src, const Tensor& index, Tensor* output) { PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace())); // check index of shape 1-D - PADDLE_ENFORCE(index.dims().size() == 1); - int index_size = index.dims()[0]; + PADDLE_ENFORCE(index.dims().size() == 1 || + (index.dims().size() == 2 && index.dims()[1] == 1)); + int64_t index_size = index.dims()[0]; auto src_dims = src.dims(); - framework::DDim output_dims(src_dims); - output_dims[0] = index_size; const T* p_src = src.data(); const int* p_index = index.data(); @@ -55,7 +54,7 @@ void CPUGather(const platform::DeviceContext& ctx, const Tensor& src, const size_t slice_bytes = slice_size * sizeof(T); - for (int i = 0; i < index_size; ++i) { + for (int64_t i = 0; i < index_size; ++i) { int index_ = p_index[i]; memcpy(p_output + i * slice_size, p_src + index_ * slice_size, slice_bytes); } diff --git a/paddle/fluid/operators/gather_op.cc b/paddle/fluid/operators/gather_op.cc index 089b541a0a61adb5efda6b2e027c913d5808dff0..95aa9b573c795159079bdb5401b34d7a61252115 100644 --- a/paddle/fluid/operators/gather_op.cc +++ b/paddle/fluid/operators/gather_op.cc @@ -31,7 +31,8 @@ class GatherOp : public framework::OperatorWithKernel { "Output(Out) of GatherOp should not be null."); auto index_dims = ctx->GetInputDim("Index"); - PADDLE_ENFORCE(index_dims.size() == 1); + PADDLE_ENFORCE(index_dims.size() == 1 || + (index_dims.size() == 2 && index_dims[1] == 1)); int batch_size = ctx->GetInputDim("Index")[0]; framework::DDim output_dims(ctx->GetInputDim("X")); output_dims[0] = batch_size; @@ -53,6 +54,7 @@ class GatherGradOp : public framework::OperatorWithKernel { void InferShape(framework::InferShapeContext* ctx) const override { ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X")); + ctx->ShareLoD("X", /*-->*/ framework::GradVarName("X")); } protected: @@ -75,7 +77,7 @@ Gather Operator. $Out = X[Index]$ -Out is obtained by gathering entries of the outer-most dimension +Out is obtained by gathering entries of the outer-most dimension of X indexed by Index and concatenate them together. Example: @@ -102,7 +104,9 @@ REGISTER_OPERATOR(gather, ops::GatherOp, ops::GatherOpMaker, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(gather_grad, ops::GatherGradOp); REGISTER_OP_CPU_KERNEL(gather, ops::GatherOpKernel, - ops::GatherOpKernel, ops::GatherOpKernel); + ops::GatherOpKernel, ops::GatherOpKernel, + ops::GatherOpKernel); REGISTER_OP_CPU_KERNEL(gather_grad, ops::GatherGradientOpKernel, + ops::GatherGradientOpKernel, ops::GatherGradientOpKernel, - ops::GatherGradientOpKernel); + ops::GatherGradientOpKernel); diff --git a/paddle/fluid/operators/gather_op.cu b/paddle/fluid/operators/gather_op.cu index 7e014dd1cb47ee0575308dc13ba7bc7617baebff..9f4aef08cd58e72ce344a640e6564b9e360ce169 100644 --- a/paddle/fluid/operators/gather_op.cu +++ b/paddle/fluid/operators/gather_op.cu @@ -61,5 +61,11 @@ class GatherGradOpCUDAKernel : public framework::OpKernel { } // namespace paddle namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL(gather, ops::GatherOpCUDAKernel); -REGISTER_OP_CUDA_KERNEL(gather_grad, ops::GatherGradOpCUDAKernel); +REGISTER_OP_CUDA_KERNEL(gather, ops::GatherOpCUDAKernel, + ops::GatherOpCUDAKernel, + ops::GatherOpCUDAKernel, + ops::GatherOpCUDAKernel); +REGISTER_OP_CUDA_KERNEL(gather_grad, ops::GatherGradOpCUDAKernel, + ops::GatherGradOpCUDAKernel, + ops::GatherGradOpCUDAKernel, + ops::GatherGradOpCUDAKernel); diff --git a/paddle/fluid/operators/gaussian_random_op.cc b/paddle/fluid/operators/gaussian_random_op.cc index 1488aab1926b5b4ba7bceed582700f5a11fc6c93..c70d5b8bc7569c38cbc003aca7c62dc503df11cf 100644 --- a/paddle/fluid/operators/gaussian_random_op.cc +++ b/paddle/fluid/operators/gaussian_random_op.cc @@ -52,7 +52,7 @@ class GaussianRandomOp : public framework::OperatorWithKernel { void InferShape(framework::InferShapeContext* ctx) const override { PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) of GaussianRandomOp should not be null."); - auto shape = ctx->Attrs().Get>("shape"); + auto shape = ctx->Attrs().Get>("shape"); std::vector temp; temp.reserve(shape.size()); for (auto dim : shape) { @@ -88,9 +88,9 @@ class GaussianRandomOpMaker : public framework::OpProtoAndCheckerMaker { void Make() override { AddOutput("Out", "Output matrix of gaussian random op"); - AddAttr>("shape", - "(vector) " - "The dimension of random tensor."); + AddAttr>("shape", + "(vector) " + "The dimension of random tensor."); AddAttr("mean", "(float, default 0.0) " "mean of random tensor.") diff --git a/paddle/fluid/operators/grid_sampler_cudnn_op.cu.cc b/paddle/fluid/operators/grid_sampler_cudnn_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..7cde7ca462fda9ae6ace7755af0a432afee28bba --- /dev/null +++ b/paddle/fluid/operators/grid_sampler_cudnn_op.cu.cc @@ -0,0 +1,132 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; +using ScopedTensorDescriptor = platform::ScopedTensorDescriptor; +using DataLayout = platform::DataLayout; +using ScopedSpatialTransformerDescriptor = + platform::ScopedSpatialTransformerDescriptor; +template +using CudnnDataType = platform::CudnnDataType; + +template +class CUDNNGridSampleOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace"); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto* input = ctx.Input("X"); + auto* grid = ctx.Input("Grid"); + auto* output = ctx.Output("Output"); + + int n = input->dims()[0]; + int c = input->dims()[1]; + int h = input->dims()[2]; + int w = input->dims()[3]; + const int size[4] = {n, c, h, w}; + + const T* input_data = input->data(); + const T* grid_data = grid->data(); + T* output_data = output->mutable_data({n, c, h, w}, ctx.GetPlace()); + + ScopedSpatialTransformerDescriptor st_desc; + cudnnSpatialTransformerDescriptor_t cudnn_st_desc = + st_desc.descriptor(4, size); + + ScopedTensorDescriptor input_desc; + ScopedTensorDescriptor output_desc; + cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor( + DataLayout::kNCHW, framework::vectorize2int(input->dims())); + cudnnTensorDescriptor_t cudnn_output_desc = output_desc.descriptor( + DataLayout::kNCHW, framework::vectorize2int(output->dims())); + + CUDNN_ENFORCE(platform::dynload::cudnnSpatialTfSamplerForward( + handle, cudnn_st_desc, CudnnDataType::kOne(), cudnn_input_desc, + input_data, grid_data, CudnnDataType::kZero(), cudnn_output_desc, + output_data)); + } +}; + +template +class CUDNNGridSampleGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "It must use CUDAPlace"); + auto& dev_ctx = ctx.template device_context(); + auto handle = dev_ctx.cudnn_handle(); + auto* input = ctx.Input("X"); + auto* grid = ctx.Input("Grid"); + auto* output_grad = ctx.Input(framework::GradVarName("Output")); + auto* input_grad = ctx.Output(framework::GradVarName("X")); + auto* grid_grad = ctx.Output(framework::GradVarName("Grid")); + + auto output_grad_dims = output_grad->dims(); + const int n = output_grad_dims[0]; + const int c = output_grad_dims[1]; + const int h = output_grad_dims[2]; + const int w = output_grad_dims[3]; + const int size[4] = {n, c, h, w}; + + ScopedSpatialTransformerDescriptor st_dest; + cudnnSpatialTransformerDescriptor_t cudnn_st_dest = + st_dest.descriptor(4, size); + + const T* input_data = input->data(); + const T* grid_data = grid->data(); + const T* output_grad_data = output_grad->data(); + T* input_grad_data = + input_grad->mutable_data(output_grad_dims, ctx.GetPlace()); + T* grid_grad_data = + grid_grad->mutable_data({n, h, w, 2}, ctx.GetPlace()); + + ScopedTensorDescriptor input_desc; + ScopedTensorDescriptor input_grad_desc; + ScopedTensorDescriptor output_grad_desc; + cudnnTensorDescriptor_t cudnn_input_desc = input_desc.descriptor( + DataLayout::kNCHW, framework::vectorize2int(input->dims())); + cudnnTensorDescriptor_t cudnn_input_grad_desc = + input_grad_desc.descriptor( + DataLayout::kNCHW, framework::vectorize2int(input_grad->dims())); + cudnnTensorDescriptor_t cudnn_output_grad_desc = + output_grad_desc.descriptor( + DataLayout::kNCHW, framework::vectorize2int(output_grad->dims())); + + CUDNN_ENFORCE(platform::dynload::cudnnSpatialTfSamplerBackward( + handle, cudnn_st_dest, CudnnDataType::kOne(), cudnn_input_desc, + input_data, CudnnDataType::kZero(), cudnn_input_grad_desc, + input_grad_data, CudnnDataType::kOne(), cudnn_output_grad_desc, + output_grad_data, grid_data, CudnnDataType::kZero(), + grid_grad_data)); + } +}; + +} // namespace operators +} // namespace paddle + +namespace plat = paddle::platform; +REGISTER_OP_KERNEL(grid_sampler, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNGridSampleOpKernel, + paddle::operators::CUDNNGridSampleOpKernel); +REGISTER_OP_KERNEL(grid_sampler_grad, CUDNN, plat::CUDAPlace, + paddle::operators::CUDNNGridSampleGradOpKernel, + paddle::operators::CUDNNGridSampleGradOpKernel); diff --git a/paddle/fluid/operators/grid_sampler_op.cc b/paddle/fluid/operators/grid_sampler_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..e76eb6893b1f7b6a965682368860c02fa32f6330 --- /dev/null +++ b/paddle/fluid/operators/grid_sampler_op.cc @@ -0,0 +1,203 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/grid_sampler_op.h" +#include "paddle/fluid/framework/op_registry.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +class GridSampleOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of GridSampleOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Grid"), + "Input(Grid) of GridSampleOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Output"), + "Output(Output) of GridSampleOp should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + auto grid_dims = ctx->GetInputDim("Grid"); + PADDLE_ENFORCE(x_dims.size() == 4, + "Input(X) of GridSampleOp should be 4-D Tensor."); + PADDLE_ENFORCE(grid_dims.size() == 4, + "Input(Grid) of GridSampleOp should be 4-D Tensor."); + PADDLE_ENFORCE(grid_dims[3] == 2, "Input(Grid) dims[3] should be 2."); + PADDLE_ENFORCE_EQ(grid_dims[0], x_dims[0], + "Input(X) and Input(Grid) dims[0] should be equal."); + PADDLE_ENFORCE_EQ( + grid_dims[1], x_dims[2], + "Input(X) dims[2] and Input(Grid) dims[1] should be equal."); + PADDLE_ENFORCE_EQ( + grid_dims[2], x_dims[3], + "Input(X) dims[3] and Input(Grid) dims[2] should be equal."); + + ctx->SetOutputDim("Output", x_dims); + ctx->ShareLoD("X", "Output"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library_{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kCUDNN; + } +#endif + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), ctx.GetPlace(), + framework::DataLayout::kAnyLayout, library_); + } +}; + +class GridSampleOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor) The input data of GridSampleOp, " + "This is a 4-D tensor with shape of [N, C, H, W]"); + AddInput( + "Grid", + "(Tensor) The input grid of GridSampleOp generated by AffineGridOp, " + "This is a 4-D tensor with shape of [N, H, W, 2] is the concatenation " + "of x and y coordinates with shape [N, H, W] in last dimention"); + AddOutput("Output", "(Tensor) Output tensor with shape [N, C, H, W]"); + AddAttr( + "use_cudnn", + "(bool, default true) Only used in cudnn kernel, need install cudnn") + .SetDefault(true); + + AddComment(R"DOC( + This operation samples input X by using bilinear interpolation based on + flow field grid, which is usually gennerated by affine_grid. The grid of + shape [N, H, W, 2] is the concatenation of (grid_x, grid_y) coordinates + with shape [N, H, W] each, where grid_x is indexing the 4th dimension + (in width dimension) of input data x and grid_y is indexng the 3rd + dimention (in height dimension), finally results is the bilinear + interpolation value of 4 nearest corner points. + + Step 1: + Get (x, y) grid coordinates and scale to [0, H-1/W-1]. + + grid_x = 0.5 * (grid[:, :, :, 0] + 1) * (W - 1) + grid_y = 0.5 * (grid[:, :, :, 1] + 1) * (H - 1) + + Step 2: + Indices input data X with grid (x, y) in each [H, W] area, and bilinear + interpolate point value by 4 nearest points. + + wn ------- y_n ------- en + | | | + | d_n | + | | | + x_w --d_w-- grid--d_e-- x_e + | | | + | d_s | + | | | + ws ------- y_s ------- wn + + x_w = floor(x) // west side x coord + x_e = x_w + 1 // east side x coord + y_n = floor(y) // north side y coord + y_s = y_s + 1 // south side y coord + + d_w = grid_x - x_w // distance to west side + d_e = x_e - grid_x // distance to east side + d_n = grid_y - y_n // distance to north side + d_s = y_s - grid_y // distance to south side + + wn = X[:, :, y_n, x_w] // north-west point value + en = X[:, :, y_n, x_e] // north-east point value + ws = X[:, :, y_s, x_w] // south-east point value + es = X[:, :, y_s, x_w] // north-east point value + + output = wn * d_e * d_s + en * d_w * d_s + + ws * d_e * d_n + es * d_w * d_n + )DOC"); + } +}; + +class GridSampleOpGrad : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + void InferShape(framework::InferShapeContext* ctx) const override { + auto input_dims = ctx->GetInputDim("X"); + auto grid_dims = ctx->GetInputDim("Grid"); + if (ctx->HasOutput(framework::GradVarName("X"))) { + ctx->SetOutputDim(framework::GradVarName("X"), input_dims); + } + if (ctx->HasOutput(framework::GradVarName("Grid"))) { + ctx->SetOutputDim(framework::GradVarName("Grid"), grid_dims); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + framework::LibraryType library_{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kCUDNN; + } +#endif + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), ctx.GetPlace(), + framework::DataLayout::kAnyLayout, library_); + } +}; + +class GridSampleGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + auto* op = new framework::OpDesc(); + op->SetType("grid_sampler_grad"); + op->SetInput("X", Input("X")); + op->SetInput("Grid", Input("Grid")); + op->SetInput(framework::GradVarName("Output"), OutputGrad("Output")); + + op->SetAttrMap(Attrs()); + + op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + op->SetOutput(framework::GradVarName("Grid"), InputGrad("Grid")); + return std::unique_ptr(op); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(grid_sampler, ops::GridSampleOp, ops::GridSampleOpMaker, + ops::GridSampleGradMaker); +REGISTER_OPERATOR(grid_sampler_grad, ops::GridSampleOpGrad); + +REGISTER_OP_CPU_KERNEL( + grid_sampler, + ops::GridSampleOpKernel, + ops::GridSampleOpKernel); +REGISTER_OP_CPU_KERNEL( + grid_sampler_grad, + ops::GridSampleGradOpKernel, + ops::GridSampleGradOpKernel); diff --git a/paddle/fluid/operators/grid_sampler_op.h b/paddle/fluid/operators/grid_sampler_op.h new file mode 100644 index 0000000000000000000000000000000000000000..08a6043eb07a6e44d46428ee195f6cb28c2ee77c --- /dev/null +++ b/paddle/fluid/operators/grid_sampler_op.h @@ -0,0 +1,330 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/gather.h" +#include "paddle/fluid/operators/math/math_function.h" +#include "paddle/fluid/platform/hostdevice.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +template +using EigenTensor = framework::EigenTensor; + +using Array3 = Eigen::DSizes; +using Array4 = Eigen::DSizes; + +template +static inline bool isInBound(T x, T y, T x_max, T y_max) { + if (x < 0 || x > x_max || y < 0 || y > y_max) { + return false; + } + return true; +} + +template +static void CalcGridLocations(const platform::CPUDeviceContext& ctx, + const Tensor& grid, Tensor* x_w, Tensor* x_e, + Tensor* y_n, Tensor* y_s, Tensor* d_w, + Tensor* d_e, Tensor* d_n, Tensor* d_s) { + auto& place = *ctx.eigen_device(); + const int n = grid.dims()[0]; + const int h = grid.dims()[1]; + const int w = grid.dims()[2]; + const T x_max = static_cast(w - 1); + const T y_max = static_cast(h - 1); + + // split grid with shape (n, h, w, 2) into (x, y) by the 3rd Dim + Tensor grid_x, grid_y; + T* grid_x_data = grid_x.mutable_data({n, h, w}, ctx.GetPlace()); + T* grid_y_data = grid_y.mutable_data({n, h, w}, ctx.GetPlace()); + const T* grid_data = grid.data(); + for (int i = 0; i < n * h * w; i++) { + grid_x_data[i] = grid_data[2 * i]; + grid_y_data[i] = grid_data[(2 * i) + 1]; + } + + Tensor ones; + ones.mutable_data({n, h, w}, ctx.GetPlace()); + auto ones_t = EigenTensor::From(ones).setConstant(1.0); + Tensor half_xmax; + Tensor half_ymax; + half_xmax.mutable_data({n, h, w}, ctx.GetPlace()); + auto half_xmax_t = + EigenTensor::From(half_xmax).setConstant(0.5 * x_max); + half_ymax.mutable_data({n, h, w}, ctx.GetPlace()); + auto half_ymax_t = + EigenTensor::From(half_ymax).setConstant(0.5 * y_max); + + // scale grid to [0, h-1/w-1] + auto grid_x_t = EigenTensor::From(grid_x); + auto grid_y_t = EigenTensor::From(grid_y); + grid_x_t.device(place) = (grid_x_t + ones_t) * half_xmax_t; + grid_y_t.device(place) = (grid_y_t + ones_t) * half_ymax_t; + + // calculate coords of 4 corner points + x_w->mutable_data({n, h, w}, ctx.GetPlace()); + x_e->mutable_data({n, h, w}, ctx.GetPlace()); + y_n->mutable_data({n, h, w}, ctx.GetPlace()); + y_s->mutable_data({n, h, w}, ctx.GetPlace()); + auto x_w_t = EigenTensor::From(*x_w); + auto x_e_t = EigenTensor::From(*x_e); + auto y_n_t = EigenTensor::From(*y_n); + auto y_s_t = EigenTensor::From(*y_s); + x_w_t.device(place) = grid_x_t.floor(); + x_e_t.device(place) = x_w_t + ones_t; + y_n_t.device(place) = grid_y_t.floor(); + y_s_t.device(place) = y_n_t + ones_t; + + // calculate distances to 4 sides + d_w->mutable_data({n, h, w}, ctx.GetPlace()); + d_e->mutable_data({n, h, w}, ctx.GetPlace()); + d_n->mutable_data({n, h, w}, ctx.GetPlace()); + d_s->mutable_data({n, h, w}, ctx.GetPlace()); + auto d_w_t = EigenTensor::From(*d_w); + auto d_e_t = EigenTensor::From(*d_e); + auto d_n_t = EigenTensor::From(*d_n); + auto d_s_t = EigenTensor::From(*d_s); + d_w_t.device(place) = grid_x_t - x_w_t; + d_e_t.device(place) = x_e_t - grid_x_t; + d_n_t.device(place) = grid_y_t - y_n_t; + d_s_t.device(place) = y_s_t - grid_y_t; +} + +template +static void GetGridPointValue(const Tensor& input, Tensor* output, + const Tensor& x, const Tensor& y) { + const int n = input.dims()[0]; + const int c = input.dims()[1]; + const int h = input.dims()[2]; + const int w = input.dims()[3]; + auto x_t = EigenTensor::From(x); + auto y_t = EigenTensor::From(y); + auto output_t = EigenTensor::From(*output).setConstant((T)0); + auto input_t = EigenTensor::From(input); + + for (int i = 0; i < n; i++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + if (isInBound(x_t(i, k, l), y_t(i, k, l), (T)(w - 1), (T)(h - 1))) { + for (int j = 0; j < c; j++) { + output_t(i, j, k, l) = + input_t(i, j, static_cast(round(y_t(i, k, l))), + static_cast(round(x_t(i, k, l)))); + } + } + } + } + } +} + +template +static void GatherOutputGradToInputGrad(const Tensor& output_grad, + Tensor* input_grad, const Tensor& x, + const Tensor& y, const Tensor& d1, + const Tensor& d2) { + const int n = output_grad.dims()[0]; + const int c = output_grad.dims()[1]; + const int h = output_grad.dims()[2]; + const int w = output_grad.dims()[3]; + auto x_t = EigenTensor::From(x); + auto y_t = EigenTensor::From(y); + auto d1_t = EigenTensor::From(d1); + auto d2_t = EigenTensor::From(d2); + auto input_grad_t = EigenTensor::From(*input_grad); + auto output_grad_t = EigenTensor::From(output_grad); + + for (int i = 0; i < n; i++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + if (isInBound(x_t(i, k, l), y_t(i, k, l), (T)(w - 1), (T)(h - 1))) { + for (int j = 0; j < c; j++) { + input_grad_t(i, j, static_cast(round(y_t(i, k, l))), + static_cast(round(x_t(i, k, l)))) += + output_grad_t(i, j, k, l) * d1_t(i, k, l) * d2_t(i, k, l); + } + } + } + } + } +} + +template +class GridSampleOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto& place = *ctx.template device_context().eigen_device(); + auto* input = ctx.Input("X"); + auto* grid = ctx.Input("Grid"); + + const int n = input->dims()[0]; + const int c = input->dims()[1]; + const int h = input->dims()[2]; + const int w = input->dims()[3]; + + // calc locations and distances of 4 corner points + Tensor x_w, x_e, y_n, y_s; + Tensor d_w, d_e, d_n, d_s; + CalcGridLocations( + ctx.template device_context(), *grid, &x_w, + &x_e, &y_n, &y_s, &d_w, &d_e, &d_n, &d_s); + + auto* output = ctx.Output("Output"); + output->mutable_data({n, c, h, w}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), output, + static_cast(0)); + + // calc 4 corner points value + Tensor v_wn, v_en, v_ws, v_es; + v_wn.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_en.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_ws.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_es.mutable_data({n, c, h, w}, ctx.GetPlace()); + GetGridPointValue(*input, &v_wn, x_w, y_n); + GetGridPointValue(*input, &v_en, x_e, y_n); + GetGridPointValue(*input, &v_ws, x_w, y_s); + GetGridPointValue(*input, &v_es, x_e, y_s); + + auto d_w_t = EigenTensor::From(d_w); + auto d_e_t = EigenTensor::From(d_e); + auto d_n_t = EigenTensor::From(d_n); + auto d_s_t = EigenTensor::From(d_s); + auto d_w_scaled_t = + d_w_t.reshape(Array4(n, 1, h, w)).broadcast(Array4(1, c, 1, 1)); + auto d_e_scaled_t = + d_e_t.reshape(Array4(n, 1, h, w)).broadcast(Array4(1, c, 1, 1)); + auto d_n_scaled_t = + d_n_t.reshape(Array4(n, 1, h, w)).broadcast(Array4(1, c, 1, 1)); + auto d_s_scaled_t = + d_s_t.reshape(Array4(n, 1, h, w)).broadcast(Array4(1, c, 1, 1)); + auto v_wn_t = EigenTensor::From(v_wn); + auto v_en_t = EigenTensor::From(v_en); + auto v_ws_t = EigenTensor::From(v_ws); + auto v_es_t = EigenTensor::From(v_es); + auto output_t = EigenTensor::From(*output); + // bilinear interpolaetion by 4 corner points + output_t.device(place) = v_wn_t * d_e_scaled_t * d_s_scaled_t + + v_en_t * d_w_scaled_t * d_s_scaled_t + + v_ws_t * d_e_scaled_t * d_n_scaled_t + + v_es_t * d_w_scaled_t * d_n_scaled_t; + } +}; + +template +class GridSampleGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* grid = ctx.Input("Grid"); + auto* output_grad = ctx.Input(framework::GradVarName("Output")); + + const int n = input->dims()[0]; + const int c = input->dims()[1]; + const int h = input->dims()[2]; + const int w = input->dims()[3]; + + auto* input_grad = ctx.Output(framework::GradVarName("X")); + input_grad->mutable_data({n, c, h, w}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), input_grad, + static_cast(0)); + auto* grid_grad = ctx.Output(framework::GradVarName("Grid")); + grid_grad->mutable_data({n, h, w, 2}, ctx.GetPlace()); + math::SetConstant()( + ctx.template device_context(), grid_grad, + static_cast(0)); + + Tensor x_w, x_e, y_n, y_s; + Tensor d_w, d_e, d_n, d_s; + CalcGridLocations( + ctx.template device_context(), *grid, &x_w, + &x_e, &y_n, &y_s, &d_w, &d_e, &d_n, &d_s); + + // gather output grad value to input grad by corner point coords and weight + GatherOutputGradToInputGrad(*output_grad, input_grad, x_w, y_n, d_e, + d_s); + GatherOutputGradToInputGrad(*output_grad, input_grad, x_w, y_s, d_e, + d_n); + GatherOutputGradToInputGrad(*output_grad, input_grad, x_e, y_n, d_w, + d_s); + GatherOutputGradToInputGrad(*output_grad, input_grad, x_e, y_s, d_w, + d_n); + + // calc 4 corner points value + Tensor v_wn, v_en, v_ws, v_es; + v_wn.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_en.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_ws.mutable_data({n, c, h, w}, ctx.GetPlace()); + v_es.mutable_data({n, c, h, w}, ctx.GetPlace()); + GetGridPointValue(*input, &v_wn, x_w, y_n); + GetGridPointValue(*input, &v_en, x_e, y_n); + GetGridPointValue(*input, &v_ws, x_w, y_s); + GetGridPointValue(*input, &v_es, x_e, y_s); + auto v_wn_t = EigenTensor::From(v_wn); + auto v_en_t = EigenTensor::From(v_en); + auto v_ws_t = EigenTensor::From(v_ws); + auto v_es_t = EigenTensor::From(v_es); + + auto d_w_t = EigenTensor::From(d_w); + auto d_e_t = EigenTensor::From(d_e); + auto d_n_t = EigenTensor::From(d_n); + auto d_s_t = EigenTensor::From(d_s); + + auto output_grad_t = EigenTensor::From(*output_grad); + + Tensor grid_grad_x, grid_grad_y; + grid_grad_x.mutable_data({n, h, w}, ctx.GetPlace()); + grid_grad_y.mutable_data({n, h, w}, ctx.GetPlace()); + auto grid_grad_x_t = EigenTensor::From(grid_grad_x).setConstant(0.0); + auto grid_grad_y_t = EigenTensor::From(grid_grad_y).setConstant(0.0); + for (int i = 0; i < n; i++) { + for (int j = 0; j < c; j++) { + for (int k = 0; k < h; k++) { + for (int l = 0; l < w; l++) { + grid_grad_x_t(i, k, l) += + ((v_en_t(i, j, k, l) - v_wn_t(i, j, k, l)) * d_s_t(i, k, l) + + (v_es_t(i, j, k, l) - v_ws_t(i, j, k, l)) * d_n_t(i, k, l)) * + output_grad_t(i, j, k, l); + grid_grad_y_t(i, k, l) += + ((v_ws_t(i, j, k, l) - v_wn_t(i, j, k, l)) * d_e_t(i, k, l) + + (v_es_t(i, j, k, l) - v_en_t(i, j, k, l)) * d_w_t(i, k, l)) * + output_grad_t(i, j, k, l); + } + } + } + } + const T x_max = static_cast(w - 1); + const T y_max = static_cast(h - 1); + grid_grad_x_t = grid_grad_x_t * (x_max / (T)2); + grid_grad_y_t = grid_grad_y_t * (y_max / (T)2); + + // gather grid_grad [x, y] in 3rd Dim + T* grid_grad_data = grid_grad->data(); + T* grid_grad_x_data = grid_grad_x.data(); + T* grid_grad_y_data = grid_grad_y.data(); + for (int i = 0; i < n * h * w; i++) { + grid_grad_data[2 * i] = grid_grad_x_data[i]; + grid_grad_data[2 * i + 1] = grid_grad_y_data[i]; + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/group_norm_op.cc b/paddle/fluid/operators/group_norm_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..6322659b67f6aeaeae3e29135fd52e08bf21ead1 --- /dev/null +++ b/paddle/fluid/operators/group_norm_op.cc @@ -0,0 +1,162 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/group_norm_op.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; +using DataLayout = framework::DataLayout; + +class GroupNormOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Y"), + "Output(Y) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Mean"), + "Output(Mean) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Variance"), + "Output(Variance) of GroupNormOp should not be null."); + + auto x_dim = ctx->GetInputDim("X"); + auto channel_num = x_dim[1]; + auto batch_size = x_dim[0]; + auto groups = ctx->Attrs().Get("groups"); + PADDLE_ENFORCE_LE( + groups, channel_num, + "'groups' must be less equal than the number of channels."); + PADDLE_ENFORCE_GE(groups, 1, "'groups' must be greater equal than 1."); + + if (ctx->HasInput("Scale")) { + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Scale")[0], channel_num); + } + if (ctx->HasInput("Bias")) { + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias").size(), 1UL); + PADDLE_ENFORCE_EQ(ctx->GetInputDim("Bias")[0], channel_num); + } + + ctx->SetOutputDim("Y", ctx->GetInputDim("X")); + ctx->SetOutputDim("Mean", {batch_size, groups}); + ctx->SetOutputDim("Variance", {batch_size, groups}); + ctx->ShareLoD("X", "Y"); + } +}; + +class GroupNormOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "The input tensor."); + AddInput("Scale", + "Scale is a 1-dimensional tensor of size C" + "that is applied to the output.") + .AsDispensable(); + AddInput("Bias", + "Bias is a 1-dimensional tensor of size C " + "that is applied to the output") + .AsDispensable(); + AddOutput("Y", "Result after normalization."); + AddOutput("Mean", "Mean of each group.").AsIntermediate(); + AddOutput("Variance", "Variance of each group.").AsIntermediate(); + + AddAttr("epsilon", + "Constant for numerical stability [default 1e-5].") + .SetDefault(1e-5) + .AddCustomChecker([](const float &epsilon) { + PADDLE_ENFORCE(epsilon >= 0.0f && epsilon <= 1.0f, + "'epsilon' should be between 0.0 and 1.0."); + }); + AddAttr("groups", "The number of groups that divided from channels.") + .AddCustomChecker([](const int &groups) { + PADDLE_ENFORCE_GT(groups, 0, "'groups' should be greater than zero."); + }); + + AddComment(R"DOC( +Group Normalization + +Refer to `Group Normalization `_ +)DOC"); + } +}; + +class GroupNormGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + // check input + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Mean"), + "Input(Mean) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Variance"), + "Input(Variance) of GroupNormOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")), + "Input(Y@GRAD) of GroupNormOp should not be null."); + + // check output + if (ctx->HasOutput(framework::GradVarName("X"))) { + ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X")); + } + if (ctx->HasOutput(framework::GradVarName("Scale"))) { + ctx->SetOutputDim(framework::GradVarName("Scale"), + ctx->GetInputDim("Scale")); + } + if (ctx->HasOutput(framework::GradVarName("Bias"))) { + ctx->SetOutputDim(framework::GradVarName("Bias"), + ctx->GetInputDim("Bias")); + } + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + const auto *var = ctx.InputVar(framework::GradVarName("Y")); + if (var == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } + const Tensor *t = nullptr; + if (var->IsType()) { + t = &var->Get(); + } else if (var->IsType()) { + t = &var->Get(); + } + if (t == nullptr) { + PADDLE_THROW("can't find Y@GRAD"); + } + return framework::OpKernelType(framework::ToDataType(t->type()), + ctx.GetPlace()); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(group_norm, ops::GroupNormOp, ops::GroupNormOpMaker, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(group_norm_grad, ops::GroupNormGradOp); +REGISTER_OP_CPU_KERNEL( + group_norm, ops::GroupNormKernel, + ops::GroupNormKernel); +REGISTER_OP_CPU_KERNEL( + group_norm_grad, + ops::GroupNormGradKernel, + ops::GroupNormGradKernel); diff --git a/paddle/fluid/operators/group_norm_op.cu b/paddle/fluid/operators/group_norm_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..27174630227c8123a31cb1c70d5eb5f5b3ee5107 --- /dev/null +++ b/paddle/fluid/operators/group_norm_op.cu @@ -0,0 +1,292 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include "paddle/fluid/operators/group_norm_op.h" + +namespace paddle { +namespace operators { + +template +__global__ void GroupNormForwardGetMeanAndVar(const T* x, int N, int C, + int imsize, int groups, + int group_size, T* mean, T* var) { + int gid = blockIdx.y; + int cid = blockIdx.x; + int bid = blockIdx.z; + int number = min(group_size, static_cast(C - gid * group_size)); + int ccid = gid * group_size + cid; + if (ccid >= C) return; + T x_mean = 0, x_var = 0; + for (int imid = threadIdx.x; imid < imsize; imid += blockDim.x) { + T val = x[(bid * C + ccid) * imsize + imid]; + x_mean += val; + x_var += val * val; + } + x_mean /= number * imsize; + x_var /= number * imsize; + __shared__ T s_mem[2]; + if (threadIdx.x == 0) { + s_mem[0] = s_mem[1] = 0; + } + __syncthreads(); + paddle::platform::CudaAtomicAdd(&s_mem[0], x_mean); + paddle::platform::CudaAtomicAdd(&s_mem[1], x_var); + __syncthreads(); + if (threadIdx.x == 0) { + paddle::platform::CudaAtomicAdd(&mean[bid * groups + gid], s_mem[0]); + paddle::platform::CudaAtomicAdd(&var[bid * groups + gid], s_mem[1]); + } +} + +template +__global__ void GroupNormForward(const T* x, const T* mean, const T* var, + const T* scale, const T* bias, int N, int C, + int imsize, int groups, int group_size, + T epsilon, T* y, T* real_var) { + int gid = blockIdx.y; + int cid = blockIdx.x; + int bid = blockIdx.z; + int ccid = gid * group_size + cid; + if (ccid >= C) return; + T x_mean = mean[bid * groups + gid]; + T x_var = var[bid * groups + gid]; + x_var = x_var - x_mean * x_mean; + T var_inv = 1.0 / sqrt(x_var + epsilon); + if (cid == 0 && threadIdx.x == 0) real_var[bid * groups + gid] = x_var; + for (int imid = threadIdx.x; imid < imsize; imid += blockDim.x) { + T val = x[(bid * C + ccid) * imsize + imid]; + val = (val - x_mean) * var_inv; + if (scale) val *= scale[gid * group_size + cid]; + if (bias) val += bias[gid * group_size + cid]; + y[(bid * C + ccid) * imsize + imid] = val; + } +} + +template +class GroupNormKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + const float epsilon = ctx.Attr("epsilon"); + auto* scale = ctx.Input("Scale"); + auto* bias = ctx.Input("Bias"); + auto* x = ctx.Input("X"); + + auto* y = ctx.Output("Y"); + auto* mean = ctx.Output("Mean"); + auto* var = ctx.Output("Variance"); + const auto groups = ctx.Attr("groups"); + + const auto x_dims = x->dims(); + const int group_size = (x_dims[1] - 1) / groups + 1; + + y->mutable_data(ctx.GetPlace()); + mean->mutable_data(ctx.GetPlace()); + var->mutable_data(ctx.GetPlace()); + math::SetConstant set_zero; + auto& dev_ctx = ctx.template device_context(); + Tensor temp_var; + temp_var.mutable_data(var->dims(), ctx.GetPlace()); + + set_zero(dev_ctx, mean, static_cast(0)); + set_zero(dev_ctx, &temp_var, static_cast(0)); + + auto* x_data = x->data(); + auto* y_data = y->data(); + auto* mean_data = mean->data(); + auto* var_data = var->data(); + auto* temp_var_data = temp_var.data(); + + const T* scale_data = nullptr; + if (scale) scale_data = scale->data(); + const T* bias_data = nullptr; + if (bias) bias_data = bias->data(); + + int imsize = x_dims[2] * x_dims[3]; + int block_size = std::min(512, imsize); + dim3 grid(group_size, groups, x_dims[0]); + dim3 threads(block_size, 1, 1); + GroupNormForwardGetMeanAndVar<<>>( + x_data, x_dims[0], x_dims[1], imsize, groups, group_size, mean_data, + temp_var_data); + GroupNormForward<<>>( + x_data, mean_data, temp_var_data, scale_data, bias_data, x_dims[0], + x_dims[1], imsize, groups, group_size, epsilon, y_data, var_data); + } +}; + +template +__global__ void GroupNormBackwardGetMeanAndVar( + const T* x, const T* mean, const T* var, const T* scale, const T* d_y, + int N, int C, int imsize, int groups, int group_size, T epsilon, T* d_x, + T* d_mean, T* d_var, T* d_scale, T* d_bias) { + int gid = blockIdx.y; + int cid = blockIdx.x; + int bid = blockIdx.z; + int number = min(group_size, static_cast(C - gid * group_size)); + int ccid = gid * group_size + cid; + if (ccid >= C) return; + T x_mean = mean[bid * groups + gid]; + T x_var = var[bid * groups + gid]; + T var_inv = 1.0 / sqrt(x_var + epsilon); + T d_var_inv = 0, d_x_mean = 0; + T d_mean_data = 0, d_var_data = 0, d_scale_data = 0, d_bias_data = 0; + + for (int imid = threadIdx.x; imid < imsize; imid += blockDim.x) { + T tmp = x[(bid * C + ccid) * imsize + imid]; + T val = (tmp - x_mean) * var_inv; + T dval = d_y[(bid * C + ccid) * imsize + imid]; + if (d_bias) d_bias_data += dval; + if (d_scale) d_scale_data += val * dval; + if (scale) dval = dval * scale[ccid]; + d_var_data += (tmp - x_mean) * dval; + T d_tmp = dval * var_inv; + if (d_x) d_x[(bid * C + ccid) * imsize + imid] = d_tmp; + d_mean_data -= d_tmp; + } + + __shared__ T s_mem[4]; + if (threadIdx.x == 0) { + s_mem[0] = s_mem[1] = 0; + if (d_scale) s_mem[2] = 0; + if (d_bias) s_mem[3] = 0; + } + __syncthreads(); + paddle::platform::CudaAtomicAdd(&s_mem[0], d_mean_data); + paddle::platform::CudaAtomicAdd(&s_mem[1], d_var_data); + if (d_scale) paddle::platform::CudaAtomicAdd(&s_mem[2], d_scale_data); + if (d_bias) paddle::platform::CudaAtomicAdd(&s_mem[3], d_bias_data); + __syncthreads(); + if (threadIdx.x == 0) { + paddle::platform::CudaAtomicAdd(&d_mean[bid * groups + gid], s_mem[0]); + paddle::platform::CudaAtomicAdd(&d_var[bid * groups + gid], s_mem[1]); + if (d_scale) paddle::platform::CudaAtomicAdd(&d_scale[ccid], s_mem[2]); + if (d_bias) paddle::platform::CudaAtomicAdd(&d_bias[ccid], s_mem[3]); + } +} + +template +__global__ void GroupNormBackward(const T* x, const T* mean, const T* var, + const T* d_mean, const T* d_var, int N, int C, + int imsize, int groups, int group_size, + T epsilon, T* d_x) { + int gid = blockIdx.y; + int cid = blockIdx.x; + int bid = blockIdx.z; + int number = min(group_size, static_cast(C - gid * group_size)); + int ccid = gid * group_size + cid; + if (ccid >= C) return; + T x_mean = mean[bid * groups + gid]; + T x_var = var[bid * groups + gid]; + T d_x_mean = d_mean[bid * groups + gid]; + T d_var_inv = d_var[bid * groups + gid]; + + T d_x_var = + -1.0 / (2 * (x_var + epsilon) * sqrt(x_var + epsilon)) * d_var_inv; + d_x_mean -= 2 * d_x_var * x_mean; + d_x_var /= number * imsize; + d_x_mean /= number * imsize; + for (int imid = threadIdx.x; imid < imsize; imid += blockDim.x) { + T tmp = x[(bid * C + ccid) * imsize + imid]; + if (d_x) + d_x[(bid * C + ccid) * imsize + imid] += d_x_mean + tmp * 2 * d_x_var; + } +} + +template +class GroupNormGradKernel + : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + const float epsilon = ctx.Attr("epsilon"); + auto* x = ctx.Input("X"); + auto* mean = ctx.Input("Mean"); + auto* var = ctx.Input("Variance"); + auto* scale = ctx.Input("Scale"); + auto* d_y = ctx.Input(framework::GradVarName("Y")); + const auto groups = ctx.Attr("groups"); + + // init output + auto* d_x = ctx.Output(framework::GradVarName("X")); + auto* d_scale = ctx.Output(framework::GradVarName("Scale")); + auto* d_bias = ctx.Output(framework::GradVarName("Bias")); + + const auto& x_dims = x->dims(); + const int group_size = (x_dims[1] - 1) / groups + 1; + + T* d_x_data = nullptr; + if (d_x) { + d_x->mutable_data(ctx.GetPlace()); + d_x_data = d_x->data(); + } + math::SetConstant set_zero; + auto& dev_ctx = ctx.template device_context(); + + Tensor temp_var; + temp_var.mutable_data(var->dims(), ctx.GetPlace()); + set_zero(dev_ctx, &temp_var, static_cast(0)); + T* temp_var_data = temp_var.data(); + + Tensor temp_mean; + temp_mean.mutable_data(var->dims(), ctx.GetPlace()); + set_zero(dev_ctx, &temp_mean, static_cast(0)); + T* temp_mean_data = temp_mean.data(); + + auto* x_data = x->data(); + auto* y_data = d_y->data(); + auto* mean_data = mean->data(); + auto* var_data = var->data(); + T* d_scale_data = nullptr; + if (d_scale) { + d_scale->mutable_data(ctx.GetPlace()); + set_zero(dev_ctx, d_scale, static_cast(0)); + d_scale_data = d_scale->data(); + } + T* d_bias_data = nullptr; + if (d_bias) { + d_bias->mutable_data(ctx.GetPlace()); + set_zero(dev_ctx, d_bias, static_cast(0)); + d_bias_data = d_bias->data(); + } + + const T* scale_data = nullptr; + if (scale) scale_data = scale->data(); + + int imsize = x_dims[2] * x_dims[3]; + int block_size = std::min(512, imsize); + dim3 grid(group_size, groups, x_dims[0]); + dim3 threads(block_size, 1, 1); + GroupNormBackwardGetMeanAndVar<<>>( + x_data, mean_data, var_data, scale_data, y_data, x_dims[0], x_dims[1], + imsize, groups, group_size, epsilon, d_x_data, temp_mean_data, + temp_var_data, d_scale_data, d_bias_data); + GroupNormBackward<<>>( + x_data, mean_data, var_data, temp_mean_data, temp_var_data, x_dims[0], + x_dims[1], imsize, groups, group_size, epsilon, d_x_data); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + group_norm, + ops::GroupNormKernel, + ops::GroupNormKernel); +REGISTER_OP_CUDA_KERNEL( + group_norm_grad, + ops::GroupNormGradKernel, + ops::GroupNormGradKernel); diff --git a/paddle/fluid/operators/group_norm_op.h b/paddle/fluid/operators/group_norm_op.h new file mode 100644 index 0000000000000000000000000000000000000000..3d6c6a46a9662e3b99b4e080b424b4794db7fcc3 --- /dev/null +++ b/paddle/fluid/operators/group_norm_op.h @@ -0,0 +1,197 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" +#include "paddle/fluid/operators/math/blas.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; +using DataLayout = framework::DataLayout; + +template +class GroupNormKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + const float epsilon = ctx.Attr("epsilon"); + auto* scale = ctx.Input("Scale"); + auto* bias = ctx.Input("Bias"); + auto* x = ctx.Input("X"); + + auto* y = ctx.Output("Y"); + auto* mean = ctx.Output("Mean"); + auto* var = ctx.Output("Variance"); + const auto groups = ctx.Attr("groups"); + + const auto x_dims = x->dims(); + const int group_size = (x_dims[1] - 1) / groups + 1; + + y->mutable_data(ctx.GetPlace()); + mean->mutable_data(ctx.GetPlace()); + var->mutable_data(ctx.GetPlace()); + + auto* x_data = x->data(); + auto* y_data = y->data(); + auto* mean_data = mean->data(); + auto* var_data = var->data(); + + const T* scale_data = nullptr; + if (scale) scale_data = scale->data(); + const T* bias_data = nullptr; + if (bias) bias_data = bias->data(); + + int imsize = x_dims[2] * x_dims[3]; + auto* iter_x_data = x_data; + auto* iter_y_data = y_data; + for (int bid = 0; bid < x_dims[0]; bid++) + for (int gid = 0; gid < groups; gid++) { + T x_mean = 0, x_var = 0; + int number = std::min(group_size, + static_cast(x_dims[1] - gid * group_size)); + auto* tmp = iter_x_data; + for (int cid = 0; cid < number; cid++) { + for (int imid = 0; imid < imsize; imid++, iter_x_data++) { + x_mean += iter_x_data[0]; + x_var += iter_x_data[0] * iter_x_data[0]; + } + } + x_mean /= number * imsize; + x_var /= number * imsize; + x_var = x_var - x_mean * x_mean; + T var_inv = 1.0 / sqrt(x_var + epsilon); + mean_data[bid * groups + gid] = x_mean; + var_data[bid * groups + gid] = x_var; + for (int cid = 0; cid < number; cid++) { + for (int imid = 0; imid < imsize; imid++, tmp++, iter_y_data++) { + T val = (tmp[0] - x_mean) * var_inv; + if (scale_data) val *= scale_data[gid * group_size + cid]; + if (bias_data) val += bias_data[gid * group_size + cid]; + iter_y_data[0] = val; + } + } + } + } +}; + +template +class GroupNormGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + const float epsilon = ctx.Attr("epsilon"); + auto* x = ctx.Input("X"); + auto* mean = ctx.Input("Mean"); + auto* var = ctx.Input("Variance"); + auto* scale = ctx.Input("Scale"); + auto* d_y = ctx.Input(framework::GradVarName("Y")); + const auto groups = ctx.Attr("groups"); + + // init output + auto* d_x = ctx.Output(framework::GradVarName("X")); + auto* d_scale = ctx.Output(framework::GradVarName("Scale")); + auto* d_bias = ctx.Output(framework::GradVarName("Bias")); + + const auto& x_dims = x->dims(); + const int group_size = (x_dims[1] - 1) / groups + 1; + + // TODO(liangdun): need to check d_x is null + math::SetConstant set_zero; + auto& dev_ctx = ctx.template device_context(); + T* d_x_data = nullptr; + if (d_x) { + d_x->mutable_data(ctx.GetPlace()); + set_zero(dev_ctx, d_x, static_cast(0)); + d_x_data = d_x->data(); + } + + auto* x_data = x->data(); + auto* y_data = d_y->data(); + auto* mean_data = mean->data(); + auto* var_data = var->data(); + T* d_scale_data = nullptr; + if (d_scale) { + d_scale->mutable_data(ctx.GetPlace()); + set_zero(dev_ctx, d_scale, static_cast(0)); + d_scale_data = d_scale->data(); + } + T* d_bias_data = nullptr; + if (d_bias) { + d_bias->mutable_data(ctx.GetPlace()); + set_zero(dev_ctx, d_bias, static_cast(0)); + d_bias_data = d_bias->data(); + } + + const T* scale_data = nullptr; + if (scale) scale_data = scale->data(); + + int imsize = x_dims[2] * x_dims[3]; + auto* iter_x_data = x_data; + auto* iter_d_x_data = d_x_data; + auto* iter_y_data = y_data; + for (int bid = 0; bid < x_dims[0]; bid++) + for (int gid = 0; gid < groups; gid++) { + T x_mean = mean_data[bid * groups + gid]; + T x_var = var_data[bid * groups + gid]; + T var_inv = 1.0 / sqrt(x_var + epsilon); + int number = std::min(group_size, + static_cast(x_dims[1] - gid * group_size)); + auto* tmp = iter_x_data; + auto* tmp2 = iter_d_x_data; + T d_var_inv = 0, d_x_mean = 0; + for (int cid = 0; cid < number; cid++) { + for (int imid = 0; imid < imsize; + imid++, tmp++, iter_y_data++, iter_d_x_data++) { + T val = (tmp[0] - x_mean) * var_inv; + T dval = iter_y_data[0]; + if (d_bias_data) d_bias_data[gid * group_size + cid] += dval; + if (d_scale_data) + d_scale_data[gid * group_size + cid] += val * dval; + if (scale_data) dval = scale_data[gid * group_size + cid] * dval; + + d_var_inv += (tmp[0] - x_mean) * dval; + T d_tmp = dval * var_inv; + if (d_x_data) iter_d_x_data[0] += d_tmp; + d_x_mean -= d_tmp; + } + } + + T d_x_var = + -1.0 / (2 * (x_var + epsilon) * sqrt(x_var + epsilon)) * d_var_inv; + d_x_mean -= 2 * d_x_var * x_mean; + d_x_var /= number * imsize; + d_x_mean /= number * imsize; + + iter_d_x_data = tmp2; + + if (d_x_data) { + for (int cid = 0; cid < number; cid++) { + for (int imid = 0; imid < imsize; + imid++, iter_x_data++, iter_d_x_data++) { + iter_d_x_data[0] += d_x_mean; + iter_d_x_data[0] += iter_x_data[0] * 2 * d_x_var; + } + } + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/hash_op.cc b/paddle/fluid/operators/hash_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b2c2c7954b79658e66f1524a81bcad0b7bf22c35 --- /dev/null +++ b/paddle/fluid/operators/hash_op.cc @@ -0,0 +1,74 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/hash_op.h" +#include +#include + +namespace paddle { +namespace operators { + +class HashOp : public framework::OperatorWithKernel { + public: + HashOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorWithKernel(type, inputs, outputs, attrs) {} + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of HashOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of HashOp should not be null."); + + auto dims = ctx->GetInputDim("X"); + PADDLE_ENFORCE_EQ(dims.size(), 2UL, + "The input of hash_op's dimensions must be 2"); + std::vector out_dims; + out_dims.reserve(dims.size() + 1); + // copy all dims except the last one + for (int i = 0u; i != dims.size() - 1; ++i) { + out_dims.emplace_back(dims[i]); + } + int num_hash = ctx->Attrs().Get("num_hash"); + out_dims.emplace_back(num_hash); + // keep the last dim to 1 + out_dims.emplace_back(1); + + ctx->SetOutputDim("Out", framework::make_ddim(out_dims)); + ctx->ShareLoD("X", /*->*/ "Out"); + } +}; + +class HashOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "(Tensor) Input tensor of scale operator."); + AddOutput("Out", "(Tensor) Output tensor of scale operator."); + AddComment(R"DOC( +**Hash Operator** +$$Out = scale * X$$ +)DOC"); + AddAttr("num_hash", "").SetDefault(1); + AddAttr("mod_by", "").SetDefault(100000); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OP_WITHOUT_GRADIENT(hash, ops::HashOp, ops::HashOpMaker); +REGISTER_OP_CPU_KERNEL(hash, ops::HashKerel, ops::HashKerel); diff --git a/paddle/fluid/operators/hash_op.h b/paddle/fluid/operators/hash_op.h new file mode 100644 index 0000000000000000000000000000000000000000..9781bb0f453642cefb3eb59a05389c339a7de39d --- /dev/null +++ b/paddle/fluid/operators/hash_op.h @@ -0,0 +1,56 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +extern "C" { +#include +} +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { +// template +template +class HashKerel : public framework::OpKernel { + public: + virtual void Compute(const framework::ExecutionContext& context) const { + auto* out_t = context.Output("Out"); + auto* in_t = context.Input("X"); + int mod_by = context.Attr("mod_by"); + int num_hash = context.Attr("num_hash"); + auto* output = out_t->mutable_data(context.GetPlace()); + + auto in_dims = in_t->dims(); + auto in_lod = in_t->lod(); + PADDLE_ENFORCE_EQ( + static_cast(in_dims[0]), in_lod[0].back(), + "The actual input data's size mismatched with LoD information."); + + auto seq_length = in_dims[0]; + auto last_dim = in_dims[in_dims.size() - 1]; + auto* input = in_t->data(); + for (int idx = 0; idx < seq_length; ++idx) { + for (int ihash = 0; ihash != num_hash; ++ihash) { + output[idx * num_hash + ihash] = + XXH64(input, sizeof(int) * last_dim, ihash) % mod_by; + } + input += last_dim; + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/hierarchical_sigmoid_op.h b/paddle/fluid/operators/hierarchical_sigmoid_op.h index 64096a717b12ed231344649f5eb76b7e4b9af4a6..79980cda53befc2bce3cbd79a15da58b39c922ad 100644 --- a/paddle/fluid/operators/hierarchical_sigmoid_op.h +++ b/paddle/fluid/operators/hierarchical_sigmoid_op.h @@ -111,7 +111,7 @@ class HierarchicalSigmoidGradOpKernel : public framework::OpKernel { auto pre_out_mat = EigenMatrix::From(*pre_out); auto pre_out_grad_mat = EigenMatrix::From(pre_out_grad); auto out_grad_mat = EigenMatrix::From(*out_grad); - Eigen::array bcast({{1, static_cast(pre_out_grad.dims()[1])}}); + Eigen::array bcast{1, static_cast(pre_out_grad.dims()[1])}; // softrelu derivative pre_out_grad_mat.device(place) = diff --git a/paddle/fluid/operators/bilinear_interp_op.cc b/paddle/fluid/operators/interpolate_op.cc similarity index 52% rename from paddle/fluid/operators/bilinear_interp_op.cc rename to paddle/fluid/operators/interpolate_op.cc index 2dc3399da183fbcf7664066f6f7ce12db3dc6d5e..8f979e05d31e5a85bc86784943f4588ab650f668 100644 --- a/paddle/fluid/operators/bilinear_interp_op.cc +++ b/paddle/fluid/operators/interpolate_op.cc @@ -1,4 +1,4 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at @@ -9,7 +9,8 @@ See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/bilinear_interp_op.h" +#include "paddle/fluid/operators/interpolate_op.h" +#include #include #include "paddle/fluid/framework/op_registry.h" @@ -18,27 +19,34 @@ namespace operators { using framework::Tensor; -class BilinearInterpOp : public framework::OperatorWithKernel { +class InterpolateOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; protected: void InferShape(framework::InferShapeContext* ctx) const override { PADDLE_ENFORCE(ctx->HasInput("X"), - "Input(X) of BilinearInterOp should not be null."); + "Input(X) of InterpolateOp should not be null."); PADDLE_ENFORCE(ctx->HasOutput("Out"), - "Output(Out) of BilinearInterOp should not be null."); + "Output(Out) of InterpolationOp should not be null."); + + auto interp_method = ctx->Attrs().Get("interp_method"); + PADDLE_ENFORCE( + "bilinear" == interp_method || "nearest" == interp_method, + "Interpolation method can only be \"bilinear\" or \"nearest\"."); auto dim_x = ctx->GetInputDim("X"); // NCHW format int out_h = ctx->Attrs().Get("out_h"); int out_w = ctx->Attrs().Get("out_w"); PADDLE_ENFORCE_EQ(dim_x.size(), 4, "X's dimension must be 4"); - if (ctx->HasInput("OutSize")) { + if (ctx->HasInput("OutSize") && ctx->IsRuntime()) { auto out_size_dim = ctx->GetInputDim("OutSize"); PADDLE_ENFORCE_EQ(out_size_dim.size(), 1, "OutSize's dimension size must be 1"); PADDLE_ENFORCE_EQ(out_size_dim[0], 2, "OutSize's dim[0] must be 2"); + ctx->ShareLoD("X", "Out"); + return; } std::vector dim_out({dim_x[0], dim_x[1], out_h, out_w}); ctx->SetOutputDim("Out", framework::make_ddim(dim_out)); @@ -52,35 +60,53 @@ class BilinearInterpOp : public framework::OperatorWithKernel { } }; -class BilinearInterpOpMaker : public framework::OpProtoAndCheckerMaker { +class InterpolateOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { AddInput("X", - "The input tensor of bilinear interpolation, " - "This is a 4-D tensor with shape of (N x C x h x w)"); + "The input tensor of interpolate operator, " + "This is a 4-D tensor with shape of [N, C, H, w]."); AddInput("OutSize", - "This is a 1-D tensor with two number. " + "This is a 1-D tensor with two numbers to specify output size. " "The first number is height and the second number is width.") .AsDispensable(); - AddOutput("Out", "The dimension of output is (N x C x out_h x out_w)"); + AddOutput("Out", + "The output tensor of interpolate operator, " + "This is a 4-D tensor with shape of [N, C, H, W]."); - AddAttr("out_h", "output height of bilinear interpolation op."); - AddAttr("out_w", "output width of bilinear interpolation op."); + AddAttr("out_h", "output height of interpolate op."); + AddAttr("out_w", "output width of interpolate op."); + AddAttr( + "interp_method", + "(string), interpolation method, can be \"bilinear\" for " + "bilinear interpolation and \"nearest\" for nearest " + "neighbor interpolation."); AddComment(R"DOC( + This operator samples input X to given output shape by using specified + interpolation method, the interpolation methods can be \"nearest\" + for nearest neighbor interpolation and \"bilinear\" for bilinear + interpolation. + + Nearest neighbor interpolation is to perform nearest neighbor interpolation + in both the 3rd dimention(in height direction) and the 4th dimention(in width + direction) on input tensor. + Bilinear interpolation is an extension of linear interpolation for interpolating functions of two variables (e.g. H-direction and - W-direction in this op) on a rectilinear 2D grid. - - The key idea is to perform linear interpolation first in one - direction, and then again in the other direction. - - For details, please refer to Wikipedia: + W-direction in this op) on a rectilinear 2D grid. The key idea is + to perform linear interpolation first in one direction, and then + again in the other direction. + + For details of nearest neighbor interpolation, please refer to Wikipedia: + https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation + + For details of bilinear interpolation, please refer to Wikipedia: https://en.wikipedia.org/wiki/Bilinear_interpolation )DOC"); } }; -class BilinearInterpOpGrad : public framework::OperatorWithKernel { +class InterpolateOpGrad : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -106,11 +132,11 @@ class BilinearInterpOpGrad : public framework::OperatorWithKernel { } // namespace paddle namespace ops = paddle::operators; -REGISTER_OPERATOR(bilinear_interp, ops::BilinearInterpOp, - ops::BilinearInterpOpMaker, +REGISTER_OPERATOR(interpolate, ops::InterpolateOp, ops::InterpolateOpMaker, paddle::framework::DefaultGradOpDescMaker); -REGISTER_OPERATOR(bilinear_interp_grad, ops::BilinearInterpOpGrad); -REGISTER_OP_CPU_KERNEL(bilinear_interp, ops::BilinearInterpKernel, - ops::BilinearInterpKernel); -REGISTER_OP_CPU_KERNEL(bilinear_interp_grad, - ops::BilinearInterpGradKernel); +REGISTER_OPERATOR(interpolate_grad, ops::InterpolateOpGrad); +REGISTER_OP_CPU_KERNEL(interpolate, ops::InterpolateKernel, + ops::InterpolateKernel, + ops::InterpolateKernel); +REGISTER_OP_CPU_KERNEL(interpolate_grad, ops::InterpolateGradKernel, + ops::InterpolateGradKernel); diff --git a/paddle/fluid/operators/interpolate_op.cu b/paddle/fluid/operators/interpolate_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..190afbdac431f863c32e2a4a4b3ad83848e550fc --- /dev/null +++ b/paddle/fluid/operators/interpolate_op.cu @@ -0,0 +1,292 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#include +#include "paddle/fluid/operators/interpolate_op.h" +#include "paddle/fluid/platform/cuda_primitives.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; + +template +__global__ void KeNearestNeighborInterpFw( + const T* in, const size_t in_img_h, const size_t in_img_w, + const size_t input_h, const size_t input_w, T* out, const size_t out_img_h, + const size_t out_img_w, const size_t output_h, const size_t output_w, + const size_t num_channels, const float ratio_h, const float ratio_w) { + int nthreads = output_h * output_w; + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (; tid < nthreads; tid += stride) { + int out_id_h = tid / output_w; + int out_id_w = tid % output_w; + int in_img_size = input_w / num_channels; + int out_img_size = output_w / num_channels; + int channel_id = out_id_w / out_img_size; + + int out_img_idy = (out_id_w % out_img_size) / out_img_w; + int in_img_idy = static_cast(ratio_h * out_img_idy + 0.5); + + int out_img_idx = tid % out_img_w; + int in_img_idx = static_cast(ratio_w * out_img_idx + 0.5); + + out[tid] = in[out_id_h * input_w + channel_id * in_img_size + + in_img_idy * in_img_w + in_img_idx]; + } +} + +template +__global__ void KeNearestNeighborInterpBw( + T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h, + const size_t input_w, const T* out, const size_t out_img_h, + const size_t out_img_w, const size_t output_h, const size_t output_w, + const size_t num_channels, const float ratio_h, const float ratio_w) { + int nthreads = output_h * output_w; + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (; tid < nthreads; tid += stride) { + int out_id_h = tid / output_w; + int out_id_w = tid % output_w; + int in_img_size = input_w / num_channels; + int out_img_size = output_w / num_channels; + int channel_id = out_id_w / out_img_size; + + int out_img_idy = (out_id_w % out_img_size) / out_img_w; + int in_img_idy = static_cast(ratio_h * out_img_idy + 0.5); + + int out_img_idx = tid % out_img_w; + int in_img_idx = static_cast(ratio_w * out_img_idx + 0.5); + + T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + + in_img_idy * in_img_w + in_img_idx]; + const T out_pos = out[out_id_h * output_w + out_id_w]; + platform::CudaAtomicAdd(in_pos, out_pos); + } +} + +template +__global__ void KeBilinearInterpFw( + const T* in, const size_t in_img_h, const size_t in_img_w, + const size_t input_h, const size_t input_w, T* out, const size_t out_img_h, + const size_t out_img_w, const size_t output_h, const size_t output_w, + const size_t num_channels, const float ratio_h, const float ratio_w) { + int nthreads = output_h * output_w; + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (; tid < nthreads; tid += stride) { + int out_id_h = tid / output_w; + int out_id_w = tid % output_w; + int in_img_size = input_w / num_channels; + int out_img_size = output_w / num_channels; + int channel_id = out_id_w / out_img_size; + + int out_img_idy = (out_id_w % out_img_size) / out_img_w; + int in_img_idy = ratio_h * out_img_idy; + int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; + T h1lambda = ratio_h * out_img_idy - in_img_idy; + T h2lambda = 1.f - h1lambda; + + int out_img_idx = tid % out_img_w; + int in_img_idx = ratio_w * out_img_idx; + int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; + T w1lambda = ratio_w * out_img_idx - in_img_idx; + T w2lambda = 1.f - w1lambda; + + const T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + + in_img_idy * in_img_w + in_img_idx]; + + // bilinear interpolation + out[out_id_h * output_w + out_id_w] = + h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[w_id]) + + h1lambda * (w2lambda * in_pos[h_id * in_img_w] + + w1lambda * in_pos[h_id * in_img_w + w_id]); + } +} + +template +__global__ void KeBilinearInterpBw( + T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h, + const size_t input_w, const T* out, const size_t out_img_h, + const size_t out_img_w, const size_t output_h, const size_t output_w, + const size_t num_channels, const T ratio_h, const T ratio_w) { + int nthreads = output_h * output_w; + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int stride = blockDim.x * gridDim.x; + for (; tid < nthreads; tid += stride) { + int out_id_h = tid / output_w; + int out_id_w = tid % output_w; + int in_img_size = input_w / num_channels; + int out_img_size = output_w / num_channels; + int channel_id = out_id_w / out_img_size; + + int out_img_idy = (out_id_w % out_img_size) / out_img_w; + int in_img_idy = ratio_h * out_img_idy; + int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0; + T h1lambda = ratio_h * out_img_idy - in_img_idy; + T h2lambda = 1.f - h1lambda; + + int out_img_idx = tid % out_img_w; + int in_img_idx = ratio_w * out_img_idx; + int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0; + T w1lambda = ratio_w * out_img_idx - in_img_idx; + T w2lambda = 1.f - w1lambda; + + T* in_pos = &in[out_id_h * input_w + channel_id * in_img_size + + in_img_idy * in_img_w + in_img_idx]; + const T* out_pos = &out[out_id_h * output_w + out_id_w]; + platform::CudaAtomicAdd(&in_pos[0], h2lambda * w2lambda * out_pos[0]); + platform::CudaAtomicAdd(&in_pos[w_id], h2lambda * w1lambda * out_pos[0]); + platform::CudaAtomicAdd(&in_pos[h_id * in_img_w], + h1lambda * w2lambda * out_pos[0]); + platform::CudaAtomicAdd(&in_pos[h_id * in_img_w + w_id], + h1lambda * w1lambda * out_pos[0]); + } +} + +template +class InterpolateOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), + "This kernel only runs on GPU device."); + auto* input = ctx.Input("X"); + auto* output = ctx.Output("Out"); + auto* input_data = input->data(); + + auto interp_method = ctx.Attr("interp_method"); + int out_h = ctx.Attr("out_h"); + int out_w = ctx.Attr("out_w"); + auto out_size = ctx.Input("OutSize"); + if (out_size != nullptr) { + Tensor sizes; + framework::TensorCopy(*out_size, platform::CPUPlace(), &sizes); + auto size_data = sizes.data(); + out_h = size_data[0]; + out_w = size_data[1]; + } + + int n = input->dims()[0]; + int c = input->dims()[1]; + int in_h = input->dims()[2]; + int in_w = input->dims()[3]; + + auto* output_data = + output->mutable_data({n, c, out_h, out_w}, ctx.GetPlace()); + + int in_hw = in_h * in_w; + int out_hw = out_h * out_w; + int in_chw = c * in_hw; + int out_chw = c * out_hw; + + float ratio_h = + (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; + float ratio_w = + (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; + + if (in_h == out_h && in_w == out_w) { + framework::TensorCopy(*input, ctx.GetPlace(), output); + return; + } + + int pixelNum = n * out_chw; + int grid_dim = (pixelNum + 512 - 1) / 512; + grid_dim = grid_dim > 8 ? 8 : grid_dim; + + if ("nearest" == interp_method) { + KeNearestNeighborInterpFw< + T><<>>( + input_data, in_h, in_w, n, in_chw, output_data, out_h, out_w, n, + out_chw, c, ratio_h, ratio_w); + } else if ("bilinear" == interp_method) { + KeBilinearInterpFw< + T><<>>( + input_data, in_h, in_w, n, in_chw, output_data, out_h, out_w, n, + out_chw, c, ratio_h, ratio_w); + } + } +}; + +template +class InterpolateGradOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input_grad = ctx.Output(framework::GradVarName("X")); + auto* output_grad = ctx.Input(framework::GradVarName("Out")); + auto* output_grad_data = output_grad->data(); + auto* input_grad_data = input_grad->mutable_data(ctx.GetPlace()); + + auto& device_ctx = + ctx.template device_context(); + math::SetConstant zero; + zero(device_ctx, input_grad, static_cast(0.0)); + + auto interp_method = ctx.Attr("interp_method"); + int out_h = ctx.Attr("out_h"); + int out_w = ctx.Attr("out_w"); + auto out_size = ctx.Input("OutSize"); + if (out_size != nullptr) { + Tensor sizes; + framework::TensorCopy(*out_size, platform::CPUPlace(), &sizes); + auto size_data = sizes.data(); + out_h = size_data[0]; + out_w = size_data[1]; + } + + int n = input_grad->dims()[0]; + int c = input_grad->dims()[1]; + int in_h = input_grad->dims()[2]; + int in_w = input_grad->dims()[3]; + + int in_hw = in_h * in_w; + int out_hw = out_h * out_w; + int in_chw = c * in_hw; + int out_chw = c * out_hw; + + float ratio_h = + (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; + float ratio_w = + (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; + + if (in_h == out_h && in_w == out_w) { + framework::TensorCopy(*output_grad, ctx.GetPlace(), input_grad); + return; + } + + int pixelNum = n * out_chw; + int grid_dim = (pixelNum + 512 - 1) / 512; + grid_dim = grid_dim > 8 ? 8 : grid_dim; + + if ("nearest" == interp_method) { + KeNearestNeighborInterpBw< + T><<>>( + input_grad_data, in_h, in_w, n, in_chw, output_grad_data, out_h, + out_w, n, out_chw, c, ratio_h, ratio_w); + } else if ("bilinear" == interp_method) { + KeBilinearInterpBw< + T><<>>( + input_grad_data, in_h, in_w, n, in_chw, output_grad_data, out_h, + out_w, n, out_chw, c, ratio_h, ratio_w); + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL(interpolate, ops::InterpolateOpCUDAKernel, + ops::InterpolateOpCUDAKernel, + ops::InterpolateOpCUDAKernel); +REGISTER_OP_CUDA_KERNEL(interpolate_grad, + ops::InterpolateGradOpCUDAKernel, + ops::InterpolateGradOpCUDAKernel); diff --git a/paddle/fluid/operators/interpolate_op.h b/paddle/fluid/operators/interpolate_op.h new file mode 100644 index 0000000000000000000000000000000000000000..7fdb3e1f5a2ff82284d89dd0759e357978e1d873 --- /dev/null +++ b/paddle/fluid/operators/interpolate_op.h @@ -0,0 +1,236 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +template +using EigenTensor = framework::EigenTensor; +using Tensor = framework::Tensor; + +template +static void NearestNeighborInterpolate(const Tensor& input, Tensor* output, + const float ratio_h, const float ratio_w, + const int n, const int c, + const int out_h, const int out_w) { + auto input_t = EigenTensor::From(input); + auto output_t = EigenTensor::From(*output); + for (int k = 0; k < out_h; k++) { // loop for images + int in_k = static_cast(ratio_h * k + 0.5); + + for (int l = 0; l < out_w; l++) { + int in_l = static_cast(ratio_w * l + 0.5); + + for (int i = 0; i < n; i++) { // loop for batches + for (int j = 0; j < c; j++) { // loop for channels + output_t(i, j, k, l) = input_t(i, j, in_k, in_l); + } + } + } + } +} + +template +static void BilinearInterpolation(const Tensor& input, Tensor* output, + const float ratio_h, const float ratio_w, + const int in_h, const int in_w, const int n, + const int c, const int out_h, + const int out_w) { + auto input_t = EigenTensor::From(input); + auto output_t = EigenTensor::From(*output); + for (int k = 0; k < out_h; k++) { // loop for images + int y_n = static_cast(ratio_h * k); + int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1); + float d_n = ratio_h * k - y_n; + float d_s = 1.f - d_n; + + for (int l = 0; l < out_w; l++) { + int x_w = static_cast(ratio_w * l); + int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1); + float d_w = ratio_w * l - x_w; + float d_e = 1.f - d_w; + + for (int i = 0; i < n; i++) { // loop for batches + for (int j = 0; j < c; j++) { // loop for channels + // bilinear interpolation + output_t(i, j, k, l) = input_t(i, j, y_n, x_w) * d_s * d_e + + input_t(i, j, y_s, x_w) * d_n * d_e + + input_t(i, j, y_n, x_e) * d_s * d_w + + input_t(i, j, y_s, x_e) * d_n * d_w; + } + } + } + } +} + +template +static void NearestNeighborInterpolateGrad(const Tensor& output_grad, + Tensor* input_grad, + const float ratio_h, + const float ratio_w, const int n, + const int c, const int out_h, + const int out_w) { + auto input_grad_t = EigenTensor::From(*input_grad); + auto output_grad_t = EigenTensor::From(output_grad); + for (int k = 0; k < out_h; k++) { // loop for images + int in_k = static_cast(ratio_h * k + 0.5); + + for (int l = 0; l < out_w; l++) { + int in_l = static_cast(ratio_w * l + 0.5); + + for (int i = 0; i < n; i++) { // loop for batches + for (int j = 0; j < c; j++) { // loop for channels + input_grad_t(i, j, in_k, in_l) += output_grad_t(i, j, k, l); + } + } + } + } +} + +template +static void BilinearInterpolationGrad(const Tensor& output_grad, + Tensor* input_grad, const float ratio_h, + const float ratio_w, const int in_h, + const int in_w, const int n, const int c, + const int out_h, const int out_w) { + auto input_grad_t = EigenTensor::From(*input_grad); + auto output_grad_t = EigenTensor::From(output_grad); + for (int k = 0; k < out_h; k++) { // loop for images + int y_n = static_cast(ratio_h * k); + int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1); + float d_n = ratio_h * k - y_n; + float d_s = 1.f - d_n; + + for (int l = 0; l < out_w; l++) { + int x_w = static_cast(ratio_w * l); + int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1); + float d_w = ratio_w * l - x_w; + float d_e = 1.f - d_w; + + for (int i = 0; i < n; i++) { // loop for batches + for (int j = 0; j < c; j++) { // loop for channels + // bilinear interpolation grad + const T grad = output_grad_t(i, j, k, l); + input_grad_t(i, j, y_n, x_w) += static_cast(grad * d_s * d_e); + input_grad_t(i, j, y_s, x_w) += static_cast(grad * d_n * d_e); + input_grad_t(i, j, y_n, x_e) += static_cast(grad * d_s * d_w); + input_grad_t(i, j, y_s, x_e) += static_cast(grad * d_n * d_w); + } + } + } + } +} + +template +class InterpolateKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* output = ctx.Output("Out"); + + std::string interp_method = ctx.Attr("interp_method"); + int out_h = ctx.Attr("out_h"); + int out_w = ctx.Attr("out_w"); + auto out_size = ctx.Input("OutSize"); + if (out_size != nullptr) { + auto out_size_data = out_size->data(); + out_h = out_size_data[0]; + out_w = out_size_data[1]; + } + + const int n = input->dims()[0]; + const int c = input->dims()[1]; + const int in_h = input->dims()[2]; + const int in_w = input->dims()[3]; + + output->mutable_data({n, c, out_h, out_w}, ctx.GetPlace()); + auto& device_ctx = + ctx.template device_context(); + math::SetConstant zero; + zero(device_ctx, output, static_cast(0.0)); + + if (in_h == out_h && in_w == out_w) { + framework::TensorCopy(*input, ctx.GetPlace(), output); + return; + } + + float ratio_h = + (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; + float ratio_w = + (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; + + if ("bilinear" == interp_method) { + BilinearInterpolation(*input, output, ratio_h, ratio_w, in_h, in_w, n, + c, out_h, out_w); + } else if ("nearest" == interp_method) { + NearestNeighborInterpolate(*input, output, ratio_h, ratio_w, n, c, + out_h, out_w); + } + } +}; + +template +class InterpolateGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* input = ctx.Input("X"); + auto* input_grad = ctx.Output(framework::GradVarName("X")); + auto* output_grad = ctx.Input(framework::GradVarName("Out")); + + std::string interp_method = ctx.Attr("interp_method"); + int out_h = ctx.Attr("out_h"); + int out_w = ctx.Attr("out_w"); + auto out_size = ctx.Input("OutSize"); + if (out_size != nullptr) { + auto out_size_data = out_size->data(); + out_h = out_size_data[0]; + out_w = out_size_data[1]; + } + + const int n = input->dims()[0]; + const int c = input->dims()[1]; + const int in_h = input->dims()[2]; + const int in_w = input->dims()[3]; + + input_grad->mutable_data({n, c, in_h, in_w}, ctx.GetPlace()); + auto& device_ctx = + ctx.template device_context(); + math::SetConstant zero; + zero(device_ctx, input_grad, static_cast(0.0)); + + if (in_h == out_h && in_w == out_w) { + framework::TensorCopy(*output_grad, ctx.GetPlace(), input_grad); + return; + } + + float ratio_h = + (out_h > 1) ? static_cast(in_h - 1) / (out_h - 1) : 0.f; + float ratio_w = + (out_w > 1) ? static_cast(in_w - 1) / (out_w - 1) : 0.f; + + if ("bilinear" == interp_method) { + BilinearInterpolationGrad(*output_grad, input_grad, ratio_h, ratio_w, + in_h, in_w, n, c, out_h, out_w); + } else if ("nearest" == interp_method) { + NearestNeighborInterpolateGrad(*output_grad, input_grad, ratio_h, + ratio_w, n, c, out_h, out_w); + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/layer_norm_op.h b/paddle/fluid/operators/layer_norm_op.h index 2e54bb497dec11eaeda03a1aa6acfd4cc261dbfe..78d20ddf5fd63b81fd5e7fba656d825897a67a11 100644 --- a/paddle/fluid/operators/layer_norm_op.h +++ b/paddle/fluid/operators/layer_norm_op.h @@ -15,8 +15,12 @@ limitations under the License. */ #pragma once #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/elementwise_op_function.h" +#include "paddle/fluid/operators/elementwise/elementwise_op_function.h" #include "paddle/fluid/operators/math/blas.h" +#if !defined(PADDLE_WITH_CUDA) && !defined(_WIN32) && !defined(__APPLE__) && \ + !defined(__OSX__) +#include "paddle/fluid/operators/math/jit_kernel.h" +#endif #include "paddle/fluid/operators/math/math_function.h" namespace paddle { @@ -191,6 +195,8 @@ class LayerNormKernel : public framework::OpKernel { out.ShareDataWith(*y); out.Resize(matrix_shape); +#if defined(PADDLE_WITH_CUDA) || defined(_WIN32) || defined(__APPLE__) || \ + defined(__OSX__) auto& dev_ctx = ctx.template device_context(); RowwiseMean2D row_mean(left, right, ctx.device_context()); @@ -217,6 +223,19 @@ class LayerNormKernel : public framework::OpKernel { ElementwiseComputeEx, DeviceContext, T>( ctx, &out, bias, /*axis*/ 1, AddFunctor(), &out); } +#else + PADDLE_ENFORCE_EQ(mean->numel(), left); + PADDLE_ENFORCE_EQ(var->numel(), left); + PADDLE_ENFORCE_EQ(scale->numel(), right); + PADDLE_ENFORCE_EQ(bias->numel(), right); + + const auto& ker = math::jitkernel::KernelPool::Instance() + .template Get>( + static_cast(right)); + ker->Compute(x.data(), out.data(), mean->data(), var->data(), + scale->data(), bias->data(), static_cast(left), + static_cast(epsilon)); +#endif } }; diff --git a/paddle/fluid/operators/load_op.cc b/paddle/fluid/operators/load_op.cc index 51219504ffa2a778b56351f759e8a8dfb951ad91..df1edc5c2e994b3093d6f6e7e4f6e0e5b2abb469 100644 --- a/paddle/fluid/operators/load_op.cc +++ b/paddle/fluid/operators/load_op.cc @@ -40,8 +40,9 @@ class LoadOp : public framework::OperatorBase { auto out_var_name = Output("Out"); auto *out_var = scope.FindVar(out_var_name); - PADDLE_ENFORCE(out_var != nullptr, "Output variable %s cannot be found", - out_var_name); + PADDLE_ENFORCE(out_var != nullptr, + "Output variable %s cannot be found in scope %p", + out_var_name, &scope); if (out_var->IsType()) { LoadLodTensor(fin, place, out_var); diff --git a/paddle/fluid/operators/lod_rank_table_op.cc b/paddle/fluid/operators/lod_rank_table_op.cc index 166952fe23192799443ef9c9d1f7ba5056d19290..59ef9cb626d61f918c8ad1990a0f25030fb44ec6 100644 --- a/paddle/fluid/operators/lod_rank_table_op.cc +++ b/paddle/fluid/operators/lod_rank_table_op.cc @@ -30,9 +30,9 @@ class LoDRankTableOp : public framework::OperatorBase { auto x = scope.FindVar(Input("X"))->Get(); auto *out = scope.FindVar(Output("Out"))->GetMutable(); - VLOG(10) << "Level = " << static_cast(Attr("level")); + VLOG(100) << "Level = " << static_cast(Attr("level")); out->Reset(x.lod(), static_cast(Attr("level"))); - VLOG(10) << Input("X") << "'s lod information is " << *out; + VLOG(100) << Input("X") << "'s lod information is " << *out; } }; diff --git a/paddle/fluid/operators/lod_tensor_to_array_op.cc b/paddle/fluid/operators/lod_tensor_to_array_op.cc index 8eab83fcd247fcd099ae1fa5dab1e67c2081bf9c..e72337a3e6f7884c3a05372e8732647e5910f3e4 100644 --- a/paddle/fluid/operators/lod_tensor_to_array_op.cc +++ b/paddle/fluid/operators/lod_tensor_to_array_op.cc @@ -17,7 +17,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/safe_ref.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/port.h" @@ -79,7 +79,7 @@ struct LoDTensorToArrayFunctor : public boost::static_visitor { template template void LoDTensorToArrayFunctorImpl::apply() { - math::ConcatGradFunctor func; + math::SplitFunctor func; func(*dev_ctx_, prev_functor_->input_, prev_functor_->ref_inputs_, 0, &prev_functor_->outputs_); } diff --git a/paddle/fluid/operators/lookup_sparse_table_op.cc b/paddle/fluid/operators/lookup_sparse_table_op.cc index de3f0990e109cacd49c4d888bbc1f797fb196e01..1b55527fd33e879c5c6fe702a53d0a23bebc2b14 100644 --- a/paddle/fluid/operators/lookup_sparse_table_op.cc +++ b/paddle/fluid/operators/lookup_sparse_table_op.cc @@ -45,6 +45,7 @@ class LookupSparseTableOp : public framework::OperatorBase { auto out_var = scope.FindVar(Output("Out")); auto w_var = scope.FindVar(Input("W")); auto ids_var = scope.FindVar(Input("Ids")); + auto is_test = Attr("is_test"); PADDLE_ENFORCE(out_var->IsType(), "The type of Out var should be LodTensor."); @@ -65,7 +66,8 @@ class LookupSparseTableOp : public framework::OperatorBase { PADDLE_ENFORCE_EQ(framework::ToDataType(w_t->value().type()), framework::proto::VarType::FP32, "The sparse table only support FP32"); - w_t->Get(ids_t, out_t, true); + w_t->Get(ids_t, out_t, true, is_test); + out_t->set_lod(ids_t.lod()); } }; @@ -91,6 +93,10 @@ class LookupSparseTableOpMaker : public framework::OpProtoAndCheckerMaker { "(bool default false)" "Whether create new value if for nonexistent key.") .SetDefault(true); + AddAttr("is_test", + "In test mode, lookup_sparse_table will " + "return a 0 for unknown id") + .SetDefault(false); AddComment(R"DOC( Lookup Sprase Tablel Operator. diff --git a/paddle/fluid/operators/lookup_table_op.cc b/paddle/fluid/operators/lookup_table_op.cc index b9ac54e446811889b647397ae1fbb11c28f46777..1878dfe8a897db1b8c948d325fa48a38ca224a2b 100644 --- a/paddle/fluid/operators/lookup_table_op.cc +++ b/paddle/fluid/operators/lookup_table_op.cc @@ -81,6 +81,12 @@ class LookupTableOpMaker : public framework::OpProtoAndCheckerMaker { "Otherwise the given value indicates padding the output " "with zeros whenever lookup encounters it in Ids.") .SetDefault(kNoPadding); + // NOTE(minqiyang): grad_inplace is an temporal attribute, + // please do NOT set this attribute in python layer. + AddAttr("grad_inplace", + "(boolean, default false) " + "If the grad op reuse the input's variable.") + .SetDefault(false); AddComment(R"DOC( Lookup Table Operator. @@ -115,7 +121,7 @@ class LookupTableOpGrad : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { - auto data_type = framework::GetDataTypeOfVar(ctx.InputVar("W")); + auto data_type = framework::GetDataTypeOfVar(ctx.InputVar("Out")); return framework::OpKernelType(data_type, ctx.device_context()); } }; @@ -128,13 +134,13 @@ class LookupTableOpGradVarTypeInference : public framework::VarTypeInference { auto attr = op_desc.GetAttr("is_sparse"); bool is_sparse = boost::get(attr); if (is_sparse) { - VLOG(3) << "lookup_table_grad op " << framework::GradVarName("W") - << " is set to SelectedRows"; + VLOG(30) << "lookup_table_grad op " << framework::GradVarName("W") + << " is set to SelectedRows"; block->Var(out_var_name) ->SetType(framework::proto::VarType::SELECTED_ROWS); } else { - VLOG(3) << "lookup_table_grad op " << framework::GradVarName("W") - << " is set to LoDTensor"; + VLOG(30) << "lookup_table_grad op " << framework::GradVarName("W") + << " is set to LoDTensor"; block->Var(out_var_name)->SetType(framework::proto::VarType::LOD_TENSOR); } block->Var(out_var_name)->SetDataType(block->Var("W")->GetDataType()); diff --git a/paddle/fluid/operators/lookup_table_op.h b/paddle/fluid/operators/lookup_table_op.h index 58463dc4d6fd7cc3454de766814a947fee161070..e504c4f0cd5c0feaef4a251fad57b389a10a2ce7 100644 --- a/paddle/fluid/operators/lookup_table_op.h +++ b/paddle/fluid/operators/lookup_table_op.h @@ -21,6 +21,7 @@ limitations under the License. */ #include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/selected_rows.h" +#include "paddle/fluid/operators/math/blas.h" namespace paddle { namespace operators { @@ -68,6 +69,7 @@ class LookupTableKernel : public framework::OpKernel { const auto *table = table_t.value().data(); auto *output = output_t->mutable_data(context.GetPlace()); + auto blas = math::GetBlas(context); for (int64_t i = 0; i < ids_numel; ++i) { if (padding_idx != kNoPadding && ids[i] == padding_idx) { memset(output + i * row_width, 0, row_width * sizeof(T)); @@ -75,8 +77,8 @@ class LookupTableKernel : public framework::OpKernel { PADDLE_ENFORCE_GE(ids[i], 0); auto id_index = table_t.Index(ids[i]); PADDLE_ENFORCE_GE(id_index, 0, "the input key should be exists."); - memcpy(output + i * row_width, table + id_index * row_width, - row_width * sizeof(T)); + blas.VCOPY(row_width, table + id_index * row_width, + output + i * row_width); } } } @@ -111,27 +113,37 @@ class LookupTableGradKernel : public framework::OpKernel { auto *ids_data = ids->data(); int64_t ids_num = ids->numel(); - framework::Vector new_rows; - new_rows.reserve(ids_num); - for (int64_t i = 0; i < ids_num; i++) { - new_rows.push_back(ids_data[i]); - } + std::vector new_rows; + new_rows.resize(ids_num); + std::memcpy(&new_rows[0], ids_data, ids_num * sizeof(int64_t)); d_table->set_rows(new_rows); auto *d_table_value = d_table->mutable_value(); d_table_value->Resize({ids_num, table_dim[1]}); - d_table_value->mutable_data(context.GetPlace()); - - d_table->set_height(table_dim[0]); - - auto *d_output_data = d_output->data(); - auto *d_table_data = d_table_value->data(); - - auto d_output_dims = d_output->dims(); - PADDLE_ENFORCE_EQ( - d_table_value->dims(), - framework::flatten_to_2d(d_output_dims, d_output_dims.size() - 1)); - memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel()); + // FIXME(minqiyang): + // memory optimization will NOT reuse Tensor with SelectedRows + // so we could just share the tensor here directly. + // However, the InferVarType method will infer the output SelectedRows + // to Tensor sometimes, which is a bug, so we will add an attribute + // here to indicate the inplace and remove this attribute after + // the InferVarType's bug was fixed + bool grad_inplace = context.Attr("grad_inplace"); + if (grad_inplace) { + d_table_value->ShareDataWith(*d_output); + } else { + d_table_value->mutable_data(context.GetPlace()); + + d_table->set_height(table_dim[0]); + + auto *d_output_data = d_output->data(); + auto *d_table_data = d_table_value->data(); + + auto d_output_dims = d_output->dims(); + PADDLE_ENFORCE_EQ( + d_table_value->dims(), + framework::flatten_to_2d(d_output_dims, d_output_dims.size() - 1)); + memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel()); + } } else { auto *ids = context.Input("Ids"); auto *d_output = context.Input(framework::GradVarName("Out")); diff --git a/paddle/fluid/operators/lrn_op.cc b/paddle/fluid/operators/lrn_op.cc index 52b459a6a2e56b7c256efdb535b4652c64bae23c..a3bb2be5c7af5b85fa9785c5e64ac314feda8b78 100644 --- a/paddle/fluid/operators/lrn_op.cc +++ b/paddle/fluid/operators/lrn_op.cc @@ -14,6 +14,7 @@ limitations under the License. */ #include "paddle/fluid/operators/lrn_op.h" #include +#include "paddle/fluid/operators/math/blas.h" #ifdef PADDLE_WITH_MKLDNN #include "paddle/fluid/platform/mkldnn_helper.h" #endif @@ -29,34 +30,43 @@ struct LRNFunctor { const framework::Tensor& input, framework::Tensor* out, framework::Tensor* mid, int N, int C, int H, int W, int n, T k, T alpha, T beta) { - auto x_v = framework::EigenVector::Flatten(input); - - const int start = -(n - 1) / 2; - const int end = start + n; - - auto e_mid = framework::EigenTensor::From(*mid); - e_mid = e_mid.constant(k); - - auto e_x = framework::EigenTensor::From(input); - for (int m = 0; m < N; m++) { - for (int i = 0; i < C; i++) { - for (int c = start; c < end; c++) { - int ch = i + c; - if (ch >= 0 && ch < C) { - auto s = e_mid.slice(Eigen::array({{m, i, 0, 0}}), - Eigen::array({{1, 1, H, W}})); - - auto r = e_x.slice(Eigen::array({{m, ch, 0, 0}}), - Eigen::array({{1, 1, H, W}})); - - s += alpha * r.square(); - } - } + const T* idata = input.data(); + auto place = ctx.GetPlace(); + auto blas = math::GetBlas(ctx); + T* odata = out->mutable_data(place); + T* mdata = mid->mutable_data(place); + Tensor squared; + T* sdata = squared.mutable_data({1, C + n - 1, H, W}, place); + std::memset(sdata, 0, sizeof(T) * squared.numel()); + for (int i = 0; i < mid->numel(); ++i) { + mdata[i] = k; + } + int img_size = H * W; + int fea_size = C * img_size; + int pre_pad = (n - 1) / 2; + // compute batches one by one + for (int i = 0; i < N; ++i) { + blas.VSQUARE(fea_size, idata + i * fea_size, sdata + pre_pad * img_size); + // init the first channel of mid + for (int c = 0; c < n; ++c) { + blas.AXPY(img_size, alpha, sdata + c * img_size, mdata + i * fea_size); + } + for (int c = 1; c < C; ++c) { + // copy previous scale + int mid_offset = i * fea_size + c * img_size; + std::memcpy(mdata + mid_offset, mdata + mid_offset - img_size, + img_size * sizeof(T)); + // add last + blas.AXPY(img_size, alpha, sdata + (c + n - 1) * img_size, + mdata + mid_offset); + // sub rest + blas.AXPY(img_size, -alpha, sdata + (c - 1) * img_size, + mdata + mid_offset); } } - - auto out_e = framework::EigenVector::Flatten(*out); - out_e = x_v * e_mid.reshape(Eigen::DSizes(e_mid.size())).pow(-beta); + // compute the final output + blas.VPOW(mid->numel(), mdata, -beta, odata); + blas.VMUL(mid->numel(), odata, idata, odata); } }; template struct LRNFunctor; @@ -156,6 +166,9 @@ class LRNOp : public framework::OperatorWithKernel { auto x_dim = ctx->GetInputDim("X"); PADDLE_ENFORCE_EQ(x_dim.size(), 4, "Input(X)'rank of LRNOp should be 4."); + int n = ctx->Attrs().Get("n"); + PADDLE_ENFORCE(n > 0 && n % 2 == 1, "n should be positive odd value"); + ctx->SetOutputDim("Out", x_dim); ctx->ShareLoD("X", /*->*/ "Out"); ctx->SetOutputDim("MidOut", x_dim); @@ -216,8 +229,8 @@ class LRNOpMaker : public framework::OpProtoAndCheckerMaker { "the input will be transformed automatically. ") .SetDefault("AnyLayout"); AddAttr("is_test", - "Turns on memory optimization that optimizes away " - "unnecessary memory allocations. Used by MKLDNN.") + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") .SetDefault(false); AddComment(R"DOC( diff --git a/paddle/fluid/operators/lrn_op.h b/paddle/fluid/operators/lrn_op.h index 0fd3175e8579df9e61368cc151a94fa45e433884..12d39c3815395896343238b536110aecac66a376 100644 --- a/paddle/fluid/operators/lrn_op.h +++ b/paddle/fluid/operators/lrn_op.h @@ -60,7 +60,6 @@ class LRNKernel : public framework::OpKernel { T beta = ctx.Attr("beta"); T k = ctx.Attr("k"); - PADDLE_ENFORCE(n > 0, "n should >= 0"); PADDLE_ENFORCE(alpha >= 0.0, "alpha should >= 0.0"); PADDLE_ENFORCE(beta >= 0.0, "beta should >= 0.0"); PADDLE_ENFORCE(k >= 0.0, "k should >= 0.0"); diff --git a/paddle/fluid/operators/math/CMakeLists.txt b/paddle/fluid/operators/math/CMakeLists.txt index 7365bfeeb8edf09a8ad5e1cb2c61300e86bdf518..63363086adbf12c38ac09949ac20483116ccf4ee 100644 --- a/paddle/fluid/operators/math/CMakeLists.txt +++ b/paddle/fluid/operators/math/CMakeLists.txt @@ -1,6 +1,4 @@ -if (NOT WIN32) add_subdirectory(detail) -endif(NOT WIN32) function(math_library TARGET) # math_library is a function to create math library. @@ -35,17 +33,16 @@ function(math_library TARGET) endfunction() # please add new math_library in alphabetical order -math_library(concat) +math_library(concat_and_split) math_library(context_project DEPS im2col math_function) math_library(cross_entropy) math_library(cos_sim_functor) math_library(depthwise_conv) math_library(im2col) +math_library(sampler) -if (NOT WIN32) # windows do not support avx functions yet. math_library(gru_compute DEPS activation_functions math_function) math_library(lstm_compute DEPS activation_functions) -endif (NOT WIN32) cc_library(blas SRCS blas.cc DEPS cblas framework_proto device_context) math_library(math_function DEPS blas) @@ -57,9 +54,9 @@ math_library(sequence_padding) math_library(sequence_pooling DEPS math_function) math_library(sequence_scale) math_library(softmax DEPS math_function) -if (NOT WIN32) + math_library(matrix_bit_code) -endif (NOT WIN32) + math_library(unpooling) math_library(vol2col) @@ -68,13 +65,19 @@ cc_test(selected_rows_functor_test SRCS selected_rows_functor_test.cc DEPS selec cc_test(im2col_test SRCS im2col_test.cc DEPS im2col) cc_test(vol2col_test SRCS vol2col_test.cc DEPS vol2col) cc_test(sequence_padding_test SRCS sequence_padding_test.cc DEPS sequence_padding) +cc_test(sequence_pooling_test SRCS sequence_pooling_test.cc DEPS sequence_pooling) if(WITH_GPU) nv_test(math_function_gpu_test SRCS math_function_test.cu DEPS math_function) - nv_test(selected_rows_functor_gpu_test SRCS selected_rows_functor_test.cu DEPS selected_rows_functor math_function) + nv_test(selected_rows_functor_gpu_test SRCS selected_rows_functor_test.cu.cc DEPS selected_rows_functor math_function) endif() -cc_test(concat_test SRCS concat_test.cc DEPS concat) +cc_test(concat_test SRCS concat_test.cc DEPS concat_and_split) cc_test(cpu_vec_test SRCS cpu_vec_test.cc DEPS blas cpu_info) -cc_library(jit_kernel - SRCS jit_kernel.cc jit_kernel_blas.cc jit_kernel_exp.cc jit_kernel_lstm.cc - DEPS cpu_info cblas activation_functions) + +set(JIT_KERNEL_SRCS jit_kernel.cc jit_kernel_blas.cc jit_kernel_exp.cc jit_kernel_rnn.cc jit_kernel_crf_decode.cc jit_kernel_layer_norm.cc) +set(JIT_KERNEL_DEPS cpu_info cblas gflags enforce) +if(WITH_XBYAK) + list(APPEND JIT_KERNEL_SRCS jit_gen.cc jit_code.cc) + list(APPEND JIT_KERNEL_DEPS xbyak) +endif() +cc_library(jit_kernel SRCS ${JIT_KERNEL_SRCS} DEPS ${JIT_KERNEL_DEPS}) cc_test(jit_kernel_test SRCS jit_kernel_test.cc DEPS jit_kernel) diff --git a/paddle/fluid/operators/math/algorithm.h b/paddle/fluid/operators/math/algorithm.h index 262469beea7449eb5820b86de1ac4f790a833e79..2e75b6abce5e1f43742ee15bff1dac4801186cd4 100644 --- a/paddle/fluid/operators/math/algorithm.h +++ b/paddle/fluid/operators/math/algorithm.h @@ -39,6 +39,52 @@ HOSTDEVICE inline int64_t BinarySearch(const T *x, int64_t num, const T &val) { return -1; } +template +HOSTDEVICE inline size_t LowerBound(const T *x, size_t num, const T &val) { +#ifdef __CUDA_ARCH__ + // The following code is from + // https://en.cppreference.com/w/cpp/algorithm/lower_bound + auto *first = x; + int64_t count = static_cast(num); + while (count > 0) { + int64_t step = (count >> 1); + auto *it = first + step; + if (*it < val) { + first = ++it; + count -= (step + 1); + } else { + count = step; + } + } + return static_cast(first - x); +#else + return static_cast(std::lower_bound(x, x + num, val) - x); +#endif +} + +template +HOSTDEVICE inline size_t UpperBound(const T *x, size_t num, const T &val) { +#ifdef __CUDA_ARCH__ + // The following code is from + // https://en.cppreference.com/w/cpp/algorithm/upper_bound + auto *first = x; + int64_t count = static_cast(num); + while (count > 0) { + auto step = (count >> 1); + auto *it = first + step; + if (val < *it) { + count = step; + } else { + first = ++it; + count -= (step + 1); + } + } + return static_cast(first - x); +#else + return static_cast(std::upper_bound(x, x + num, val) - x); +#endif +} + } // namespace math } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/math/blas.h b/paddle/fluid/operators/math/blas.h index da185d93c09f9b06bd5968b9c8e93176f9ef014b..6734df1530893777fca3ccf66b1e8aab40e41cfc 100644 --- a/paddle/fluid/operators/math/blas.h +++ b/paddle/fluid/operators/math/blas.h @@ -152,6 +152,12 @@ class Blas { template void VEXP(int n, const T* x, T* y) const; + template + void VSQUARE(int n, const T* x, T* y) const; + + template + void VPOW(int n, const T* x, T alpha, T* y) const; + template void GEMV(bool trans_a, int M, int N, T alpha, const T* A, const T* B, T beta, T* C) const; @@ -238,6 +244,16 @@ class BlasT : private Blas { Base()->template VEXP(args...); } + template + void VSQUARE(ARGS... args) const { + Base()->template VSQUARE(args...); + } + + template + void VPOW(ARGS... args) const { + Base()->template VPOW(args...); + } + template void GEMV(ARGS... args) const { Base()->template GEMV(args...); diff --git a/paddle/fluid/operators/math/blas_impl.cu.h b/paddle/fluid/operators/math/blas_impl.cu.h index d84c88cb3bc1a13acb83b3444dbd1bfca3cba503..d35073029a3440d8a17e383ce97fcfc582663888 100644 --- a/paddle/fluid/operators/math/blas_impl.cu.h +++ b/paddle/fluid/operators/math/blas_impl.cu.h @@ -16,6 +16,9 @@ #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/platform/dynload/cublas.h" +#include "paddle/fluid/platform/gpu_info.h" + +DECLARE_bool(enable_cublas_tensor_op_math); namespace paddle { namespace operators { @@ -42,11 +45,44 @@ struct CUBlas { } template - static void GEMM_BATCH(ARGS... args) { + static void GEMM_STRIDED_BATCH(ARGS... args) { #if CUDA_VERSION >= 8000 PADDLE_ENFORCE(platform::dynload::cublasSgemmStridedBatched(args...)); #else PADDLE_THROW("SgemmStridedBatched is not supported on cuda <= 7.5"); +#endif + } + + // NOTES: GEMM_EX can use Tensor Core to accelerate matrix multiply. + // https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode + template + static void GEMM_EX(platform::CUDADeviceContext *dev_ctx, + cublasOperation_t transa, cublasOperation_t transb, int m, + int n, int k, const float *alpha, const void *A, + cudaDataType_t Atype, int lda, const void *B, + cudaDataType_t Btype, int ldb, const float *beta, void *C, + cudaDataType_t Ctype, int ldc) { + // Because the gcc 4.8 doesn't expand template parameter pack that + // appears in a lambda-expression, I can not use template parameter pack + // here. + auto cublas_call = [&]() { +#if CUDA_VERSION >= 8000 + VLOG(5) << "use_tensor_op_math: " + << (platform::TensorCoreAvailable() ? "True" : "False"); + PADDLE_ENFORCE(platform::dynload::cublasSgemmEx( + dev_ctx->cublas_handle(), transa, transb, m, n, k, alpha, A, Atype, + lda, B, Btype, ldb, beta, C, Ctype, ldc)); +#else + PADDLE_THROW("cublasSgemmEx is supported on cuda >= 8.0"); +#endif + }; + +#if CUDA_VERSION >= 9000 + // NOTES: To use Tensor Core, we should change the cublas config, + // but the cublas may be hold by multi-thread. + dev_ctx->CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH); +#else + cublas_call(); #endif } }; @@ -69,13 +105,18 @@ struct CUBlas { } template - static void GEMM_BATCH(ARGS... args) { + static void GEMM_STRIDED_BATCH(ARGS... args) { #if CUDA_VERSION >= 8000 PADDLE_ENFORCE(platform::dynload::cublasDgemmStridedBatched(args...)); #else PADDLE_THROW("DgemmStridedBatched is not supported on cuda <= 7.5"); #endif } + + template + static void GEMM_EX(ARGS... args) { + PADDLE_THROW("Currently there are not cublasDgemmEx."); + } }; template <> @@ -96,14 +137,16 @@ struct CUBlas { reinterpret_cast<__half *>(C), ldc)); } - static void GEMM_BATCH(cublasHandle_t handle, cublasOperation_t transa, - cublasOperation_t transb, int m, int n, int k, - const float16 *alpha, const float16 *A, int lda, - long long int strideA, const float16 *B, // NOLINT - int ldb, long long int strideB, // NOLINT - const float16 *beta, float16 *C, int ldc, - long long int strideC, // NOLINT - int batchCount) { + static void GEMM_STRIDED_BATCH(cublasHandle_t handle, + cublasOperation_t transa, + cublasOperation_t transb, int m, int n, int k, + const float16 *alpha, const float16 *A, + int lda, long long int strideA, // NOLINT + const float16 *B, // NOLINT + int ldb, long long int strideB, // NOLINT + const float16 *beta, float16 *C, int ldc, + long long int strideC, // NOLINT + int batchCount) { #if CUDA_VERSION >= 8000 PADDLE_ENFORCE(platform::dynload::cublasHgemmStridedBatched( handle, transa, transb, m, n, k, @@ -114,6 +157,45 @@ struct CUBlas { ldc, strideC, batchCount)); #else PADDLE_THROW("HgemmStridedBatched is not supported on cuda <= 7.5"); +#endif + } + + // NOTES: GEMM_EX can use Tensor Core to accelerate matrix multiply. + // https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode + template + static void GEMM_EX(platform::CUDADeviceContext *dev_ctx, + cublasOperation_t transa, cublasOperation_t transb, int m, + int n, int k, const void *alpha, const void *A, + cudaDataType_t Atype, int lda, const void *B, + cudaDataType_t Btype, int ldb, const void *beta, void *C, + cudaDataType_t Ctype, int ldc, + cudaDataType_t computeType) { + auto cublas_call = [&]() { +#if CUDA_VERSION >= 8000 + cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT; +#if CUDA_VERSION >= 9000 + bool use_tensor_op_math = platform::TensorCoreAvailable(); + if (use_tensor_op_math) { + algo = CUBLAS_GEMM_DFALT_TENSOR_OP; + } + VLOG(5) << "use_tensor_op_math: " + << (use_tensor_op_math ? "True" : "False"); +#endif // CUDA_VERSION >= 9000 + + PADDLE_ENFORCE(platform::dynload::cublasGemmEx( + dev_ctx->cublas_handle(), transa, transb, m, n, k, alpha, A, Atype, + lda, B, Btype, ldb, beta, C, Ctype, ldc, computeType, algo)); +#else + PADDLE_THROW("cublasGemmEx is supported on cuda >= 8.0"); +#endif + }; + +#if CUDA_VERSION >= 9000 + // NOTES: To use Tensor Core, we should change the cublas config, + // but the cublas may be hold by multi-thread. + dev_ctx->CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH); +#else + cublas_call(); #endif } }; @@ -133,8 +215,21 @@ void Blas::GEMM(CBLAS_TRANSPOSE transA, cublasOperation_t cuTransB = (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T; - CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, - B, ldb, A, lda, &beta, C, N); +#if CUDA_VERSION >= 8000 + if (FLAGS_enable_cublas_tensor_op_math && std::is_same::value) { + auto &cuda_ctx = const_cast(context_); + CUBlas::GEMM_EX(&cuda_ctx, cuTransB, cuTransA, N, M, K, &alpha, B, + CUDA_R_32F, ldb, A, CUDA_R_32F, lda, &beta, C, + CUDA_R_32F, N); + } else { +#endif // CUDA_VERSION >= 8000 + + CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, + &alpha, B, ldb, A, lda, &beta, C, N); + +#if CUDA_VERSION >= 8000 + } +#endif // CUDA_VERSION >= 8000 } template <> @@ -157,30 +252,18 @@ inline void Blas::GEMM( PADDLE_ENFORCE_GE(context_.GetComputeCapability(), 53, "cublas fp16 gemm requires GPU compute capability >= 53"); -#if CUDA_VERSION >= 8000 float h_alpha = static_cast(alpha); float h_beta = static_cast(beta); - cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT; -#if CUDA_VERSION >= 9000 - if (context_.GetComputeCapability() >= 70) { - PADDLE_ENFORCE(platform::dynload::cublasSetMathMode( - context_.cublas_handle(), CUBLAS_TENSOR_OP_MATH)); - algo = CUBLAS_GEMM_DFALT_TENSOR_OP; - } else { - PADDLE_ENFORCE(platform::dynload::cublasSetMathMode( - context_.cublas_handle(), CUBLAS_DEFAULT_MATH)); - } -#endif // CUDA_VERSION >= 9000 - +#if CUDA_VERSION >= 8000 // cublasHgemm does true FP16 computation which is slow for non-Volta // GPUs. So use cublasGemmEx instead which does pesudo FP16 computation: // input/output in fp16, computation in fp32, which can also be accelerated // using tensor cores in volta GPUs. - PADDLE_ENFORCE(platform::dynload::cublasGemmEx( - context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &h_alpha, B, - CUDA_R_16F, ldb, A, CUDA_R_16F, lda, &h_beta, C, CUDA_R_16F, N, - CUDA_R_32F, algo)); + auto &cuda_ctx = const_cast(context_); + CUBlas::GEMM_EX( + &cuda_ctx, cuTransB, cuTransA, N, M, K, &h_alpha, B, CUDA_R_16F, ldb, A, + CUDA_R_16F, lda, &h_beta, C, CUDA_R_16F, N, CUDA_R_32F); #else // CUDA 7.5 does not support cublasGemmEx, hence we fall back to use hgemm CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, @@ -199,8 +282,38 @@ void Blas::GEMM(bool transA, bool transB, int M, // the cblas convention. cublasOperation_t cuTransA = transA ? CUBLAS_OP_T : CUBLAS_OP_N; cublasOperation_t cuTransB = transB ? CUBLAS_OP_T : CUBLAS_OP_N; - CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, - B, ldb, A, lda, &beta, C, ldc); + +#if CUDA_VERSION >= 8000 + if (FLAGS_enable_cublas_tensor_op_math && std::is_same::value) { + auto &cuda_ctx = const_cast(context_); + CUBlas::GEMM_EX(&cuda_ctx, cuTransB, cuTransA, N, M, K, &alpha, B, + CUDA_R_32F, ldb, A, CUDA_R_32F, lda, &beta, C, + CUDA_R_32F, ldc); + } else { +#endif // CUDA_VERSION >= 8000 + + CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, + &alpha, B, ldb, A, lda, &beta, C, ldc); + +#if CUDA_VERSION >= 8000 + } +#endif // CUDA_VERSION >= 8000 +} + +template <> +template <> +inline void Blas::GEMM( + bool transA, bool transB, int M, int N, int K, platform::float16 alpha, + const platform::float16 *A, int lda, const platform::float16 *B, int ldb, + platform::float16 beta, platform::float16 *C, int ldc) const { + // Note that cublas follows fortran order, so the order is different from + // the cblas convention. + cublasOperation_t cuTransA = transA ? CUBLAS_OP_T : CUBLAS_OP_N; + cublasOperation_t cuTransB = transB ? CUBLAS_OP_T : CUBLAS_OP_N; + + CUBlas::GEMM(context_.cublas_handle(), cuTransB, cuTransA, + N, M, K, &alpha, B, ldb, A, lda, &beta, C, + ldc); } template <> @@ -238,9 +351,34 @@ void Blas::BatchedGEMM( (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T; const int64_t strideC = M * N; - CUBlas::GEMM_BATCH(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, - &alpha, B, ldb, strideB, A, lda, strideA, &beta, C, ldc, - strideC, batchCount); +#if CUDA_VERSION >= 9010 + if (FLAGS_enable_cublas_tensor_op_math && std::is_same::value) { + auto cublas_call = [&]() { + cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT; + bool use_tensor_op_math = platform::TensorCoreAvailable(); + if (use_tensor_op_math) { + algo = CUBLAS_GEMM_DFALT_TENSOR_OP; + } + VLOG(5) << "use_tensor_op_math: " + << (use_tensor_op_math ? "True" : "False"); + + PADDLE_ENFORCE(platform::dynload::cublasGemmStridedBatchedEx( + context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, B, + CUDA_R_32F, ldb, strideB, A, CUDA_R_32F, lda, strideA, &beta, C, + CUDA_R_32F, ldc, strideC, batchCount, CUDA_R_32F, algo)); + }; + auto &dev_ctx = const_cast(context_); + dev_ctx.CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH); + } else { +#endif // CUDA_VERSION >= 9010 + + CUBlas::GEMM_STRIDED_BATCH(context_.cublas_handle(), cuTransB, cuTransA, + N, M, K, &alpha, B, ldb, strideB, A, lda, + strideA, &beta, C, ldc, strideC, batchCount); + +#if CUDA_VERSION >= 9010 + } +#endif // CUDA_VERSION >= 9010 } } // namespace math diff --git a/paddle/fluid/operators/math/blas_impl.h b/paddle/fluid/operators/math/blas_impl.h index e1df78d11e41c5f74e244643f40c6d0581fa6a4a..93bf7c7c88db36807143b136ea800d6e5e49dd43 100644 --- a/paddle/fluid/operators/math/blas_impl.h +++ b/paddle/fluid/operators/math/blas_impl.h @@ -12,6 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. #pragma once +#include #include #include #include "paddle/fluid/operators/math/math_function.h" @@ -102,6 +103,16 @@ struct CBlas { static void VEXP(ARGS... args) { platform::dynload::vsExp(args...); } + + template + static void VSQUARE(ARGS... args) { + platform::dynload::vsSqr(args...); + } + + template + static void VPOW(ARGS... args) { + platform::dynload::vsPowx(args...); + } }; template <> @@ -182,6 +193,16 @@ struct CBlas { static void VEXP(ARGS... args) { platform::dynload::vdExp(args...); } + + template + static void VSQUARE(ARGS... args) { + platform::dynload::vdSqr(args...); + } + + template + static void VPOW(ARGS... args) { + platform::dynload::vdPowx(args...); + } }; #else @@ -241,6 +262,10 @@ struct CBlas { } static void VMUL(...) { PADDLE_THROW("float16 VMUL not supported on CPU"); } static void VEXP(...) { PADDLE_THROW("float16 VEXP not supported on CPU"); } + static void VSQUARE(...) { + PADDLE_THROW("float16 VSQUARE not supported on CPU"); + } + static void VPOW(...) { PADDLE_THROW("float16 VPOW not supported on CPU"); } static void DOT(...) { PADDLE_THROW("float16 DOT not supported on CPU"); }; static void SCAL(...) { PADDLE_THROW("float16 SCAL not supported on CPU"); }; #ifdef PADDLE_WITH_MKLML @@ -398,6 +423,31 @@ void Blas::VEXP(int n, const T *x, T *y) const { #endif } +template <> +template +void Blas::VSQUARE(int n, const T *x, T *y) const { +#ifdef PADDLE_WITH_MKLML + CBlas::VSQUARE(n, x, y); +#else + for (int i = 0; i < n; ++i) { + y[i] = x[i] * x[i]; + } +#endif +} + +template <> +template +void Blas::VPOW(int n, const T *x, T a, + T *y) const { +#ifdef PADDLE_WITH_MKLML + CBlas::VPOW(n, x, a, y); +#else + for (int i = 0; i < n; ++i) { + y[i] = std::pow(x[i], a); + } +#endif +} + template <> template T Blas::DOT(int n, const T *x, const T *y) const { diff --git a/paddle/fluid/operators/math/concat.cc b/paddle/fluid/operators/math/concat_and_split.cc similarity index 95% rename from paddle/fluid/operators/math/concat.cc rename to paddle/fluid/operators/math/concat_and_split.cc index 7b79f10e33d4474e279c6e46208722d6b52277fc..c6e17fd042f19bbeee3507e4cd64f49cff369682 100644 --- a/paddle/fluid/operators/math/concat.cc +++ b/paddle/fluid/operators/math/concat_and_split.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include namespace paddle { @@ -67,7 +67,7 @@ class ConcatFunctor { * each dimension must be the same, except the axis dimension. */ template -class ConcatGradFunctor { +class SplitFunctor { public: void operator()(const platform::CPUDeviceContext& context, const framework::Tensor& input, @@ -111,7 +111,7 @@ class ConcatGradFunctor { }; #define DEFINE_FUNCTOR(type) \ template class ConcatFunctor; \ - template class ConcatGradFunctor; + template class SplitFunctor; FOR_ALL_TYPES(DEFINE_FUNCTOR); diff --git a/paddle/fluid/operators/math/concat.cu b/paddle/fluid/operators/math/concat_and_split.cu similarity index 90% rename from paddle/fluid/operators/math/concat.cu rename to paddle/fluid/operators/math/concat_and_split.cu index b59d86e661aff25eba8e770247e85845365d628b..760a065c1081d1e55901774b258ba524471b856b 100644 --- a/paddle/fluid/operators/math/concat.cu +++ b/paddle/fluid/operators/math/concat_and_split.cu @@ -15,7 +15,7 @@ limitations under the License. */ #include #include #include "paddle/fluid/framework/mixed_vector.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/platform/cuda_primitives.h" #include "paddle/fluid/platform/float16.h" @@ -24,7 +24,7 @@ namespace operators { namespace math { template -__global__ void KernelConcat(T** inputs, const int* input_cols, int col_size, +__global__ void ConcatKernel(T** inputs, const int* input_cols, int col_size, const int output_rows, const int output_cols, T* output) { int tid_x = blockIdx.x * blockDim.x + threadIdx.x; @@ -50,7 +50,7 @@ __global__ void KernelConcat(T** inputs, const int* input_cols, int col_size, } template -__global__ void KernelConcat(T** inputs_data, const int fixed_in_col, +__global__ void ConcatKernel(T** inputs_data, const int fixed_in_col, const int out_rows, const int out_cols, T* output_data) { int tid_x = blockIdx.x * blockDim.x + threadIdx.x; @@ -67,9 +67,9 @@ __global__ void KernelConcat(T** inputs_data, const int fixed_in_col, } template -__global__ void KernelConcatGrad(const T* input_data, const int in_row, - const int in_col, const int* out_cols, - int out_cols_size, T** outputs_data) { +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int* out_cols, + int out_cols_size, T** outputs_data) { int tid_x = blockIdx.x * blockDim.x + threadIdx.x; int curr_segment = 0; int curr_offset = out_cols[0]; @@ -94,9 +94,9 @@ __global__ void KernelConcatGrad(const T* input_data, const int in_row, } template -__global__ void KernelConcatGrad(const T* input_data, const int in_row, - const int in_col, const int fixed_out_col, - T** outputs_data) { +__global__ void SplitKernel(const T* input_data, const int in_row, + const int in_col, const int fixed_out_col, + T** outputs_data) { int tid_x = blockIdx.x * blockDim.x + threadIdx.x; for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) { int split = tid_x / fixed_out_col; @@ -170,11 +170,11 @@ class ConcatFunctor { dim3 grid_size = dim3(grid_cols, grid_rows, 1); if (sameShape) { - KernelConcat<<>>( + ConcatKernel<<>>( dev_ins_data, in_col, out_row, out_col, output->data()); } else { const int* dev_ins_col_data = inputs_col.CUDAData(context.GetPlace()); - KernelConcat<<>>( + ConcatKernel<<>>( dev_ins_data, dev_ins_col_data, static_cast(inputs_col.size()), out_row, out_col, output->data()); } @@ -189,7 +189,7 @@ class ConcatFunctor { * each dimension must be the same, except the axis dimension. */ template -class ConcatGradFunctor { +class SplitFunctor { public: void operator()(const platform::CUDADeviceContext& context, const framework::Tensor& input, @@ -248,11 +248,11 @@ class ConcatGradFunctor { dim3 grid_size = dim3(grid_cols, grid_rows, 1); if (sameShape) { - KernelConcatGrad<<>>( + SplitKernel<<>>( input.data(), in_row, in_col, out0_col, dev_out_gpu_data); } else { const int* dev_outs_col_data = outputs_cols.CUDAData(context.GetPlace()); - KernelConcatGrad<<>>( + SplitKernel<<>>( input.data(), in_row, in_col, dev_outs_col_data, static_cast(outputs_cols.size()), dev_out_gpu_data); } @@ -264,7 +264,7 @@ class ConcatGradFunctor { #define DEFINE_FUNCTOR(type) \ template class ConcatFunctor; \ - template class ConcatGradFunctor + template class SplitFunctor FOR_ALL_TYPES(DEFINE_FUNCTOR); diff --git a/paddle/fluid/operators/math/concat.h b/paddle/fluid/operators/math/concat_and_split.h similarity index 98% rename from paddle/fluid/operators/math/concat.h rename to paddle/fluid/operators/math/concat_and_split.h index 867a84fa873a2e90bdab7a5eecbb1755cb4b02d1..3a5eddcbf4af699a89ae1a21571337155699a1f3 100644 --- a/paddle/fluid/operators/math/concat.h +++ b/paddle/fluid/operators/math/concat_and_split.h @@ -54,7 +54,7 @@ class ConcatFunctor { * Output[1] = [[5,6]] */ template -class ConcatGradFunctor { +class SplitFunctor { public: void operator()(const DeviceContext& context, const framework::Tensor& input, const std::vector& ref_inputs, diff --git a/paddle/fluid/operators/math/concat_test.cc b/paddle/fluid/operators/math/concat_test.cc index a46f2d51ca64501a622b5b48b424dffa16efc5b4..8ba9e8e8ec1344edc3beaf7f4a58f99107cc0e9c 100644 --- a/paddle/fluid/operators/math/concat_test.cc +++ b/paddle/fluid/operators/math/concat_test.cc @@ -12,10 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/math/concat.h" #include #include #include "paddle/fluid/framework/tensor_util.h" +#include "paddle/fluid/operators/math/concat_and_split.h" template void testConcat() { diff --git a/paddle/fluid/operators/math/cos_sim_functor.cu b/paddle/fluid/operators/math/cos_sim_functor.cu index 4e6ff5ee0a449b42762748ba1a103876beee01f2..537c7e47155fe9a12196869ceaed84fca198335b 100644 --- a/paddle/fluid/operators/math/cos_sim_functor.cu +++ b/paddle/fluid/operators/math/cos_sim_functor.cu @@ -51,7 +51,7 @@ struct CosSimDyFunctor { T* dy) const { const int block_size = 512; dim3 threads(block_size, 1); - dim3 grid(1, (rows + block_size - 1) / block_size); + dim3 grid((rows + block_size - 1) / block_size, 1); CosSimDyKernel<<>>( x_norm, y_norm, x, y, z, dz, rows, cols, dy); } diff --git a/paddle/fluid/operators/math/cpu_vec.h b/paddle/fluid/operators/math/cpu_vec.h index 0aed253c80fc28560716cbcfa70f74ef9c84f9b6..7d81aee596934308763002d440f52400f45b5f20 100644 --- a/paddle/fluid/operators/math/cpu_vec.h +++ b/paddle/fluid/operators/math/cpu_vec.h @@ -33,11 +33,11 @@ namespace math { #define SIGMOID_THRESHOLD_MIN -40.0 #define SIGMOID_THRESHOLD_MAX 13.0 -#define AVX_FLOAT_BLOCK 8 +#define YMM_FLOAT_BLOCK 8 #define AVX_DOUBLE_BLOCK 4 -#define AVX2_FLOAT_BLOCK 8 +#define YMM_FLOAT_BLOCK 8 #define AVX2_DOUBLE_BLOCK 4 -#define AVX512_FLOAT_BLOCK 16 +#define ZMM_FLOAT_BLOCK 16 #define AVX512_DOUBLE_BLOCK 8 template @@ -88,7 +88,7 @@ template <> inline void vec_scal(const int n, const float a, const float* x, float* y) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block) { vec_scal(n, a, x, y); return; @@ -142,7 +142,7 @@ template <> inline void vec_bias_sub(const int n, const float a, const float* x, float* y) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block) { vec_bias_sub(n, a, x, y); return; @@ -200,7 +200,7 @@ inline void vec_cross(const int n, const float* x, const float* y, const float* z, float* out) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block) { vec_cross(n, x, y, z, out); return; @@ -257,7 +257,7 @@ template <> inline void vec_add_bias(const int n, const float a, const float* x, float* y) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block) { vec_add_bias(n, a, x, y); return; @@ -326,7 +326,7 @@ template <> inline void vec_sigmoid(const int n, const float* x, float* y) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block) { vec_sigmoid(n, x, y); return; @@ -415,7 +415,7 @@ template <> inline void vec_relu(const int n, const float* x, float* y) { #ifdef __AVX__ - constexpr int block = AVX_FLOAT_BLOCK; + constexpr int block = YMM_FLOAT_BLOCK; if (n < block * 4) { vec_relu(n, x, y); return; diff --git a/paddle/fluid/operators/math/cpu_vec_test.cc b/paddle/fluid/operators/math/cpu_vec_test.cc index cd40f1b2f984126663a5711efac24fdf6d680b32..ad734bae425f3b3edf3ed57474285c1b5a754416 100644 --- a/paddle/fluid/operators/math/cpu_vec_test.cc +++ b/paddle/fluid/operators/math/cpu_vec_test.cc @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include #include #include #include @@ -22,6 +21,7 @@ limitations under the License. */ #include "gtest/gtest.h" #include "paddle/fluid/operators/math/cpu_vec.h" +#include "paddle/fluid/platform/port.h" inline double GetCurrentUS() { struct timeval time; @@ -96,8 +96,8 @@ void TestAndBench(const int n, std::function tgt, } auto et = GetCurrentUS(); - VLOG(3) << "Vec size " << n << ": refer takes: " << (et - mt) / repeat - << " us, tgt takes: " << (mt - st) / repeat; + VLOG(30) << "Vec size " << n << ": refer takes: " << (et - mt) / repeat + << " us, tgt takes: " << (mt - st) / repeat; for (int i = 0; i < n; ++i) { EXPECT_NEAR(ytgt_data[i], yref_data[i], 1e-3); } diff --git a/paddle/fluid/operators/math/cross_entropy.cu b/paddle/fluid/operators/math/cross_entropy.cu index c92341ea55ea21773acba33665e267b2f1c25fe3..cb200ec8d6ea533d546f3e01a16a48c88b14f677 100644 --- a/paddle/fluid/operators/math/cross_entropy.cu +++ b/paddle/fluid/operators/math/cross_entropy.cu @@ -21,6 +21,16 @@ namespace operators { namespace math { namespace { + +__device__ __forceinline__ float real_log(float x) { return logf(x); } + +__device__ __forceinline__ double real_log(double x) { return log(x); } + +__device__ __forceinline__ platform::float16 real_log( + const platform::float16& val) { + return static_cast(logf(static_cast(val))); +} + template __global__ void CrossEntropyKernel(T* Y, const T* X, const int64_t* label, const int N, const int D, @@ -29,8 +39,8 @@ __global__ void CrossEntropyKernel(T* Y, const T* X, const int64_t* label, i += blockDim.x * gridDim.x) { PADDLE_ASSERT(label[i] >= 0 && label[i] < D || label[i] == ignore_index); Y[i] = ignore_index == label[i] - ? 0 - : -math::TolerableValue()(log(X[i * D + label[i]])); + ? static_cast(0) + : -math::TolerableValue()(real_log(X[i * D + label[i]])); } } @@ -38,12 +48,12 @@ template __global__ void SoftCrossEntropyKernel(T* Y, const T* X, const T* label, const int class_num) { int tid = threadIdx.x; - T val = 0; + T val(0); int idx = blockIdx.x * class_num + tid; int end = blockIdx.x * class_num + class_num; for (; idx < end; idx += blockDim.x) { - val += math::TolerableValue()(std::log(X[idx])) * label[idx]; + val += math::TolerableValue()(real_log(X[idx])) * label[idx]; } val = paddle::platform::reduceSum(val, tid, blockDim.x); @@ -53,8 +63,6 @@ __global__ void SoftCrossEntropyKernel(T* Y, const T* X, const T* label, } } // namespace -using Tensor = framework::Tensor; - template class CrossEntropyFunctor { public: @@ -89,6 +97,8 @@ class CrossEntropyFunctor { template class CrossEntropyFunctor; template class CrossEntropyFunctor; +template class CrossEntropyFunctor; } // namespace math } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/math/cross_entropy.h b/paddle/fluid/operators/math/cross_entropy.h index e8aeb5d0575ac0f6b8761e97896df73578e8a103..99a4935186e1e6f9e3bf36eb029ce3d230510117 100644 --- a/paddle/fluid/operators/math/cross_entropy.h +++ b/paddle/fluid/operators/math/cross_entropy.h @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/platform/hostdevice.h" @@ -33,6 +34,26 @@ struct TolerableValue { } }; +// NOTE(dzh): float16 value clip behave different. +// 1. Our ValueClipping has a hardcore threshold 1e20 +// for float number. 1e20 will resulting in overflow in float16. +// 2. float16 should expose the the real number overflow to python. +// because mixed-training depends the inf/nan value to determine +// if the scale value will be adjusted. +// Also. In standard implementation of cross entropy, other +// framework not has the ValueClipping. +template <> +struct TolerableValue { + HOSTDEVICE platform::float16 operator()(const platform::float16& x) const { + if (platform::isfinite(x)) + return x; + else if (x > static_cast(0)) + return std::numeric_limits::max(); + else + return std::numeric_limits::min(); + } +}; + template class CrossEntropyFunctor { public: diff --git a/paddle/fluid/operators/math/detail/activation_functions.h b/paddle/fluid/operators/math/detail/activation_functions.h index b127fbe8c8515e7fe57b07ea1d4291675ec4efca..2b3d38d95a18fad9b76e616cdf2cb6c3eb07da3a 100644 --- a/paddle/fluid/operators/math/detail/activation_functions.h +++ b/paddle/fluid/operators/math/detail/activation_functions.h @@ -15,6 +15,7 @@ limitations under the License. */ #pragma once #include #include + #include "paddle/fluid/platform/enforce.h" #include "paddle/fluid/platform/hostdevice.h" diff --git a/paddle/fluid/operators/math/fc_compute.h b/paddle/fluid/operators/math/fc_compute.h index 1f5a49c0ab5a10b0d7dc1febd258ce76c467cb1c..5b9953a5aa9a29bd917d16a16c678fc32a32c18f 100644 --- a/paddle/fluid/operators/math/fc_compute.h +++ b/paddle/fluid/operators/math/fc_compute.h @@ -15,8 +15,7 @@ limitations under the License. */ #pragma once #include "paddle/fluid/operators/math/blas.h" - -DECLARE_int32(paddle_num_threads); +#include "paddle/fluid/operators/math/jit_kernel.h" namespace paddle { namespace operators { @@ -30,20 +29,25 @@ inline void FCCompute(const BlasT& blas, const int M, if (B == NULL) { return; } + if (relu) { + const auto& vaddrelu = jitkernel::KernelPool::Instance() + .template Get>(N); + for (int i = 0; i < M; i++) { + T* dst = Y + i * N; + vaddrelu->Compute(B, dst, dst, N); + } + } else { + const auto& vadd = jitkernel::KernelPool::Instance() + .template Get>(N); #ifdef PADDLE_WITH_MKLML -#pragma omp parallel for if (FLAGS_paddle_num_threads > 1) +#pragma omp parallel for #endif - for (int i = 0; i < M; i++) { - blas.AXPY(N, static_cast(1), B, Y + i * N); + for (int i = 0; i < M; i++) { + T* dst = Y + i * N; + vadd->Compute(B, dst, dst, N); + } } - - if (!relu) { - return; - } - - // TODO(TJ): fuse relu - LOG(FATAL) << "Not implemented!"; } } // namespace math diff --git a/paddle/fluid/operators/math/im2col_test.cc b/paddle/fluid/operators/math/im2col_test.cc index ae2c90b33a4298ada4fd01aa2a44ebdf10d036d4..521cd7801abd6bea0c4f27422141742921dddf53 100644 --- a/paddle/fluid/operators/math/im2col_test.cc +++ b/paddle/fluid/operators/math/im2col_test.cc @@ -14,9 +14,9 @@ limitations under the License. */ #include "paddle/fluid/operators/math/im2col.h" #include -#include #include #include "paddle/fluid/operators/math/im2col_cfo_cpu.h" +#include "paddle/fluid/platform/port.h" template void testIm2col() { diff --git a/paddle/fluid/operators/math/jit_code.cc b/paddle/fluid/operators/math/jit_code.cc new file mode 100644 index 0000000000000000000000000000000000000000..52cbdf685dee651cbc1490dc6faacb8680004c89 --- /dev/null +++ b/paddle/fluid/operators/math/jit_code.cc @@ -0,0 +1,333 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/math/jit_code.h" +#include // offsetof +#include "paddle/fluid/operators/math/jit_kernel.h" // TODO(TJ): remove me + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { +namespace gen { + +using namespace platform::jit; // NOLINT + +bool VXXJitCode::init(int d, int scalar_index) { + // It's not necessary to use avx512 since it would slow down the frequency + // and this kernel is not compute bound. + return MayIUse(avx) && scalar_index >= 0 && scalar_index <= 2; +} + +void VXXJitCode::generate() { + // do not need push stack, and do not need save avx512reg if do not use avx512 + int offset = 0; + if (with_relu_) { + vxorps(ymm_zero, ymm_zero, ymm_zero); + } + if (scalar_index_ == 1) { + vbroadcastss(ymm_src1, ptr[param1]); + } else if (scalar_index_ == 2) { + vbroadcastss(ymm_src2, ptr[param2]); + } + for (int i = 0; i < num_ / YMM_FLOAT_BLOCK; ++i) { + if (scalar_index_ != 1) { + vmovups(ymm_src1, ptr[param1 + offset]); + } + if (scalar_index_ != 2) { + vmovups(ymm_src2, ptr[param2 + offset]); + } + if (type_ == operand_type::mul) { + vmulps(ymm_dst, ymm_src1, ymm_src2); + } else if (type_ == operand_type::add) { + vaddps(ymm_dst, ymm_src1, ymm_src2); + } + if (with_relu_) { + vmaxps(ymm_dst, ymm_zero, ymm_dst); + } + vmovups(ptr[param3 + offset], ymm_dst); + offset += sizeof(float) * YMM_FLOAT_BLOCK; + } + int rest = num_ % YMM_FLOAT_BLOCK; + while (rest > 0) { + int block = XMM_FLOAT_BLOCK; + if (rest >= 4) { + block = 4; + if (scalar_index_ != 1) { + vmovups(xmm_src1, ptr[param1 + offset]); + } + if (scalar_index_ != 2) { + vmovups(xmm_src2, ptr[param2 + offset]); + } + } else if (rest >= 2) { + block = 2; + if (scalar_index_ != 1) { + vmovq(xmm_src1, ptr[param1 + offset]); + } + if (scalar_index_ != 2) { + vmovq(xmm_src2, ptr[param2 + offset]); + } + } else { + block = 1; + if (scalar_index_ != 1) { + vmovss(xmm_src1, ptr[param1 + offset]); + } + if (scalar_index_ != 2) { + vmovss(xmm_src2, ptr[param2 + offset]); + } + } + switch (type_) { + case operand_type::mul: + vmulps(xmm_dst, xmm_src1, xmm_src2); + break; + case operand_type::add: + vaddps(xmm_dst, xmm_src1, xmm_src2); + break; + default: + break; + } + if (with_relu_) { + vmaxps(xmm_dst, xmm_zero, xmm_dst); + } + if (rest >= 4) { + vmovups(ptr[param3 + offset], xmm_dst); + } else if (rest >= 2) { + vmovq(ptr[param3 + offset], xmm_dst); + } else { + vmovss(ptr[param3 + offset], xmm_dst); + } + offset += sizeof(float) * block; + rest -= block; + } + ret(); +} + +const float exp_float_consts[] ALIGN32 = {REPEAT_8TIMES(1.f), + REPEAT_8TIMES(2.f), + REPEAT_8TIMES(0.5f), + REPEAT_8TIMES(EXP_HIG), + REPEAT_8TIMES(EXP_LOW), + REPEAT_8TIMES(CEPHES_LOG2EF), + REPEAT_8TIMES(CEPHES_EXP_C1), + REPEAT_8TIMES(CEPHES_EXP_C2), + REPEAT_8TIMES(CEPHES_EXP_P0), + REPEAT_8TIMES(CEPHES_EXP_P1), + REPEAT_8TIMES(CEPHES_EXP_P2), + REPEAT_8TIMES(CEPHES_EXP_P3), + REPEAT_8TIMES(CEPHES_EXP_P4), + REPEAT_8TIMES(CEPHES_EXP_P5), + REPEAT_8TIMES(EXP_MAX_INPUT), + REPEAT_8TIMES(SIGMOID_THRESHOLD_MAX), + REPEAT_8TIMES(SIGMOID_THRESHOLD_MIN)}; + +const int exp_int_0x7f[] ALIGN32 = {REPEAT_8TIMES(0x7f)}; +int g_tmp_mem[16] ALIGN32 = {0}; + +bool VActJitCode::init(int d, operand_type type) { + // TODO(TJ): implement avx512, avx_exp is slower than mkl when d >= 256 + return MayIUse(avx); +} + +void VActJitCode::generate() { + int offset = 0; + for (int i = 0; i < num_ / YMM_FLOAT_BLOCK; ++i) { + vmovups(ymm_src, ptr[param1 + offset]); + act(ymm_dst, ymm_src, type_); + vmovups(ptr[param2 + offset], ymm_dst); + offset += sizeof(float) * YMM_FLOAT_BLOCK; + } + int rest = num_ % YMM_FLOAT_BLOCK; + while (rest > 0) { + int block = XMM_FLOAT_BLOCK; + if (rest >= 4) { + block = 4; + vmovups(xmm_src, ptr[param1 + offset]); + } else if (rest >= 2) { + block = 2; + vmovq(xmm_src, ptr[param1 + offset]); + } else { + block = 1; + vmovss(xmm_src, ptr[param1 + offset]); + } + act(xmm_dst, xmm_src, type_); + if (rest >= 4) { + vmovups(ptr[param2 + offset], xmm_dst); + } else if (rest >= 2) { + vmovq(ptr[param2 + offset], xmm_dst); + } else { + vmovss(ptr[param2 + offset], xmm_dst); + } + offset += sizeof(float) * block; + rest -= block; + } + ret(); +} + +bool LSTMJitCode::init(int d) { return MayIUse(avx) && d % 8 == 0; } + +void LSTMJitCode::generate() { + if (use_peephole_) { + preCode(); + } + reg64_t reg_ptr_gates = rax; + reg64_t reg_ptr_ct_1 = r9; + reg64_t reg_ptr_ct = r10; + reg64_t reg_ptr_ht = r11; + reg64_t reg_ptr_wp = r12; + mov(reg_ptr_gates, ptr[param1 + offsetof(lstm_t, gates)]); + mov(reg_ptr_ct_1, ptr[param1 + offsetof(lstm_t, ct_1)]); + mov(reg_ptr_ct, ptr[param1 + offsetof(lstm_t, ct)]); + mov(reg_ptr_ht, ptr[param1 + offsetof(lstm_t, ht)]); + if (use_peephole_) { + mov(reg_ptr_wp, ptr[param1 + offsetof(lstm_t, wp)]); + } + + int offset = 0; + int d = num_ * sizeof(float); + for (int i = 0; i < num_ / YMM_FLOAT_BLOCK; ++i) { + /* gates: W_ch, W_ih, W_fh, W_oh */ + ymm_t ymm_c = ymm_t(0); + ymm_t ymm_i = ymm_t(1); + ymm_t ymm_f = ymm_t(2); + ymm_t ymm_o = ymm_t(3); + ymm_t ymm_ct_1 = ymm_t(4); + ymm_t ymm_wp0 = ymm_t(5); + ymm_t ymm_wp1 = ymm_t(6); + ymm_t ymm_wp2 = ymm_t(7); + vmovups(ymm_c, ptr[reg_ptr_gates + offset]); + vmovups(ymm_i, ptr[reg_ptr_gates + offset + d]); + vmovups(ymm_f, ptr[reg_ptr_gates + offset + 2 * d]); + vmovups(ymm_o, ptr[reg_ptr_gates + offset + 3 * d]); + if (!compute_c1h1_) { + vmovups(ymm_ct_1, ptr[reg_ptr_ct_1 + offset]); + } + if (use_peephole_) { + vmovups(ymm_wp0, ptr[reg_ptr_wp + offset]); + vmovups(ymm_wp1, ptr[reg_ptr_wp + offset + d]); + vmovups(ymm_wp2, ptr[reg_ptr_wp + offset + 2 * d]); + } + /* C_t = act_cand(c) * act_gate(i) + C_t-1 * act_gate(f) */ + // act_cand(c) + act(ymm_c, ymm_c, act_cand_); + // act_gate(i) or act_gate(ct_1 * wp0 + i) + if (!compute_c1h1_ && use_peephole_) { + vmulps(ymm_wp0, ymm_ct_1, ymm_wp0); + vaddps(ymm_i, ymm_i, ymm_wp0); + } + act(ymm_i, ymm_i, act_gate_); + vmulps(ymm_c, ymm_c, ymm_i); + if (!compute_c1h1_) { + // act_gate(f) or act_gate(ct_1 * wp1 + f) + if (use_peephole_) { + vmulps(ymm_wp1, ymm_ct_1, ymm_wp1); + vaddps(ymm_f, ymm_f, ymm_wp1); + } + act(ymm_f, ymm_f, act_gate_); + // ct + vmulps(ymm_f, ymm_f, ymm_ct_1); + vaddps(ymm_f, ymm_f, ymm_c); + } + /* H_t = act_cell(C_t) * act_gate(o) */ + // act_cell(C_t) + ymm_t ymm_ct = compute_c1h1_ ? ymm_c : ymm_f; + ymm_t ymm_tmp = ymm_i; + act(ymm_tmp, ymm_ct, act_cell_); + // act_gate(o) or act_gate(ct * wp2 + o) + if (use_peephole_) { + vmulps(ymm_wp2, ymm_ct, ymm_wp2); + vaddps(ymm_o, ymm_o, ymm_wp2); + } + act(ymm_o, ymm_o, act_gate_); + // ht + vmulps(ymm_o, ymm_o, ymm_tmp); + // save ct and ht + vmovups(ptr[reg_ptr_ct + offset], ymm_ct); + vmovups(ptr[reg_ptr_ht + offset], ymm_o); + offset += sizeof(float) * YMM_FLOAT_BLOCK; + } + + if (use_peephole_) { + postCode(); + } else { + ret(); + } +} + +bool GRUJitCode::init(int d) { return MayIUse(avx) && d % 8 == 0; } + +void GRUJitCode::generate() { + reg64_t reg_ptr_gates = rax; + reg64_t reg_ptr_ht_1 = r9; + reg64_t reg_ptr_ht = r10; + mov(reg_ptr_gates, ptr[param1 + offsetof(gru_t, gates)]); + mov(reg_ptr_ht_1, ptr[param1 + offsetof(gru_t, ht_1)]); + mov(reg_ptr_ht, ptr[param1 + offsetof(gru_t, ht)]); + ymm_t ymm_one = ymm_t(0); + + if (id_ == 2) { + reg64_t reg_ptr_tmp = r11; + mov(reg_ptr_tmp, reinterpret_cast(exp_float_consts)); + vmovaps(ymm_one, ptr[reg_ptr_tmp + OFFSET_EXP_ONE]); + } + int offset = 0; + int d = num_ * sizeof(float); + for (int i = 0; i < num_ / YMM_FLOAT_BLOCK; ++i) { + ymm_t ymm_u = ymm_t(1); + ymm_t ymm_r = ymm_t(2); + ymm_t ymm_s = ymm_t(3); + ymm_t ymm_ht_1 = ymm_t(4); + // W: {W_update, W_reset; W_state} + if (id_ == 0 || id_ == 2) { + vmovups(ymm_u, ptr[reg_ptr_gates + offset]); + vmovups(ymm_s, ptr[reg_ptr_gates + offset + 2 * d]); + } + if (id_ == 1) { + vmovups(ymm_r, ptr[reg_ptr_gates + offset + d]); + } + if (id_ == 1 || id_ == 2) { + vmovups(ymm_ht_1, ptr[reg_ptr_ht_1 + offset]); + } + + if (id_ == 0) { + // ht = act_gate(u) * act_cand(s) + act(ymm_u, ymm_u, act_gate_); + act(ymm_s, ymm_s, act_cand_); + vmulps(ymm_s, ymm_s, ymm_u); + vmovups(ptr[reg_ptr_ht + offset], ymm_s); + } else if (id_ == 1) { + // ht = act_gate(r) * ht_1 + act(ymm_r, ymm_r, act_gate_); + vmulps(ymm_r, ymm_r, ymm_ht_1); + vmovups(ptr[reg_ptr_ht + offset], ymm_r); + } else if (id_ == 2) { + // ht = act_gate(u) * act_cand(s) + (1-act_gate(u)) * ht_1 + ymm_t ymm_one_inner = ymm_t(ymm_one.getIdx()); + act(ymm_u, ymm_u, act_gate_); + act(ymm_s, ymm_s, act_cand_); + vmulps(ymm_s, ymm_s, ymm_u); + vsubps(ymm_u, ymm_one_inner, ymm_u); + vmulps(ymm_u, ymm_ht_1, ymm_u); + vaddps(ymm_u, ymm_s, ymm_u); + vmovups(ptr[reg_ptr_ht + offset], ymm_u); + } + offset += sizeof(float) * YMM_FLOAT_BLOCK; + } + + ret(); +} +} // namespace gen +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_code.h b/paddle/fluid/operators/math/jit_code.h new file mode 100644 index 0000000000000000000000000000000000000000..a9214621295a7740b804b26c02d216dd5118d8bb --- /dev/null +++ b/paddle/fluid/operators/math/jit_code.h @@ -0,0 +1,533 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include "paddle/fluid/operators/math/jit_gen.h" +#include "paddle/fluid/operators/math/jit_kernel_impl.h" +#include "paddle/fluid/platform/cpu_info.h" + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { +namespace gen { + +using reg64_t = const Xbyak::Reg64; +using reg32_t = const Xbyak::Reg32; +using xmm_t = const Xbyak::Xmm; +using ymm_t = const Xbyak::Ymm; +using zmm_t = const Xbyak::Zmm; +using Label = Xbyak::Label; + +typedef enum { + mul = 0, + add, + sub, + relu, + exp, + sigmoid, + tanh, + identity +} operand_type; + +extern const float exp_float_consts[]; +extern const int exp_int_0x7f[]; +extern int g_tmp_mem[]; + +#define ALIGN32 __attribute__((aligned(32))) +#define EXP_HIG 88.3762626647949f +#define EXP_LOW -88.3762626647949f +#define CEPHES_LOG2EF 1.44269504088896341 +#define CEPHES_EXP_C1 0.693359375 +#define CEPHES_EXP_C2 -2.12194440e-4 +#define CEPHES_EXP_P0 1.9875691500E-4 +#define CEPHES_EXP_P1 1.3981999507E-3 +#define CEPHES_EXP_P2 8.3334519073E-3 +#define CEPHES_EXP_P3 4.1665795894E-2 +#define CEPHES_EXP_P4 1.6666665459E-1 +#define CEPHES_EXP_P5 5.0000001201E-1 + +#define REPEAT_8TIMES(val) val, val, val, val, val, val, val, val + +#define OFFSET_EXP_ONE 0 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_TWO 1 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_0P5 2 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_HIG 3 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_LOW 4 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_LOG2EF 5 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_C1 6 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_C2 7 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P0 8 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P1 9 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P2 10 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P3 11 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P4 12 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_P5 13 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_EXP_MAX_INPUT 14 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_SIGMOID_MAX 15 * YMM_FLOAT_BLOCK * sizeof(float) +#define OFFSET_SIGMOID_MIN 16 * YMM_FLOAT_BLOCK * sizeof(float) + +// function: vec = Operand(vec(or scalar), vec(or scalar)) (maybe with relu) +class VXXJitCode : public JitCode { + public: + const char* name() const override { + std::string base = "VXXJitCode"; + if (scalar_index_ == 1) { + base += "_Scalar"; + } else { + base += "_Vec"; + } + if (type_ == operand_type::mul) { + base += "_Mul"; + } else if (type_ == operand_type::add) { + base += "_Add"; + } + if (scalar_index_ == 2) { + base += "_Scalar"; + } else { + base += "_Vec"; + } + base += (with_relu_ ? "_Relu" : ""); + return base.c_str(); + } + explicit VXXJitCode(int d, operand_type type, int scalar_index, + bool with_relu, size_t code_size = 256 * 1024, + void* code_ptr = nullptr) + : JitCode(code_size, code_ptr), + num_(d), + type_(type), + scalar_index_(scalar_index), + with_relu_(with_relu) {} + static bool init(int d, int scalar_index = 0); + void generate() override; + + private: + int num_; + operand_type type_; + int scalar_index_; + bool with_relu_; + reg64_t param1{abi_param1}; + reg64_t param2{abi_param2}; + reg64_t param3{abi_param3}; + + xmm_t xmm_src1 = xmm_t(0); + xmm_t xmm_src2 = xmm_t(1); + xmm_t xmm_dst = xmm_t(2); + xmm_t xmm_zero = xmm_t(3); + + ymm_t ymm_src1 = ymm_t(0); + ymm_t ymm_src2 = ymm_t(1); + ymm_t ymm_dst = ymm_t(2); + ymm_t ymm_zero = ymm_t(3); +}; + +class VActJitCode : public JitCode { + public: + const char* name() const override { + std::string base = "VActJitCode"; + switch (type_) { + case operand_type::relu: + base += "_Relu"; + break; + case operand_type::exp: + base += "_Exp"; + break; + case operand_type::sigmoid: + base += "_Sigmoid"; + break; + case operand_type::tanh: + base += "_Tanh"; + break; + case operand_type::identity: + base += "_Identity"; + break; + default: + break; + } + return base.c_str(); + } + + explicit VActJitCode(int d, operand_type type, size_t code_size = 256 * 1024, + void* code_ptr = nullptr) + : JitCode(code_size, code_ptr), num_(d), type_(type) {} + static bool init(int d, operand_type type); + void generate() override; + + protected: + // compute relu with ymm, xmm + template + void relu_jmm(JMM& dst, JMM& src, int zero_idx = 15) { // NOLINT + JMM zero = JMM(zero_idx); + vxorps(zero, zero, zero); + vmaxps(dst, src, zero); + } + + // compute exp with ymm, xmm + template + void exp_jmm(JMM& dst, JMM& src, int src_idx = 11, int fx_idx = 12, // NOLINT + int fy_idx = 13, int mask_idx = 14, int tmp_idx = 15) { + using namespace platform::jit; // NOLINT + // check all idx can not equal + JMM jmm_src = JMM(src_idx); + JMM jmm_fx = JMM(fx_idx); + JMM jmm_fy = JMM(fy_idx); + JMM jmm_mask = JMM(mask_idx); + JMM jmm_tmp = JMM(tmp_idx); + reg64_t reg_ptr_global = rax; + push(reg_ptr_global); + vmovaps(jmm_src, src); + mov(reg_ptr_global, reinterpret_cast(exp_float_consts)); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_HIG]); + vminps(jmm_src, jmm_src, jmm_tmp); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_LOW]); + vmaxps(jmm_src, jmm_src, jmm_tmp); + // express exp(x) as exp(g + n*log(2)) + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_LOG2EF]); + vmulps(jmm_fx, jmm_src, jmm_tmp); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_0P5]); + vaddps(jmm_fx, jmm_fx, jmm_tmp); + vroundps(jmm_fy, jmm_fx, 0x01); + // if greater, substract 1 + vcmpgtps(jmm_mask, jmm_fy, jmm_fx); + vmovaps(jmm_tmp, ptr[reg_ptr_global]); + vandps(jmm_mask, jmm_mask, jmm_tmp); + vsubps(jmm_fx, jmm_fy, jmm_mask); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_C1]); + vmulps(jmm_fy, jmm_fx, jmm_tmp); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_C2]); + JMM ymm_z = JMM(jmm_mask.getIdx()); + vmulps(ymm_z, jmm_fx, jmm_tmp); + vsubps(jmm_src, jmm_src, jmm_fy); + vsubps(jmm_src, jmm_src, ymm_z); + vmulps(ymm_z, jmm_src, jmm_src); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_P0]); + vmulps(dst, jmm_src, jmm_tmp); + for (size_t i = OFFSET_EXP_P1; i < OFFSET_EXP_P5; + i += (YMM_FLOAT_BLOCK * sizeof(float))) { + vmovaps(jmm_tmp, ptr[reg_ptr_global + i]); // P1~P4 + vaddps(dst, dst, jmm_tmp); + vmulps(dst, dst, jmm_src); + } + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_P5]); + vaddps(dst, dst, jmm_tmp); + vmulps(dst, dst, ymm_z); + vaddps(dst, dst, jmm_src); + vmovaps(jmm_tmp, ptr[reg_ptr_global]); + vaddps(dst, dst, jmm_tmp); + // build 2^n + JMM ymm_int = jmm_fx; + vcvttps2dq(ymm_int, jmm_fx); + mov(reg_ptr_global, reinterpret_cast(exp_int_0x7f)); + vmovdqa(jmm_tmp, ptr[reg_ptr_global]); + if (MayIUse(avx2) || std::is_same::value) { + vpaddd(ymm_int, ymm_int, jmm_tmp); + vpslld(ymm_int, ymm_int, 23); + } else if (MayIUse(avx)) { + xmm_t xtmp1 = xmm_t(ymm_int.getIdx()); + xmm_t xtmp2 = xmm_t(jmm_tmp.getIdx()); + reg64_t reg_ptr_tmp = reg_ptr_global; + mov(reg_ptr_tmp, reinterpret_cast(g_tmp_mem)); + vmovdqa(ptr[reg_ptr_tmp], ymm_int); + vmovdqa(ptr[reg_ptr_tmp + YMM_FLOAT_BLOCK * sizeof(float)], jmm_tmp); + vpaddd(xtmp1, xtmp1, xtmp2); + vpslld(xtmp1, xtmp1, 23); + vmovdqa(ptr[reg_ptr_tmp], xtmp1); + // next 128bits + vmovdqa(xtmp1, ptr[reg_ptr_tmp + XMM_FLOAT_BLOCK * sizeof(float)]); + vmovdqa(xtmp2, ptr[reg_ptr_tmp + + (YMM_FLOAT_BLOCK + XMM_FLOAT_BLOCK) * sizeof(float)]); + vpaddd(xtmp1, xtmp1, xtmp2); + vpslld(xtmp1, xtmp1, 23); + vmovdqa(ptr[reg_ptr_tmp + XMM_FLOAT_BLOCK * sizeof(float)], xtmp1); + // load out + vmovdqa(ymm_int, ptr[reg_ptr_tmp]); + } + vmulps(dst, dst, ymm_int); + pop(reg_ptr_global); + } + + // compute sigmoid with ymm, xmm + template + void sigmoid_jmm(JMM& dst, JMM& src, int src_idx = 11, // NOLINT + int fx_idx = 12, int fy_idx = 13, int mask_idx = 14, + int tmp_idx = 15) { + // y = 1 / (1 + e^-x) + JMM jmm_tmp = JMM(tmp_idx); + JMM jmm_src = JMM(src_idx); + reg64_t reg_ptr_global = rax; + push(reg_ptr_global); + vmovaps(jmm_src, src); + mov(reg_ptr_global, reinterpret_cast(exp_float_consts)); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_SIGMOID_MAX]); + vminps(jmm_src, jmm_src, jmm_tmp); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_SIGMOID_MIN]); + vmaxps(jmm_src, jmm_src, jmm_tmp); + vxorps(jmm_tmp, jmm_tmp, jmm_tmp); + vsubps(jmm_src, jmm_tmp, jmm_src); + exp_jmm(dst, jmm_src, src_idx, fx_idx, fy_idx, mask_idx, tmp_idx); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_ONE]); + vaddps(dst, dst, jmm_tmp); + vdivps(dst, jmm_tmp, dst); + pop(reg_ptr_global); + } + + // compute tanh with ymm, xmm + template + void tanh_jmm(JMM& dst, JMM& src, int src_idx = 11, // NOLINT + int fx_idx = 12, int fy_idx = 13, int mask_idx = 14, + int tmp_idx = 15) { + // y = 2 / (1 + e^(-2x)) - 1 + JMM jmm_src = JMM(src_idx); + JMM jmm_tmp = JMM(tmp_idx); + JMM jmm_zero = JMM(mask_idx); + reg64_t reg_ptr_global = rax; + push(reg_ptr_global); + vmovaps(jmm_src, src); + mov(reg_ptr_global, reinterpret_cast(exp_float_consts)); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_TWO]); + vxorps(jmm_zero, jmm_zero, jmm_zero); + vsubps(jmm_tmp, jmm_zero, jmm_tmp); + vmulps(jmm_src, jmm_src, jmm_tmp); + exp_jmm(dst, jmm_src, src_idx, fx_idx, fy_idx, mask_idx, tmp_idx); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_ONE]); + vaddps(dst, dst, jmm_tmp); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_TWO]); + vdivps(dst, jmm_tmp, dst); + vmovaps(jmm_tmp, ptr[reg_ptr_global + OFFSET_EXP_ONE]); + vsubps(dst, dst, jmm_tmp); + pop(reg_ptr_global); + } + + template + void act(JMM& dst, JMM& src, operand_type type) { // NOLINT + // use 11~15 + switch (type) { + case operand_type::relu: + relu_jmm(dst, src, 15); + break; + case operand_type::exp: + exp_jmm(dst, src, 11, 12, 13, 14, 15); + break; + case operand_type::sigmoid: + sigmoid_jmm(dst, src, 11, 12, 13, 14, 15); + break; + case operand_type::tanh: + tanh_jmm(dst, src, 11, 12, 13, 14, 15); + break; + case operand_type::identity: + break; + default: + // throw error + break; + } + } + + protected: + int num_; + operand_type type_; + reg64_t param1{abi_param1}; + reg64_t param2{abi_param2}; + + xmm_t xmm_src = xmm_t(0); + ymm_t ymm_src = ymm_t(0); + + xmm_t xmm_dst = xmm_t(1); + ymm_t ymm_dst = ymm_t(1); +}; + +class LSTMJitCode : public VActJitCode { + public: + const char* name() const override { + std::string base = "LSTMJitCode"; + if (use_peephole_) { + base += "_Peephole"; + } + if (compute_c1h1_) { + base += "_C1H1"; + } + auto AddTypeStr = [&](operand_type type) { + switch (type) { + case operand_type::relu: + base += "_Relu"; + break; + case operand_type::exp: + base += "_Exp"; + break; + case operand_type::sigmoid: + base += "_Sigmoid"; + break; + case operand_type::tanh: + base += "_Tanh"; + break; + case operand_type::identity: + base += "_Identity"; + break; + default: + break; + } + }; + AddTypeStr(act_gate_); + AddTypeStr(act_cand_); + AddTypeStr(act_cell_); + return base.c_str(); + } + + explicit LSTMJitCode(bool compute_c1h1, const lstm_attr_t& attr, + size_t code_size = 256 * 1024, void* code_ptr = nullptr) + : VActJitCode(attr.d, operand_type::sigmoid /* this is bugy*/, code_size, + code_ptr), + compute_c1h1_(compute_c1h1) { + auto typeExchange = [](const std::string& type) -> gen::operand_type { + if (type == "sigmoid") { + return operand_type::sigmoid; + } else if (type == "relu") { + return operand_type::relu; + } else if (type == "tanh") { + return operand_type::tanh; + } else if (type == "identity" || type == "") { + return operand_type::identity; + } // else throw error + return operand_type::identity; + }; + num_ = attr.d; + use_peephole_ = attr.use_peephole; + act_gate_ = typeExchange(attr.act_gate); + act_cand_ = typeExchange(attr.act_cand); + act_cell_ = typeExchange(attr.act_cell); + } + static bool init(int d); + void generate() override; + + protected: + int num_; + bool compute_c1h1_; + bool use_peephole_; + operand_type act_gate_; + operand_type act_cand_; + operand_type act_cell_; + reg64_t param1{abi_param1}; +}; + +class GRUJitCode : public VActJitCode { + public: + const char* name() const override { + std::string base = "GRUJitCode"; + if (id_ == 0) { + base += "_H1"; + } else if (id_ == 1) { + base += "_HtPart1"; + } else if (id_ == 2) { + base += "_HtPart2"; + } + auto AddTypeStr = [&](operand_type type) { + switch (type) { + case operand_type::relu: + base += "_Relu"; + break; + case operand_type::exp: + base += "_Exp"; + break; + case operand_type::sigmoid: + base += "_Sigmoid"; + break; + case operand_type::tanh: + base += "_Tanh"; + break; + case operand_type::identity: + base += "_Identity"; + break; + default: + break; + } + }; + AddTypeStr(act_gate_); + AddTypeStr(act_cand_); + return base.c_str(); + } + + explicit GRUJitCode(int id, const gru_attr_t& attr, + size_t code_size = 256 * 1024, void* code_ptr = nullptr) + : VActJitCode(attr.d, operand_type::sigmoid /* this is bugy*/, code_size, + code_ptr), + id_(id) { + auto typeExchange = [](const std::string& type) -> gen::operand_type { + if (type == "sigmoid") { + return operand_type::sigmoid; + } else if (type == "relu") { + return operand_type::relu; + } else if (type == "tanh") { + return operand_type::tanh; + } else if (type == "identity" || type == "") { + return operand_type::identity; + } // else throw error + return operand_type::identity; + }; + num_ = attr.d; + act_gate_ = typeExchange(attr.act_gate); + act_cand_ = typeExchange(attr.act_cand); + } + static bool init(int d); + void generate() override; + + protected: + int id_; + int num_; + operand_type act_gate_; + operand_type act_cand_; + reg64_t param1{abi_param1}; +}; + +#ifdef PADDLE_WITH_MKLDNN +struct EltwiseMulnChw16cNC : public Xbyak::CodeGenerator { + explicit EltwiseMulnChw16cNC(size_t code_size = 256 * 1024) + : Xbyak::CodeGenerator(code_size) { + // RDI is ptr x_input + // RSI is ptr y_input + // RDX is ptr output + // RCX is height + // r8 is width + + push(rbx); + + xor_(rax, rax); + xor_(r10, r10); + vmovups(zmm3, ptr[rsi]); + + L("h_loop"); + xor_(rbx, rbx); + L("w_loop"); + vmovups(zmm2, ptr[rdi + rax]); + vmulps(zmm1, zmm2, zmm3); + vmovups(ptr[rdx + rax], zmm1); + add(rax, 64); + inc(rbx); + cmp(r8, rbx); + jnz("w_loop"); + inc(r10); + cmp(r10, rcx); + jnz("h_loop"); + + pop(rbx); + ret(); + } +}; +#endif + +} // namespace gen +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_gen.cc b/paddle/fluid/operators/math/jit_gen.cc new file mode 100644 index 0000000000000000000000000000000000000000..6af39518ed926554c8c839bba701d3827923dba0 --- /dev/null +++ b/paddle/fluid/operators/math/jit_gen.cc @@ -0,0 +1,90 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/math/jit_gen.h" +#include +#include +#include +#include "paddle/fluid/platform/cpu_info.h" + +DEFINE_bool(dump_jitcode, false, "Whether to dump the jitcode to file"); + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { +namespace gen { + +constexpr Xbyak::Operand::Code g_abi_regs[] = { + Xbyak::Operand::RBX, Xbyak::Operand::RBP, Xbyak::Operand::R12, + Xbyak::Operand::R13, Xbyak::Operand::R14, Xbyak::Operand::R15}; + +constexpr int num_g_abi_regs = sizeof(g_abi_regs) / sizeof(g_abi_regs[0]); + +void JitCode::preCode() { + for (int i = 0; i < num_g_abi_regs; ++i) { + push(Xbyak::Reg64(g_abi_regs[i])); + } + if (platform::jit::MayIUse(platform::jit::avx512f)) { + mov(reg_EVEX_max_8b_offt, 2 * EVEX_max_8b_offt); + } +} + +void JitCode::postCode() { + for (int i = 0; i < num_g_abi_regs; ++i) { + pop(Xbyak::Reg64(g_abi_regs[num_g_abi_regs - 1 - i])); + } + ret(); +} + +void JitCode::dumpCode(const Xbyak::uint8 *code) const { + if (code) { + static int counter = 0; + std::ostringstream filename; + filename << "paddle_jitcode_" << name() << "." << counter << ".bin"; + counter++; + std::ofstream fout(filename.str(), std::ios::out); + if (fout.is_open()) { + fout.write(reinterpret_cast(code), getSize()); + fout.close(); + } + } +} + +Xbyak::Address JitCode::EVEX_compress_addr(Xbyak::Reg64 base, int offt, + bool bcast) { + int scale = 0; + if (EVEX_max_8b_offt <= offt && offt < 3 * EVEX_max_8b_offt) { + offt = offt - 2 * EVEX_max_8b_offt; + scale = 1; + } else if (3 * EVEX_max_8b_offt <= offt && offt < 5 * EVEX_max_8b_offt) { + offt = offt - 4 * EVEX_max_8b_offt; + scale = 2; + } + auto re = Xbyak::RegExp() + base + offt; + if (scale) { + re = re + reg_EVEX_max_8b_offt * scale; + } + if (bcast) { + return zword_b[re]; + } else { + return zword[re]; + } +} + +} // namespace gen +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_gen.h b/paddle/fluid/operators/math/jit_gen.h new file mode 100644 index 0000000000000000000000000000000000000000..6abf3434cc8d8f6ab2838ef822a4f6b948331802 --- /dev/null +++ b/paddle/fluid/operators/math/jit_gen.h @@ -0,0 +1,80 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include "paddle/fluid/platform/macros.h" + +#define XBYAK_USE_MMAP_ALLOCATOR +#include "xbyak/xbyak.h" +#include "xbyak/xbyak_util.h" + +DECLARE_bool(dump_jitcode); + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { +namespace gen { + +#define DECLARE_JIT_CODE(codename) \ + const char *name() const override { return #codename; } + +// Application Binary Interface +constexpr Xbyak::Operand::Code abi_param1(Xbyak::Operand::RDI), + abi_param2(Xbyak::Operand::RSI), abi_param3(Xbyak::Operand::RDX), + abi_param4(Xbyak::Operand::RCX), abi_not_param1(Xbyak::Operand::RCX); + +class JitCode : public Xbyak::CodeGenerator { + public: + explicit JitCode(size_t code_size = 256 * 1024, void *code_ptr = nullptr) + : Xbyak::CodeGenerator(code_size, code_ptr) {} + + virtual ~JitCode() {} + virtual const char *name() const = 0; + virtual void generate() = 0; + + template + const FUNC getCode() { + this->generate(); + const Xbyak::uint8 *code = CodeGenerator::getCode(); + if (FLAGS_dump_jitcode) { + this->dumpCode(code); + } + return reinterpret_cast(code); + } + DISABLE_COPY_AND_ASSIGN(JitCode); + + protected: + Xbyak::Reg64 param1{abi_param1}; + const int EVEX_max_8b_offt = 0x200; + const Xbyak::Reg64 reg_EVEX_max_8b_offt = rbp; + + void preCode(); + void postCode(); + void dumpCode(const Xbyak::uint8 *code) const; + void L(const char *label) { Xbyak::CodeGenerator::L(label); } + void L(const Xbyak::Label &label) { Xbyak::CodeGenerator::L(label); } + // Enhanced vector extension + Xbyak::Address EVEX_compress_addr(Xbyak::Reg64 base, int offt, + bool bcast = false); +}; + +} // namespace gen +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel.h b/paddle/fluid/operators/math/jit_kernel.h index b4dfda6db76fd4231be0acd1f90c98a2d62134b8..b78b92b4f97b761654a5b9b178f96c1dc99f7789 100644 --- a/paddle/fluid/operators/math/jit_kernel.h +++ b/paddle/fluid/operators/math/jit_kernel.h @@ -17,6 +17,7 @@ limitations under the License. */ #include // for shared_ptr #include #include +#include "paddle/fluid/operators/math/jit_kernel_impl.h" #include "paddle/fluid/platform/cpu_info.h" #include "paddle/fluid/platform/macros.h" @@ -26,19 +27,14 @@ namespace operators { namespace math { namespace jitkernel { -#define SIGMOID_THRESHOLD_MIN -40.0 -#define SIGMOID_THRESHOLD_MAX 13.0 -#define EXP_MAX_INPUT 40.0 -#define AVX_FLOAT_BLOCK 8 -#define AVX2_FLOAT_BLOCK 8 -#define AVX512_FLOAT_BLOCK 16 - +// TODO(TJ): remove me typedef enum { kLT8, kEQ8, kGT8LT16, kEQ16, kGT16 } jit_block; class Kernel { public: Kernel() = default; virtual ~Kernel() = default; + // TODO(TJ): below members should be deprecated. int num_{0}; int end_{0}; int rest_{0}; @@ -64,76 +60,95 @@ class KernelPool { template class VMulKernel : public Kernel { public: - virtual void Compute(const T *x, const T *y, T *z) const = 0; + void (*Compute)(const T *, const T *, T *, int); }; template class VAddKernel : public Kernel { public: - virtual void Compute(const T *x, const T *y, T *z) const = 0; + void (*Compute)(const T *, const T *, T *, int); }; template -class VScalKernel : public Kernel { +class VAddReluKernel : public Kernel { public: - virtual void Compute(const T a, const T *x, T *y) const = 0; - virtual void Compute(const T a, T *x) const = 0; + void (*Compute)(const T *, const T *, T *, int); }; template -class VAddBiasKernel : public Kernel { +class VScalKernel : public Kernel { public: - virtual void Compute(const T a, const T *x, T *y) const = 0; + // y = a.*x + void (*Compute)(const T *, const T *, T *, int); }; template -class VActKernel : public Kernel { +class VAddBiasKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + // y = a.+x + void (*Compute)(const T *, const T *, T *, int); }; +#ifdef PADDLE_WITH_MKLDNN template -class VReluKernel : public VActKernel { +class EltwiseMulnChw16cNCKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + // nChw16c = nChw16c .* NC + void (*Compute)(const float *, const float *, float *, int, int); }; +#endif template -class VIdentityKernel : public VActKernel { +class VActKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + void (*Compute)(const T *, T *, int); }; template -class VExpKernel : public VActKernel { +class VReluKernel : public VActKernel {}; + +template +class VIdentityKernel : public VActKernel {}; + +template +class VExpKernel : public VActKernel {}; + +template +class VSigmoidKernel : public VActKernel {}; + +template +class VTanhKernel : public VActKernel {}; + +template +class LSTMKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + // compute c1 and h1 without c0 or h0 + void (*ComputeC1H1)(lstm_t *, const lstm_attr_t *); + void (*ComputeCtHt)(lstm_t *, const lstm_attr_t *); }; template -class VSigmoidKernel : public VActKernel { +class GRUKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + // compute h1 without h0 + void (*ComputeH1)(gru_t *, const gru_attr_t *); + void (*ComputeHtPart1)(gru_t *, const gru_attr_t *); + void (*ComputeHtPart2)(gru_t *, const gru_attr_t *); }; template -class VTanhKernel : public VActKernel { +class CRFDecodeKernel : public Kernel { public: - virtual void Compute(const T *x, T *y) const = 0; + virtual void Compute(const int seq_len, const T *x, const T *w, T *alpha, + int *track) const = 0; }; template -class LSTMKernel : public Kernel { +class LayerNormKernel : public Kernel { public: - virtual void ComputeCtHt(T *gates, const T *ct_1, T *ct, T *ht, - /* below only used in peephole*/ - const T *wp_data = nullptr, - T *checked = nullptr) const = 0; - - // compute c1 and h1 without c0 or h0 - virtual void ComputeC1H1(T *gates, T *ct, T *ht, - /* below only used in peephole*/ - const T *wp_data = nullptr) const = 0; + virtual void Compute(T *x, T *out, T *mean, T *var, const T *scale, + const T *bias, int height, + const float epsilon) const = 0; }; } // namespace jitkernel diff --git a/paddle/fluid/operators/math/jit_kernel_blas.cc b/paddle/fluid/operators/math/jit_kernel_blas.cc index 0f9ea533fccdd34a5ccf061d89ffe92687d65933..a0f93fd8e7eb7d81211724a6991a681e2a0ed9ce 100644 --- a/paddle/fluid/operators/math/jit_kernel_blas.cc +++ b/paddle/fluid/operators/math/jit_kernel_blas.cc @@ -15,375 +15,381 @@ limitations under the License. */ #include "paddle/fluid/operators/math/jit_kernel.h" #include #include "paddle/fluid/operators/math/jit_kernel_macro.h" -#ifdef PADDLE_WITH_MKLML -#include "paddle/fluid/platform/dynload/mklml.h" +#include "paddle/fluid/operators/math/jit_kernel_refer.h" +#include "paddle/fluid/platform/enforce.h" + +#ifdef PADDLE_WITH_XBYAK +#include "paddle/fluid/operators/math/jit_code.h" #endif -#ifdef __AVX__ -#include +#ifdef PADDLE_WITH_MKLML +#include "paddle/fluid/platform/dynload/mklml.h" #endif namespace paddle { namespace operators { namespace math { namespace jitkernel { - namespace jit = platform::jit; +#ifdef PADDLE_WITH_MKLML +template +void VMulMKL(const T* x, const T* y, T* z, int n); + +template <> +void VMulMKL(const float* x, const float* y, float* z, int n) { + platform::dynload::vsMul(n, x, y, z); +} + +template <> +void VMulMKL(const double* x, const double* y, double* z, int n) { + platform::dynload::vdMul(n, x, y, z); +} + +template +void VAddMKL(const T* x, const T* y, T* z, int n); + +template <> +void VAddMKL(const float* x, const float* y, float* z, int n) { + platform::dynload::vsAdd(n, x, y, z); +} + +template <> +void VAddMKL(const double* x, const double* y, double* z, int n) { + platform::dynload::vdAdd(n, x, y, z); +} + +template +void VScalMKL(const T* a, const T* x, T* y, int n); + +template <> +void VScalMKL(const float* a, const float* x, float* y, int n) { + if (x == y) { + platform::dynload::cblas_sscal(n, *a, y, 1); + } else { + refer::VScal(a, x, y, n); + } +} + +template <> +void VScalMKL(const double* a, const double* x, double* y, int n) { + if (x == y) { + platform::dynload::cblas_dscal(n, *a, y, 1); + } else { + refer::VScal(a, x, y, n); + } +} + +#endif + /* VMUL JitKernel */ -template +template class VMulKernelImpl : public VMulKernel { public: - explicit VMulKernelImpl(int d) : VMulKernel() { this->num_ = d; } - void Compute(const T* x, const T* y, T* z) const override { - for (int i = 0; i < this->num_; ++i) { - z[i] = x[i] * y[i]; + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VMulKernelImpl(int d) : VMulKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + // roughly estimate the size of code + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + jitcode_.reset(new gen::VXXJitCode(d, gen::operand_type::mul, 0, false, + sz > 4096 ? sz : 4096)); + this->Compute = + jitcode_->getCode(); + return; } - } -}; - +#endif #ifdef PADDLE_WITH_MKLML -#define MKL_FLOAT(isa, block) \ - template <> \ - void VMulKernelImpl::Compute( \ - const float* x, const float* y, float* z) const { \ - platform::dynload::vsMul(this->num_, x, y, z); \ + if (useMKL(d)) { + this->Compute = VMulMKL; + return; + } +#endif + this->Compute = refer::VMul; } -#define MKL_DOUBLE(isa, block) \ - template <> \ - void VMulKernelImpl::Compute( \ - const double* x, const double* y, double* z) const { \ - platform::dynload::vdMul(this->num_, x, y, z); \ - } +#ifdef PADDLE_WITH_XBYAK -FOR_EACH_ISA(MKL_FLOAT, kGT16); -FOR_EACH_ISA_BLOCK(MKL_DOUBLE); + private: + std::unique_ptr jitcode_{nullptr}; #endif +}; -#define INTRI8_FLOAT(isa) \ - template <> \ - void VMulKernelImpl::Compute( \ - const float* x, const float* y, float* z) const { \ - __m256 tmpx, tmpy; \ - tmpx = _mm256_loadu_ps(x); \ - tmpy = _mm256_loadu_ps(y); \ - tmpx = _mm256_mul_ps(tmpx, tmpy); \ - _mm256_storeu_ps(z, tmpx); \ - } - -// avx > for > mkl -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); +#ifdef PADDLE_WITH_XBYAK +template <> +bool VMulKernelImpl::useJIT(int d) { + return gen::VXXJitCode::init(d); +} #endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); + +#ifdef PADDLE_WITH_MKLML +template <> +bool VMulKernelImpl::useMKL(int d) { + return jit::MayIUse(jit::avx512f) && d > 512; +} + +template <> +bool VMulKernelImpl::useMKL(int d) { + return true; +} #endif -// TODO(TJ): eq16 test and complete avx512 -#undef INTRI8_FLOAT -#undef MKL_FLOAT -#undef MKL_DOUBLE -/* VADD JitKernel */ -template +/* VAdd JitKernel */ +template class VAddKernelImpl : public VAddKernel { public: - explicit VAddKernelImpl(int d) : VAddKernel() { this->num_ = d; } - void Compute(const T* x, const T* y, T* z) const override { - for (int i = 0; i < this->num_; ++i) { - z[i] = x[i] + y[i]; + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VAddKernelImpl(int d) : VAddKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + jitcode_.reset(new gen::VXXJitCode(d, gen::operand_type::add, 0, false, + sz > 4096 ? sz : 4096)); + this->Compute = + jitcode_->getCode(); + return; } - } -}; - +#endif #ifdef PADDLE_WITH_MKLML -#define MKL_FLOAT(isa, block) \ - template <> \ - void VAddKernelImpl::Compute( \ - const float* x, const float* y, float* z) const { \ - platform::dynload::vsAdd(this->num_, x, y, z); \ - } - -#define MKL_DOUBLE(isa, block) \ - template <> \ - void VAddKernelImpl::Compute( \ - const double* x, const double* y, double* z) const { \ - platform::dynload::vdAdd(this->num_, x, y, z); \ + if (useMKL(d)) { + this->Compute = VAddMKL; + return; + } +#endif + this->Compute = refer::VAdd; } +#ifdef PADDLE_WITH_XBYAK -FOR_EACH_ISA(MKL_FLOAT, kGT16); -FOR_EACH_ISA_BLOCK(MKL_DOUBLE); + private: + std::unique_ptr jitcode_{nullptr}; #endif +}; -#define INTRI8_FLOAT(isa) \ - template <> \ - void VAddKernelImpl::Compute( \ - const float* x, const float* y, float* z) const { \ - __m256 tmpx, tmpy; \ - tmpx = _mm256_loadu_ps(x); \ - tmpy = _mm256_loadu_ps(y); \ - tmpx = _mm256_add_ps(tmpx, tmpy); \ - _mm256_storeu_ps(z, tmpx); \ - } -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); +#ifdef PADDLE_WITH_XBYAK +template <> +bool VAddKernelImpl::useJIT(int d) { + return gen::VXXJitCode::init(d); +} #endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -#endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -#endif -// TODO(TJ): eq16 test and complete avx512 -#undef INTRI8_FLOAT -#undef MKL_FLOAT -#undef MKL_DOUBLE +#ifdef PADDLE_WITH_MKLML +template <> +bool VAddKernelImpl::useMKL(int d) { + return d > 512; +} + +template <> +bool VAddKernelImpl::useMKL(int d) { + return true; +} +#endif -/* VSCAL JitKernel */ -template -class VScalKernelImpl : public VScalKernel { +#ifdef PADDLE_WITH_MKLDNN +/* EltwiseMul for nChw16c & NC inputs JitKernel */ +template +class EltwiseMulnChw16cNCKernelImpl + : public math::jitkernel::EltwiseMulnChw16cNCKernel { public: - explicit VScalKernelImpl(int d) : VScalKernel() { this->num_ = d; } - void Compute(const T a, const T* x, T* y) const override { - for (int i = 0; i < this->num_; ++i) { - y[i] = a * x[i]; - } - } - void Compute(const T a, T* x) const override { - for (int i = 0; i < this->num_; ++i) { - x[i] = a * x[i]; + JITKERNEL_DECLARE_STATIC_FUNC; + explicit EltwiseMulnChw16cNCKernelImpl(int d) + : EltwiseMulnChw16cNCKernel() { + using mul_func_t = void (*)(const float*, const float*, float*, int, int); +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + // roughly estimate the size of code + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + sz = sz > 4096 ? sz : 4096; + jitcode_.reset(new gen::EltwiseMulnChw16cNC(sz)); + this->Compute = (mul_func_t)jitcode_->getCode(); + return; } +#endif + PADDLE_THROW( + "This kernel shouldn't be used in Non-Xbyak, Non-MKL-DNN " + "environemnt"); } + +#ifdef PADDLE_WITH_XBYAK + + private: + std::unique_ptr jitcode_{nullptr}; }; -#ifdef PADDLE_WITH_MKLML -#define MKL_FLOAT(isa, block) \ - template <> \ - void VScalKernelImpl::Compute(const float a, float* x) \ - const { \ - platform::dynload::cblas_sscal(this->num_, a, x, 1); \ - } +template <> +bool EltwiseMulnChw16cNCKernelImpl::useJIT(int d) { + return true; +} +#endif +#endif -#define MKL_DOUBLE(isa, block) \ - template <> \ - void VScalKernelImpl::Compute(const double a, double* x) \ - const { \ - platform::dynload::cblas_dscal(this->num_, a, x, 1); \ +/* VAddRelu JitKernel */ +template +class VAddReluKernelImpl : public VAddReluKernel { + public: + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VAddReluKernelImpl(int d) : VAddReluKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + jitcode_.reset(new gen::VXXJitCode(d, gen::operand_type::add, 0, true, + sz > 4096 ? sz : 4096)); + this->Compute = + jitcode_->getCode(); + return; + } +#endif + this->Compute = refer::VAddRelu; } +#ifdef PADDLE_WITH_XBYAK -FOR_EACH_ISA(MKL_FLOAT, kGT16); -FOR_EACH_ISA_BLOCK(MKL_DOUBLE); + private: + std::unique_ptr jitcode_{nullptr}; #endif +}; -#define INTRI8_FLOAT(isa) \ - template <> \ - void VScalKernelImpl::Compute( \ - const float a, const float* x, float* y) const { \ - __m256 tmp; \ - __m256 scalar = _mm256_set1_ps(a); \ - tmp = _mm256_loadu_ps(x); \ - tmp = _mm256_mul_ps(tmp, scalar); \ - _mm256_storeu_ps(y, tmp); \ - } -#define INTRI8_INPLACE_FLOAT(isa) \ - template <> \ - void VScalKernelImpl::Compute(const float a, float* x) \ - const { \ - __m256 tmp; \ - __m256 scalar = _mm256_set1_ps(a); \ - tmp = _mm256_loadu_ps(x); \ - tmp = _mm256_mul_ps(tmp, scalar); \ - _mm256_storeu_ps(x, tmp); \ - } +#ifdef PADDLE_WITH_XBYAK +template <> +bool VAddReluKernelImpl::useJIT(int d) { + return gen::VXXJitCode::init(d); +} +#endif -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI8_INPLACE_FLOAT(jit::avx); +/* VScal JitKernel */ +template +class VScalKernelImpl : public VScalKernel { + public: + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VScalKernelImpl(int d) : VScalKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + jitcode_.reset(new gen::VXXJitCode(d, gen::operand_type::mul, 1, false, + sz > 4096 ? sz : 4096)); + this->Compute = + jitcode_->getCode(); + return; + } #endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI8_INPLACE_FLOAT(jit::avx2); +#ifdef PADDLE_WITH_MKLML + if (useMKL(d)) { + this->Compute = VScalMKL; + return; + } #endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -INTRI8_INPLACE_FLOAT(jit::avx512f); + this->Compute = refer::VScal; + } +#ifdef PADDLE_WITH_XBYAK + + private: + std::unique_ptr jitcode_{nullptr}; #endif -// TODO(TJ): eq16 test and complete avx512 +}; -#undef INTRI8_FLOAT -#undef INTRI8_INPLACE_FLOAT -#undef MKL_FLOAT -#undef MKL_DOUBLE +#ifdef PADDLE_WITH_XBYAK +template <> +bool VScalKernelImpl::useJIT(int d) { + return gen::VXXJitCode::init(d, 1); +} +#endif + +#ifdef PADDLE_WITH_MKLML +template <> +bool VScalKernelImpl::useMKL(int d) { + return d > 512; +} +template <> +bool VScalKernelImpl::useMKL(int d) { + return true; +} +#endif /* VAddBias JitKernel */ -template +template class VAddBiasKernelImpl : public VAddBiasKernel { public: - explicit VAddBiasKernelImpl(int d) : VAddBiasKernel() { this->num_ = d; } - void Compute(const T a, const T* x, T* y) const override { - for (int i = 0; i < this->num_; ++i) { - y[i] = x[i] + a; + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VAddBiasKernelImpl(int d) : VAddBiasKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 4 * 8; + jitcode_.reset(new gen::VXXJitCode(d, gen::operand_type::add, 1, false, + sz > 4096 ? sz : 4096)); + this->Compute = + jitcode_->getCode(); + return; } - } -}; - -#define INTRI8_FLOAT(isa) \ - template <> \ - void VAddBiasKernelImpl::Compute( \ - const float a, const float* x, float* y) const { \ - __m256 tmp = _mm256_loadu_ps(x); \ - tmp = _mm256_add_ps(tmp, _mm256_set1_ps(a)); \ - _mm256_storeu_ps(y, tmp); \ - } +#endif -#define INTRI16_FLOAT(isa) \ - template <> \ - void VAddBiasKernelImpl::Compute( \ - const float a, const float* x, float* y) const { \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + 8); \ - tmp0 = _mm256_add_ps(tmp0, _mm256_set1_ps(a)); \ - tmp1 = _mm256_add_ps(tmp1, _mm256_set1_ps(a)); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + 8, tmp1); \ + this->Compute = refer::VAddBias; } +#ifdef PADDLE_WITH_XBYAK -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI16_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI16_FLOAT(jit::avx2); -#endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -INTRI16_FLOAT(jit::avx512f); + private: + std::unique_ptr jitcode_{nullptr}; #endif -// TODO(TJ): eq16 test and complete avx512 +}; -#undef INTRI8_FLOAT -#undef INTRI16_FLOAT +#ifdef PADDLE_WITH_XBYAK +template <> +bool VAddBiasKernelImpl::useJIT(int d) { + return gen::VXXJitCode::init(d, 1); +} +#endif /* VRelu JitKernel */ -template +template class VReluKernelImpl : public VReluKernel { public: - explicit VReluKernelImpl(int d) : VReluKernel() { this->num_ = d; } - void Compute(const T* x, T* y) const override { - for (int i = 0; i < this->num_; ++i) { - y[i] = x[i] > 0 ? x[i] : 0; + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VReluKernelImpl(int d) : VReluKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 /* init size */ + + d / YMM_FLOAT_BLOCK * 4 /* instructions */ * + 8 /* average bytes for each instruction */; + jitcode_.reset(new gen::VActJitCode(d, gen::operand_type::relu, + sz > 4096 ? sz : 4096)); + this->Compute = jitcode_->getCode(); + return; } - } -}; - -#define INTRI8_FLOAT(isa) \ - template <> \ - void VReluKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 tmp = _mm256_loadu_ps(x); \ - tmp = _mm256_max_ps(tmp, _mm256_setzero_ps()); \ - _mm256_storeu_ps(y, tmp); \ - } - -#define INTRI16_FLOAT(isa) \ - template <> \ - void VReluKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 zeros = _mm256_setzero_ps(); \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + 8); \ - tmp0 = _mm256_max_ps(tmp0, zeros); \ - tmp1 = _mm256_max_ps(tmp1, zeros); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + 8, tmp1); \ - } - -#define INTRI_GT8LT16_FLOAT(isa) \ - template <> \ - VReluKernelImpl::VReluKernelImpl(int d) \ - : VReluKernel() { \ - this->num_ = d; \ - this->end_ = AVX_FLOAT_BLOCK; \ - this->rest_ = d - AVX_FLOAT_BLOCK; \ - } \ - template <> \ - void VReluKernelImpl::Compute(const float* x, \ - float* y) const { \ - __m256 zeros = _mm256_setzero_ps(); \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + this->rest_); \ - tmp0 = _mm256_max_ps(tmp0, zeros); \ - tmp1 = _mm256_max_ps(tmp1, zeros); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + this->rest_, tmp1); \ - } +#endif -#define INTRI_GT16_FLOAT(isa) \ - template <> \ - VReluKernelImpl::VReluKernelImpl(int d) \ - : VReluKernel() { \ - this->num_ = d; \ - this->end_ = d - d % AVX_FLOAT_BLOCK; \ - this->rest_ = d - AVX_FLOAT_BLOCK; \ - } \ - template <> \ - void VReluKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 zeros = _mm256_setzero_ps(); \ - for (int i = 0; i < this->end_; i += AVX_FLOAT_BLOCK) { \ - __m256 tmp = _mm256_loadu_ps(x + i); \ - tmp = _mm256_max_ps(tmp, zeros); \ - _mm256_storeu_ps(y + i, tmp); \ - } \ - __m256 tmp = _mm256_loadu_ps(x + this->rest_); \ - tmp = _mm256_max_ps(tmp, zeros); \ - _mm256_storeu_ps(y + this->rest_, tmp); \ + this->Compute = refer::VRelu; } +#ifdef PADDLE_WITH_XBYAK -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI16_FLOAT(jit::avx); -INTRI_GT8LT16_FLOAT(jit::avx); -INTRI_GT16_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI16_FLOAT(jit::avx2); -INTRI_GT8LT16_FLOAT(jit::avx2); -INTRI_GT16_FLOAT(jit::avx2); -#endif -#ifdef __AVX512F__ -// TODO(TJ): refine avx512 -INTRI8_FLOAT(jit::avx512f); -INTRI16_FLOAT(jit::avx512f); -INTRI_GT8LT16_FLOAT(jit::avx512f); -INTRI_GT16_FLOAT(jit::avx512f); + private: + std::unique_ptr jitcode_{nullptr}; #endif +}; -#undef INTRI8_FLOAT -#undef INTRI16_FLOAT -#undef INTRI_GT8LT16_FLOAT -#undef INTRI_GT16_FLOAT +#ifdef PADDLE_WITH_XBYAK +template <> +bool VReluKernelImpl::useJIT(int d) { + return gen::VActJitCode::init(d, gen::operand_type::relu); +} +#endif /* An empty JitKernel */ -template +template class VIdentityKernelImpl : public VIdentityKernel { public: - explicit VIdentityKernelImpl(int d) : VIdentityKernel() { this->num_ = d; } - void Compute(const T* x, T* y) const override {} + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VIdentityKernelImpl(int d) : VIdentityKernel() { + this->Compute = refer::VIdentity; + } }; REGISTER_JITKERNEL(vmul, VMulKernel); REGISTER_JITKERNEL(vadd, VAddKernel); +REGISTER_JITKERNEL(vaddrelu, VAddReluKernel); REGISTER_JITKERNEL(vscal, VScalKernel); -REGISTER_JITKERNEL(vaddb, VAddBiasKernel); +REGISTER_JITKERNEL(vaddbias, VAddBiasKernel); REGISTER_JITKERNEL(vrelu, VReluKernel); REGISTER_JITKERNEL(videntity, VIdentityKernel); +#ifdef PADDLE_WITH_MKLDNN +REGISTER_JITKERNEL(eltwise_mul_nchw16c, EltwiseMulnChw16cNCKernel); +#endif } // namespace jitkernel } // namespace math diff --git a/paddle/fluid/operators/math/jit_kernel_crf_decode.cc b/paddle/fluid/operators/math/jit_kernel_crf_decode.cc new file mode 100644 index 0000000000000000000000000000000000000000..4d26b81948238f18b097f535534fcfe9049b93c3 --- /dev/null +++ b/paddle/fluid/operators/math/jit_kernel_crf_decode.cc @@ -0,0 +1,296 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/math/jit_kernel.h" +#include +#include +#include "paddle/fluid/operators/math/jit_kernel_macro.h" +#ifdef __AVX__ +#include +#endif + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { + +namespace jit = platform::jit; + +/* CRF Decode JitKernel */ +template +class CRFDecodeKernelImpl : public CRFDecodeKernel { + public: + explicit CRFDecodeKernelImpl(int tag_num) : CRFDecodeKernel() { + this->num_ = tag_num; + } + void Compute(const int seq_len, const T* x, const T* w, T* alpha, + int* track) const override { + constexpr int state_trans_base_idx = 2; + for (int i = 0; i < this->num_; ++i) { + alpha[i] = w[i] + x[i]; + } + for (int k = 1; k < seq_len; ++k) { + for (int i = 0; i < this->num_; ++i) { + T max_score = -std::numeric_limits::max(); + int max_j = 0; + for (int j = 0; j < this->num_; ++j) { + T score = alpha[(k - 1) * this->num_ + j] + + w[(j + state_trans_base_idx) * this->num_ + i]; + if (score > max_score) { + max_score = score; + max_j = j; + } + } + alpha[k * this->num_ + i] = max_score + x[k * this->num_ + i]; + track[k * this->num_ + i] = max_j; + } + } + } +}; + +#define INIT_ALPHA(step_size) \ + /* Setup the alpha initial value.*/ \ + int i_offset = 0; \ + int last_offset = this->rest_ - step_size; \ + for (int i = 0; i <= this->end_; ++i) { \ + /* weights, input and alpha values. */ \ + __m256 w_content, x_content, alpha_content; \ + /* Load the relevant data into the variables from un-aligned address.*/ \ + w_content = _mm256_loadu_ps(w + i_offset); \ + x_content = _mm256_loadu_ps(x + i_offset); \ + alpha_content = _mm256_add_ps(w_content, x_content); \ + _mm256_storeu_ps(alpha + i_offset, alpha_content); \ + i_offset += step_size; \ + if (i == this->end_ - 1) { \ + if (this->rest_ > 0) { \ + i_offset += last_offset; \ + } else { \ + break; \ + } \ + } \ + } + +#define UPDATE_ALPHA(step_size) \ + /* Update the alpha and track values. */ \ + __m256 x_content = _mm256_loadu_ps(x + seq_offset + this->num_ + j_offset); \ + max_score = _mm256_add_ps(max_score, x_content); \ + _mm256_storeu_ps(alpha + seq_offset + this->num_ + j_offset, max_score); \ + _mm256_storeu_si256( \ + reinterpret_cast<__m256i*>(track + seq_offset + this->num_ + j_offset), \ + max_j); \ + /* Calculate the offset of next step*/ \ + j_offset += step_size; \ + if (j == this->end_ - 1) { \ + if (this->rest_ > 0) { \ + j_offset += last_offset; \ + } else { \ + break; \ + } \ + } + +#define INTRIAVX_FLOAT(block) \ + template <> \ + CRFDecodeKernelImpl::CRFDecodeKernelImpl( \ + int tag_num) \ + : CRFDecodeKernel() { \ + this->num_ = tag_num; \ + this->end_ = this->num_ / YMM_FLOAT_BLOCK; \ + this->rest_ = this->num_ % YMM_FLOAT_BLOCK; \ + } \ + template <> \ + void CRFDecodeKernelImpl::Compute( \ + const int seq_len, const float* x, const float* w, float* alpha, \ + int* track) const { \ + INIT_ALPHA(YMM_FLOAT_BLOCK) \ + /* Use the column-major strategy to get the location of maximum score.*/ \ + int seq_offset = 0; \ + constexpr int state_trans_base_idx = 2; \ + for (int k = 1; k < seq_len; ++k) { \ + int j_offset = 0; \ + for (int j = 0; j <= this->end_; ++j) { \ + /* Initialize the variables of maximum score and location.*/ \ + __m256 max_score = _mm256_set1_ps(-std::numeric_limits::max()); \ + __m256i max_j = _mm256_set1_epi32(0); \ + /* Calculate the offset of transition_weights.*/ \ + int trans_offset = state_trans_base_idx * this->num_ + j_offset; \ + for (int i = 0; i < this->num_; ++i) { \ + /* Initalize the content of alpha variable with related offset.*/ \ + __m256 alpha_content = _mm256_broadcast_ss(alpha + seq_offset + i); \ + /* Obtain the content of weights from un-aligned address.*/ \ + __m256 w_content = _mm256_loadu_ps(w + trans_offset); \ + __m256 score_v = _mm256_add_ps(alpha_content, w_content); \ + __m256 mask = _mm256_cmp_ps(score_v, max_score, _CMP_GT_OS); \ + /* According to the mask value, update the index of the max_score.*/ \ + /* AVX instructions.*/ \ + __m128i lo_max_j = _mm256_extractf128_si256(max_j, 0); \ + __m128i hi_max_j = _mm256_extractf128_si256(max_j, 1); \ + __m128i lo_mask = _mm256_extractf128_si256((__m256i)mask, 0); \ + __m128i hi_mask = _mm256_extractf128_si256((__m256i)mask, 1); \ + lo_max_j = _mm_andnot_si128(lo_mask, lo_max_j); \ + hi_max_j = _mm_andnot_si128(hi_mask, hi_max_j); \ + lo_mask = _mm_and_si128(lo_mask, _mm_set1_epi32(i)); \ + hi_mask = _mm_and_si128(hi_mask, _mm_set1_epi32(i)); \ + lo_max_j = _mm_or_si128(lo_mask, lo_max_j); \ + hi_max_j = _mm_or_si128(hi_mask, hi_max_j); \ + max_j = _mm256_insertf128_si256(max_j, lo_max_j, 0); \ + max_j = _mm256_insertf128_si256(max_j, hi_max_j, 1); \ + /* AVX done*/ \ + /* Update the max_score value.*/ \ + max_score = _mm256_max_ps(max_score, score_v); \ + trans_offset += this->num_; \ + } \ + UPDATE_ALPHA(YMM_FLOAT_BLOCK) \ + } \ + seq_offset += this->num_; \ + } \ + } + +#define INTRIAVX2_FLOAT(isa, block) \ + template <> \ + CRFDecodeKernelImpl::CRFDecodeKernelImpl(int tag_num) \ + : CRFDecodeKernel() { \ + this->num_ = tag_num; \ + this->end_ = this->num_ / YMM_FLOAT_BLOCK; \ + this->rest_ = this->num_ % YMM_FLOAT_BLOCK; \ + } \ + template <> \ + void CRFDecodeKernelImpl::Compute( \ + const int seq_len, const float* x, const float* w, float* alpha, \ + int* track) const { \ + INIT_ALPHA(YMM_FLOAT_BLOCK) \ + /* Use the column-major strategy to get the location of maximum score.*/ \ + int seq_offset = 0; \ + constexpr int state_trans_base_idx = 2; \ + for (int k = 1; k < seq_len; ++k) { \ + int j_offset = 0; \ + for (int j = 0; j <= this->end_; ++j) { \ + /* Initialize the variables of maximum score and location.*/ \ + __m256 max_score = _mm256_set1_ps(-std::numeric_limits::max()); \ + __m256i max_j = _mm256_set1_epi32(0); \ + /* Calculate the offset of transition_weights.*/ \ + int trans_offset = state_trans_base_idx * this->num_ + j_offset; \ + for (int i = 0; i < this->num_; ++i) { \ + /* Initalize the content of alpha variable with related offset.*/ \ + __m256 alpha_content = _mm256_broadcast_ss(alpha + seq_offset + i); \ + /* Obtain the content of weights from un-aligned address.*/ \ + __m256 w_content = _mm256_loadu_ps(w + trans_offset); \ + __m256 score_v = _mm256_add_ps(alpha_content, w_content); \ + __m256 mask = _mm256_cmp_ps(score_v, max_score, _CMP_GT_OS); \ + /* According to the mask value, update the index of the max_score.*/ \ + /* AVX2 instructions.*/ \ + max_j = _mm256_or_si256( \ + _mm256_andnot_si256((__m256i)mask, max_j), \ + _mm256_and_si256((__m256i)mask, _mm256_set1_epi32(i))); \ + /* Update the max_score value.*/ \ + max_score = _mm256_max_ps(max_score, score_v); \ + trans_offset += this->num_; \ + } \ + UPDATE_ALPHA(YMM_FLOAT_BLOCK) \ + } \ + seq_offset += this->num_; \ + } \ + } + +#define INTRIAVX512_FLOAT(block) \ + template <> \ + CRFDecodeKernelImpl::CRFDecodeKernelImpl( \ + int tag_num) \ + : CRFDecodeKernel() { \ + this->num_ = tag_num; \ + this->end_ = this->num_ / ZMM_FLOAT_BLOCK; \ + this->rest_ = this->num_ % ZMM_FLOAT_BLOCK; \ + } \ + template <> \ + void CRFDecodeKernelImpl::Compute( \ + const int seq_len, const float* x, const float* w, float* alpha, \ + int* track) const { \ + INIT_ALPHA(ZMM_FLOAT_BLOCK) \ + /* Use the column-major strategy to get the location of maximum score.*/ \ + int seq_offset = 0; \ + constexpr int state_trans_base_idx = 2; \ + for (int k = 1; k < seq_len; ++k) { \ + int j_offset = 0; \ + for (int j = 0; j <= this->end_; ++j) { \ + /* Initialize the variables of maximum score and location.*/ \ + __m512 max_score = _mm512_set1_ps(-std::numeric_limits::max()); \ + __m512i max_j = _mm512_setzero_si512(); \ + /* Calculate the offset of transition_weights.*/ \ + int trans_offset = state_trans_base_idx * this->num_ + j_offset; \ + for (int i = 0; i < this->num_; ++i) { \ + /* Initalize the content of alpha variable with related offset.*/ \ + __m512 alpha_content = _mm512_set1_ps(*(alpha + seq_offset + i)); \ + /* Obtain the content of weights from un-aligned address.*/ \ + __m512 w_content = _mm512_loadu_ps(w + trans_offset); \ + __m512 score_v = _mm512_add_ps(alpha_content, w_content); \ + __mmask16 mask = _mm512_cmp_ps_mask(score_v, max_score, _CMP_GT_OS); \ + /* AVX512 instructions.*/ \ + max_j = _mm512_mask_set1_epi32(max_j, mask, i); \ + /* Update the max_score value.*/ \ + max_score = _mm512_max_ps(max_score, score_v); \ + trans_offset += this->num_; \ + } \ + /* Update the alpha and track values.*/ \ + __m512 x_content = \ + _mm512_loadu_ps(x + seq_offset + this->num_ + j_offset); \ + max_score = _mm512_add_ps(max_score, x_content); \ + _mm512_storeu_ps(alpha + seq_offset + this->num_ + j_offset, \ + max_score); \ + _mm512_storeu_si512(reinterpret_cast<__m512i*>(track + seq_offset + \ + this->num_ + j_offset), \ + max_j); \ + /* Calculate the offset of next step*/ \ + j_offset += ZMM_FLOAT_BLOCK; \ + if (j == this->end_ - 1) { \ + if (this->rest_ > 0) { \ + j_offset += last_offset; \ + } else { \ + break; \ + } \ + } \ + } \ + seq_offset += this->num_; \ + } \ + } + +#ifdef __AVX__ +INTRIAVX_FLOAT(kEQ8); +INTRIAVX_FLOAT(kGT8LT16); +INTRIAVX_FLOAT(kEQ16); +INTRIAVX_FLOAT(kGT16); +#endif +#ifdef __AVX2__ +INTRIAVX2_FLOAT(jit::avx2, kEQ8); +INTRIAVX2_FLOAT(jit::avx2, kGT8LT16); +INTRIAVX2_FLOAT(jit::avx2, kEQ16); +INTRIAVX2_FLOAT(jit::avx2, kGT16); +#endif +#ifdef __AVX512F__ +INTRIAVX2_FLOAT(jit::avx512f, kEQ8); +INTRIAVX2_FLOAT(jit::avx512f, kGT8LT16); +INTRIAVX512_FLOAT(kEQ16); +INTRIAVX512_FLOAT(kGT16); +#endif + +#undef INTRIAVX512_FLOAT +#undef INTRIAVX2_FLOAT +#undef INTRIAVX_FLOAT +#undef INIT_ALPHA +#undef UPDATE_ALPHA + +REGISTER_JITKERNEL_DEPRECATED(crf_decode, CRFDecodeKernel); + +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_exp.cc b/paddle/fluid/operators/math/jit_kernel_exp.cc index b62e130c43743f542e2074868fc01598047d6b19..686f3dd9836cb9192088771753065c6add639620 100644 --- a/paddle/fluid/operators/math/jit_kernel_exp.cc +++ b/paddle/fluid/operators/math/jit_kernel_exp.cc @@ -13,387 +13,224 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/math/jit_kernel.h" -#include // for exp #include #include "paddle/fluid/operators/math/jit_kernel_macro.h" -#ifdef PADDLE_WITH_MKLML -#include "paddle/fluid/platform/dynload/mklml.h" +#include "paddle/fluid/operators/math/jit_kernel_refer.h" + +#ifdef PADDLE_WITH_XBYAK +#include "paddle/fluid/operators/math/jit_code.h" #endif -#ifdef __AVX__ -#include +#ifdef PADDLE_WITH_MKLML +#include "paddle/fluid/platform/dynload/mklml.h" #endif namespace paddle { namespace operators { namespace math { - -#ifdef __AVX__ -namespace detail { -__m256 Exp(__m256 a); -} // namespace detail -#endif - namespace jitkernel { namespace jit = platform::jit; +#ifdef PADDLE_WITH_MKLML +// try to use MKL to speedup +template +void VExpMKL(const T* x, T* y, int n); + +template <> +void VExpMKL(const float* x, float* y, int n) { + platform::dynload::vsExp(n, x, y); +} + +template <> +void VExpMKL(const double* x, double* y, int n) { + platform::dynload::vdExp(n, x, y); +} + +template +void VSigmoidMKL(const T* x, T* y, int n) { + const T min = SIGMOID_THRESHOLD_MIN; + const T max = SIGMOID_THRESHOLD_MAX; + for (int i = 0; i < n; ++i) { + y[i] = (x[i] < min) ? min : ((x[i] > max) ? max : x[i]); + y[i] = static_cast(0) - y[i]; + } + VExpMKL(y, y, n); + for (int i = 0; i < n; ++i) { + y[i] = static_cast(1) / (static_cast(1) + y[i]); + } +} + +template +void VTanhMKL(const T* x, T* y, int n) { + for (int i = 0; i < n; ++i) { + y[i] = static_cast(2) * x[i]; + } + VSigmoidMKL(y, y, n); + for (int i = 0; i < n; ++i) { + y[i] = static_cast(2) * y[i] - static_cast(1); + } +} +#endif + /* VExp JitKernel */ -template +template class VExpKernelImpl : public VExpKernel { public: - explicit VExpKernelImpl(int d) : VExpKernel() { this->num_ = d; } - void Compute(const T* x, T* y) const override { - for (int i = 0; i < this->num_; ++i) { - y[i] = std::exp(x[i]); + JITKERNEL_DECLARE_STATIC_FUNC; + explicit VExpKernelImpl(int d) : VExpKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 70 * 8; + jitcode_.reset(new gen::VActJitCode(d, gen::operand_type::exp, + sz > 4096 ? sz : 4096)); + this->Compute = jitcode_->getCode(); + return; } - } -}; - +#endif #ifdef PADDLE_WITH_MKLML -#define MKL_FLOAT(isa, block) \ - template <> \ - void VExpKernelImpl::Compute(const float* x, float* y) \ - const { \ - platform::dynload::vsExp(this->num_, x, y); \ - } - -#define MKL_DOUBLE(isa, block) \ - template <> \ - void VExpKernelImpl::Compute(const double* x, double* y) \ - const { \ - platform::dynload::vdExp(this->num_, x, y); \ - } -FOR_EACH_ISA(MKL_FLOAT, kLT8); -FOR_EACH_ISA(MKL_FLOAT, kGT8LT16); -FOR_EACH_ISA(MKL_FLOAT, kGT16); -FOR_EACH_ISA_BLOCK(MKL_DOUBLE); + if (useMKL(d)) { + this->Compute = VExpMKL; + return; + } #endif - -#define INTRI8_FLOAT(isa) \ - template <> \ - void VExpKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 tmp = _mm256_loadu_ps(x); \ - _mm256_storeu_ps(y, detail::Exp(tmp)); \ + this->Compute = refer::VExp; } -#define INTRI16_FLOAT(isa) \ - template <> \ - void VExpKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + 8); \ - tmp0 = detail::Exp(tmp0); \ - tmp1 = detail::Exp(tmp1); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + 8, tmp1); \ - } +#ifdef PADDLE_WITH_XBYAK -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI16_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI16_FLOAT(jit::avx2); + private: + std::unique_ptr jitcode_{nullptr}; #endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -INTRI16_FLOAT(jit::avx512f); +}; + +#ifdef PADDLE_WITH_XBYAK +template <> +bool VExpKernelImpl::useJIT(int d) { + return gen::VActJitCode::init(d, gen::operand_type::exp); +} #endif -// TODO(TJ): eq16 test and complete avx512 -#undef INTRI8_FLOAT -#undef INTRI16_FLOAT -#undef MKL_FLOAT -#undef MKL_DOUBLE +#ifdef PADDLE_WITH_MKLML +template <> +bool VExpKernelImpl::useMKL(int d) { + return d > 512; +} + +template <> +bool VExpKernelImpl::useMKL(int d) { + return true; +} -REGISTER_JITKERNEL(vexp, VExpKernel); +#endif /* VSigmoid JitKernel */ -template +template class VSigmoidKernelImpl : public VSigmoidKernel { public: + JITKERNEL_DECLARE_STATIC_FUNC; explicit VSigmoidKernelImpl(int d) : VSigmoidKernel() { - this->num_ = d; - vexp_ = KernelPool::Instance().template Get>(d); - } - void Compute(const T* x, T* y) const override { - const T min = SIGMOID_THRESHOLD_MIN; - const T max = SIGMOID_THRESHOLD_MAX; - for (int i = 0; i < this->num_; ++i) { - y[i] = (x[i] < min) ? min : ((x[i] > max) ? max : x[i]); - y[i] = static_cast(0) - y[i]; +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 82 * 8; + jitcode_.reset(new gen::VActJitCode(d, gen::operand_type::sigmoid, + sz > 4096 ? sz : 4096)); + this->Compute = jitcode_->getCode(); + return; } - vexp_->Compute(y, y); - for (int i = 0; i < this->num_; ++i) { - y[i] = static_cast(1) / (static_cast(1) + y[i]); +#endif + +#ifdef PADDLE_WITH_MKLML + // strictly it's a better impl with MKL, then is refer + if (useMKL(d)) { + this->Compute = VSigmoidMKL; + return; } +#endif + this->Compute = refer::VSigmoid; } +#ifdef PADDLE_WITH_XBYAK + private: - std::shared_ptr> vexp_; + std::unique_ptr jitcode_{nullptr}; +#endif }; -#define INTRI_SIGMOID(tmp, min, max) \ - tmp = _mm256_max_ps(tmp, min); \ - tmp = _mm256_min_ps(tmp, max); \ - tmp = _mm256_sub_ps(_mm256_set1_ps(0.0f), tmp); \ - tmp = detail::Exp(tmp); \ - tmp = _mm256_add_ps(_mm256_set1_ps(1.0f), tmp); \ - tmp = _mm256_div_ps(_mm256_set1_ps(1.0f), tmp) - -#define INTRI8_FLOAT(isa) \ - template <> \ - void VSigmoidKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX); \ - __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN); \ - __m256 tmp = _mm256_loadu_ps(x); \ - INTRI_SIGMOID(tmp, min, max); \ - _mm256_storeu_ps(y, tmp); \ - } - -#define INTRI16_FLOAT(isa) \ - template <> \ - void VSigmoidKernelImpl::Compute(const float* x, \ - float* y) const { \ - __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX); \ - __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN); \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + 8); \ - INTRI_SIGMOID(tmp0, min, max); \ - INTRI_SIGMOID(tmp1, min, max); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + 8, tmp1); \ - } - -#define INTRI_GT8LT16_FLOAT(isa) \ - template <> \ - VSigmoidKernelImpl::VSigmoidKernelImpl(int d) \ - : VSigmoidKernel() { \ - this->num_ = d; \ - this->end_ = AVX_FLOAT_BLOCK; \ - this->rest_ = d - this->end_; \ - vexp_ = \ - KernelPool::Instance().template Get>(this->rest_); \ - } \ - template <> \ - void VSigmoidKernelImpl::Compute(const float* x, \ - float* y) const { \ - __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX); \ - __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN); \ - __m256 tmp = _mm256_loadu_ps(x); \ - INTRI_SIGMOID(tmp, min, max); \ - _mm256_storeu_ps(y, tmp); \ - const float min_ = SIGMOID_THRESHOLD_MIN; \ - const float max_ = SIGMOID_THRESHOLD_MAX; \ - for (int i = this->end_; i < this->num_; ++i) { \ - y[i] = (x[i] < min_) ? min_ : ((x[i] > max_) ? max_ : x[i]); \ - y[i] = 0.f - y[i]; \ - } \ - vexp_->Compute(y + this->end_, y + this->end_); \ - for (int i = this->end_; i < this->num_; ++i) { \ - y[i] = 1.f / (1.f + y[i]); \ - } \ - } - -#define INTRI_GT16_FLOAT(isa) \ - template <> \ - VSigmoidKernelImpl::VSigmoidKernelImpl(int d) \ - : VSigmoidKernel() { \ - this->num_ = d; \ - this->rest_ = d % AVX_FLOAT_BLOCK; \ - this->end_ = d - this->rest_; \ - vexp_ = \ - KernelPool::Instance().template Get>(this->rest_); \ - } \ - template <> \ - void VSigmoidKernelImpl::Compute(const float* x, \ - float* y) const { \ - __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX); \ - __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN); \ - for (int i = 0; i < this->end_; i += AVX_FLOAT_BLOCK) { \ - __m256 tmp = _mm256_loadu_ps(x + i); \ - INTRI_SIGMOID(tmp, min, max); \ - _mm256_storeu_ps(y + i, tmp); \ - } \ - const float min_ = SIGMOID_THRESHOLD_MIN; \ - const float max_ = SIGMOID_THRESHOLD_MAX; \ - for (int i = this->end_; i < this->num_; ++i) { \ - y[i] = (x[i] < min_) ? min_ : ((x[i] > max_) ? max_ : x[i]); \ - y[i] = 0.f - y[i]; \ - } \ - vexp_->Compute(y + this->end_, y + this->end_); \ - for (int i = this->end_; i < this->num_; ++i) { \ - y[i] = 1.f / (1.f + y[i]); \ - } \ - } - -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI16_FLOAT(jit::avx); -INTRI_GT8LT16_FLOAT(jit::avx); -INTRI_GT16_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI16_FLOAT(jit::avx2); -// INTRI_GT8LT16_FLOAT(jit::avx2); -// INTRI_GT16_FLOAT(jit::avx2); +#ifdef PADDLE_WITH_XBYAK +template <> +bool VSigmoidKernelImpl::useJIT(int d) { + return gen::VActJitCode::init(d, gen::operand_type::sigmoid); +} #endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -INTRI16_FLOAT(jit::avx512f); -// INTRI_GT8LT16_FLOAT(jit::avx512f); -// INTRI_GT16_FLOAT(jit::avx512f); -#endif - -#undef INTRI8_FLOAT -#undef INTRI16_FLOAT -#undef INTRI_GT8LT16_FLOAT -#undef INTRI_GT16_FLOAT -#undef INTRI_VSIGMOID -REGISTER_JITKERNEL(vsigmoid, VSigmoidKernel); +#ifdef PADDLE_WITH_MKLML +template <> +bool VSigmoidKernelImpl::useMKL(int d) { + return d > 512; +} + +template <> +bool VSigmoidKernelImpl::useMKL(int d) { + return true; +} +#endif /* VTanh JitKernel */ -template +template class VTanhKernelImpl : public VTanhKernel { public: + JITKERNEL_DECLARE_STATIC_FUNC; explicit VTanhKernelImpl(int d) : VTanhKernel() { - this->num_ = d; - vscal_ = KernelPool::Instance().template Get>(d); - vsigmoid_ = KernelPool::Instance().template Get>(d); - vaddbias_ = KernelPool::Instance().template Get>(d); - } - void Compute(const T* x, T* y) const override { - vscal_->Compute(static_cast(2), x, y); - vsigmoid_->Compute(y, y); - vscal_->Compute(static_cast(2), y); - vaddbias_->Compute(static_cast(-1), y, y); - } - - private: - std::shared_ptr> vscal_; - std::shared_ptr> vsigmoid_; - std::shared_ptr> vaddbias_; -}; - -#define INTRI_VTANH(tmp) \ - tmp = _mm256_mul_ps(_mm256_set1_ps(-2.0f), tmp); \ - tmp = _mm256_min_ps(tmp, _mm256_set1_ps(EXP_MAX_INPUT)); \ - tmp = detail::Exp(tmp); \ - tmp = _mm256_add_ps(_mm256_set1_ps(1.0f), tmp); \ - tmp = _mm256_div_ps(_mm256_set1_ps(2.0f), tmp); \ - tmp = _mm256_sub_ps(tmp, _mm256_set1_ps(1.0f)) - -#define INTRI8_FLOAT(isa) \ - template <> \ - void VTanhKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 tmp = _mm256_loadu_ps(x); \ - INTRI_VTANH(tmp); \ - _mm256_storeu_ps(y, tmp); \ - } - -#define INTRI16_FLOAT(isa) \ - template <> \ - void VTanhKernelImpl::Compute(const float* x, float* y) \ - const { \ - __m256 tmp0 = _mm256_loadu_ps(x); \ - __m256 tmp1 = _mm256_loadu_ps(x + 8); \ - INTRI_VTANH(tmp0); \ - INTRI_VTANH(tmp1); \ - _mm256_storeu_ps(y, tmp0); \ - _mm256_storeu_ps(y + 8, tmp1); \ - } +#ifdef PADDLE_WITH_XBYAK + if (useJIT(d)) { + size_t sz = 96 + d / YMM_FLOAT_BLOCK * 84 * 8; + jitcode_.reset(new gen::VActJitCode(d, gen::operand_type::tanh, + sz > 4096 ? sz : 4096)); + this->Compute = jitcode_->getCode(); + return; + } +#endif -#define INTRI_GT8LT16_FLOAT(isa) \ - template <> \ - VTanhKernelImpl::VTanhKernelImpl(int d) \ - : VTanhKernel() { \ - this->num_ = d; \ - this->end_ = AVX_FLOAT_BLOCK; \ - this->rest_ = d - this->end_; \ - vscal_ = \ - KernelPool::Instance().template Get>(this->rest_); \ - vsigmoid_ = KernelPool::Instance().template Get>( \ - this->rest_); \ - vaddbias_ = KernelPool::Instance().template Get>( \ - this->rest_); \ - } \ - template <> \ - void VTanhKernelImpl::Compute(const float* x, \ - float* y) const { \ - __m256 tmp = _mm256_loadu_ps(x); \ - INTRI_VTANH(tmp); \ - _mm256_storeu_ps(y, tmp); \ - x += AVX_FLOAT_BLOCK; \ - y += AVX_FLOAT_BLOCK; \ - vscal_->Compute(2.f, x, y); \ - vsigmoid_->Compute(y, y); \ - vscal_->Compute(2.f, y); \ - vaddbias_->Compute(-1.f, y, y); \ +#ifdef PADDLE_WITH_MKLML + // strictly it's a better impl with MKL, then is refer + if (useMKL(d)) { + this->Compute = VTanhMKL; + return; + } +#endif + this->Compute = refer::VTanh; } -#define INTRI_GT16_FLOAT(isa) \ - template <> \ - VTanhKernelImpl::VTanhKernelImpl(int d) \ - : VTanhKernel() { \ - this->num_ = d; \ - this->rest_ = d % AVX_FLOAT_BLOCK; \ - this->end_ = d - this->rest_; \ - vscal_ = \ - KernelPool::Instance().template Get>(this->rest_); \ - vsigmoid_ = KernelPool::Instance().template Get>( \ - this->rest_); \ - vaddbias_ = KernelPool::Instance().template Get>( \ - this->rest_); \ - } \ - template <> \ - void VTanhKernelImpl::Compute(const float* x, float* y) \ - const { \ - for (int i = 0; i < this->end_; i += AVX_FLOAT_BLOCK) { \ - __m256 tmp = _mm256_loadu_ps(x + i); \ - INTRI_VTANH(tmp); \ - _mm256_storeu_ps(y + i, tmp); \ - } \ - x += this->end_; \ - y += this->end_; \ - vscal_->Compute(2.f, x, y); \ - vsigmoid_->Compute(y, y); \ - vscal_->Compute(2.f, y); \ - vaddbias_->Compute(-1.f, y, y); \ - } +#ifdef PADDLE_WITH_XBYAK -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -INTRI16_FLOAT(jit::avx); -INTRI_GT8LT16_FLOAT(jit::avx); -INTRI_GT16_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -INTRI16_FLOAT(jit::avx2); -// maybe use avx at gt8lt16 and gt16 + private: + std::unique_ptr jitcode_{nullptr}; #endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -INTRI16_FLOAT(jit::avx512f); -// maybe use avx at gt8lt16 and gt16 +}; + +#ifdef PADDLE_WITH_XBYAK +template <> +bool VTanhKernelImpl::useJIT(int d) { + return gen::VActJitCode::init(d, gen::operand_type::tanh); +} #endif -#undef INTRI8_FLOAT -#undef INTRI16_FLOAT -#undef INTRI_GT8LT16_FLOAT -#undef INTRI_GT16_FLOAT -#undef INTRI_VTANH +#ifdef PADDLE_WITH_MKLML +template <> +bool VTanhKernelImpl::useMKL(int d) { + return d > 512; +} + +template <> +bool VTanhKernelImpl::useMKL(int d) { + return true; +} +#endif +REGISTER_JITKERNEL(vexp, VExpKernel); +REGISTER_JITKERNEL(vsigmoid, VSigmoidKernel); REGISTER_JITKERNEL(vtanh, VTanhKernel); -#undef JITKERNEL_NEW_ACT_IMPL - } // namespace jitkernel } // namespace math } // namespace operators diff --git a/paddle/fluid/operators/math/jit_kernel_impl.h b/paddle/fluid/operators/math/jit_kernel_impl.h new file mode 100644 index 0000000000000000000000000000000000000000..ba5f20e53383d3cafab4239f1a2d911addf1ae23 --- /dev/null +++ b/paddle/fluid/operators/math/jit_kernel_impl.h @@ -0,0 +1,73 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { + +#define SIGMOID_THRESHOLD_MIN -40.0 +#define SIGMOID_THRESHOLD_MAX 13.0 +#define EXP_MAX_INPUT 40.0 +#define XMM_FLOAT_BLOCK 4 +#define YMM_FLOAT_BLOCK 8 +#define ZMM_FLOAT_BLOCK 16 + +typedef struct { + void* gates; // gates: W_ch, W_ih, W_fh, W_oh + const void* ct_1; + void* ct; + void* ht; + /* weight_peephole and checked data are only used in peephole*/ + const void* wp{nullptr}; + void* checked{nullptr}; +} lstm_t; + +typedef struct { + void* gates; // gates: {W_update, W_reset; W_state} + const void* ht_1; + void* ht; +} gru_t; + +struct rnn_attr_s { + int d; + std::string act_gate, act_cand; + rnn_attr_s() = default; + rnn_attr_s(int _d, const std::string& _act_gate, const std::string& _act_cand) + : d(_d), act_gate(_act_gate), act_cand(_act_cand) {} +}; + +struct lstm_attr_s : public rnn_attr_s { + bool use_peephole; + std::string act_cell; + lstm_attr_s() = default; + lstm_attr_s(int _d, const std::string& _act_gate, + const std::string& _act_cand, const std::string& _act_cell, + bool _use_peephole = false) + : rnn_attr_s(_d, _act_gate, _act_cand), + use_peephole(_use_peephole), + act_cell(_act_cell) {} +}; + +typedef struct rnn_attr_s gru_attr_t; +typedef struct lstm_attr_s lstm_attr_t; + +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_layer_norm.cc b/paddle/fluid/operators/math/jit_kernel_layer_norm.cc new file mode 100644 index 0000000000000000000000000000000000000000..49904e6e8c7cd346bcbfb67c3a7574118b36e058 --- /dev/null +++ b/paddle/fluid/operators/math/jit_kernel_layer_norm.cc @@ -0,0 +1,241 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include "paddle/fluid/operators/math/jit_kernel.h" +#include +#include +#include +#include "paddle/fluid/operators/math/jit_kernel_macro.h" +#ifdef __AVX__ +#include +#endif + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { + +namespace jit = platform::jit; + +/* Layer Norm JitKernel */ +template +class LayerNormKernelImpl : public LayerNormKernel { + public: + explicit LayerNormKernelImpl(int right) : LayerNormKernel() { + this->num_ = right; + } + + void Compute(T* x, T* out, T* mean, T* var, const T* scale, const T* bias, + int height, const float epsilon) const override { + // get mean + for (int i = 0; i < height; i++) { + T sum = 0.0; + int offset = i * this->num_; + for (int j = 0; j < this->num_; j++) { + sum += x[offset + j]; + } + mean[i] = sum / this->num_; + } + + // get variance + for (int i = 0; i < height; i++) { + T sum = 0.0; + int offset = i * this->num_; + for (int j = 0; j < this->num_; j++) { + sum += (x[offset + j] - mean[i]) * (x[offset + j] - mean[i]); + } + var[i] = sum / this->num_; + } + + for (int i = 0; i < height; i++) { + int offset = i * this->num_; + T sqrt_var = sqrt(var[i] + (T)epsilon); + for (int j = 0; j < this->num_; j++) { + out[offset + j] = (x[offset + j] - mean[i]) / sqrt_var; + } + } + if (scale) { + for (int i = 0; i < height; i++) { + int offset = i * this->num_; + for (int j = 0; j < this->num_; j++) { + out[offset + j] *= scale[j]; + } + } + } + + if (bias) { + for (int i = 0; i < height; i++) { + int offset = i * this->num_; + for (int j = 0; j < this->num_; j++) { + out[offset + j] += bias[j]; + } + } + } + } +}; + +#define INTRIAVX_FLOAT(isa, block) \ + template <> \ + LayerNormKernelImpl::LayerNormKernelImpl(int right) \ + : LayerNormKernel() { \ + this->num_ = right; \ + this->rest_ = this->num_ % YMM_FLOAT_BLOCK; \ + this->end_ = this->num_ - this->rest_; \ + } \ + template <> \ + void LayerNormKernelImpl::Compute( \ + float* x, float* out, float* mean, float* var, const float* scale, \ + const float* bias, int height, const float epsilon) const { \ + __m256 sum; \ + __m256 mean_vec, var_vec; \ + __m128 hi, lo; \ + __m256 tmp; \ + size_t offset; \ + size_t j; \ + __m256 reverse_num_vec = \ + _mm256_div_ps(_mm256_set1_ps(1.0), _mm256_set1_ps(this->num_)); \ + __m256 epsilon_vec = _mm256_set1_ps(epsilon); \ + int rest_mask = \ + ((-1) & (~((~0U) >> (sizeof(int) * 8 - (YMM_FLOAT_BLOCK - rest_))))) & \ + 0x0ff; \ + __m256i mask_vec = _mm256_set_epi32( \ + rest_mask & 0x80 ? 0xffffffff : 0, rest_mask & 0x40 ? 0xffffffff : 0, \ + rest_mask & 0x20 ? 0xffffffff : 0, rest_mask & 0x10 ? 0xffffffff : 0, \ + rest_mask & 0x8 ? 0xffffffff : 0, rest_mask & 0x4 ? 0xffffffff : 0, \ + rest_mask & 0x2 ? 0xffffffff : 0, rest_mask & 0x1 ? 0xffffffff : 0); \ + \ + for (int i = 0; i < height; ++i) { \ + offset = i * this->num_; \ + \ + /* get mean */ \ + sum = _mm256_setzero_ps(); \ + for (j = offset; j < end_ + offset; j += block) { \ + sum = _mm256_add_ps(sum, _mm256_loadu_ps((const float*)x + j)); \ + } \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + tmp = _mm256_loadu_ps((const float*)x + j); \ + tmp = _mm256_blendv_ps(_mm256_setzero_ps(), tmp, (__m256)mask_vec); \ + sum = _mm256_add_ps(sum, tmp); \ + } \ + hi = _mm256_extractf128_ps(sum, 1); \ + lo = _mm256_extractf128_ps(sum, 0); \ + sum = _mm256_add_ps( \ + sum, _mm256_insertf128_ps( \ + _mm256_insertf128_ps(_mm256_setzero_ps(), hi, 0), lo, 1)); \ + sum = _mm256_hadd_ps(sum, sum); \ + sum = _mm256_hadd_ps(sum, sum); \ + mean_vec = _mm256_mul_ps(sum, reverse_num_vec); \ + mean[i] = *reinterpret_cast(&mean_vec); \ + \ + /* get variance */ \ + sum = _mm256_setzero_ps(); \ + for (j = offset; j < end_ + offset; j += block) { \ + tmp = _mm256_sub_ps(_mm256_loadu_ps((const float*)x + j), mean_vec); \ + tmp = _mm256_mul_ps(tmp, tmp); \ + sum = _mm256_add_ps(sum, tmp); \ + } \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + tmp = _mm256_sub_ps(_mm256_loadu_ps((const float*)x + j), mean_vec); \ + tmp = _mm256_mul_ps(tmp, tmp); \ + tmp = _mm256_blendv_ps(_mm256_setzero_ps(), tmp, (__m256)mask_vec); \ + sum = _mm256_add_ps(sum, tmp); \ + } \ + hi = _mm256_extractf128_ps(sum, 1); \ + lo = _mm256_extractf128_ps(sum, 0); \ + sum = _mm256_add_ps( \ + sum, _mm256_insertf128_ps( \ + _mm256_insertf128_ps(_mm256_setzero_ps(), hi, 0), lo, 1)); \ + sum = _mm256_hadd_ps(sum, sum); \ + sum = _mm256_hadd_ps(sum, sum); \ + var_vec = _mm256_mul_ps(sum, reverse_num_vec); \ + var[i] = *reinterpret_cast(&var_vec); \ + \ + /* get x_norm and calculate output*/ \ + for (j = offset; j < end_ + offset; j += block) { \ + tmp = _mm256_sub_ps(_mm256_loadu_ps((const float*)x + j), mean_vec); \ + tmp = _mm256_div_ps( \ + tmp, _mm256_sqrt_ps(_mm256_add_ps(var_vec, epsilon_vec))); \ + _mm256_storeu_ps(reinterpret_cast(out) + j, tmp); \ + } \ + if (rest_ != 0) { \ + j = offset + num_ - block; \ + tmp = _mm256_sub_ps(_mm256_loadu_ps((const float*)x + j), mean_vec); \ + tmp = _mm256_div_ps( \ + tmp, _mm256_sqrt_ps(_mm256_add_ps(var_vec, epsilon_vec))); \ + _mm256_storeu_ps(reinterpret_cast(out) + j, tmp); \ + } \ + \ + if (scale) { \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + tmp = _mm256_loadu_ps((const float*)out + j); \ + } \ + for (j = offset; j < end_ + offset; j += block) { \ + _mm256_storeu_ps( \ + reinterpret_cast(out) + j, \ + _mm256_mul_ps( \ + _mm256_loadu_ps((const float*)out + j), \ + _mm256_loadu_ps((const float*)scale + j - offset))); \ + } \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + _mm256_storeu_ps( \ + reinterpret_cast(out) + j, \ + _mm256_mul_ps( \ + tmp, _mm256_loadu_ps((const float*)scale + j - offset))); \ + } \ + } \ + \ + if (bias) { \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + tmp = _mm256_loadu_ps((const float*)out + j); \ + } \ + for (j = offset; j < end_ + offset; j += block) { \ + _mm256_storeu_ps( \ + reinterpret_cast(out) + j, \ + _mm256_add_ps( \ + _mm256_loadu_ps((const float*)out + j), \ + _mm256_loadu_ps((const float*)bias + j - offset))); \ + } \ + if (rest_ != 0) { \ + j = offset + this->num_ - block; \ + _mm256_storeu_ps( \ + reinterpret_cast(out) + j, \ + _mm256_add_ps( \ + tmp, _mm256_loadu_ps((const float*)bias + j - offset))); \ + } \ + } \ + } \ + } + +#ifdef __AVX__ +INTRIAVX_FLOAT(jit::avx, kEQ8); +INTRIAVX_FLOAT(jit::avx, kGT8LT16); +INTRIAVX_FLOAT(jit::avx, kEQ16); +INTRIAVX_FLOAT(jit::avx, kGT16); +#endif +#ifdef __AVX2__ +INTRIAVX_FLOAT(jit::avx2, kEQ8); +INTRIAVX_FLOAT(jit::avx2, kGT8LT16); +INTRIAVX_FLOAT(jit::avx2, kEQ16); +INTRIAVX_FLOAT(jit::avx2, kGT16); +#endif + +#undef INTRIAVX_FLOAT + +REGISTER_JITKERNEL_DEPRECATED(layer_norm, LayerNormKernel); + +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_lstm.cc b/paddle/fluid/operators/math/jit_kernel_lstm.cc deleted file mode 100644 index 42a2b96fd945c516f8c26ca51ecb452345a9a86f..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/math/jit_kernel_lstm.cc +++ /dev/null @@ -1,308 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/math/jit_kernel.h" -#include -#include "paddle/fluid/operators/math/jit_kernel_macro.h" -#include "paddle/fluid/platform/enforce.h" -#include "paddle/fluid/platform/macros.h" - -#ifdef __AVX__ -#include -#endif - -namespace paddle { -namespace operators { -namespace math { -#ifdef __AVX__ -namespace detail { -__m256 Exp(__m256 a); -} // namespace detail -#endif - -namespace jitkernel { -namespace jit = platform::jit; - -#ifdef __AVX__ -typedef enum { kSigmoid, kRelu, kTanh, kIdentity } act_type; - -class AVXAct { - public: - virtual ~AVXAct() = default; - virtual __m256 Compute(__m256 x) const = 0; -}; - -template -class AVXActImpl : public AVXAct { - public: - __m256 Compute(__m256 x) const override { PADDLE_THROW("Unkown type!"); } -}; - -template <> -__m256 AVXActImpl::Compute(__m256 x) const { - __m256 ones = _mm256_set1_ps(1.0f); - x = _mm256_max_ps(x, _mm256_set1_ps(SIGMOID_THRESHOLD_MIN)); - x = _mm256_min_ps(x, _mm256_set1_ps(SIGMOID_THRESHOLD_MAX)); - x = _mm256_sub_ps(_mm256_set1_ps(0.0f), x); - x = detail::Exp(x); - x = _mm256_add_ps(ones, x); - return _mm256_div_ps(ones, x); -} - -template <> -__m256 AVXActImpl::Compute(__m256 x) const { - __m256 ones = _mm256_set1_ps(1.0f); - x = _mm256_mul_ps(_mm256_set1_ps(-2.0f), x); - x = _mm256_min_ps(x, _mm256_set1_ps(EXP_MAX_INPUT)); - x = detail::Exp(x); - x = _mm256_add_ps(ones, x); - x = _mm256_div_ps(_mm256_set1_ps(2.0f), x); - return _mm256_sub_ps(x, ones); -} - -template <> -__m256 AVXActImpl::Compute(__m256 x) const { - return _mm256_max_ps(x, _mm256_setzero_ps()); -} - -template <> -__m256 AVXActImpl::Compute(__m256 x) const { - return x; -} -#endif - -template -static std::shared_ptr> GetActKernel( - const std::string& type, int n) { - if (type == "sigmoid") { - return std::dynamic_pointer_cast>( - KernelPool::Instance().template Get>(n)); - } else if (type == "relu") { - return std::dynamic_pointer_cast>( - KernelPool::Instance().template Get>(n)); - } else if (type == "tanh") { - return std::dynamic_pointer_cast>( - KernelPool::Instance().template Get>(n)); - } else if (type == "identity" || type == "") { - return std::dynamic_pointer_cast>( - KernelPool::Instance().template Get>(n)); - } - PADDLE_THROW("Not support type: %s", type); - return nullptr; -} - -/* LSTM JitKernel */ -template -class LSTMKernelImpl : public LSTMKernel { - public: - explicit LSTMKernelImpl(const std::string& act_gate, - const std::string& act_cand, - const std::string& act_cell, int d) - : LSTMKernel() { - d_ = d; - d2_ = d * 2; - d3_ = d * 3; - act_gate_d3_ = GetActKernel(act_gate, d3_); - act_gate_d_ = GetActKernel(act_gate, d); - act_cand_d_ = GetActKernel(act_cand, d); - act_cell_d_ = GetActKernel(act_cell, d); - vmul_d_ = KernelPool::Instance().template Get>(d); - vadd_d_ = KernelPool::Instance().template Get>(d); -#ifdef __AVX__ - auto GetAVXAct = [&](const std::string& type) -> std::unique_ptr { - if (type == "sigmoid") { - return std::unique_ptr(new AVXActImpl()); - } else if (type == "relu") { - return std::unique_ptr(new AVXActImpl()); - } else if (type == "tanh") { - return std::unique_ptr(new AVXActImpl()); - } else if (type == "identity" || type == "") { - return std::unique_ptr(new AVXActImpl()); - } - PADDLE_THROW("Not support type: %s", type); - }; - avx_act_gate_ = GetAVXAct(act_gate); - avx_act_cand_ = GetAVXAct(act_cand); - avx_act_cell_ = GetAVXAct(act_cell); -#endif - } - - void ComputeCtHt(T* gates, const T* ct_1, T* ct, T* ht, const T* wp_data, - T* checked) const override { - // gates: W_ch, W_ih, W_fh, W_oh - act_gate_d3_->Compute(gates + d_, gates + d_); - - /* C_t = C_t-1 * fgated + cand_gated * igated */ - act_cand_d_->Compute(gates, gates); - vmul_d_->Compute(gates, gates + d_, gates + d_); - vmul_d_->Compute(ct_1, gates + d2_, gates + d2_); - vadd_d_->Compute(gates + d_, gates + d2_, ct); - - /* H_t = act_cell(C_t) * ogated */ - act_cell_d_->Compute(ct, gates + d2_); - vmul_d_->Compute(gates + d2_, gates + d3_, ht); - } - void ComputeC1H1(T* gates, T* ct, T* ht, const T* wp_data) const override { - /* C_t = igated * cgated*/ - act_gate_d_->Compute(gates + d_, gates + d_); - act_cand_d_->Compute(gates, gates); - vmul_d_->Compute(gates, gates + d_, ct); - /* H_t = act_cell(C_t) * ogated */ - act_gate_d_->Compute(gates + d3_, gates + d3_); - act_cell_d_->Compute(ct, gates + d2_); - vmul_d_->Compute(gates + d2_, gates + d3_, ht); - } - - private: - int d_, d2_, d3_; - std::shared_ptr> act_gate_d3_, act_gate_d_, act_cand_d_, - act_cell_d_; - std::shared_ptr> vmul_d_; - std::shared_ptr> vadd_d_; -#ifdef __AVX__ - std::unique_ptr avx_act_gate_, avx_act_cand_, avx_act_cell_; -#endif -}; - -#define INTRI8_FLOAT(isa) \ - template <> \ - void LSTMKernelImpl::ComputeCtHt( \ - float* gates, const float* ct_1, float* ct, float* ht, \ - const float* wp_data, float* checked) const { \ - /* gates: W_ch, W_ih, W_fh, W_oh */ \ - __m256 c, i, f, o; \ - c = _mm256_loadu_ps(gates); \ - i = _mm256_loadu_ps(gates + 8); \ - f = _mm256_loadu_ps(gates + 16); \ - o = _mm256_loadu_ps(gates + 24); \ - /* C_t = C_t-1 * fgated + cand_gated * igated*/ \ - c = _mm256_mul_ps(avx_act_cand_->Compute(c), avx_act_gate_->Compute(i)); \ - i = _mm256_loadu_ps(ct_1); \ - f = _mm256_mul_ps(i, avx_act_gate_->Compute(f)); \ - f = _mm256_add_ps(c, f); \ - _mm256_storeu_ps(ct, f); \ - /* H_t = act_cell(C_t) * ogated */ \ - o = _mm256_mul_ps(avx_act_cell_->Compute(f), avx_act_gate_->Compute(o)); \ - _mm256_storeu_ps(ht, o); \ - } - -// TODO(TJ): optimize keq16 - -#ifdef __AVX__ -INTRI8_FLOAT(jit::avx); -#endif -#ifdef __AVX2__ -INTRI8_FLOAT(jit::avx2); -#endif -#ifdef __AVX512F__ -INTRI8_FLOAT(jit::avx512f); -#endif - -/* Peephole JitKernel */ -template -class PeepholeKernelImpl : public LSTMKernel { - public: - explicit PeepholeKernelImpl(const std::string& act_gate, - const std::string& act_cand, - const std::string& act_cell, int d) - : LSTMKernel() { - d_ = d; - d2_ = d * 2; - d3_ = d * 3; - act_gate_d_ = GetActKernel(act_gate, d); - act_cand_d_ = GetActKernel(act_cand, d); - act_cell_d_ = GetActKernel(act_cell, d); - vmul_d_ = KernelPool::Instance().template Get>(d); - vadd_d_ = KernelPool::Instance().template Get>(d); - vadd_d2_ = KernelPool::Instance().template Get>(d2_); - act_gate_d2_ = GetActKernel(act_gate, d2_); - } - - void ComputeCtHt(T* gates, const T* ct_1, T* ct, T* ht, const T* wp_data, - T* checked) const override { - /* get fgated and igated*/ - vmul_d_->Compute(wp_data, ct_1, checked); - vmul_d_->Compute(wp_data + d_, ct_1, checked + d_); - vadd_d2_->Compute(checked, gates + d_, gates + d_); - act_gate_d2_->Compute(gates + d_, gates + d_); - /* C_t = C_t-1 * fgated + cand_gated * igated*/ - act_cand_d_->Compute(gates, gates); - vmul_d_->Compute(gates, gates + d_, gates + d_); - vmul_d_->Compute(ct_1, gates + d2_, gates + d2_); - vadd_d_->Compute(gates + d_, gates + d2_, ct); - /* get ogated*/ - vmul_d_->Compute(wp_data + d2_, ct, gates + d_); - vadd_d_->Compute(gates + d_, gates + d3_, gates + d3_); - act_gate_d_->Compute(gates + d3_, gates + d3_); - /* H_t = act_cell(C_t) * ogated */ - act_cell_d_->Compute(ct, gates + d2_); - vmul_d_->Compute(gates + d2_, gates + d3_, ht); - } - - void ComputeC1H1(T* gates, T* ct, T* ht, const T* wp_data) const override { - /* C_t = igated * cgated*/ - act_gate_d_->Compute(gates + d_, gates + d_); - act_cand_d_->Compute(gates, gates); - vmul_d_->Compute(gates, gates + d_, ct); - /* get outgated, put W_oc * C_t on igated */ - vmul_d_->Compute(wp_data + d2_, ct, gates + d_); - vadd_d_->Compute(gates + d_, gates + d3_, gates + d3_); - /* H_t = act_cell(C_t) * ogated */ - act_gate_d_->Compute(gates + d3_, gates + d3_); - act_cell_d_->Compute(ct, gates + d2_); - vmul_d_->Compute(gates + d2_, gates + d3_, ht); - } - - private: - int d_, d2_, d3_; - std::shared_ptr> act_gate_d2_, act_gate_d_, act_cand_d_, - act_cell_d_; - std::shared_ptr> vmul_d_; - std::shared_ptr> vadd_d_, vadd_d2_; -}; - -#define JITKERNEL_DECLARE_LSTM(ker_class, ker_dtype) \ - template <> \ - std::shared_ptr> \ - KernelPool::Get, const std::string&, \ - const std::string&, const std::string&, int, bool>( \ - const std::string& act_gate, const std::string& act_cand, \ - const std::string& act_cell, int d, bool use_peephole) - -#define JITKERNEL_KEY_LSTM(ker_key, dtype_key) \ - #ker_key #dtype_key + std::to_string(d) + act_gate + act_cand + act_cell + \ - (use_peephole ? "p" : "n") - -#define JITKERNEL_NEW_LSTM_IMPL(ker, dtype, isa, k) \ - if (use_peephole) { \ - p = std::dynamic_pointer_cast>( \ - std::make_shared>( \ - act_gate, act_cand, act_cell, d)); \ - } else { \ - p = std::dynamic_pointer_cast>( \ - std::make_shared>(act_gate, act_cand, \ - act_cell, d)); \ - } - -REGISTER_JITKERNEL_ARGS(lstm, LSTMKernel, JITKERNEL_DECLARE_LSTM, - JITKERNEL_KEY_LSTM, JITKERNEL_NEW_LSTM_IMPL); - -#undef INTRI8_FLOAT -#undef JITKERNEL_DECLARE_LSTM -#undef JITKERNEL_KEY_LSTM -#undef JITKERNEL_NEW_LSTM_IMPL -} // namespace jitkernel -} // namespace math -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_macro.h b/paddle/fluid/operators/math/jit_kernel_macro.h index d8e55f2673560ff6afa34376b73275b57a8ceea1..5a3efd979f803d396a5084c199b1d71b88a77126 100644 --- a/paddle/fluid/operators/math/jit_kernel_macro.h +++ b/paddle/fluid/operators/math/jit_kernel_macro.h @@ -15,25 +15,96 @@ limitations under the License. */ #pragma once #include #include "paddle/fluid/platform/cpu_info.h" +#include "paddle/fluid/platform/enforce.h" namespace paddle { namespace operators { namespace math { namespace jitkernel { -namespace jit = platform::jit; +#define JITKERNEL_DECLARE_STATIC_FUNC \ + static inline std::string name(int d) { \ + PADDLE_THROW("DType should be either float or double"); \ + } \ + static inline bool useJIT(int d) { return false; } \ + static inline bool useMKL(int d) { return false; } + +#define JITKERNEL_DEFINE_NAME(ker_key, ker_class) \ + template <> \ + std::string ker_class##Impl::name(int d) { \ + std::string key(#ker_key "f"); \ + if (useJIT(d)) { \ + /* only jit code need record d*/ \ + return key + "jit" + std::to_string(d); \ + } else if (useMKL(d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } \ + template <> \ + std::string ker_class##Impl::name(int d) { \ + std::string key(#ker_key "d"); \ + /* jit code do not support double yet*/ \ + if (useMKL(d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } -#define SEARCH_BLOCK(macro_, ker, dtype, isa) \ - if (d < AVX_FLOAT_BLOCK) { \ - macro_(ker, dtype, isa, kLT8); \ - } else if (d == AVX_FLOAT_BLOCK) { \ - macro_(ker, dtype, isa, kEQ8); \ - } else if (d > AVX_FLOAT_BLOCK && d < AVX512_FLOAT_BLOCK) { \ - macro_(ker, dtype, isa, kGT8LT16); \ - } else if (d == AVX512_FLOAT_BLOCK) { \ - macro_(ker, dtype, isa, kEQ16); \ - } else { \ - macro_(ker, dtype, isa, kGT16); \ +#define JITKERNEL_DECLARE(ker_class, ker_dtype) \ + template <> \ + std::shared_ptr> \ + KernelPool::Get, int>(int d) + +#define JITKERNEL_FIND_KEY(ker_class, ker_dtype) \ + std::string key = ker_class##Impl::name(d) + +#define JITKERNEL_IMPL(ker_class, ker_dtype) \ + p = std::dynamic_pointer_cast>( \ + std::make_shared>(d)) + +#define REGISTER_JITKERNEL_WITH_DTYPE(ker_class, ker_dtype, marco_declare, \ + macro_find_key, macro_impl) \ + marco_declare(ker_class, ker_dtype) { \ + macro_find_key(ker_class, ker_dtype); \ + if (kers_.find(key) == kers_.end()) { \ + std::shared_ptr> p; \ + macro_impl(ker_class, ker_dtype); \ + kers_.insert({key, std::dynamic_pointer_cast(p)}); \ + return p; \ + } \ + return std::dynamic_pointer_cast>( \ + kers_.at(key)); \ + } + +#define REGISTER_JITKERNEL_ARGS(ker_key, ker_class, marco_define_name, \ + marco_declare, macro_find_key, macro_impl) \ + marco_define_name(ker_key, ker_class); \ + REGISTER_JITKERNEL_WITH_DTYPE(ker_class, float, marco_declare, \ + macro_find_key, macro_impl); \ + REGISTER_JITKERNEL_WITH_DTYPE(ker_class, double, marco_declare, \ + macro_find_key, macro_impl) + +#define REGISTER_JITKERNEL(ker_key, ker_class) \ + REGISTER_JITKERNEL_ARGS(ker_key, ker_class, JITKERNEL_DEFINE_NAME, \ + JITKERNEL_DECLARE, JITKERNEL_FIND_KEY, \ + JITKERNEL_IMPL) + +namespace jit = platform::jit; +// TODO(TJ): below defines are deprecated, would be remove recently +#define SEARCH_BLOCK(macro_, ker, dtype, isa) \ + if (d < YMM_FLOAT_BLOCK) { \ + macro_(ker, dtype, isa, kLT8); \ + } else if (d == YMM_FLOAT_BLOCK) { \ + macro_(ker, dtype, isa, kEQ8); \ + } else if (d > YMM_FLOAT_BLOCK && d < ZMM_FLOAT_BLOCK) { \ + macro_(ker, dtype, isa, kGT8LT16); \ + } else if (d == ZMM_FLOAT_BLOCK) { \ + macro_(ker, dtype, isa, kEQ16); \ + } else { \ + macro_(ker, dtype, isa, kGT16); \ } #define SEARCH_ISA_BLOCK(macro_, ker, dtype) \ @@ -47,44 +118,42 @@ namespace jit = platform::jit; SEARCH_BLOCK(macro_, ker, dtype, jit::isa_any); \ } -#define JITKERNEL_DECLARE(ker_class, ker_dtype) \ - template <> \ - std::shared_ptr> \ - KernelPool::Get, int>(int d) - #define JITKERNEL_KEY(ker_key, dtype_key) \ #ker_key #dtype_key + std::to_string(d) -#define JITKERNEL_NEW_IMPL(ker, dtype, isa, k) \ - p = std::dynamic_pointer_cast>( \ +#define JITKERNEL_NEW_IMPL_DEPRECATED(ker, dtype, isa, k) \ + p = std::dynamic_pointer_cast>( \ std::make_shared>(d)) -#define JITKERNEL_WITH_DTYPE(ker_key, ker_class, ker_dtype, dtype_key, \ - marco_declare, macro_key, macro_impl) \ - marco_declare(ker_class, ker_dtype) { \ - std::string key = macro_key(ker_key, dtype_key); \ - if (kers_.find(key) == kers_.end()) { \ - std::shared_ptr> p; \ - SEARCH_ISA_BLOCK(macro_impl, ker_class, ker_dtype); \ - kers_.insert({key, std::dynamic_pointer_cast(p)}); \ - return p; \ - } \ - return std::dynamic_pointer_cast>( \ - kers_.at(key)); \ +#define JITKERNEL_WITH_DTYPE_DEPRECATED(ker_key, ker_class, ker_dtype, \ + dtype_key, marco_declare, macro_key, \ + macro_impl) \ + marco_declare(ker_class, ker_dtype) { \ + std::string key = macro_key(ker_key, dtype_key); \ + if (kers_.find(key) == kers_.end()) { \ + std::shared_ptr> p; \ + SEARCH_ISA_BLOCK(macro_impl, ker_class, ker_dtype); \ + kers_.insert({key, std::dynamic_pointer_cast(p)}); \ + return p; \ + } \ + return std::dynamic_pointer_cast>( \ + kers_.at(key)); \ } -#define REGISTER_JITKERNEL(ker_key, ker_class) \ - JITKERNEL_WITH_DTYPE(ker_key, ker_class, float, f, JITKERNEL_DECLARE, \ - JITKERNEL_KEY, JITKERNEL_NEW_IMPL); \ - JITKERNEL_WITH_DTYPE(ker_key, ker_class, double, d, JITKERNEL_DECLARE, \ - JITKERNEL_KEY, JITKERNEL_NEW_IMPL) - -#define REGISTER_JITKERNEL_ARGS(ker_key, ker_class, marco_declare, macro_key, \ - macro_impl) \ - JITKERNEL_WITH_DTYPE(ker_key, ker_class, float, f, marco_declare, macro_key, \ - macro_impl); \ - JITKERNEL_WITH_DTYPE(ker_key, ker_class, double, d, marco_declare, \ - macro_key, macro_impl) +#define REGISTER_JITKERNEL_DEPRECATED(ker_key, ker_class) \ + JITKERNEL_WITH_DTYPE_DEPRECATED(ker_key, ker_class, float, f, \ + JITKERNEL_DECLARE, JITKERNEL_KEY, \ + JITKERNEL_NEW_IMPL_DEPRECATED); \ + JITKERNEL_WITH_DTYPE_DEPRECATED(ker_key, ker_class, double, d, \ + JITKERNEL_DECLARE, JITKERNEL_KEY, \ + JITKERNEL_NEW_IMPL_DEPRECATED) + +#define REGISTER_JITKERNEL_ARGS_DEPRECATED(ker_key, ker_class, marco_declare, \ + macro_key, macro_impl) \ + JITKERNEL_WITH_DTYPE_DEPRECATED(ker_key, ker_class, float, f, marco_declare, \ + macro_key, macro_impl); \ + JITKERNEL_WITH_DTYPE_DEPRECATED(ker_key, ker_class, double, d, \ + marco_declare, macro_key, macro_impl) #define FOR_EACH_ISA(macro_, block) \ macro_(jit::avx512f, block); \ diff --git a/paddle/fluid/operators/math/jit_kernel_refer.h b/paddle/fluid/operators/math/jit_kernel_refer.h new file mode 100644 index 0000000000000000000000000000000000000000..e0b2e3c7fada6b422318c68a42fd6d103c99af5a --- /dev/null +++ b/paddle/fluid/operators/math/jit_kernel_refer.h @@ -0,0 +1,238 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/operators/math/jit_kernel_impl.h" +#include "paddle/fluid/platform/enforce.h" + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { +namespace refer { +/* Refer code only focus on correctness */ + +template +void VMul(const T* x, const T* y, T* z, int n) { + for (int i = 0; i < n; ++i) { + z[i] = x[i] * y[i]; + } +} + +template +void VAdd(const T* x, const T* y, T* z, int n) { + for (int i = 0; i < n; ++i) { + z[i] = x[i] + y[i]; + } +} + +template +void VAddRelu(const T* x, const T* y, T* z, int n) { + for (int i = 0; i < n; ++i) { + z[i] = x[i] + y[i]; + z[i] = z[i] > 0 ? z[i] : 0; + } +} + +template +void VScal(const T* a, const T* x, T* y, int n) { + for (int i = 0; i < n; ++i) { + y[i] = a[0] * x[i]; + } +} + +template +void VAddBias(const T* a, const T* x, T* y, int n) { + for (int i = 0; i < n; ++i) { + y[i] = a[0] + x[i]; + } +} + +template +void VRelu(const T* x, T* y, int n) { + for (int i = 0; i < n; ++i) { + y[i] = x[i] > 0 ? x[i] : 0; + } +} + +template +inline void VIdentity(const T* x, T* y, int n) {} + +template +void VExp(const T* x, T* y, int n) { + for (int i = 0; i < n; ++i) { + y[i] = std::exp(x[i]); + } +} + +template +void VSigmoid(const T* x, T* y, int n) { + // y = 1 / (1 + e^-x) + const T min = SIGMOID_THRESHOLD_MIN; + const T max = SIGMOID_THRESHOLD_MAX; + for (int i = 0; i < n; ++i) { + T tmp = (x[i] < min) ? min : ((x[i] > max) ? max : x[i]); + y[i] = static_cast(1) / (static_cast(1) + std::exp(-tmp)); + } +} + +template +void VTanh(const T* x, T* y, int n) { + // y = 2 * sigmoid(2x) - 1 + for (int i = 0; i < n; ++i) { + y[i] = static_cast(2) * x[i]; + } + VSigmoid(y, y, n); + for (int i = 0; i < n; ++i) { + y[i] = static_cast(2) * y[i] - static_cast(1); + } +} + +template +void (*getActFunc(const std::string& type))(const T*, T*, int) { // NOLINT + if (type == "sigmoid") { + return VSigmoid; + } else if (type == "relu") { + return VRelu; + } else if (type == "tanh") { + return VTanh; + } else if (type == "identity" || type == "") { + return VIdentity; + } + PADDLE_THROW("Not support type: %s", type); + return nullptr; +} + +// compute ct and ht +template +void LSTMCtHt(lstm_t* step, const lstm_attr_t* attr) { + T* gates = reinterpret_cast(step->gates); + const T* ct_1 = reinterpret_cast(step->ct_1); + T* ct = reinterpret_cast(step->ct); + T* ht = reinterpret_cast(step->ht); + const T* wp = reinterpret_cast(step->wp); + T* checked = reinterpret_cast(step->checked); + auto act_gate = getActFunc(attr->act_gate); + auto act_cand = getActFunc(attr->act_cand); + auto act_cell = getActFunc(attr->act_cell); + int d = attr->d; + int d2 = d * 2; + int d3 = d * 3; + // gates: W_ch, W_ih, W_fh, W_oh + if (attr->use_peephole) { + VMul(wp, ct_1, checked, d); + VMul(wp + d, ct_1, checked + d, d); + VAdd(checked, gates + d, gates + d, d2); + act_gate(gates + d, gates + d, d2); + } else { + act_gate(gates + d, gates + d, d3); + } + + // C_t = C_t-1 * fgated + cand_gated * igated + act_cand(gates, gates, d); + VMul(gates, gates + d, gates + d, d); + VMul(ct_1, gates + d2, gates + d2, d); + VAdd(gates + d, gates + d2, ct, d); + + if (attr->use_peephole) { + // get ogated + VMul(wp + d2, ct, gates + d, d); + VAdd(gates + d, gates + d3, gates + d3, d); + act_gate(gates + d3, gates + d3, d); + } + // H_t = act_cell(C_t) * ogated + act_cell(ct, gates + d2, d); + VMul(gates + d2, gates + d3, ht, d); +} + +// compute c1 and h1 without c0 or h0 +template +void LSTMC1H1(lstm_t* step, const lstm_attr_t* attr) { + T* gates = reinterpret_cast(step->gates); + T* ct = reinterpret_cast(step->ct); + T* ht = reinterpret_cast(step->ht); + auto act_gate = getActFunc(attr->act_gate); + auto act_cand = getActFunc(attr->act_cand); + auto act_cell = getActFunc(attr->act_cell); + int d = attr->d; + int d2 = d * 2; + int d3 = d * 3; + /* C_t = igated * cgated*/ + act_gate(gates + d, gates + d, d); + act_cand(gates, gates, d); + VMul(gates, gates + d, ct, d); + if (attr->use_peephole) { + // get outgated, put W_oc * C_t on igated + const T* wp = reinterpret_cast(step->wp); + VMul(wp + d2, ct, gates + d, d); + VAdd(gates + d, gates + d3, gates + d3, d); + } + /* H_t = act_cell(C_t) * ogated */ + act_gate(gates + d3, gates + d3, d); + act_cell(ct, gates + d2, d); + VMul(gates + d2, gates + d3, ht, d); +} + +// compute h1 without h0 +template +void GRUH1(gru_t* step, const gru_attr_t* attr) { + T* gates = reinterpret_cast(step->gates); + T* ht = reinterpret_cast(step->ht); + auto act_gate = getActFunc(attr->act_gate); + auto act_cand = getActFunc(attr->act_cand); + int d = attr->d; + int d2 = d * 2; + act_gate(gates, gates, d); + act_cand(gates + d2, gates + d2, d); + VMul(gates, gates + d2, ht, d); +} + +// compute the first part of GRU: ht = act_gate(r) * ht_1 +template +void GRUHtPart1(gru_t* step, const gru_attr_t* attr) { + // W: {W_update, W_reset; W_state} + T* gates = reinterpret_cast(step->gates); + T* ht = reinterpret_cast(step->ht); + const T* ht_1 = reinterpret_cast(step->ht_1); + auto act_gate = getActFunc(attr->act_gate); + act_gate(gates + attr->d, gates + attr->d, attr->d); + VMul(ht_1, gates + attr->d, ht, attr->d); +} + +// compute the second part of GRU: +// ht = act_gate(u) * act_cand(s) + (1-act_gate(u)) * ht_1 +template +void GRUHtPart2(gru_t* step, const gru_attr_t* attr) { + T* gates = reinterpret_cast(step->gates); + T* ht = reinterpret_cast(step->ht); + const T* ht_1 = reinterpret_cast(step->ht_1); + auto act_gate = getActFunc(attr->act_gate); + auto act_cand = getActFunc(attr->act_cand); + int d = attr->d; + T* y = gates + d * 2; + act_gate(gates, gates, d); + act_cand(y, y, d); + // out = zt*ht~ + (1-zt)*ht_1 + for (int i = 0; i < d; ++i) { + ht[i] = gates[i] * y[i] + (static_cast(1) - gates[i]) * ht_1[i]; + } +} + +} // namespace refer +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_rnn.cc b/paddle/fluid/operators/math/jit_kernel_rnn.cc new file mode 100644 index 0000000000000000000000000000000000000000..2db3274a45610aedea385baf650b8efb42ac39d0 --- /dev/null +++ b/paddle/fluid/operators/math/jit_kernel_rnn.cc @@ -0,0 +1,263 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/math/jit_kernel.h" +#include +#include "paddle/fluid/operators/math/jit_kernel_macro.h" +#include "paddle/fluid/operators/math/jit_kernel_refer.h" +#include "paddle/fluid/platform/enforce.h" +#include "paddle/fluid/platform/macros.h" + +#ifdef PADDLE_WITH_XBYAK +#include "paddle/fluid/operators/math/jit_code.h" +#endif + +namespace paddle { +namespace operators { +namespace math { +namespace jitkernel { + +/* LSTM JitKernel */ +template +class LSTMKernelImpl : public LSTMKernel { + public: + static inline std::string name(const lstm_attr_t& attr) { + PADDLE_THROW("DType should be either float or double"); + } + static inline bool useJIT(int d) { return false; } + static inline bool useMKL(int d) { return false; } + explicit LSTMKernelImpl(const lstm_attr_t& attr) : LSTMKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(attr.d)) { + size_t sz = 96 + attr.d / YMM_FLOAT_BLOCK * 90 * 4 * 8; + jitcode0_.reset(new gen::LSTMJitCode(false, attr, sz > 4096 ? sz : 4096)); + this->ComputeCtHt = + jitcode0_->getCode(); + + jitcode1_.reset(new gen::LSTMJitCode(true, attr, sz > 4096 ? sz : 4096)); + this->ComputeC1H1 = + jitcode1_->getCode(); + return; + } +#endif + + this->ComputeCtHt = refer::LSTMCtHt; + this->ComputeC1H1 = refer::LSTMC1H1; + } + +#ifdef PADDLE_WITH_XBYAK + + private: + std::unique_ptr jitcode0_{nullptr}, jitcode1_{nullptr}; +#endif +}; + +#ifdef PADDLE_WITH_XBYAK +template <> +bool LSTMKernelImpl::useJIT(int d) { + return gen::LSTMJitCode::init(d); +} +#endif + +/* Peephole JitKernel */ +template +class PeepholeKernelImpl : public LSTMKernel { + public: + static inline std::string name(const lstm_attr_t& attr) { + PADDLE_THROW("DType should be either float or double"); + } + static inline bool useJIT(int d) { return false; } + static inline bool useMKL(int d) { return false; } + explicit PeepholeKernelImpl(const lstm_attr_t& attr) : LSTMKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(attr.d)) { + size_t sz = 96 + attr.d / YMM_FLOAT_BLOCK * 96 * 4 * 8; + jitcode0_.reset(new gen::LSTMJitCode(false, attr, sz > 4096 ? sz : 4096)); + this->ComputeCtHt = + jitcode0_->getCode(); + + jitcode1_.reset(new gen::LSTMJitCode(true, attr, sz > 4096 ? sz : 4096)); + this->ComputeC1H1 = + jitcode1_->getCode(); + return; + } +#endif + + this->ComputeCtHt = refer::LSTMCtHt; + this->ComputeC1H1 = refer::LSTMC1H1; + } + +#ifdef PADDLE_WITH_XBYAK + + private: + std::unique_ptr jitcode0_{nullptr}, jitcode1_{nullptr}; +#endif +}; + +#ifdef PADDLE_WITH_XBYAK +template <> +bool PeepholeKernelImpl::useJIT(int d) { + return gen::LSTMJitCode::init(d); +} +#endif + +#define JITKERNEL_DEFINE_NAME_LSTM(ker_key, ker_class) \ + template <> \ + std::string ker_class##Impl::name(const lstm_attr_t& attr) { \ + std::string key(#ker_key "f"); \ + key += (attr.act_gate + attr.act_cand + attr.act_cell + \ + (attr.use_peephole ? "p" : "n")); \ + if (useJIT(attr.d)) { \ + /* only jit code need record d*/ \ + return key + "jit" + std::to_string(attr.d); \ + } else if (useMKL(attr.d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } \ + template <> \ + std::string ker_class##Impl::name(const lstm_attr_t& attr) { \ + std::string key(#ker_key "d"); \ + /* jit code do not support double yet*/ \ + if (useMKL(attr.d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } + +#define JITKERNEL_DECLARE_LSTM(ker_class, ker_dtype) \ + template <> \ + std::shared_ptr> \ + KernelPool::Get, const lstm_attr_t&>( \ + const lstm_attr_t& attr) + +#define JITKERNEL_FIND_KEY_LSTM(ker_class, ker_dtype) \ + std::string key = ker_class##Impl::name(attr) + +#define JITKERNEL_LSTM_IMPL(ker, dtype) \ + if (attr.use_peephole) { \ + p = std::dynamic_pointer_cast>( \ + std::make_shared>(attr)); \ + } else { \ + p = std::dynamic_pointer_cast>( \ + std::make_shared>(attr)); \ + } + +REGISTER_JITKERNEL_ARGS(lstm, LSTMKernel, JITKERNEL_DEFINE_NAME_LSTM, + JITKERNEL_DECLARE_LSTM, JITKERNEL_FIND_KEY_LSTM, + JITKERNEL_LSTM_IMPL); + +#undef JITKERNEL_LSTM_IMPL +#undef JITKERNEL_FIND_KEY_LSTM +#undef JITKERNEL_DECLARE_LSTM +#undef JITKERNEL_DEFINE_NAME_LSTM + +/* GRU JitKernel */ +template +class GRUKernelImpl : public GRUKernel { + public: + static inline std::string name(const gru_attr_t& attr) { + PADDLE_THROW("DType should be either float or double"); + } + static inline bool useJIT(int d) { return false; } + static inline bool useMKL(int d) { return false; } + explicit GRUKernelImpl(const gru_attr_t& attr) : GRUKernel() { +#ifdef PADDLE_WITH_XBYAK + if (useJIT(attr.d)) { + size_t sz = 96 + attr.d / YMM_FLOAT_BLOCK * 96 * 2 * 8; + jitcode0_.reset(new gen::GRUJitCode(0, attr, sz > 4096 ? sz : 4096)); + this->ComputeH1 = + jitcode0_->getCode(); + + jitcode1_.reset(new gen::GRUJitCode(1, attr, sz > 4096 ? sz : 4096)); + this->ComputeHtPart1 = + jitcode1_->getCode(); + + jitcode2_.reset(new gen::GRUJitCode(2, attr, sz > 4096 ? sz : 4096)); + this->ComputeHtPart2 = + jitcode2_->getCode(); + return; + } +#endif + this->ComputeH1 = refer::GRUH1; + this->ComputeHtPart1 = refer::GRUHtPart1; + this->ComputeHtPart2 = refer::GRUHtPart2; + } +#ifdef PADDLE_WITH_XBYAK + + private: + std::unique_ptr jitcode0_{nullptr}, jitcode1_{nullptr}, + jitcode2_{nullptr}; +#endif +}; + +#ifdef PADDLE_WITH_XBYAK +template <> +bool GRUKernelImpl::useJIT(int d) { + return gen::GRUJitCode::init(d); +} +#endif + +#define JITKERNEL_DEFINE_NAME_GRU(ker_key, ker_class) \ + template <> \ + std::string ker_class##Impl::name(const gru_attr_t& attr) { \ + std::string key(#ker_key "f"); \ + key += (attr.act_gate + attr.act_cand); \ + if (useJIT(attr.d)) { \ + /* only jit code need record d*/ \ + return key + "jit" + std::to_string(attr.d); \ + } else if (useMKL(attr.d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } \ + template <> \ + std::string ker_class##Impl::name(const gru_attr_t& attr) { \ + std::string key(#ker_key "d"); \ + /* jit code do not support double yet*/ \ + if (useMKL(attr.d)) { \ + return key + "mkl"; \ + } else { \ + return key + "any"; \ + } \ + } + +#define JITKERNEL_DECLARE_GRU(ker_class, ker_dtype) \ + template <> \ + std::shared_ptr> \ + KernelPool::Get, const gru_attr_t&>( \ + const gru_attr_t& attr) + +#define JITKERNEL_FIND_KEY_GRU(ker_class, ker_dtype) \ + std::string key = ker_class##Impl::name(attr) + +#define JITKERNEL_GRU_IMPL(ker, dtype) \ + p = std::dynamic_pointer_cast>( \ + std::make_shared>(attr)); + +REGISTER_JITKERNEL_ARGS(gru, GRUKernel, JITKERNEL_DEFINE_NAME_GRU, + JITKERNEL_DECLARE_GRU, JITKERNEL_FIND_KEY_GRU, + JITKERNEL_GRU_IMPL); + +#undef JITKERNEL_GRU_IMPL +#undef JITKERNEL_FIND_KEY_GRU +#undef JITKERNEL_DECLARE_GRU +#undef JITKERNEL_DEFINE_NAME_GRU +} // namespace jitkernel +} // namespace math +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/math/jit_kernel_test.cc b/paddle/fluid/operators/math/jit_kernel_test.cc index 7fdd1c6b76aebcea757540e7312a679b8c08402a..2e5072ca790a00cf0185adf32d64a3b392ff0019 100644 --- a/paddle/fluid/operators/math/jit_kernel_test.cc +++ b/paddle/fluid/operators/math/jit_kernel_test.cc @@ -13,7 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/math/jit_kernel.h" -#include #include // for exp #include // for memcpy #include @@ -22,6 +21,8 @@ limitations under the License. */ #include "gflags/gflags.h" #include "glog/logging.h" #include "gtest/gtest.h" +#include "paddle/fluid/operators/math/jit_kernel_refer.h" +#include "paddle/fluid/platform/port.h" #ifdef PADDLE_WITH_MKLML #include "paddle/fluid/platform/dynload/mklml.h" @@ -33,6 +34,9 @@ limitations under the License. */ constexpr int repeat = 20000; +// TODO(TJ): benchmark and test should be seperated, +// benchmark should verify more sizes + inline double GetCurrentUS() { struct timeval time; gettimeofday(&time, NULL); @@ -50,12 +54,6 @@ void RandomVec(const int n, T* a, const T lower = static_cast(-20.f), } } -void vrelu_ref(const int n, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = x[i] > 0.f ? x[i] : 0.f; - } -} - #if defined __AVX__ || defined __AVX2__ void vrelu_intri8(const int n, const float* x, float* y) { __m256 tmp = _mm256_loadu_ps(x); @@ -66,7 +64,8 @@ void vrelu_intri8(const int n, const float* x, float* y) { TEST(JitKernel, vrelu) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 256, 512}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {3, 7, 8, 15, 16, 30, 256, 512}) { std::vector x(d); std::vector zref(d), ztgt(d); RandomVec(d, x.data(), -10.f, 1.f); @@ -77,7 +76,7 @@ TEST(JitKernel, vrelu) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vrelu_ref(d, x_data, zref_data); + refer::VRelu(x_data, zref_data, d); } auto trefe = GetCurrentUS(); #if defined __AVX__ || defined __AVX2__ @@ -87,30 +86,26 @@ TEST(JitKernel, vrelu) { vrelu_intri8(d, x_data, zref_data); } auto si1 = GetCurrentUS(); - VLOG(3) << "Vec size 8 intr takes: " << (si1 - si0) / repeat; + VLOG(30) << "Vec size 8 intr takes: " << (si1 - si0) / repeat << " us"; } #endif auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, ztgt_data); + ker->Compute(x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, tgt takes: " << (ttgte - ttgts) / repeat; + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -void vaddbias_ref(const int n, const float a, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = x[i] + a; - } -} - TEST(JitKernel, vaddbias) { namespace jit = paddle::operators::math::jitkernel; + namespace refer = paddle::operators::math::jitkernel::refer; for (int d : {7, 8, 15, 16, 30, 64, 100, 128, 256}) { std::vector x(d); std::vector zref(d), ztgt(d); @@ -123,29 +118,24 @@ TEST(JitKernel, vaddbias) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vaddbias_ref(d, a, x_data, zref_data); + refer::VAddBias(&a, x_data, zref_data, d); } auto trefe = GetCurrentUS(); auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(a, x_data, ztgt_data); + ker->Compute(&a, x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, tgt takes: " << (ttgte - ttgts) / repeat; + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -void vexp_ref(const int n, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = std::exp(x[i]); - } -} - #ifdef PADDLE_WITH_MKLML void vexp_mkl(const int n, const float* x, float* y) { paddle::platform::dynload::vsExp(n, x, y); @@ -154,7 +144,8 @@ void vexp_mkl(const int n, const float* x, float* y) { TEST(JitKernel, vexp) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 128, 256}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {1, 3, 4, 6, 7, 8, 12, 15, 16, 20, 30, 128, 256}) { std::vector x(d); std::vector zref(d), ztgt(d); RandomVec(d, x.data(), -2.f, 2.f); @@ -165,7 +156,7 @@ TEST(JitKernel, vexp) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vexp_ref(d, x_data, zref_data); + refer::VExp(x_data, zref_data, d); } auto trefe = GetCurrentUS(); @@ -179,36 +170,25 @@ TEST(JitKernel, vexp) { auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, ztgt_data); + // ker->Compute(x_data, ztgt_data); + ker->Compute(x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat #ifdef PADDLE_WITH_MKLML - << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " + << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " #else - << " us, " + << " us, " #endif - << "tgt takes: " << (ttgte - ttgts) / repeat; + << "tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -inline float _sigmoid(float x) { - const float min = SIGMOID_THRESHOLD_MIN; - const float max = SIGMOID_THRESHOLD_MAX; - float tmp = (x < min) ? min : ((x > max) ? max : x); - return 1.f / (1.f + std::exp(-tmp)); -} - -void vsigmoid_ref(const int n, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = _sigmoid(x[i]); - } -} - void vsigmoid_better( const std::shared_ptr< const paddle::operators::math::jitkernel::VExpKernel>& vexp, @@ -219,7 +199,7 @@ void vsigmoid_better( y[i] = (x[i] < min) ? min : ((x[i] > max) ? max : x[i]); y[i] = 0.f - y[i]; } - vexp->Compute(y, y); + vexp->Compute(y, y, n); for (int i = 0; i < n; ++i) { y[i] = 1.f / (1.f + y[i]); } @@ -227,7 +207,8 @@ void vsigmoid_better( TEST(JitKernel, vsigmoid) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 32, 64, 100, 128, 256}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {1, 3, 4, 6, 7, 8, 15, 16, 30, 32, 64, 100, 128, 256}) { std::vector x(d); std::vector zref(d), ztgt(d); RandomVec(d, x.data(), -2.f, 2.f); @@ -245,32 +226,25 @@ TEST(JitKernel, vsigmoid) { auto tmkle = GetCurrentUS(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vsigmoid_ref(d, x_data, zref_data); + refer::VSigmoid(x_data, zref_data, d); } auto trefe = GetCurrentUS(); auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, ztgt_data); + ker->Compute(x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, better(jit exp) takes: " << (tmkle - tmkls) / repeat - << " us, tgt takes: " << (ttgte - ttgts) / repeat; + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, better(jit exp) takes: " << (tmkle - tmkls) / repeat + << " us, tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -inline float _tanh(float x) { return 2.f * _sigmoid(2.f * x) - 1.f; } - -void vtanh_ref(const int n, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = _tanh(x[i]); - } -} - void vtanh_better( const std::shared_ptr< const paddle::operators::math::jitkernel::VScalKernel>& vscal, @@ -281,15 +255,17 @@ void vtanh_better( const paddle::operators::math::jitkernel::VAddBiasKernel>& vaddbias, const int n, const float* x, float* y) { - vscal->Compute(2.f, x, y); - vsigmoid->Compute(y, y); - vscal->Compute(2.f, y); - vaddbias->Compute(-1.f, y, y); + const float a = 2.f, b = -1.f; + vscal->Compute(&a, x, y, n); + vsigmoid->Compute(y, y, n); + vscal->Compute(&a, y, y, n); + vaddbias->Compute(&b, y, y, n); } TEST(JitKernel, vtanh) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 32, 64, 100, 128, 256}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {1, 2, 3, 4, 5, 6, 7, 8, 15, 16, 30, 32, 64, 100, 128, 256}) { std::vector x(d); std::vector zref(d), ztgt(d); RandomVec(d, x.data(), -2.f, 2.f); @@ -311,50 +287,25 @@ TEST(JitKernel, vtanh) { auto tmkle = GetCurrentUS(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vtanh_ref(d, x_data, zref_data); + refer::VTanh(x_data, zref_data, d); } auto trefe = GetCurrentUS(); auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, ztgt_data); + ker->Compute(x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, better(jit exp) takes: " << (tmkle - tmkls) / repeat - << " us, tgt takes: " << (ttgte - ttgts) / repeat; + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, better(jit exp) takes: " << (tmkle - tmkls) / repeat + << " us, tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -void lstm_ctht_ref( - const std::shared_ptr< - const paddle::operators::math::jitkernel::VSigmoidKernel>& - vsigmoid_3d, - const std::shared_ptr< - const paddle::operators::math::jitkernel::VTanhKernel>& vtanh_d, - const std::shared_ptr< - const paddle::operators::math::jitkernel::VExpKernel>& vexp_1, - const int d, float* gates, const float* ct_1, float* ct, float* ht) { - vsigmoid_3d->Compute(gates + d, gates + d); - vtanh_d->Compute(gates, gates); - const float *i = gates + d, *f = gates + d * 2, *o = gates + d * 3; - const float min = SIGMOID_THRESHOLD_MIN; - const float max = SIGMOID_THRESHOLD_MAX; - for (int k = 0; k < d; ++k) { - // C_t = C_t-1 * fgated + cand_gated * igated - ct[k] = ct_1[k] * f[k] + gates[k] * i[k]; - // H_t = act_cell(C_t) * ogated - float tmp = ct[k] * 2; - tmp = 0.f - ((tmp < min) ? min : ((tmp > max) ? max : tmp)); - vexp_1->Compute(&tmp, &tmp); - tmp = 2.f / (1.f + tmp) - 1.f; - ht[k] = tmp * o[k]; - } -} - void lstm_ctht_better( const std::shared_ptr< const paddle::operators::math::jitkernel::VSigmoidKernel>& @@ -367,19 +318,20 @@ void lstm_ctht_better( const paddle::operators::math::jitkernel::VAddKernel>& vadd_d, const int d, float* gates, const float* ct_1, float* ct, float* ht) { int d2 = d * 2; - vsigmoid_3d->Compute(gates + d, gates + d); - vtanh_d->Compute(gates, gates); - vmul_d->Compute(gates, gates + d, gates + d); - vmul_d->Compute(ct_1, gates + d2, gates + d2); - vadd_d->Compute(gates + d, gates + d2, ct); + vsigmoid_3d->Compute(gates + d, gates + d, 3 * d); + vtanh_d->Compute(gates, gates, d); + vmul_d->Compute(gates, gates + d, gates + d, d); + vmul_d->Compute(ct_1, gates + d2, gates + d2, d); + vadd_d->Compute(gates + d, gates + d2, ct, d); /* H_t = act_cell(C_t) * ogated */ - vtanh_d->Compute(ct, gates + d2); - vmul_d->Compute(gates + d2, gates + d * 3, ht); + vtanh_d->Compute(ct, gates + d2, d); + vmul_d->Compute(gates + d2, gates + d * 3, ht, d); } TEST(JitKernel, lstm) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 32, 64, 100}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {1, 2, 3, 4, 5, 6, 7, 8, 15, 16, 30, 32, 64, 100}) { int d4 = d * 4; int d3 = d * 3; std::vector x(d4), xref(d4); @@ -389,19 +341,17 @@ TEST(JitKernel, lstm) { RandomVec(d, ct_1.data(), -2.f, 2.f); memcpy(xref.data(), x.data(), sizeof(float) * d4); std::string act_gate = "sigmoid", act_cand = "tanh", act_cell = "tanh"; + const jit::lstm_attr_t attr(d, act_gate, act_cand, act_cell, false); const auto& ker = jit::KernelPool::Instance() - .template Get, const std::string&, - const std::string&, const std::string&>( - act_gate, act_cand, act_cell, d, false); + .template Get, const jit::lstm_attr_t&>( + attr); // below kernels are used to compute refer const auto& vsigmoid_3d = jit::KernelPool::Instance().template Get>( d3); const auto& vtanh_d = jit::KernelPool::Instance().template Get>(d); - const auto& vexp_1 = - jit::KernelPool::Instance().template Get>(1); const auto& vmul_d = jit::KernelPool::Instance().template Get>(d); const auto& vadd_d = @@ -415,9 +365,17 @@ TEST(JitKernel, lstm) { float* ct_ref_data = ct_ref.data(); float* ht_ref_data = ht_ref.data(); // compute once to check correctness - lstm_ctht_ref(vsigmoid_3d, vtanh_d, vexp_1, d, xref_data, ct_1_data, - ct_ref_data, ht_ref_data); - ker->ComputeCtHt(x_data, ct_1_data, ct_tgt_data, ht_tgt_data); + jit::lstm_t step; + step.gates = xref_data; + step.ct_1 = ct_1_data; + step.ct = ct_ref_data; + step.ht = ht_ref_data; + refer::LSTMCtHt(&step, &attr); + + step.gates = x_data; + step.ct = ct_tgt_data; + step.ht = ht_tgt_data; + ker->ComputeCtHt(&step, &attr); for (int i = 0; i < d; ++i) { EXPECT_NEAR(ct_tgt_data[i], ct_ref_data[i], 1e-3); EXPECT_NEAR(ht_tgt_data[i], ht_ref_data[i], 1e-3); @@ -431,31 +389,21 @@ TEST(JitKernel, lstm) { auto tmkle = GetCurrentUS(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - lstm_ctht_ref(vsigmoid_3d, vtanh_d, vexp_1, d, xref_data, ct_1_data, - ct_ref_data, ht_ref_data); + refer::LSTMCtHt(&step, &attr); } auto trefe = GetCurrentUS(); auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->ComputeCtHt(x_data, ct_1_data, ct_tgt_data, ht_tgt_data); + ker->ComputeCtHt(&step, &attr); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, better(jit) takes: " << (tmkle - tmkls) / repeat - << " us, tgt takes: " << (ttgte - ttgts) / repeat; + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, better(jit) takes: " << (tmkle - tmkls) / repeat + << " us, tgt takes: " << (ttgte - ttgts) / repeat << " us"; } } -void vscal_ref(const int n, const float a, const float* x, float* y) { - for (int i = 0; i < n; ++i) { - y[i] = a * x[i]; - } -} -void vscal_inp_ref(const int n, const float a, float* x) { - for (int i = 0; i < n; ++i) { - x[i] = a * x[i]; - } -} #if defined __AVX__ || defined __AVX2__ void vscal_intri8(const int n, const float a, const float* x, float* y) { __m256 tmp; @@ -481,6 +429,7 @@ void vscal_inp_mkl(const int n, const float a, float* x) { TEST(JitKernel, vscal) { namespace jit = paddle::operators::math::jitkernel; + namespace refer = paddle::operators::math::jitkernel::refer; for (int d : {7, 8, 15, 16, 30, 256, 512}) { std::vector x(d), y(d); std::vector zref(d), ztgt(d); @@ -495,12 +444,12 @@ TEST(JitKernel, vscal) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vscal_ref(d, a, x_data, zref_data); + refer::VScal(&a, x_data, zref_data, d); } auto trefe = GetCurrentUS(); auto trefs1 = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vscal_inp_ref(d, a, y_data); + refer::VScal(&a, y_data, y_data, d); } auto trefe1 = GetCurrentUS(); @@ -524,42 +473,39 @@ TEST(JitKernel, vscal) { vscal_inp_intri8(d, a, y_data); } auto si3 = GetCurrentUS(); - VLOG(3) << "Vec size 8 intr takes: " << (si1 - si0) / repeat - << " us, inplace: " << (si3 - si2) / repeat; + VLOG(30) << "Vec size 8 intr takes: " << (si1 - si0) / repeat + << " us, inplace: " << (si3 - si2) / repeat << " us"; } #endif auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(a, x_data, ztgt_data); + ker->Compute(&a, x_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); auto ttgts1 = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(a, y_data); + ker->Compute(&a, y_data, y_data, d); } auto ttgte1 = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat - << " us, inplace takes: " << (trefe1 - trefs1) / repeat + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, inplace takes: " << (trefe1 - trefs1) / repeat #ifdef PADDLE_WITH_MKLML - << " us, mkl inplace takes: " << (tmkle - tmkls) / repeat << " us, " + << " us, mkl inplace takes: " << (tmkle - tmkls) / repeat + << " us, " #else - << " us, " + << " us, " #endif - << "tgt takes: " << (ttgte - ttgts) / repeat - << "us, tgt inplace takes: " << (ttgte1 - ttgts1) / repeat; + << "tgt takes: " << (ttgte - ttgts) / repeat + << "us, tgt inplace takes: " << (ttgte1 - ttgts1) / repeat + << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -void vmul_ref(const int n, const float* x, const float* y, float* z) { - for (int i = 0; i < n; ++i) { - z[i] = x[i] * y[i]; - } -} - #if defined __AVX__ || defined __AVX2__ void vmul_intri8(const int n, const float* x, const float* y, float* z) { __m256 tmpx, tmpy; @@ -578,7 +524,8 @@ void vmul_mkl(const int n, const float* x, const float* y, float* z) { TEST(JitKernel, vmul) { namespace jit = paddle::operators::math::jitkernel; - for (int d : {7, 8, 15, 16, 30, 256, 512}) { + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {7, 8, 15, 16, 20, 30, 256, 512, 1000, 1024}) { std::vector x(d), y(d); std::vector zref(d), ztgt(d); RandomVec(d, x.data()); @@ -591,7 +538,7 @@ TEST(JitKernel, vmul) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vmul_ref(d, x_data, y_data, zref_data); + refer::VMul(x_data, y_data, zref_data, d); } auto trefe = GetCurrentUS(); @@ -610,35 +557,30 @@ TEST(JitKernel, vmul) { vmul_intri8(d, x_data, y_data, zref_data); } auto si1 = GetCurrentUS(); - VLOG(3) << "Vec size 8 intr takes: " << (si1 - si0) / repeat; + VLOG(30) << "Vec size 8 intr takes: " << (si1 - si0) / repeat; } #endif auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, y_data, ztgt_data); + ker->Compute(x_data, y_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat #ifdef PADDLE_WITH_MKLML - << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " + << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " #else - << " us, " + << " us, " #endif - << "tgt takes: " << (ttgte - ttgts) / repeat; + << "tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } } } -void vadd_ref(const int n, const float* x, const float* y, float* z) { - for (int i = 0; i < n; ++i) { - z[i] = x[i] + y[i]; - } -} - #if defined __AVX__ || defined __AVX2__ void vadd_intri8(const int n, const float* x, const float* y, float* z) { __m256 tmpx, tmpy; @@ -657,6 +599,7 @@ void vadd_mkl(const int n, const float* x, const float* y, float* z) { TEST(JitKernel, vadd) { namespace jit = paddle::operators::math::jitkernel; + namespace refer = paddle::operators::math::jitkernel::refer; for (int d : {7, 8, 15, 16, 30, 256, 512}) { std::vector x(d), y(d); std::vector zref(d), ztgt(d); @@ -670,7 +613,7 @@ TEST(JitKernel, vadd) { float* zref_data = zref.data(); auto trefs = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - vadd_ref(d, x_data, y_data, zref_data); + refer::VAdd(x_data, y_data, zref_data, d); } auto trefe = GetCurrentUS(); @@ -689,23 +632,77 @@ TEST(JitKernel, vadd) { vadd_intri8(d, x_data, y_data, zref_data); } auto si1 = GetCurrentUS(); - VLOG(3) << "Vec size 8 intr takes: " << (si1 - si0) / repeat; + VLOG(30) << "Vec size 8 intr takes: " << (si1 - si0) / repeat; } #endif auto ttgts = GetCurrentUS(); for (int i = 0; i < repeat; ++i) { - ker->Compute(x_data, y_data, ztgt_data); + ker->Compute(x_data, y_data, ztgt_data, d); } auto ttgte = GetCurrentUS(); - VLOG(3) << "Vec size " << d << ": refer takes: " << (trefe - trefs) / repeat + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat #ifdef PADDLE_WITH_MKLML - << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " + << " us, mkl takes: " << (tmkle - tmkls) / repeat << " us, " #else - << " us, " + << " us, " #endif - << "tgt takes: " << (ttgte - ttgts) / repeat; + << "tgt takes: " << (ttgte - ttgts) / repeat << " us"; + for (int i = 0; i < d; ++i) { + EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); + } + } +} + +void vaddrelu_better( + const std::shared_ptr< + const paddle::operators::math::jitkernel::VAddKernel>& vadd, + const std::shared_ptr< + const paddle::operators::math::jitkernel::VReluKernel>& vrelu, + const float* x, const float* y, float* z, int d) { + vadd->Compute(x, y, z, d); + vrelu->Compute(z, z, d); +} + +TEST(JitKernel, vaddrelu) { + namespace jit = paddle::operators::math::jitkernel; + namespace refer = paddle::operators::math::jitkernel::refer; + for (int d : {7, 8, 15, 16, 30, 256, 512}) { + std::vector x(d), y(d); + std::vector zref(d), ztgt(d); + RandomVec(d, x.data()); + RandomVec(d, y.data()); + const auto& ker = + jit::KernelPool::Instance().template Get>(d); + const auto& vadd = + jit::KernelPool::Instance().template Get>(d); + const auto& vrelu = + jit::KernelPool::Instance().template Get>(d); + const float* x_data = x.data(); + const float* y_data = y.data(); + float* ztgt_data = ztgt.data(); + float* zref_data = zref.data(); + auto trefs = GetCurrentUS(); + for (int i = 0; i < repeat; ++i) { + refer::VAddRelu(x_data, y_data, zref_data, d); + } + auto trefe = GetCurrentUS(); + auto tmkls = GetCurrentUS(); + for (int i = 0; i < repeat; ++i) { + vaddrelu_better(vadd, vrelu, x_data, y_data, zref_data, d); + } + auto tmkle = GetCurrentUS(); + auto ttgts = GetCurrentUS(); + for (int i = 0; i < repeat; ++i) { + ker->Compute(x_data, y_data, ztgt_data, d); + } + auto ttgte = GetCurrentUS(); + VLOG(30) << "Vec size " << d + << ": refer takes: " << (trefe - trefs) / repeat + << " us, better takes: " << (tmkle - tmkls) / repeat << " us, " + << "tgt takes: " << (ttgte - ttgts) / repeat << " us"; for (int i = 0; i < d; ++i) { EXPECT_NEAR(ztgt_data[i], zref_data[i], 1e-3); } @@ -716,21 +713,23 @@ TEST(JitKernel, pool) { namespace jit = paddle::operators::math::jitkernel; const int frame_size = 4; std::string act_gate = "sigmoid", act_cand = "tanh", act_cell = "tanh"; + jit::lstm_attr_t attr(frame_size, act_gate, act_cand, act_cell, false); + + // empty call it to avoid unknown flag 'use_pinned_memory' on Mac + paddle::platform::jit::MayIUse(paddle::platform::jit::avx); const auto& plstm1 = jit::KernelPool::Instance() - .template Get, const std::string&, - const std::string&, const std::string&>( - act_gate, act_cand, act_cell, frame_size, false); + .template Get, const jit::lstm_attr_t&>(attr); + const auto& plstm2 = jit::KernelPool::Instance() - .template Get, const std::string&, - const std::string&, const std::string&>( - act_gate, act_cand, act_cell, frame_size, false); + .template Get, const jit::lstm_attr_t&>(attr); + EXPECT_EQ(plstm1, plstm2); + const auto& peephole = jit::KernelPool::Instance() - .template Get, const std::string&, - const std::string&, const std::string&>( - act_gate, act_cand, act_cell, frame_size, true); + .template Get, const jit::lstm_attr_t&>( + jit::lstm_attr_t(frame_size, act_gate, act_cand, act_cell, true)); EXPECT_TRUE(plstm1 != peephole); const auto& pvmul_f = @@ -743,8 +742,12 @@ TEST(JitKernel, pool) { EXPECT_TRUE(std::dynamic_pointer_cast(pvmul_f) != std::dynamic_pointer_cast(pvmul_d)); - const auto& pvmul_from_key = jit::KernelPool::Instance().Get("vmulf4"); - EXPECT_EQ(pvmul_f, pvmul_from_key); - const auto& pvmul_from_key2 = jit::KernelPool::Instance().Get("vmulf5"); + const auto& pvmul_from_key = jit::KernelPool::Instance().Get("vmulfjit4"); +#if defined(__APPLE__) || defined(__OSX__) || defined(_WIN32) + EXPECT_EQ(pvmul_from_key, nullptr); +#else + EXPECT_EQ(pvmul_from_key, pvmul_f); +#endif + const auto& pvmul_from_key2 = jit::KernelPool::Instance().Get("vmulfjit"); EXPECT_TRUE(pvmul_from_key2 == nullptr); } diff --git a/paddle/fluid/operators/math/matrix_bit_code.h b/paddle/fluid/operators/math/matrix_bit_code.h index 07854c83584f90db02b416b85a4aa61f5cdc0685..c329b8b6113e847ec1c57e63258a18b6f65d9396 100644 --- a/paddle/fluid/operators/math/matrix_bit_code.h +++ b/paddle/fluid/operators/math/matrix_bit_code.h @@ -67,7 +67,7 @@ inline constexpr size_t FindLastSet(size_t x) { : (std::is_same::value // NOLINT ? (x ? 8 * sizeof(x) - __builtin_clzl(x) : 0) : (x ? 8 * sizeof(x) - __builtin_clzll(x) : 0)); - +} #else // windows don't have built-in clz, ctz function template @@ -92,7 +92,6 @@ inline int clz(const T& value) { inline size_t FindLastSet(size_t x) { return sizeof(size_t) * 8 - clz(x); } #endif // !_WIN32 -} struct SimpleCode { SimpleCode(size_t code, size_t num_classes) : c_(code + num_classes) {} diff --git a/paddle/fluid/operators/math/pooling.cc b/paddle/fluid/operators/math/pooling.cc index b871851798e48e6b598cb4ab8e2e42db478a3820..8df43bb616179e2487534e0acabb71b09b87e1af 100644 --- a/paddle/fluid/operators/math/pooling.cc +++ b/paddle/fluid/operators/math/pooling.cc @@ -31,7 +31,7 @@ class Pool2dFunctor { const framework::Tensor& input, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* output) { + bool exclusive, framework::Tensor* output) { const int batch_size = input.dims()[0]; const int input_height = input.dims()[2]; const int input_width = input.dims()[3]; @@ -68,7 +68,8 @@ class Pool2dFunctor { pool_process.compute(input_data[h * input_width + w], &ele); } } - int pool_size = (hend - hstart) * (wend - wstart); + int pool_size = exclusive ? (hend - hstart) * (wend - wstart) + : ksize_height * ksize_width; pool_process.finalize(static_cast(pool_size), &ele); output_data[ph * output_width + pw] = ele; } @@ -93,7 +94,7 @@ class Pool2dGradFunctor { const framework::Tensor& output, const framework::Tensor& output_grad, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_grad_process, - framework::Tensor* input_grad) { + bool exclusive, framework::Tensor* input_grad) { const int batch_size = input.dims()[0]; const int input_height = input.dims()[2]; const int input_width = input.dims()[3]; @@ -124,7 +125,8 @@ class Pool2dGradFunctor { int wstart = pw * stride_width - padding_width; int wend = std::min(wstart + ksize_width, input_width); wstart = std::max(wstart, 0); - int pool_size = (hend - hstart) * (wend - wstart); + int pool_size = exclusive ? (hend - hstart) * (wend - wstart) + : ksize_height * ksize_width; float scale = 1.0 / pool_size; for (int h = hstart; h < hend; ++h) { for (int w = wstart; w < wend; ++w) { @@ -249,7 +251,7 @@ class Pool3dFunctor { const framework::Tensor& input, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* output) { + bool exclusive, framework::Tensor* output) { const int batch_size = input.dims()[0]; const int input_depth = input.dims()[2]; const int input_height = input.dims()[3]; @@ -300,7 +302,9 @@ class Pool3dFunctor { } } int pool_size = - (dend - dstart) * (hend - hstart) * (wend - wstart); + exclusive + ? (dend - dstart) * (hend - hstart) * (wend - wstart) + : ksize_depth * ksize_height * ksize_width; pool_process.finalize(static_cast(pool_size), &ele); output_data[output_idx] = ele; } @@ -326,7 +330,7 @@ class Pool3dGradFunctor { const framework::Tensor& output, const framework::Tensor& output_grad, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_grad_process, - framework::Tensor* input_grad) { + bool exclusive, framework::Tensor* input_grad) { const int batch_size = input.dims()[0]; const int input_depth = input.dims()[2]; const int input_height = input.dims()[3]; @@ -369,7 +373,9 @@ class Pool3dGradFunctor { wstart = std::max(wstart, 0); int pool_size = - (dend - dstart) * (hend - hstart) * (wend - wstart); + exclusive + ? (dend - dstart) * (hend - hstart) * (wend - wstart) + : ksize_depth * ksize_height * ksize_width; float scale = 1.0 / pool_size; for (int d = dstart; d < dend; ++d) { for (int h = hstart; h < hend; ++h) { diff --git a/paddle/fluid/operators/math/pooling.cu b/paddle/fluid/operators/math/pooling.cu index b1c76350d1724629bae175abf47e6671a1532242..cdc79e207aa9a2e59e25a07002134c12ad5a1df8 100644 --- a/paddle/fluid/operators/math/pooling.cu +++ b/paddle/fluid/operators/math/pooling.cu @@ -29,7 +29,7 @@ __global__ void KernelPool2D(const int nthreads, const T* input_data, const int ksize_width, const int stride_height, const int stride_width, const int padding_height, const int padding_width, PoolProcess pool_process, - T* output_data) { + bool exclusive, T* output_data) { for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x) { int pw = index % output_width; @@ -52,7 +52,8 @@ __global__ void KernelPool2D(const int nthreads, const T* input_data, pool_process.compute(input_data[h * input_width + w], &ele); } } - int pool_size = (hend - hstart) * (wend - wstart); + int pool_size = exclusive ? (hend - hstart) * (wend - wstart) + : ksize_height * ksize_width; pool_process.finalize(static_cast(pool_size), &ele); output_data[index] = ele; } @@ -65,7 +66,7 @@ __global__ void KernelPool2DGrad( const int input_width, const int output_height, const int output_width, const int ksize_height, const int ksize_width, const int stride_height, const int stride_width, const int padding_height, const int padding_width, - PoolProcess pool_process, T* input_grad) { + PoolProcess pool_process, bool exclusive, T* input_grad) { for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x) { int offsetW = index % input_width + padding_width; @@ -95,7 +96,8 @@ __global__ void KernelPool2DGrad( int wend = min(wstart + ksize_width, input_width); hstart = max(hstart, 0); wstart = max(wstart, 0); - int pool_size = (hend - hstart) * (wend - wstart); + int pool_size = exclusive ? (hend - hstart) * (wend - wstart) + : ksize_height * ksize_width; int output_sub_idx = ph * output_width + pw; pool_process.compute(input, output_data[output_sub_idx], output_grad[output_sub_idx], @@ -151,6 +153,37 @@ __global__ void KernelMaxPool2DGrad( } } +template +void Pool2dDirectCUDAFunctor::operator()( + const T* input, const std::vector& input_shape, + const std::vector& output_shape, const std::vector& ksize, + const std::vector& strides, const std::vector& paddings, + PoolProcess pool_compute, bool exclusive, T* output, cudaStream_t stream) { + const int batch_size = input_shape[0]; + const int input_channels = input_shape[1]; + const int input_height = input_shape[2]; + const int input_width = input_shape[3]; + const int output_channels = output_shape[1]; + const int output_height = output_shape[2]; + const int output_width = output_shape[3]; + const int ksize_height = ksize[0]; + const int ksize_width = ksize[1]; + const int stride_height = strides[0]; + const int stride_width = strides[1]; + const int padding_height = paddings[0]; + const int padding_width = paddings[1]; + + int nthreads = batch_size * output_channels * output_height * output_width; + int blocks = (nthreads + 1024 - 1) / 1024; + dim3 threads(1024, 1); + dim3 grid(blocks, 1); + + KernelPool2D<<>>( + nthreads, input, input_channels, input_height, input_width, output_height, + output_width, ksize_height, ksize_width, stride_height, stride_width, + padding_height, padding_width, pool_compute, exclusive, output); +} + /* * All tensors are in NCHW format. * Ksize, strides, paddings are two elements. These two elements represent @@ -163,7 +196,7 @@ class Pool2dFunctor { const framework::Tensor& input, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* output) { + bool exclusive, framework::Tensor* output) { const int batch_size = input.dims()[0]; const int input_channels = input.dims()[1]; const int input_height = input.dims()[2]; @@ -189,7 +222,8 @@ class Pool2dFunctor { KernelPool2D<<>>( nthreads, input_data, input_channels, input_height, input_width, output_height, output_width, ksize_height, ksize_width, stride_height, - stride_width, padding_height, padding_width, pool_process, output_data); + stride_width, padding_height, padding_width, pool_process, exclusive, + output_data); } }; @@ -208,7 +242,7 @@ class Pool2dGradFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* input_grad) { + bool exclusive, framework::Tensor* input_grad) { const int batch_size = input.dims()[0]; const int input_channels = input.dims()[1]; const int input_height = input.dims()[2]; @@ -236,7 +270,7 @@ class Pool2dGradFunctor { nthreads, input_data, output_data, output_grad_data, input_channels, input_height, input_width, output_height, output_width, ksize_height, ksize_width, stride_height, stride_width, padding_height, padding_width, - pool_process, input_grad_data); + pool_process, exclusive, input_grad_data); } }; @@ -288,6 +322,11 @@ class MaxPool2dGradFunctor { } }; +template class Pool2dDirectCUDAFunctor, + float>; +template class Pool2dDirectCUDAFunctor, + float>; + template class MaxPool2dGradFunctor; template class MaxPool2dGradFunctor; @@ -313,16 +352,14 @@ template class Pool2dGradFunctor; template -__global__ void KernelPool3D(const int nthreads, const T* input_data, - const int channels, const int input_depth, - const int input_height, const int input_width, - const int output_depth, const int output_height, - const int output_width, const int ksize_depth, - const int ksize_height, const int ksize_width, - const int stride_depth, const int stride_height, - const int stride_width, const int padding_depth, - const int padding_height, const int padding_width, - PoolProcess pool_process, T* output_data) { +__global__ void KernelPool3D( + const int nthreads, const T* input_data, const int channels, + const int input_depth, const int input_height, const int input_width, + const int output_depth, const int output_height, const int output_width, + const int ksize_depth, const int ksize_height, const int ksize_width, + const int stride_depth, const int stride_height, const int stride_width, + const int padding_depth, const int padding_height, const int padding_width, + PoolProcess pool_process, bool exclusive, T* output_data) { for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x) { int pw = index % output_width; @@ -351,7 +388,9 @@ __global__ void KernelPool3D(const int nthreads, const T* input_data, } } } - int pool_size = (dend - dstart) * (hend - hstart) * (wend - wstart); + int pool_size = exclusive + ? (dend - dstart) * (hend - hstart) * (wend - wstart) + : ksize_depth * ksize_height * ksize_width; pool_process.finalize(static_cast(pool_size), &ele); output_data[index] = ele; } @@ -366,7 +405,7 @@ __global__ void KernelPool3DGrad( const int ksize_height, const int ksize_width, const int stride_depth, const int stride_height, const int stride_width, const int padding_depth, const int padding_height, const int padding_width, PoolProcess pool_process, - T* input_grad) { + bool exclusive, T* input_grad) { for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x) { int offsetW = index % input_width + padding_width; @@ -409,7 +448,9 @@ __global__ void KernelPool3DGrad( dstart = max(dstart, 0); hstart = max(hstart, 0); wstart = max(wstart, 0); - int pool_size = (dend - dstart) * (hend - hstart) * (wend - wstart); + int pool_size = + exclusive ? (dend - dstart) * (hend - hstart) * (wend - wstart) + : ksize_depth * ksize_height * ksize_width; int output_sub_idx = (pd * output_height + ph) * output_width + pw; pool_process.compute(input, output_data[output_sub_idx], output_grad[output_sub_idx], @@ -484,7 +525,7 @@ class Pool3dFunctor { const framework::Tensor& input, const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* output) { + bool exclusive, framework::Tensor* output) { const int batch_size = input.dims()[0]; const int input_channels = input.dims()[1]; const int input_depth = input.dims()[2]; @@ -517,7 +558,7 @@ class Pool3dFunctor { nthreads, input_data, input_channels, input_depth, input_height, input_width, output_depth, output_height, output_width, ksize_depth, ksize_height, ksize_width, stride_depth, stride_height, stride_width, - padding_depth, padding_height, padding_width, pool_process, + padding_depth, padding_height, padding_width, pool_process, exclusive, output_data); } }; @@ -537,7 +578,7 @@ class Pool3dGradFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_process, - framework::Tensor* input_grad) { + bool exclusive, framework::Tensor* input_grad) { const int batch_size = input.dims()[0]; const int input_channels = input.dims()[1]; const int input_depth = input.dims()[2]; @@ -573,7 +614,7 @@ class Pool3dGradFunctor { input_depth, input_height, input_width, output_depth, output_height, output_width, ksize_depth, ksize_height, ksize_width, stride_depth, stride_height, stride_width, padding_depth, padding_height, - padding_width, pool_process, input_grad_data); + padding_width, pool_process, exclusive, input_grad_data); } }; diff --git a/paddle/fluid/operators/math/pooling.h b/paddle/fluid/operators/math/pooling.h index 120f5919803806e0d3b7dc8eaf530ae89819b84d..923babd4c248364b735bb09def7bf12f2762f305 100644 --- a/paddle/fluid/operators/math/pooling.h +++ b/paddle/fluid/operators/math/pooling.h @@ -82,6 +82,19 @@ class AvgPoolGrad { * This is different from average pooling. So we rewrite the max_pool_grad: * MaxPool2dGradFunctor, MaxPool3dGradFunctor. */ +#ifdef PADDLE_WITH_CUDA +template +class Pool2dDirectCUDAFunctor { + public: + void operator()(const T* input, const std::vector& input_shape, + const std::vector& output_shape, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, PoolProcess pool_compute, + bool exclusive, T* output, cudaStream_t stream); +}; +#endif + template class Pool2dFunctor { public: @@ -89,7 +102,7 @@ class Pool2dFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_compute, - framework::Tensor* output); + bool exclusive, framework::Tensor* output); }; template @@ -101,7 +114,7 @@ class Pool2dGradFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_compute, - framework::Tensor* input_grad); + bool exclusive, framework::Tensor* input_grad); }; template @@ -123,7 +136,7 @@ class Pool3dFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_compute, - framework::Tensor* output); + bool exclusive, framework::Tensor* output); }; template @@ -135,7 +148,7 @@ class Pool3dGradFunctor { const std::vector& ksize, const std::vector& strides, const std::vector& paddings, PoolProcess pool_compute, - framework::Tensor* input_grad); + bool exclusive, framework::Tensor* input_grad); }; template diff --git a/paddle/fluid/operators/math/sampler.cc b/paddle/fluid/operators/math/sampler.cc index 3066dc0ba284611af89c4927f45089a570ab88bc..690d6f6baafb33d50c8f2d3606d903634d622d16 100644 --- a/paddle/fluid/operators/math/sampler.cc +++ b/paddle/fluid/operators/math/sampler.cc @@ -1,4 +1,4 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -13,52 +13,46 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/math/sampler.h" +#include +#include +#include +#include namespace paddle { -namespace random { +namespace operators { +namespace math { Sampler::~Sampler() {} -UniformSampler::UniformSampler(int64 range) - : Sampler(range), inv_range_(1.0 / range) { - random_engine_ = std::make_shared(seed_); +UniformSampler::UniformSampler(int64_t range, unsigned int seed) + : Sampler(range, seed), inv_range_(1.0 / (range + 1)) { + random_engine_ = std::make_shared(seed_); dist_ = std::make_shared>(0, range); } -UniformSampler::UniformSampler(int64 range, unsigned int seed) - : Sampler(range, seed), inv_range_(1.0 / range) { - random_engine_ = std::make_shared(seed_); - dist_ = std::make_shared>(0, range); -} - -int64 UniformSampler::Sample() const { return (*dist_)(*random_engine_); } +int64_t UniformSampler::Sample() const { return (*dist_)(*random_engine_); } -float UniformSampler::Probability(int64 value) const { return inv_range_; } +float UniformSampler::Probability(int64_t value) const { return inv_range_; } -LogUniformSampler::LogUniformSampler(int64 range) - : Sampler(range), log_range_(log(range + 1)) { - random_engine_ = std::make_shared(seed_); - dist_ = std::make_shared>(0, 1); -} - -LogUniformSampler::LogUniformSampler(int64 range, unsigned int seed) +LogUniformSampler::LogUniformSampler(int64_t range, unsigned int seed) : Sampler(range, seed), log_range_(log(range + 1)) { - random_engine_ = std::make_shared(seed_); + random_engine_ = std::make_shared(seed_); dist_ = std::make_shared>(0, 1); } -int64 LogUniformSampler::Sample() const { + +int64_t LogUniformSampler::Sample() const { // Got Log Uniform distribution from uniform distribution by // inverse_transform_sampling method // More details: // https://wanghaoshuang.github.io/2017/11/Log-uniform-distribution-sampler/ - const int64 value = - static_cast(exp((*dist_)(*random_engine_) * log_range_)) - 1; + const int64_t value = + static_cast(exp((*dist_)(*random_engine_) * log_range_)) - 1; // Mathematically, value should be <= range_, but might not be due to some // floating point roundoff, so we mod by range_. return value % range_; } -float LogUniformSampler::Probability(int64 value) const { +float LogUniformSampler::Probability(int64_t value) const { // Given f(x) = 1/[(x+1) * log_range_] // The value's probability is integral of f(x) from value to (value + 1) // More details: @@ -66,5 +60,76 @@ float LogUniformSampler::Probability(int64 value) const { return (log((value + 2.0) / (value + 1.0))) / log_range_; } -} // namespace random +CustomSampler::CustomSampler(int64_t range, const float* probabilities, + unsigned int seed) + : Sampler(range, seed) { + random_engine_ = std::make_shared(seed_); + real_dist_ = std::make_shared>(0, 1); + int_dist_ = std::make_shared>(0, range); + alias_probs_ = std::make_shared>(range + 1); + alias_ = std::make_shared>(range + 1); + probs_ = std::make_shared>(range + 1); + + std::queue> bigs; + std::queue> littles; + for (int64_t i = 0; i <= range; ++i) { + (*probs_)[i] = probabilities[i]; + float normal_prob = probabilities[i] * (range + 1); + if (normal_prob - 1.0 > 1e-4) { + bigs.emplace(i, normal_prob); + } else if (1.0 - normal_prob > 1e-4) { + littles.emplace(i, normal_prob); + } else { + (*alias_probs_)[i] = normal_prob; + (*alias_)[i] = -1; + } + } + + while ((!littles.empty()) && (!bigs.empty())) { + auto big = bigs.front(); + auto little = littles.front(); + bigs.pop(); + littles.pop(); + (*alias_probs_)[little.first] = little.second; + (*alias_)[little.first] = big.first; + auto big_left = big.second - (1 - little.second); + if (big_left - 1.0 > 1e-4) { + bigs.emplace(big.first, big_left); + } else if (1.0 - big_left > 1e-4) { + littles.emplace(big.first, big_left); + } else { + (*alias_probs_)[big.first] = big_left; + (*alias_)[big.first] = -1; + } + } + + if (!littles.empty()) { // littles.second is close to 1.0 + auto little = littles.front(); + (*alias_probs_)[little.first] = 1.0; + (*alias_)[little.first] = -1; + } + + if (!bigs.empty()) { // bigs.second is close to 1.0 + auto big = bigs.front(); + (*alias_probs_)[big.first] = 1.0; + (*alias_)[big.first] = -1; + } +} + +int64_t CustomSampler::Sample() const { + auto index = (*int_dist_)(*random_engine_); + auto p = (*real_dist_)(*random_engine_); + if (p > (*alias_probs_)[index]) { + return (*alias_)[index]; + } else { + return index; + } +} + +float CustomSampler::Probability(int64_t value) const { + return (*probs_)[value]; +} + +} // namespace math +} // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/math/sampler.h b/paddle/fluid/operators/math/sampler.h index b82691f269c5d0f267ca98c78646efe9b26f0b34..836cdad51f17e93f811ba14695bbe1a65156c588 100644 --- a/paddle/fluid/operators/math/sampler.h +++ b/paddle/fluid/operators/math/sampler.h @@ -16,6 +16,8 @@ limitations under the License. */ #include #include #include +#include + namespace paddle { namespace operators { namespace math { @@ -27,14 +29,14 @@ namespace math { */ class Sampler { public: - explicit Sampler(int64_t range) : range_(range) { - PADDLE_ENFORCE_GT(range, 0); - std::random_device r; - seed_ = r(); - } - explicit Sampler(int64_t range, unsigned int seed) - : range_(range), seed_(seed) { - PADDLE_ENFORCE_GT(range, 0); + explicit Sampler(int64_t range, unsigned int seed = 0UL) : range_(range) { + // PADDLE_ENFORCE_GT(range, 0, "Range should be greater than 0."); + if (seed == 0) { + std::random_device r; + seed_ = r(); + } else { + seed_ = seed; + } } virtual ~Sampler(); // Sample a single value @@ -42,7 +44,7 @@ class Sampler { // The probability that a single call to Sample() returns the given value. virtual float Probability(int64_t value) const = 0; - int64 range() { return range_; } + int64_t range() { return range_; } protected: const int64_t range_; @@ -56,13 +58,11 @@ class Sampler { */ class UniformSampler : public Sampler { public: - explicit UniformSampler(int64_t range); - - explicit UniformSampler(int64_t range, unsigned int seed); + explicit UniformSampler(int64_t range, unsigned int seed = 0UL); ~UniformSampler() override {} - int64 Sample() const override; + int64_t Sample() const override; float Probability(int64_t value) const override; @@ -79,13 +79,11 @@ class UniformSampler : public Sampler { */ class LogUniformSampler : public Sampler { public: - explicit LogUniformSampler(int64_t range); - - explicit LogUniformSampler(int64_t range, unsigned int seed); + explicit LogUniformSampler(int64_t range, unsigned int seed = 0UL); ~LogUniformSampler() override {} - int64 Sample() const override; + int64_t Sample() const override; float Probability(int64_t value) const override; @@ -95,6 +93,29 @@ class LogUniformSampler : public Sampler { std::shared_ptr> dist_; }; +/** + * Sample integers from [0, range) from custom distribution. + */ +class CustomSampler : public Sampler { + public: + explicit CustomSampler(int64_t range, const float* probabilities, + unsigned int seed = 0UL); + + ~CustomSampler() override {} + + int64_t Sample() const override; + + float Probability(int64_t value) const override; + + private: + std::shared_ptr> alias_probs_; + std::shared_ptr> alias_; + std::shared_ptr> probs_; + std::shared_ptr random_engine_; + std::shared_ptr> real_dist_; + std::shared_ptr> int_dist_; +}; + } // namespace math } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/math/selected_rows_functor.cc b/paddle/fluid/operators/math/selected_rows_functor.cc index 08f57dd45ad76946cbcafb98a3414003ed9d67a9..5978c1d6056001142854583840b8bfcb54d475d1 100644 --- a/paddle/fluid/operators/math/selected_rows_functor.cc +++ b/paddle/fluid/operators/math/selected_rows_functor.cc @@ -12,9 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include #include -#include +#include #include "paddle/fluid/operators/math/blas.h" #include "paddle/fluid/operators/math/selected_rows_functor.h" @@ -230,8 +229,24 @@ template struct SelectedRowsAddToTensor; // add or mul. namespace scatter { -size_t FindPos(const std::vector& rows, int64_t value) { - return std::find(rows.begin(), rows.end(), value) - rows.begin(); +template +typename std::enable_if< + std::is_floating_point::value && + std::is_same::value>::type +elementwise_add_to(const DeviceContext& ctx, BlasT* blas, + size_t data_len, const T* in, T* out) { + blas->AXPY(data_len, 1., in, out); +} + +template +typename std::enable_if< + !std::is_floating_point::value && + std::is_same::value>::type +elementwise_add_to(const DeviceContext& ctx, BlasT* blas, + size_t data_len, const T* in, T* out) { + for (size_t i = 0; i < data_len; i++) { + out[i] += in[i]; + } } template @@ -246,48 +261,84 @@ struct MergeAdd { void operator()(const platform::CPUDeviceContext& context, const framework::SelectedRows& input, framework::SelectedRows* output) { - framework::SelectedRows& out = *output; - std::vector input_rows(input.rows()); + std::vector inputs; + inputs.push_back(&input); + (*this)(context, inputs, output); + } - std::map> merge_row_map; - for (size_t i = 0; i < input_rows.size(); ++i) { - merge_row_map[input_rows[i]].push_back(i); + void operator()(const platform::CPUDeviceContext& context, + const std::vector& inputs, + framework::SelectedRows* output) { + if (inputs.size() == 0) { + VLOG(30) << "no input! return"; + return; } - - std::vector merge_rows(merge_row_map.size()); - size_t idx = 0; - int64_t input_width = input.value().dims()[1]; - out.set_height(input.height()); - - T* out_data = out.mutable_value()->mutable_data( + const framework::SelectedRows* has_value_input = nullptr; + for (auto* in : inputs) { + if (in->rows().size() > 0) { + has_value_input = in; + break; + } + } + if (has_value_input == nullptr) { + VLOG(30) << "no input has value! just return" << std::endl; + return; + } + auto input_width = has_value_input->value().dims()[1]; + auto input_height = has_value_input->height(); + framework::SelectedRows& out = *output; + std::set merged_row_set; + for (auto* input : inputs) { + if (input->rows().size() == 0) { + continue; + } + PADDLE_ENFORCE_EQ(input_width, input->value().dims()[1], + "all input should have same " + "dimension except for the first one"); + PADDLE_ENFORCE_EQ(input_height, input->height(), + "all input should have same height"); + merged_row_set.insert(input->rows().begin(), input->rows().end()); + } + std::vector merge_rows(merged_row_set.begin(), + merged_row_set.end()); + std::unordered_map rows_to_id; + for (size_t i = 0; i < merge_rows.size(); ++i) { + rows_to_id[merge_rows[i]] = i; + } + out.set_rows(merge_rows); + out.set_height(input_height); + out.mutable_value()->mutable_data( framework::make_ddim( {static_cast(merge_rows.size()), input_width}), context.GetPlace()); - const T* in_data = input.value().data(); - - for (auto& row_pair : merge_row_map) { - auto* out_ptr = out_data + idx * input_width; - auto& rows = row_pair.second; - merge_rows[idx] = row_pair.first; - ++idx; - // rows.size() is always larger than 0 - std::memcpy(out_ptr, in_data + rows[0] * input_width, - sizeof(T) * input_width); - - for (size_t i = 1; i < rows.size(); ++i) { - auto* in_ptr = in_data + rows[i] * input_width; - for (int64_t j = 0; j < input_width; ++j) { - out_ptr[j] += in_ptr[j]; - } + + math::SetConstant constant_functor; + constant_functor(context, out.mutable_value(), 0.0); + + auto* out_data = out.mutable_value()->data(); + + auto blas = math::GetBlas(context); + for (auto* input : inputs) { + if (input->rows().size() == 0) { + continue; + } + auto* input_data = input->value().data(); + auto& input_rows = input->rows(); + + for (size_t i = 0; i < input_rows.size(); i++) { + size_t out_i = rows_to_id[input_rows[i]]; + elementwise_add_to( + context, &blas, static_cast(input_width), + &input_data[i * input_width], &out_data[out_i * input_width]); } } - - out.set_rows(merge_rows); } }; template struct MergeAdd; template struct MergeAdd; +template struct MergeAdd; +template struct MergeAdd; template struct UpdateToTensor { diff --git a/paddle/fluid/operators/math/selected_rows_functor.cu b/paddle/fluid/operators/math/selected_rows_functor.cu index ba8eccf82042b679f69a32f9d053f05ac8fb9a99..74b9659cfd38076bf1948b5c664817a6753b7090 100644 --- a/paddle/fluid/operators/math/selected_rows_functor.cu +++ b/paddle/fluid/operators/math/selected_rows_functor.cu @@ -18,6 +18,7 @@ limitations under the License. */ #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/math/selected_rows_functor.h" #include "paddle/fluid/platform/cuda_primitives.h" +#include "paddle/fluid/platform/float16.h" namespace paddle { namespace operators { @@ -80,7 +81,7 @@ template __global__ void SelectedRowsAddTensorKernel(const T* selected_rows, const int64_t* rows, T* tensor_out, int64_t row_numel) { - const int ty = blockIdx.y; + const int ty = blockIdx.x; int tid = threadIdx.x; selected_rows += ty * row_numel; @@ -118,11 +119,11 @@ struct SelectedRowsAddTensor { auto* out_data = output->data(); SetConstant functor; - functor(context, output, 0.0); + functor(context, output, static_cast(0)); const int block_size = 256; dim3 threads(block_size, 1); - dim3 grid(1, in1_rows.size()); + dim3 grid(in1_rows.size(), 1); SelectedRowsAddTensorKernel< T, block_size><<>>( in1_data, in1_rows.CUDAData(context.GetPlace()), out_data, @@ -136,6 +137,9 @@ struct SelectedRowsAddTensor { template struct SelectedRowsAddTensor; template struct SelectedRowsAddTensor; +template struct SelectedRowsAdd; +template struct SelectedRowsAddTensor; template struct SelectedRowsAddTo { @@ -175,6 +179,8 @@ template struct SelectedRowsAddTo; template struct SelectedRowsAddTo; template struct SelectedRowsAddTo; template struct SelectedRowsAddTo; +template struct SelectedRowsAddTo; namespace { template @@ -182,7 +188,7 @@ __global__ void SelectedRowsAddToTensorKernel(const T* selected_rows, const int64_t* rows, T* tensor_out, int64_t row_numel) { - const int ty = blockIdx.y; + const int ty = blockIdx.x; int tid = threadIdx.x; selected_rows += ty * row_numel; @@ -215,7 +221,7 @@ struct SelectedRowsAddToTensor { auto* in2_data = input2->data(); const int block_size = 256; dim3 threads(block_size, 1); - dim3 grid(1, in1_rows.size()); + dim3 grid(in1_rows.size(), 1); SelectedRowsAddToTensorKernel< T, block_size><<>>( in1_data, in1_rows.CUDAData(context.GetPlace()), in2_data, @@ -227,6 +233,8 @@ template struct SelectedRowsAddToTensor; template struct SelectedRowsAddToTensor; template struct SelectedRowsAddToTensor; template struct SelectedRowsAddToTensor; +template struct SelectedRowsAddToTensor; namespace scatter { @@ -267,10 +275,15 @@ struct MergeAdd { void operator()(const platform::CUDADeviceContext& context, const framework::SelectedRows& input, framework::SelectedRows* output) { - framework::SelectedRows& out = *output; framework::Vector input_rows(input.rows()); + if (input_rows.size() == 0) { + return; + } + + framework::SelectedRows& out = *output; std::set row_set(input_rows.begin(), input_rows.end()); - std::vector merge_rows(row_set.begin(), row_set.end()); + std::vector merge_rows_cpu(row_set.begin(), row_set.end()); + framework::Vector merge_rows(merge_rows_cpu); auto input_width = input.value().dims()[1]; @@ -282,7 +295,7 @@ struct MergeAdd { context.GetPlace()); math::SetConstant constant_functor; - constant_functor(context, out.mutable_value(), 0.0); + constant_functor(context, out.mutable_value(), static_cast(0)); auto* out_data = out.mutable_value()->data(); auto* input_data = input.value().data(); @@ -296,18 +309,86 @@ struct MergeAdd { out.mutable_rows()->CUDAMutableData(context.GetPlace()), out.rows().size(), input_width); } + + void operator()(const platform::CUDADeviceContext& context, + const std::vector& inputs, + framework::SelectedRows* output) { + if (inputs.size() == 0) { + VLOG(30) << "no input! return"; + return; + } + const framework::SelectedRows* has_value_input = nullptr; + for (auto* in : inputs) { + if (in->rows().size() > 0) { + has_value_input = in; + break; + } + } + if (has_value_input == nullptr) { + VLOG(30) << "no input has value! just return" << std::endl; + return; + } + auto input_width = has_value_input->value().dims()[1]; + auto input_height = has_value_input->height(); + framework::SelectedRows& out = *output; + std::set merged_row_set; + for (auto* input : inputs) { + if (input->rows().size() == 0) { + continue; + } + PADDLE_ENFORCE_EQ(input_width, input->value().dims()[1], + "all input should have same " + "dimension except for the first one"); + PADDLE_ENFORCE_EQ(input_height, input->height(), + "all input should have same height"); + merged_row_set.insert(input->rows().begin(), input->rows().end()); + } + std::vector merge_rows_cpu(merged_row_set.begin(), + merged_row_set.end()); + framework::Vector merge_rows(merge_rows_cpu); + + out.set_rows(merge_rows); + out.set_height(input_height); + out.mutable_value()->mutable_data( + framework::make_ddim( + {static_cast(merge_rows.size()), input_width}), + context.GetPlace()); + + math::SetConstant constant_functor; + constant_functor(context, out.mutable_value(), static_cast(0)); + + auto* out_data = out.mutable_value()->data(); + + const int block_size = 256; + dim3 threads(block_size, 1); + + for (auto* input : inputs) { + if (input->rows().size() == 0) { + continue; + } + auto* input_data = input->value().data(); + auto& input_rows = input->rows(); + dim3 grid1(input_rows.size(), 1); + + MergeAddKernel<<>>( + input_data, input_rows.CUDAData(context.GetPlace()), out_data, + out.mutable_rows()->CUDAMutableData(context.GetPlace()), + out.rows().size(), input_width); + } + } }; template struct MergeAdd; template struct MergeAdd; template struct MergeAdd; template struct MergeAdd; +template struct MergeAdd; template __global__ void UpdateToTensorKernel(const T* selected_rows, const int64_t* rows, const ScatterOps& op, T* tensor_out, int64_t row_numel) { - const int ty = blockIdx.y; + const int ty = blockIdx.x; int tid = threadIdx.x; selected_rows += ty * row_numel; @@ -376,7 +457,7 @@ struct UpdateToTensor { auto* in2_data = input2->data(); dim3 threads(platform::PADDLE_CUDA_NUM_THREADS, 1); - dim3 grid(1, in1_rows.size()); + dim3 grid(in1_rows.size(), 1); UpdateToTensorKernel<<< grid, threads, 0, context.stream()>>>(in1_data, in1_rows.cuda_data(), op, in2_data, in1_row_numel); diff --git a/paddle/fluid/operators/math/selected_rows_functor.h b/paddle/fluid/operators/math/selected_rows_functor.h index 900be86f91c6658a5265189a6745316c6471209e..6d146d39d6d07678e859b82b25ba60ed7661546d 100644 --- a/paddle/fluid/operators/math/selected_rows_functor.h +++ b/paddle/fluid/operators/math/selected_rows_functor.h @@ -64,6 +64,8 @@ struct SelectedRowsSumTo { framework::SelectedRows* input2); }; +// FIXME: The result of SelectedRowsAddToTensor maybe non deterministic, +// because it uses CudaAtomicAdd. // input2 = input1 + input2 template struct SelectedRowsAddToTensor { @@ -83,155 +85,9 @@ struct MergeAdd { void operator()(const DeviceContext& context, const framework::SelectedRows& input, framework::SelectedRows* output); -}; - -template <> -struct MergeAdd { - framework::SelectedRows operator()(const platform::CPUDeviceContext& context, - const framework::SelectedRows& input) { - framework::SelectedRows out; - (*this)(context, input, &out); - return out; - } - - void operator()(const platform::CPUDeviceContext& context, - const framework::SelectedRows& input, - framework::SelectedRows* output) { - framework::SelectedRows& out = *output; - std::vector input_rows(input.rows()); - - std::map> merge_row_map; - for (size_t i = 0; i < input_rows.size(); ++i) { - merge_row_map[input_rows[i]].push_back(i); - } - - std::vector merge_rows(merge_row_map.size()); - size_t idx = 0; - int64_t input_width = input.value().dims()[1]; - out.set_height(input.height()); - - auto* out_data = out.mutable_value()->mutable_data( - framework::make_ddim( - {static_cast(merge_rows.size()), input_width}), - context.GetPlace()); - auto* in_data = input.value().data(); - - auto blas = GetBlas(context); - for (auto& row_pair : merge_row_map) { - auto* out_ptr = out_data + idx * input_width; - auto& rows = row_pair.second; - merge_rows[idx] = row_pair.first; - ++idx; - // rows.size() is always larger than 0 - blas.VCOPY(input_width, in_data + rows[0] * input_width, out_ptr); - - for (size_t i = 1; i < rows.size(); ++i) { - blas.AXPY(input_width, 1., in_data + rows[i] * input_width, out_ptr); - } - } - - out.set_rows(merge_rows); - } -}; - -template <> -struct MergeAdd { - framework::SelectedRows operator()(const platform::CPUDeviceContext& context, - const framework::SelectedRows& input) { - framework::SelectedRows out; - (*this)(context, input, &out); - return out; - } - - void operator()(const platform::CPUDeviceContext& context, - const framework::SelectedRows& input, - framework::SelectedRows* output) { - framework::SelectedRows& out = *output; - std::vector input_rows(input.rows()); - - std::map> merge_row_map; - for (size_t i = 0; i < input_rows.size(); ++i) { - merge_row_map[input_rows[i]].push_back(i); - } - - std::vector merge_rows(merge_row_map.size()); - size_t idx = 0; - int64_t input_width = input.value().dims()[1]; - out.set_height(input.height()); - - auto* out_data = out.mutable_value()->mutable_data( - framework::make_ddim( - {static_cast(merge_rows.size()), input_width}), - context.GetPlace()); - auto* in_data = input.value().data(); - - auto blas = GetBlas(context); - for (auto& row_pair : merge_row_map) { - auto* out_ptr = out_data + idx * input_width; - auto& rows = row_pair.second; - merge_rows[idx] = row_pair.first; - ++idx; - // rows.size() is always larger than 0 - blas.VCOPY(input_width, in_data + rows[0] * input_width, out_ptr); - - for (size_t i = 1; i < rows.size(); ++i) { - blas.AXPY(input_width, 1., in_data + rows[i] * input_width, out_ptr); - } - } - - out.set_rows(merge_rows); - } -}; - -template -struct Add { - framework::SelectedRows operator()(const DeviceContext& context, - const framework::SelectedRows& input1, - const framework::SelectedRows& input2) { - framework::SelectedRows out; - out.set_rows(input1.rows()); - out.set_height(input1.height()); - out.mutable_value()->mutable_data(input1.value().dims(), - context.GetPlace()); - auto e_out = framework::EigenVector::Flatten(*(out.mutable_value())); - auto e_in1 = framework::EigenVector::Flatten(input1.value()); - auto e_in2 = framework::EigenVector::Flatten(input2.value()); - e_out.device(*context.eigen_device()) = e_in1 + e_in2; - return out; - } -}; - -template -struct Mul { - // multiply two SelectedRows - framework::SelectedRows operator()(const DeviceContext& context, - const framework::SelectedRows& input1, - const framework::SelectedRows& input2) { - framework::SelectedRows out; - out.set_rows(input1.rows()); - out.set_height(input1.height()); - out.mutable_value()->mutable_data(input1.value().dims(), - context.GetPlace()); - auto e_out = framework::EigenVector::Flatten(*(out.mutable_value())); - auto e_in1 = framework::EigenVector::Flatten(input1.value()); - auto e_in2 = framework::EigenVector::Flatten(input2.value()); - e_out.device(*context.eigen_device()) = e_in1 * e_in2; - return out; - } - // multiply scalar to SelectedRows - framework::SelectedRows operator()(const DeviceContext& context, - const framework::SelectedRows& input1, - const T input2) { - framework::SelectedRows out; - out.set_rows(input1.rows()); - out.set_height(input1.height()); - out.mutable_value()->mutable_data(input1.value().dims(), - context.GetPlace()); - auto e_out = framework::EigenVector::Flatten(*(out.mutable_value())); - auto e_in1 = framework::EigenVector::Flatten(input1.value()); - e_out.device(*context.eigen_device()) = input2 * e_in1; - return out; - } + void operator()(const DeviceContext& context, + const std::vector& inputs, + framework::SelectedRows* output); }; enum class ScatterOps { ASSIGN, ADD, SUB, SUBBY, MUL, DIV, DIVBY }; diff --git a/paddle/fluid/operators/math/selected_rows_functor_test.cc b/paddle/fluid/operators/math/selected_rows_functor_test.cc index 835589356042b44c9fa5988aed726434fd66910a..f15b37a1e3f0ae9c7612c4f74470472393ff4ad6 100644 --- a/paddle/fluid/operators/math/selected_rows_functor_test.cc +++ b/paddle/fluid/operators/math/selected_rows_functor_test.cc @@ -302,6 +302,64 @@ TEST(selected_rows_functor, cpu_merge_add_int) { EXPECT_EQ(out_data[1 * row_numel], 2); EXPECT_EQ(out_data[2 * row_numel], 1); } + +TEST(selected_rows_functor, cpu_merge_add_multi) { + paddle::platform::CPUPlace cpu_place; + paddle::platform::CPUDeviceContext ctx(cpu_place); + paddle::operators::math::SetConstant + set_const; + + int64_t height = 10; + int64_t row_numel = 8; + + std::vector rows1{5, 2, 5, 3, 5}; + std::unique_ptr selected_rows1{ + new paddle::framework::SelectedRows(rows1, height)}; + auto* in1_value = selected_rows1->mutable_value(); + in1_value->mutable_data( + paddle::framework::make_ddim( + {static_cast(rows1.size()), row_numel}), + cpu_place); + set_const(ctx, in1_value, 1.0); + + std::vector rows2{2, 5, 3, 5, 3}; + std::unique_ptr selected_rows2{ + new paddle::framework::SelectedRows(rows2, height)}; + auto* in2_value = selected_rows2->mutable_value(); + in2_value->mutable_data( + paddle::framework::make_ddim( + {static_cast(rows2.size()), row_numel}), + cpu_place); + set_const(ctx, in2_value, 1.0); + + std::unique_ptr output{ + new paddle::framework::SelectedRows()}; + output->set_height(height); + paddle::operators::math::scatter::MergeAdd + merge_add_functor; + + std::vector inputs; + inputs.push_back(selected_rows1.get()); + inputs.push_back(selected_rows2.get()); + merge_add_functor(ctx, inputs, output.get()); + + EXPECT_EQ(output->height(), height); + EXPECT_EQ(output->value().dims(), + paddle::framework::make_ddim({3, row_numel})); + + std::vector ret_rows{2, 3, 5}; + EXPECT_EQ(output->rows(), ret_rows); + + auto* out_data = output->value().data(); + for (size_t i = 0; i < ret_rows.size(); ++i) { + for (size_t j = 0; j < row_numel; ++j) { + EXPECT_EQ(out_data[i * row_numel + j], ret_rows[i]); + } + } +} + TEST(selected_rows_functor, cpu_sum_to) { paddle::platform::CPUPlace cpu_place; paddle::platform::CPUDeviceContext ctx(cpu_place); @@ -318,6 +376,7 @@ TEST(selected_rows_functor, cpu_sum_to) { paddle::framework::make_ddim( {static_cast(rows1.size()), row_numel}), cpu_place); + functor(ctx, in1_value, 1.0); std::vector rows2{0, 5, 7, 9}; std::unique_ptr selected_rows2{ @@ -327,6 +386,7 @@ TEST(selected_rows_functor, cpu_sum_to) { paddle::framework::make_ddim( {static_cast(rows2.size()), row_numel}), cpu_place); + functor(ctx, in2_value, 2.0); std::unique_ptr output{ new paddle::framework::SelectedRows()}; diff --git a/paddle/fluid/operators/math/selected_rows_functor_test.cu b/paddle/fluid/operators/math/selected_rows_functor_test.cu.cc similarity index 79% rename from paddle/fluid/operators/math/selected_rows_functor_test.cu rename to paddle/fluid/operators/math/selected_rows_functor_test.cu.cc index 5fc50aba25d8e69480a17f0f80877b0d03e17276..73d83fa2e43f14445c969648cd469b0e32d644c7 100644 --- a/paddle/fluid/operators/math/selected_rows_functor_test.cu +++ b/paddle/fluid/operators/math/selected_rows_functor_test.cu.cc @@ -12,10 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ +#include "paddle/fluid/operators/math/selected_rows_functor.h" #include #include "gtest/gtest.h" #include "paddle/fluid/operators/math/math_function.h" -#include "paddle/fluid/operators/math/selected_rows_functor.h" TEST(selected_rows_functor, gpu_add) { paddle::platform::CUDAPlace gpu_place(0); @@ -38,6 +38,7 @@ TEST(selected_rows_functor, gpu_add) { {static_cast(rows1.size()), row_numel}), gpu_place); functor(ctx, in1_value, 1.0); + PADDLE_ENFORCE(cudaDeviceSynchronize()); std::vector rows2{0, 5, 7, 9}; std::unique_ptr selected_rows2{ @@ -241,3 +242,67 @@ TEST(selected_rows_functor, gpu_add_to) { // row9: 2.0 + 3.0 EXPECT_EQ(tensor1_cpu_data[9 * row_numel + 6], 5.0); } + +TEST(selected_rows_functor, gpu_merge_add) { + paddle::platform::CUDAPlace gpu_place(0); + paddle::platform::CPUPlace cpu_place; + paddle::platform::CUDADeviceContext& ctx = + *reinterpret_cast( + paddle::platform::DeviceContextPool::Instance().Get(gpu_place)); + paddle::operators::math::SetConstant + set_const; + + int64_t height = 10; + int64_t row_numel = 8; + + std::vector rows1{5, 2, 5, 3, 5}; + std::unique_ptr selected_rows1{ + new paddle::framework::SelectedRows(rows1, height)}; + auto* in1_value = selected_rows1->mutable_value(); + in1_value->mutable_data( + paddle::framework::make_ddim( + {static_cast(rows1.size()), row_numel}), + gpu_place); + set_const(ctx, in1_value, 1.0); + + std::vector rows2{2, 5, 3, 5, 3}; + std::unique_ptr selected_rows2{ + new paddle::framework::SelectedRows(rows2, height)}; + auto* in2_value = selected_rows2->mutable_value(); + in2_value->mutable_data( + paddle::framework::make_ddim( + {static_cast(rows2.size()), row_numel}), + gpu_place); + set_const(ctx, in2_value, 1.0); + + std::unique_ptr output{ + new paddle::framework::SelectedRows()}; + output->set_height(height); + paddle::operators::math::scatter::MergeAdd< + paddle::platform::CUDADeviceContext, float> + merge_add_functor; + + std::vector inputs; + inputs.push_back(selected_rows1.get()); + inputs.push_back(selected_rows2.get()); + merge_add_functor(ctx, inputs, output.get()); + + paddle::framework::Tensor output_cpu; + paddle::framework::TensorCopy(output->value(), cpu_place, ctx, &output_cpu); + ctx.Wait(); + + EXPECT_EQ(output->height(), height); + EXPECT_EQ(output->value().dims(), + paddle::framework::make_ddim({3, row_numel})); + + std::vector ret_rows{2, 3, 5}; + EXPECT_EQ(output->rows(), ret_rows); + + auto* out_data = output_cpu.data(); + for (size_t i = 0; i < ret_rows.size(); ++i) { + for (size_t j = 0; j < row_numel; ++j) { + EXPECT_EQ(out_data[i * row_numel + j], ret_rows[i]); + } + } +} diff --git a/paddle/fluid/operators/math/sequence_pooling.cc b/paddle/fluid/operators/math/sequence_pooling.cc index 235b5405fb7d016f4bd8c738f75b303522183116..6d491dbf1ed162ef07fda4c07e95cc57108486fd 100644 --- a/paddle/fluid/operators/math/sequence_pooling.cc +++ b/paddle/fluid/operators/math/sequence_pooling.cc @@ -31,7 +31,7 @@ template using EigenMatrix = framework::EigenMatrix; -template +template class MaxSeqPoolFunctor { public: void operator()(const platform::CPUDeviceContext& context, @@ -70,7 +70,41 @@ class MaxSeqPoolFunctor { } } }; +// Instantisation of Max Sequence Pooling for test phase eg. no need to fill +// index buffer +template +class MaxSeqPoolFunctor { + public: + void operator()(const platform::CPUDeviceContext& context, + const framework::LoDTensor& input, framework::Tensor* output, + framework::Tensor* index) { + auto in_dims = input.dims(); + auto out_dims = output->dims(); + PADDLE_ENFORCE_GT(in_dims.size(), 1); + PADDLE_ENFORCE_GT(out_dims.size(), 1); + for (int64_t i = 1; i < in_dims.size(); ++i) { + PADDLE_ENFORCE_EQ(in_dims[i], out_dims[i]); + } + auto starts = input.lod()[0]; + const T* in_data = input.data(); + T* out_data = output->data(); + + int64_t num_seq = out_dims[0]; + int64_t dim = output->numel() / num_seq; + for (int64_t i = 0; i < num_seq; ++i) { + std::memcpy(&out_data[i * dim], &in_data[starts[i] * dim], + dim * sizeof(T)); + for (size_t j = starts[i] + 1; j < starts[i + 1]; ++j) { + for (int64_t k = 0; k < dim; ++k) { + if (in_data[j * dim + k] > out_data[i * dim + k]) { + out_data[i * dim + k] = in_data[j * dim + k]; + } + } + } + } + } +}; template class MaxSeqPoolGradFunctor { public: @@ -157,17 +191,47 @@ class FirstSeqPoolFunctor { } }; +template +class SumSeqPoolGradFunctor { + public: + void operator()(const platform::CPUDeviceContext& context, + const framework::Tensor& out_grad, + framework::LoDTensor* in_grad) { + auto lod = in_grad->lod()[0]; + int64_t out_w = out_grad.numel() / out_grad.dims()[0]; + int64_t in_w = in_grad->numel() / in_grad->dims()[0]; + PADDLE_ENFORCE(in_w == out_w); + const T* out_g_data = out_grad.data(); + T* in_g_data = in_grad->mutable_data(context.GetPlace()); + auto blas = math::GetBlas(context); + for (int i = 0; i < static_cast(lod.size()) - 1; ++i) { + int64_t h = static_cast(lod[i + 1] - lod[i]); + int64_t in_offset = lod[i] * in_w; + const T* out_pos = out_g_data + i * out_w; + T* in_pos = in_g_data + in_offset; + for (int r = 0; r != h; ++r) { + blas.VCOPY(in_w, out_pos, in_pos + r * in_w); + } + } + } +}; + template class SequencePoolFunctor { public: /* max pool has index output */ void operator()(const platform::CPUDeviceContext& context, const std::string pooltype, const framework::LoDTensor& input, - framework::Tensor* output, + framework::Tensor* output, bool is_test, framework::Tensor* index = nullptr) { if (pooltype == "MAX") { - math::MaxSeqPoolFunctor max_pool; - max_pool(context, input, output, index); + if (is_test) { + math::MaxSeqPoolFunctor max_pool; + max_pool(context, input, output, index); + } else { + math::MaxSeqPoolFunctor max_pool; + max_pool(context, input, output, index); + } return; } if (pooltype == "LAST") { @@ -175,6 +239,7 @@ class SequencePoolFunctor { last_pool(context, input, output); return; } + if (pooltype == "FIRST") { math::FirstSeqPoolFunctor first_pool; first_pool(context, input, output); @@ -231,9 +296,15 @@ class SequencePoolGradFunctor { math::SetConstant functor; functor(context, in_grad, 0); } + + if (pooltype == "SUM") { + math::SumSeqPoolGradFunctor sum_pool_grad; + sum_pool_grad(context, out_grad, in_grad); + return; + } + auto lod = in_grad->lod()[0]; auto& place = *context.eigen_device(); - auto blas = math::GetBlas(context); for (int i = 0; i < static_cast(lod.size()) - 1; ++i) { auto in_g_t = in_grad->Slice(static_cast(lod[i]), static_cast(lod[i + 1])); @@ -247,12 +318,6 @@ class SequencePoolGradFunctor { if (pooltype == "AVERAGE") { in_g_e.device(place) = (out_g_e / static_cast(h)).broadcast(bcast); - } else if (pooltype == "SUM") { - const T* out_g_data = out_g_t.data(); - T* in_g_data = in_g_t.mutable_data(context.GetPlace()); - for (int r = 0; r != h; ++r) { - blas.VCOPY(w, out_g_data, in_g_data + r * w); - } } else if (pooltype == "SQRT") { in_g_e.device(place) = (out_g_e / std::sqrt(static_cast(h))).broadcast(bcast); diff --git a/paddle/fluid/operators/math/sequence_pooling.cu b/paddle/fluid/operators/math/sequence_pooling.cu index a92aef805a0434f2ebcbc62d4e5eaef0cfb21bfa..0015fafbc892912424dfa6dbd1778438d384ca19 100644 --- a/paddle/fluid/operators/math/sequence_pooling.cu +++ b/paddle/fluid/operators/math/sequence_pooling.cu @@ -133,7 +133,7 @@ class SequencePoolFunctor { public: void operator()(const platform::CUDADeviceContext& context, const std::string pooltype, const framework::LoDTensor& input, - framework::Tensor* output, + framework::Tensor* output, bool is_test, framework::Tensor* index = nullptr) { auto& lod = input.lod()[0]; const size_t item_dim = output->numel() / output->dims()[0]; diff --git a/paddle/fluid/operators/math/sequence_pooling.h b/paddle/fluid/operators/math/sequence_pooling.h index 8dcbee65d0b63a137e5f422ec8667cc950641b4a..a1046ea2160d0ae9c2251612c97d3f2640b0aad1 100644 --- a/paddle/fluid/operators/math/sequence_pooling.h +++ b/paddle/fluid/operators/math/sequence_pooling.h @@ -28,7 +28,7 @@ class SequencePoolFunctor { /* max pool has index output */ void operator()(const DeviceContext& context, const std::string pooltype, const framework::LoDTensor& input, framework::Tensor* output, - framework::Tensor* index = nullptr); + bool is_test = false, framework::Tensor* index = nullptr); }; template diff --git a/paddle/fluid/operators/math/sequence_pooling_test.cc b/paddle/fluid/operators/math/sequence_pooling_test.cc new file mode 100644 index 0000000000000000000000000000000000000000..5535523e798912ff80eeb5d753914c7d8d70a05f --- /dev/null +++ b/paddle/fluid/operators/math/sequence_pooling_test.cc @@ -0,0 +1,126 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/math/sequence_pooling.h" +#include +#include + +template +void TestSequencePoolingSum(const paddle::framework::LoD& lod) { + paddle::framework::LoDTensor cpu_out_grad; + paddle::framework::LoDTensor cpu_in_grad; + paddle::framework::LoDTensor out_grad; + paddle::framework::LoDTensor in_grad; + const size_t second_dim = 128u; + + // construct out_grad's tensor in cpu + const size_t out_first_dim = lod[0].size() - 1; + auto out_dims = paddle::framework::make_ddim( + {static_cast(out_first_dim), static_cast(second_dim)}); + + cpu_out_grad.mutable_data(out_dims, paddle::platform::CPUPlace()); + for (int64_t i = 0; i < cpu_out_grad.numel(); ++i) { + cpu_out_grad.data()[i] = static_cast(i); + } + + // copy to dst out_grad + auto* place = new Place(); + DeviceContext* context = new DeviceContext(*place); + if (paddle::platform::is_cpu_place(*place)) { + out_grad = cpu_out_grad; + } else { + TensorCopySync(cpu_out_grad, *place, &out_grad); + } + + // construct in_grad + in_grad.set_lod(lod); + auto in_dims = paddle::framework::make_ddim( + {static_cast(lod[0].back()), static_cast(second_dim)}); + in_grad.mutable_data(in_dims, context->GetPlace()); + + // check tensor contruction result + PADDLE_ENFORCE_EQ(in_grad.dims().size(), out_grad.dims().size()); + for (int64_t i = 1; i < out_grad.dims().size(); ++i) { + PADDLE_ENFORCE_EQ(in_grad.dims()[i], out_grad.dims()[i]); + } + + // call functor + paddle::operators::math::SequencePoolGradFunctor()( + *context, "SUM", out_grad, &in_grad); + + if (paddle::platform::is_cpu_place(*place)) { + cpu_in_grad = in_grad; + } else { + TensorCopySync(in_grad, paddle::platform::CPUPlace(), &cpu_in_grad); + cpu_in_grad.set_lod(in_grad.lod()); + } + + EXPECT_EQ(in_grad.numel(), lod[0].back() * second_dim); + EXPECT_EQ(in_grad.lod(), lod); + + if (paddle::platform::is_cpu_place(*place)) { + for (size_t i = 0; i < in_grad.lod()[0].size() - 1; ++i) { + int64_t begin = in_grad.lod()[0][i]; + int64_t end = in_grad.lod()[0][i + 1]; + paddle::framework::Tensor tmp = in_grad.Slice(begin, end); + for (size_t j = 0; j != tmp.numel() / second_dim; ++j) { + for (int64_t m = 0; m != second_dim; ++m) { + EXPECT_EQ(tmp.data()[m + j * second_dim], + out_grad.data()[m + i * second_dim]); + } + } + } + } else { + for (size_t i = 0; i < cpu_in_grad.lod()[0].size() - 1; ++i) { + int64_t begin = cpu_in_grad.lod()[0][i]; + int64_t end = cpu_in_grad.lod()[0][i + 1]; + paddle::framework::Tensor tmp = cpu_in_grad.Slice(begin, end); + for (size_t j = 0; j != tmp.numel() / second_dim; ++j) { + for (int64_t m = 0; m != second_dim; ++m) { + EXPECT_EQ(tmp.data()[m + j * second_dim], + cpu_out_grad.data()[m + i * second_dim]); + } + } + } + } + + delete place; + delete context; +} + +TEST(SequencePoolingGrad, CPU_SUM) { + paddle::framework::LoD lod1; + lod1.push_back(std::vector{0, 10}); + TestSequencePoolingSum(lod1); + + paddle::framework::LoD lod2; + lod2.push_back(std::vector{0, 2, 7, 10}); + TestSequencePoolingSum(lod2); +} + +#ifdef PADDLE_WITH_CUDA +TEST(SequencePoolingGrad, CUDA_SUM) { + paddle::framework::LoD lod1; + lod1.push_back(std::vector{0, 10}); + TestSequencePoolingSum(lod1); + + paddle::framework::LoD lod2; + lod2.push_back(std::vector{0, 2, 7, 10}); + TestSequencePoolingSum(lod2); +} +#endif diff --git a/paddle/fluid/operators/math/softmax.cc b/paddle/fluid/operators/math/softmax.cc index 78c65af24a8c5fa57e33415acc3018790bf70790..fa2018178f44ff4e3b14937c1f508fa8a698e20e 100644 --- a/paddle/fluid/operators/math/softmax.cc +++ b/paddle/fluid/operators/math/softmax.cc @@ -19,8 +19,10 @@ namespace paddle { namespace operators { namespace math { -template class SoftmaxFunctor; -template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; template class SoftmaxGradFunctor; template class SoftmaxGradFunctor; diff --git a/paddle/fluid/operators/math/softmax.cu b/paddle/fluid/operators/math/softmax.cu index 3effe776258cb541dbba32f63eda457d917011f4..2e9669049e36478549b793e3fa76220825888e21 100644 --- a/paddle/fluid/operators/math/softmax.cu +++ b/paddle/fluid/operators/math/softmax.cu @@ -96,12 +96,20 @@ template class SoftmaxCUDNNFunctor; template class SoftmaxCUDNNFunctor; template class SoftmaxGradCUDNNFunctor; template class SoftmaxGradCUDNNFunctor; +template class SoftmaxGradCUDNNFunctor; -template class SoftmaxFunctor; -template class SoftmaxFunctor; -template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; +template class SoftmaxFunctor; template class SoftmaxGradFunctor; template class SoftmaxGradFunctor; +template class SoftmaxGradFunctor; } // namespace math } // namespace operators diff --git a/paddle/fluid/operators/math/softmax.h b/paddle/fluid/operators/math/softmax.h index da1f0b672d3a5fb5da8f4d72892be21964bdbc0d..089458e957dfaac1cbc3bf1bc2b4be4877e702c9 100644 --- a/paddle/fluid/operators/math/softmax.h +++ b/paddle/fluid/operators/math/softmax.h @@ -19,7 +19,8 @@ namespace paddle { namespace operators { namespace math { -template +template class SoftmaxFunctor { public: void operator()(const DeviceContext& context, const framework::Tensor* X, diff --git a/paddle/fluid/operators/math/softmax_impl.h b/paddle/fluid/operators/math/softmax_impl.h index dd9971ba091cc3ece86654f65c335b98087f45ed..0f3e5b20086378da8ef1138a5f5c005b724f7fa2 100644 --- a/paddle/fluid/operators/math/softmax_impl.h +++ b/paddle/fluid/operators/math/softmax_impl.h @@ -16,6 +16,7 @@ limitations under the License. */ #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/tensor.h" +#include "paddle/fluid/operators/math/blas.h" namespace paddle { namespace operators { namespace math { @@ -32,10 +33,10 @@ struct ValueClip { } }; -template -void SoftmaxFunctor::operator()(const DeviceContext& context, - const framework::Tensor* X, - framework::Tensor* Y) { +template +void SoftmaxFunctor::operator()( + const DeviceContext& context, const framework::Tensor* X, + framework::Tensor* Y) { auto logits = EigenMatrix::From(*X); auto softmax = EigenMatrix::From(*Y); @@ -65,6 +66,49 @@ void SoftmaxFunctor::operator()(const DeviceContext& context, .broadcast(one_by_class)); } +template +using enable_if_CPU = typename std::enable_if< + std::is_same::value>::type; + +template +class SoftmaxFunctor> { + void operator()(const DeviceContext& context, const framework::Tensor* X, + framework::Tensor* Y) { + auto in_dims = X->dims(); + auto out_dims = Y->dims(); + const float* in_data = X->data(); + float* out_data = Y->data(); + const int kBatchDim = 0; + const int kClassDim = 1; + // 2D data. Batch x C + const int batch_size = in_dims[kBatchDim]; + const int num_classes = in_dims[kClassDim]; + std::vector entities(batch_size); + auto blas = math::GetBlas(context); + for (int n = 0; n < batch_size; ++n) { + entities[n] = in_data[n * num_classes]; + for (int c = 1; c < num_classes; ++c) { + entities[n] = in_data[n * num_classes + c] > entities[n] + ? in_data[n * num_classes + c] + : entities[n]; + } + for (int c = 0; c < num_classes; ++c) { + out_data[n * num_classes + c] = + in_data[n * num_classes + c] - entities[n]; + } + } + + blas.VEXP(num_classes * batch_size, out_data, out_data); + for (int n = 0; n < batch_size; ++n) { + entities[n] = out_data[n * num_classes]; + for (int c = 1; c < num_classes; ++c) { + entities[n] += out_data[n * num_classes + c]; + } + blas.SCAL(num_classes, 1.0f / entities[n], &out_data[n * num_classes]); + } + } +}; + template void SoftmaxGradFunctor::operator()( const DeviceContext& context, const framework::Tensor* y, diff --git a/paddle/fluid/operators/mean_op.cc b/paddle/fluid/operators/mean_op.cc index 9e0bebd17c02a3ce010b77142757b8789cfbcdd9..820636defad0be9fb2e6decefc938658ae70ea9b 100644 --- a/paddle/fluid/operators/mean_op.cc +++ b/paddle/fluid/operators/mean_op.cc @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/mean_op.h" - +#include namespace paddle { namespace operators { @@ -34,7 +34,7 @@ class MeanOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { AddInput("X", "(Tensor) The input of mean op"); - AddOutput("Out", "(Tensor) The output of mean op").Reuse("X"); + AddOutput("Out", "(Tensor) The output of mean op"); AddComment(R"DOC( Mean Operator calculates the mean of all elements in X. @@ -42,6 +42,14 @@ Mean Operator calculates the mean of all elements in X. } }; +class MeanOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; + } +}; + class MeanGradOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -50,6 +58,14 @@ class MeanGradOp : public framework::OperatorWithKernel { ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X")); ctx->ShareLoD("X", framework::GradVarName("X")); } + + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + auto input_data_type = + framework::ToDataType(ctx.Input("X")->type()); + + return framework::OpKernelType(input_data_type, ctx.GetPlace()); + } }; class MeanGradMaker : public framework::SingleGradOpDescMaker { @@ -71,7 +87,8 @@ class MeanGradMaker : public framework::SingleGradOpDescMaker { } // namespace paddle namespace ops = paddle::operators; -REGISTER_OPERATOR(mean, ops::MeanOp, ops::MeanOpMaker, ops::MeanGradMaker); +REGISTER_OPERATOR(mean, ops::MeanOp, ops::MeanOpMaker, ops::MeanOpInferVarType, + ops::MeanGradMaker); REGISTER_OPERATOR(mean_grad, ops::MeanGradOp); REGISTER_OP_CPU_KERNEL( mean, ops::MeanKernel, diff --git a/paddle/fluid/operators/mean_op.cu b/paddle/fluid/operators/mean_op.cu index 91e0ab28efc21d4376524c8ecf66b429d51d8847..413b8ace67bd0a36849373812950834523b62216 100644 --- a/paddle/fluid/operators/mean_op.cu +++ b/paddle/fluid/operators/mean_op.cu @@ -15,11 +15,15 @@ limitations under the License. */ #define EIGEN_USE_GPU #include "paddle/fluid/operators/mean_op.h" +#include "paddle/fluid/platform/float16.h" namespace ops = paddle::operators; +namespace plat = paddle::platform; REGISTER_OP_CUDA_KERNEL( mean, ops::MeanKernel, - ops::MeanKernel); + ops::MeanKernel, + ops::MeanKernel); REGISTER_OP_CUDA_KERNEL( mean_grad, ops::MeanGradKernel, - ops::MeanGradKernel); + ops::MeanGradKernel, + ops::MeanGradKernel); diff --git a/paddle/fluid/operators/mean_op.h b/paddle/fluid/operators/mean_op.h index 362e9f9ae8b2f0f77198e3f3939211ae1117b27b..360b2f68a749f630d3c7ed009c16cb51ec150581 100644 --- a/paddle/fluid/operators/mean_op.h +++ b/paddle/fluid/operators/mean_op.h @@ -55,8 +55,7 @@ class MeanGradKernel : public framework::OpKernel { IG->mutable_data(context.GetPlace()); T ig_size = static_cast(IG->numel()); - Eigen::DSizes bcast(ig_size); - + Eigen::DSizes bcast(static_cast(ig_size)); EigenVector::Flatten(*IG).device( *context.template device_context().eigen_device()) = (EigenVector::From(*OG) / ig_size).broadcast(bcast); diff --git a/paddle/fluid/operators/merge_ids_op.h b/paddle/fluid/operators/merge_ids_op.h deleted file mode 100644 index 83712a8519c6817151e1922c606c0fdd4682a2db..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/merge_ids_op.h +++ /dev/null @@ -1,92 +0,0 @@ -/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#pragma once - -#include -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/framework/tensor_util.h" -#include "paddle/fluid/operators/math/selected_rows_functor.h" - -namespace paddle { -namespace operators { - -template -class MergeIdsOpKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext &ctx) const override { - auto place = ctx.GetPlace(); - if (!platform::is_cpu_place(place)) { - PADDLE_THROW("MergeIds do not support GPU kernel"); - } - VLOG(3) << "run in MergeIdsOpKernel"; - - const auto *ids_var = ctx.InputVar("Ids"); - PADDLE_ENFORCE(ids_var->IsType(), - "only support to merge Ids of LoDTensor"); - - const auto &ids_tensor = ids_var->Get(); - const auto &ids_dims = ids_tensor.dims(); - const int64_t *ids = ids_tensor.data(); - - auto x_tensors = ctx.MultiInput("X"); - - auto *out = ctx.Output("Out"); - - int batch_size = 0; - int embedding_size = 0; - for (auto &input : x_tensors) { - if (framework::product(input->dims()) != 0) { - if (embedding_size == 0) { - embedding_size = input->dims()[1]; - } - PADDLE_ENFORCE_EQ(embedding_size, input->dims()[1], - "embedding size of all input should be the same"); - batch_size += input->dims()[0]; - } - } - PADDLE_ENFORCE_EQ( - batch_size, ids_dims[0], - "the batch size of ids and merged embedding value should be the same"); - - const size_t shard_num = x_tensors.size(); - - if (shard_num == 1) { - VLOG(3) << "only one shard, we can copy the data directly"; - TensorCopy(*x_tensors[0], place, out); - } else { - std::vector in_indexs(shard_num, 0); - auto *out_data = out->mutable_data( - framework::make_ddim({batch_size, embedding_size}), place); - // copy data from ins[shard_num] to out. - for (int i = 0; i < ids_dims[0]; ++i) { - int64_t id = ids[i]; - size_t shard_id = static_cast(id) % shard_num; - int index = in_indexs[shard_id]; - memcpy(out_data + embedding_size * i, - x_tensors[shard_id]->data() + index * embedding_size, - sizeof(T) * embedding_size); - in_indexs[shard_id] += 1; - } - - for (size_t i = 0; i < shard_num; ++i) { - PADDLE_ENFORCE_EQ(in_indexs[i], x_tensors[i]->dims()[0], - "after merge, all data in x_tensor should be used"); - } - } - } -}; - -} // namespace operators -} // namespace paddle diff --git a/paddle/fluid/operators/metrics/CMakeLists.txt b/paddle/fluid/operators/metrics/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/metrics/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/accuracy_op.cc b/paddle/fluid/operators/metrics/accuracy_op.cc similarity index 98% rename from paddle/fluid/operators/accuracy_op.cc rename to paddle/fluid/operators/metrics/accuracy_op.cc index 42fcace17926641b5caf677eb3c8ba5222e37190..95aa76bc6947c9c39e56d39031c5184dc262acd0 100644 --- a/paddle/fluid/operators/accuracy_op.cc +++ b/paddle/fluid/operators/metrics/accuracy_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/accuracy_op.h" +#include "paddle/fluid/operators/metrics/accuracy_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/accuracy_op.cu b/paddle/fluid/operators/metrics/accuracy_op.cu similarity index 98% rename from paddle/fluid/operators/accuracy_op.cu rename to paddle/fluid/operators/metrics/accuracy_op.cu index 23b48c6fdf427348879de07c671c65327d6436d7..b255d2a7c413b4f965f6b874d342dcb93c7b5e66 100644 --- a/paddle/fluid/operators/accuracy_op.cu +++ b/paddle/fluid/operators/metrics/accuracy_op.cu @@ -14,7 +14,7 @@ limitations under the License. */ #include #include -#include "paddle/fluid/operators/accuracy_op.h" +#include "paddle/fluid/operators/metrics/accuracy_op.h" #include "paddle/fluid/platform/cuda_primitives.h" #include "paddle/fluid/platform/gpu_info.h" diff --git a/paddle/fluid/operators/accuracy_op.h b/paddle/fluid/operators/metrics/accuracy_op.h similarity index 100% rename from paddle/fluid/operators/accuracy_op.h rename to paddle/fluid/operators/metrics/accuracy_op.h diff --git a/paddle/fluid/operators/auc_op.cc b/paddle/fluid/operators/metrics/auc_op.cc similarity index 98% rename from paddle/fluid/operators/auc_op.cc rename to paddle/fluid/operators/metrics/auc_op.cc index 0784920064a879963cd9725cd9acf4cec7b874ce..335d4fded4a9543dabf984f7ed9c342b46dd04f0 100644 --- a/paddle/fluid/operators/auc_op.cc +++ b/paddle/fluid/operators/metrics/auc_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/auc_op.h" +#include "paddle/fluid/operators/metrics/auc_op.h" namespace paddle { namespace operators { @@ -53,7 +53,7 @@ class AucOp : public framework::OperatorWithKernel { const framework::ExecutionContext &ctx) const override { return framework::OpKernelType( framework::ToDataType(ctx.Input("Predict")->type()), - ctx.device_context()); + platform::CPUPlace()); } }; diff --git a/paddle/fluid/operators/auc_op.h b/paddle/fluid/operators/metrics/auc_op.h similarity index 100% rename from paddle/fluid/operators/auc_op.h rename to paddle/fluid/operators/metrics/auc_op.h diff --git a/paddle/fluid/operators/precision_recall_op.cc b/paddle/fluid/operators/metrics/precision_recall_op.cc similarity index 99% rename from paddle/fluid/operators/precision_recall_op.cc rename to paddle/fluid/operators/metrics/precision_recall_op.cc index e7ce16f33fb5052ffb41fc05bd1538e2f0dc35be..0d733c47dd2fcaad776d8d4e6467ecd1872bce05 100644 --- a/paddle/fluid/operators/precision_recall_op.cc +++ b/paddle/fluid/operators/metrics/precision_recall_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/precision_recall_op.h" +#include "paddle/fluid/operators/metrics/precision_recall_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/precision_recall_op.h b/paddle/fluid/operators/metrics/precision_recall_op.h similarity index 100% rename from paddle/fluid/operators/precision_recall_op.h rename to paddle/fluid/operators/metrics/precision_recall_op.h diff --git a/paddle/fluid/operators/momentum_op.cc b/paddle/fluid/operators/momentum_op.cc deleted file mode 100644 index c8079a99fb8c8e05144c5390794db7757e74f6ae..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/momentum_op.cc +++ /dev/null @@ -1,120 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/operators/momentum_op.h" - -namespace paddle { -namespace operators { - -using Tensor = framework::Tensor; - -class MomentumOp : public framework::OperatorWithKernel { - public: - using framework::OperatorWithKernel::OperatorWithKernel; - - protected: - void InferShape(framework::InferShapeContext *ctx) const override { - PADDLE_ENFORCE(ctx->HasInput("Param"), - "Input(param) of Momentum should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Grad"), - "Input(grad) of Momentum should not be null."); - PADDLE_ENFORCE(ctx->HasInput("Velocity"), - "Input(velocity) of Momentum should not be null."); - PADDLE_ENFORCE(ctx->HasInput("LearningRate"), - "Input(LearningRate) of Momentum should not be null."); - PADDLE_ENFORCE( - ctx->GetInputsVarType("Param").front() == - framework::proto::VarType::LOD_TENSOR, - "The input var's type should be LoDTensor, but the received is %s", - ctx->Inputs("Param").front(), ctx->GetInputsVarType("Param").front()); - - PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), - "Output(ParamOut) of Momentum should not be null."); - PADDLE_ENFORCE(ctx->HasOutput("VelocityOut"), - "Output(VelocityOut) of Momentum should not be null."); - - auto param_dim = ctx->GetInputDim("Param"); - PADDLE_ENFORCE_EQ( - param_dim, ctx->GetInputDim("Grad"), - "Param and Grad input of MomentumOp should have the same dimension."); - PADDLE_ENFORCE_EQ( - param_dim, ctx->GetInputDim("Velocity"), - "Param and Velocity of MomentumOp should have the same dimension."); - PADDLE_ENFORCE_EQ(framework::product(ctx->GetInputDim("LearningRate")), 1, - "Learning_rate should be a scalar"); - - ctx->SetOutputDim("ParamOut", param_dim); - ctx->SetOutputDim("VelocityOut", param_dim); - } - framework::OpKernelType GetExpectedKernelType( - const framework::ExecutionContext &ctx) const override { - auto input_data_type = - framework::ToDataType(ctx.Input("Param")->type()); - return framework::OpKernelType(input_data_type, ctx.GetPlace()); - } -}; - -class MomentumOpMaker : public framework::OpProtoAndCheckerMaker { - public: - void Make() override { - AddInput("Param", - "(Tensor, default Tensor) " - "Input parameter that has to be updated"); - AddInput("Grad", - "(Tensor, default Tensor) " - "Input gradient of the parameter"); - AddInput("Velocity", - "(Tensor, default Tensor) " - "Input velocity (corresponding to the parameter) " - "that has to be updated"); - AddInput("LearningRate", - "(Tensor, default Tensor) " - "Input learning rate"); - - AddOutput("ParamOut", - "(Tensor) This output is updated parameter. " - "It shared memory with Input(Param)."); - AddOutput("VelocityOut", - "(Tensor) This output is updated velocity. " - "It shared memory with Input(Velocity)."); - - AddAttr("mu", "(float) Momentum coefficient"); - AddAttr("use_nesterov", - "(bool, default false) " - "Use Nesterov Momentum") - .SetDefault(false); - AddComment(R"DOC( -Momentum Optimizer. - -This optimizer has a flag for Nestrov Momentum. -The update equations are as follows: - -$$ -velocity = mu * velocity + gradient \\ -if (use\_nesterov): \\ - param = param - (gradient + mu * velocity) * learning\_rate \\ -else: \\ - param = param - learning\_rate * velocity. \\ -$$ - -)DOC"); - } -}; -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_WITHOUT_GRADIENT(momentum, ops::MomentumOp, ops::MomentumOpMaker); -REGISTER_OP_CPU_KERNEL(momentum, ops::MomentumOpKernel, - ops::MomentumOpKernel); diff --git a/paddle/fluid/operators/momentum_op.cu b/paddle/fluid/operators/momentum_op.cu deleted file mode 100644 index 5dc920c70979ad5c3d4fb3acc8d2dacaffe386c0..0000000000000000000000000000000000000000 --- a/paddle/fluid/operators/momentum_op.cu +++ /dev/null @@ -1,90 +0,0 @@ -/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. */ - -#include "paddle/fluid/framework/op_registry.h" -#include "paddle/fluid/operators/momentum_op.h" - -namespace paddle { -namespace operators { - -template -__global__ void MomentumKernel(const T* p, const T* g, const T* v, - const T* learning_rate, const T mu, - const int64_t num, bool use_nesterov, T* p_out, - T* v_out) { - T lr = learning_rate[0]; - if (use_nesterov) { - for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; - i += blockDim.x * gridDim.x) { - T g_val = g[i]; - T v_new = v[i] * mu + g_val; - v_out[i] = v_new; - p_out[i] = p[i] - (g_val + v_new * mu) * lr; - } - } else { - for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; - i += blockDim.x * gridDim.x) { - T v_new = v[i] * mu + g[i]; - v_out[i] = v_new; - p_out[i] = p[i] - lr * v_new; - } - } -} - -template -class MomentumOpCUDAKernel : public framework::OpKernel { - public: - void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - const auto* grad_var = ctx.InputVar("Grad"); - PADDLE_ENFORCE(grad_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Grad").front(), grad_var->Type().name()); - - auto param_out = ctx.Output("ParamOut"); - auto velocity_out = ctx.Output("VelocityOut"); - auto param = ctx.Input("Param"); - auto velocity = ctx.Input("Velocity"); - auto grad = ctx.Input("Grad"); - auto learning_rate = ctx.Input("LearningRate"); - - T* p_out = param_out->mutable_data(ctx.GetPlace()); - T* v_out = velocity_out->mutable_data(ctx.GetPlace()); - - T mu = static_cast(ctx.Attr("mu")); - bool use_nesterov = ctx.Attr("use_nesterov"); - - auto* p = param->data(); - auto* v = velocity->data(); - auto* g = grad->data(); - auto* lr = learning_rate->data(); - - int block = 512; - int grid = (param->numel() + block - 1) / block; - MomentumKernel<<>>( - p, g, v, lr, mu, param->numel(), use_nesterov, p_out, v_out); - } -}; - -} // namespace operators -} // namespace paddle - -namespace ops = paddle::operators; -REGISTER_OP_CUDA_KERNEL(momentum, ops::MomentumOpCUDAKernel, - ops::MomentumOpCUDAKernel); diff --git a/paddle/fluid/operators/mul_op.cc b/paddle/fluid/operators/mul_op.cc index 363abfb0e0c96e8a4d82124dff168f28e339a9ae..7e434c293c9631025a5a725d62838fa12e845838 100644 --- a/paddle/fluid/operators/mul_op.cc +++ b/paddle/fluid/operators/mul_op.cc @@ -38,9 +38,9 @@ class MulOp : public framework::OperatorWithKernel { int x_num_col_dims = ctx->Attrs().Get("x_num_col_dims"); int y_num_col_dims = ctx->Attrs().Get("y_num_col_dims"); - VLOG(3) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims - << " x_num_col_dims=" << x_num_col_dims - << " y_num_col_dims=" << y_num_col_dims; + VLOG(30) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims + << " x_num_col_dims=" << x_num_col_dims + << " y_num_col_dims=" << y_num_col_dims; PADDLE_ENFORCE_GT( x_dims.size(), x_num_col_dims, @@ -56,7 +56,8 @@ class MulOp : public framework::OperatorWithKernel { PADDLE_ENFORCE_EQ(x_mat_dims[1], y_mat_dims[0], "First matrix's width must be equal with second matrix's " - "height. %s, %s"); + "height. %s, %s", + x_mat_dims[1], y_mat_dims[0]); std::vector output_dims; output_dims.reserve( static_cast(x_num_col_dims + y_dims.size() - y_num_col_dims)); @@ -126,6 +127,14 @@ or not. But the output only shares the LoD information with input $X$. } }; +class MulOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; + } +}; + class MulGradOp : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -178,7 +187,8 @@ class MulOpGradMaker : public framework::SingleGradOpDescMaker { } // namespace paddle namespace ops = paddle::operators; -REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker, ops::MulOpGradMaker); +REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker, ops::MulOpInferVarType, + ops::MulOpGradMaker); REGISTER_OPERATOR(mul_grad, ops::MulGradOp); REGISTER_OP_CPU_KERNEL( mul, ops::MulKernel, diff --git a/paddle/fluid/operators/mul_op.cu.cc b/paddle/fluid/operators/mul_op.cu.cc index 81f3e42bf412fa4d2cb48405f2f8ee49b6aa0b67..6c5a83c6a50c463502171f09bbf18e17e43917b5 100644 --- a/paddle/fluid/operators/mul_op.cu.cc +++ b/paddle/fluid/operators/mul_op.cu.cc @@ -20,6 +20,7 @@ namespace plat = paddle::platform; REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel, ops::MulKernel, ops::MulKernel); -REGISTER_OP_CUDA_KERNEL(mul_grad, - ops::MulGradKernel, - ops::MulGradKernel); +REGISTER_OP_CUDA_KERNEL( + mul_grad, ops::MulGradKernel, + ops::MulGradKernel, + ops::MulGradKernel); diff --git a/paddle/fluid/operators/nccl/CMakeLists.txt b/paddle/fluid/operators/nccl/CMakeLists.txt index cdcba8035762d8f442eb8b8ed52a4e3e99ac31b6..9b26e19cc7ed05038e05308f9277b200a885dc10 100644 --- a/paddle/fluid/operators/nccl/CMakeLists.txt +++ b/paddle/fluid/operators/nccl/CMakeLists.txt @@ -1,3 +1,13 @@ if(WITH_GPU AND NOT WIN32) nv_library(nccl_common SRCS nccl_gpu_common.cc DEPS device_context operator ) endif() + +if(WITH_GPU) + op_library(nccl_op DEPS nccl_common) + file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(ncclAllReduce);\n") + set(OPERATOR_DEPS ${OPERATOR_DEPS} nccl_common PARENT_SCOPE) +endif() + +if(NOT WIN32) + nv_test(nccl_op_test SRCS nccl_op_test.cu.cc DEPS nccl_op gpu_info device_context) +endif() diff --git a/paddle/fluid/operators/nccl_op.cc b/paddle/fluid/operators/nccl/nccl_op.cc similarity index 100% rename from paddle/fluid/operators/nccl_op.cc rename to paddle/fluid/operators/nccl/nccl_op.cc diff --git a/paddle/fluid/operators/nccl_op.cu.cc b/paddle/fluid/operators/nccl/nccl_op.cu.cc similarity index 86% rename from paddle/fluid/operators/nccl_op.cu.cc rename to paddle/fluid/operators/nccl/nccl_op.cu.cc index 8de974bc2b333fb6ccc5b5f0bb1af86533139925..9db0031a6934537a7d991b775ecac688ae6b66e9 100644 --- a/paddle/fluid/operators/nccl_op.cu.cc +++ b/paddle/fluid/operators/nccl/nccl_op.cu.cc @@ -63,16 +63,16 @@ class NCCLAllReduceKernel : public framework::OpKernel { // device id int gpu_id = boost::get(ctx.GetPlace()).GetDeviceId(); int idx = comm->GetCommId(gpu_id); - VLOG(3) << "gpu : " - << " invoke allreduce. send " << x->numel() << " recv " - << out->numel(); + VLOG(30) << "gpu : " + << " invoke allreduce. send " << x->numel() << " recv " + << out->numel(); PADDLE_ENFORCE(platform::dynload::ncclAllReduce( x->data(), out->mutable_data(ctx.GetPlace()), out->numel(), NCCLTypeWrapper::type, reduction_op_, comm->comms().at(idx), ctx.cuda_device_context().stream())); - VLOG(3) << "gpu : " - << " finished allreduce. send " << x->numel() << " recv " - << out->numel(); + VLOG(30) << "gpu : " + << " finished allreduce. send " << x->numel() << " recv " + << out->numel(); } }; @@ -109,14 +109,14 @@ class NCCLReduceKernel : public framework::OpKernel { } else { out->Resize(framework::make_ddim({0})); } - VLOG(3) << "gpu : " << gpu_id << " invoke reduce. send " << x->numel() - << " recv " << out->numel(); + VLOG(30) << "gpu : " << gpu_id << " invoke reduce. send " << x->numel() + << " recv " << out->numel(); PADDLE_ENFORCE(platform::dynload::ncclReduce( x->data(), recvbuffer, x->numel(), NCCLTypeWrapper::type, reduction_op_, root, comm->comms().at(idx), ctx.cuda_device_context().stream())); - VLOG(3) << "gpu : " << gpu_id << " finished reduce. send " << x->numel() - << " recv " << out->numel(); + VLOG(30) << "gpu : " << gpu_id << " finished reduce. send " << x->numel() + << " recv " << out->numel(); } }; @@ -133,21 +133,22 @@ class NCCLBcastKernel : public framework::OpKernel { int idx = comm->GetCommId(gpu_id); if (idx == root) { auto* x = ctx.Input("X"); - VLOG(3) << "gpu : " << gpu_id << " invoke Bcast. send " << x->numel(); + VLOG(30) << "gpu : " << gpu_id << " invoke Bcast. send " << x->numel(); PADDLE_ENFORCE(platform::dynload::ncclBcast( reinterpret_cast(const_cast(x->data())), x->numel(), NCCLTypeWrapper::type, root, comm->comms().at(idx), ctx.cuda_device_context().stream())); - VLOG(3) << "gpu : " << gpu_id << " finished Bcast."; + VLOG(30) << "gpu : " << gpu_id << " finished Bcast."; } else { auto* out = ctx.Output("Out"); - VLOG(3) << "gpu : " << gpu_id << " invoke Bcast. recv buffer " - << framework::product(out->dims()); + VLOG(30) << "gpu : " << gpu_id << " invoke Bcast. recv buffer " + << framework::product(out->dims()); PADDLE_ENFORCE(platform::dynload::ncclBcast( out->mutable_data(ctx.GetPlace()), out->numel(), NCCLTypeWrapper::type, root, comm->comms().at(idx), ctx.cuda_device_context().stream())); - VLOG(3) << "gpu : " << gpu_id << " finished Bcast. recv " << out->numel(); + VLOG(30) << "gpu : " << gpu_id << " finished Bcast. recv " + << out->numel(); } } }; diff --git a/paddle/fluid/operators/nccl_op_test.cu.cc b/paddle/fluid/operators/nccl/nccl_op_test.cu.cc similarity index 95% rename from paddle/fluid/operators/nccl_op_test.cu.cc rename to paddle/fluid/operators/nccl/nccl_op_test.cu.cc index d5fb7a12e5d9757f3e639f6de7f0129bd531e2a1..f48ccdd97fa5adb475013cf26e7544c2729b4457 100644 --- a/paddle/fluid/operators/nccl_op_test.cu.cc +++ b/paddle/fluid/operators/nccl/nccl_op_test.cu.cc @@ -86,9 +86,9 @@ class NCCLTester : public ::testing::Test { (*p_scopes).resize(gpu_list_.size()); auto op = f::OpRegistry::CreateOp(*op1); - VLOG(1) << "invoke NCCLInitOp."; + VLOG(10) << "invoke NCCLInitOp."; op->Run(g_scope_, cpu_place); - VLOG(1) << "NCCLInitOp finished."; + VLOG(10) << "NCCLInitOp finished."; } int GetGPUData(int gpu_id) { return gpu_id + 42; } @@ -109,7 +109,7 @@ class NCCLTester : public ::testing::Test { std::vector send_vector(f::product(kDims), GetGPUData(gpu_id)); paddle::framework::TensorFromVector(send_vector, *ctx, send_tensor); - VLOG(1) << "Send Tensor filled with elements " << send_tensor->numel(); + VLOG(10) << "Send Tensor filled with elements " << send_tensor->numel(); } lk.unlock(); @@ -119,11 +119,11 @@ class NCCLTester : public ::testing::Test { auto op = f::OpRegistry::CreateOp(*op1); - VLOG(1) << "Device : " << gpu_id << " invoke " << op_desc.Type(); - VLOG(1) << " send_tensor : " << send_tensor->numel() - << " recv_tensor : " << recv_tensor->numel(); + VLOG(10) << "Device : " << gpu_id << " invoke " << op_desc.Type(); + VLOG(10) << " send_tensor : " << send_tensor->numel() + << " recv_tensor : " << recv_tensor->numel(); op->Run(*scope, place); - VLOG(1) << "Device : " << gpu_id << " finished " << op_desc.Type(); + VLOG(10) << "Device : " << gpu_id << " finished " << op_desc.Type(); } public: diff --git a/paddle/fluid/operators/nce_op.cc b/paddle/fluid/operators/nce_op.cc index e471f04662a1fa3e8e77a2db37f0da4521682018..9b0d45ae5b9d104c8b7bb1529a9baaaf3d6a736d 100644 --- a/paddle/fluid/operators/nce_op.cc +++ b/paddle/fluid/operators/nce_op.cc @@ -35,6 +35,7 @@ class NCEOp : public framework::OperatorWithKernel { auto x_dims = ctx->GetInputDim("Input"); auto label_dims = ctx->GetInputDim("Label"); + auto w_dims = ctx->GetInputDim("Weight"); PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0]); int num_true_classes = label_dims.size() == 2 ? label_dims[1] : 1; if (ctx->HasInput("Bias")) { @@ -69,7 +70,7 @@ class NCEOp : public framework::OperatorWithKernel { const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( framework::ToDataType(ctx.Input("Input")->type()), - ctx.GetPlace()); + platform::CPUPlace()); } }; @@ -98,6 +99,13 @@ class NCEOpMaker : public framework::OpProtoAndCheckerMaker { "each sample. And it is a dispensable input. The default value of " "sample is 1.") .AsDispensable(); + + AddInput( + "CustomDistribution", + "(Tensor) It is used in 'CostumDist' sampler. " + "It is a tensor with shape [num_total_classes]." + "The i-th element is the probsbility of the i-th class being sampled.") + .AsDispensable(); AddOutput("Cost", "(Tensor) A tensor of shape [batch_size, 1]. Cost of samples."); AddOutput("SampleLogits", @@ -121,6 +129,17 @@ class NCEOpMaker : public framework::OpProtoAndCheckerMaker { AddAttr("num_neg_samples", "The number of negative classes. The default value is 10.") .SetDefault(10); + + AddAttr("sampler", + "(int) Which sampler to be used to sample negative class." + "0: Uniform; 1: LogUniform; 2: CostumDist.") + .SetDefault(0); + + AddAttr("seed", + "(int) The seed used in sampler. If it is 0, " + "the sampler will generate a seed randomly.") + .SetDefault(0); + AddAttr>("custom_neg_classes", "This attribute only be used in unitest. Classes " "in this list wiil be used as negative classes " @@ -174,7 +193,7 @@ class NCEOpGrad : public framework::OperatorWithKernel { const framework::ExecutionContext& ctx) const override { return framework::OpKernelType( framework::ToDataType(ctx.Input("Input")->type()), - ctx.GetPlace()); + platform::CPUPlace()); } }; diff --git a/paddle/fluid/operators/nce_op.h b/paddle/fluid/operators/nce_op.h index 2c4c97f28bc0b511d6eaa8f79a3a4efc9be8a5da..e9af8ad4ce8501f464202039d99c36984d7feba9 100644 --- a/paddle/fluid/operators/nce_op.h +++ b/paddle/fluid/operators/nce_op.h @@ -19,29 +19,28 @@ limitations under the License. */ #include #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/sampler.h" #include "unsupported/Eigen/CXX11/Tensor" namespace paddle { namespace operators { using Tensor = framework::Tensor; +using Sampler = math::Sampler; template using EigenMatrix = framework::EigenMatrix; template -void PrepareSamples(const framework::ExecutionContext& context) { +void PrepareSamples(const framework::ExecutionContext& context, + Sampler* sampler) { auto label = context.Input("Label"); const int64_t* label_data = label->data(); auto label_dims = label->dims(); - int num_total_classes = context.Attr("num_total_classes"); + // int num_total_classes = context.Attr("num_total_classes"); // for unitest std::vector custom_neg_classes = context.Attr>("custom_neg_classes"); - // random machine - std::random_device rd; - std::mt19937 rng(rd()); - std::uniform_int_distribution rand(0, num_total_classes - 1); auto sample_labels = context.Output("SampleLabels"); auto sample_labels_dims = sample_labels->dims(); @@ -62,7 +61,7 @@ void PrepareSamples(const framework::ExecutionContext& context) { } else { for (; j < sample_labels_dims[1]; ++j) { // TODO(wanghaoshuang): support more distribution sampling - sample_labels_data[index++] = rand(rng); + sample_labels_data[index++] = sampler->Sample(); } } } @@ -72,7 +71,33 @@ template class NCEKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& context) const override { - PrepareSamples(context); + int sampler_type = context.Attr("sampler"); + int seed = context.Attr("seed"); + int num_total_classes = context.Attr("num_total_classes"); + int num_neg_samples = context.Attr("num_neg_samples"); + + Sampler* sampler; + switch (sampler_type) { + case 0: { + sampler = new math::UniformSampler(num_total_classes - 1, seed); + break; + } + case 1: { + sampler = new math::LogUniformSampler(num_total_classes - 1, seed); + break; + } + case 2: { + auto custom_dist = context.Input("CustomDistribution"); + const float* custom_dist_data = custom_dist->data(); + PADDLE_ENFORCE_EQ(custom_dist->numel(), num_total_classes); + sampler = new math::CustomSampler(num_total_classes - 1, + custom_dist_data, seed); + break; + } + default: { PADDLE_THROW("Unsupported SamplerType."); } + } + + PrepareSamples(context, sampler); auto sample_labels = context.Output("SampleLabels"); const int64_t* sample_labels_data = sample_labels->data(); auto sample_out = context.Output("SampleLogits"); @@ -85,13 +110,12 @@ class NCEKernel : public framework::OpKernel { } auto out = context.Output("Cost"); T* out_data = out->mutable_data(context.GetPlace()); - int num_neg_samples = context.Attr("num_neg_samples"); - int num_total_classes = context.Attr("num_total_classes"); int64_t num_true_class = 1; if (label != nullptr) { num_true_class = label->dims()[1]; } - T b = 1. / num_total_classes * num_neg_samples; + int64_t sampled_labels_num = sample_labels->dims()[1]; + // T b = 1. / num_total_classes * num_neg_samples; // forward bias auto bias = context.Input("Bias"); if (bias != nullptr) { @@ -117,22 +141,17 @@ class NCEKernel : public framework::OpKernel { } // forward cost for (int64_t i = 0; i < sample_labels->dims()[0]; ++i) { - int64_t j = 0; out_data[i] = 0; T w = sample_weight == nullptr ? 1. : sample_weight_data[i]; - // for true classes - for (; j < num_true_class; ++j) { - T o = sample_out_data[i * sample_out->dims()[1] + j]; - T cost = -log(o / (o + b)); - out_data[i] += w * cost; - } - // for sampled neg classes - for (; j < sample_labels->dims()[1]; ++j) { - T o = sample_out_data[i * sample_out->dims()[1] + j]; - T cost = -log(b / (o + b)); + for (int64_t j = 0; j < sampled_labels_num; ++j) { + int64_t target = sample_labels_data[i * sampled_labels_num + j]; + T o = sample_out_data[i * sampled_labels_num + j]; + float b = sampler->Probability(target) * num_neg_samples; + T cost = (j < num_true_class) ? -log(o / (o + b)) : -log(b / (o + b)); out_data[i] += w * cost; } } + delete sampler; } }; @@ -158,20 +177,45 @@ class NCEGradKernel : public framework::OpKernel { if (label != nullptr) { num_true_class = label->dims()[1]; } - T b = 1. / num_total_classes * num_neg_samples; + + int sampler_type = context.Attr("sampler"); + int seed = context.Attr("seed"); + Sampler* sampler; + switch (sampler_type) { + case 0: { + sampler = new math::UniformSampler(num_total_classes - 1, seed); + break; + } + case 1: { + sampler = new math::LogUniformSampler(num_total_classes - 1, seed); + break; + } + case 2: { + auto custom_dist = context.Input("CustomDistribution"); + const float* custom_dist_data = custom_dist->data(); + PADDLE_ENFORCE_EQ(custom_dist->numel(), num_total_classes); + sampler = new math::CustomSampler(num_total_classes - 1, + custom_dist_data, seed); + break; + } + default: { PADDLE_THROW("Unsupported SamplerType."); } + } + + // T b = 1. / num_total_classes * num_neg_samples; Tensor sample_grad; // tmp tensor T* sample_grad_data = sample_grad.mutable_data(sample_labels->dims(), context.GetPlace()); // backward cost for (int64_t i = 0; i < sample_labels->numel(); ++i) { + int64_t label_idx = i % sample_labels->dims()[1]; + int64_t sample_idx = i / sample_labels->dims()[1]; + float b = sampler->Probability(sample_labels_data[i]) * num_neg_samples; T o = sample_out_data[i]; - T w = sample_weight == nullptr - ? 1 - : sample_weight_data[i / sample_labels->dims()[1]]; - sample_grad_data[i] = (i % sample_labels->dims()[1]) < num_true_class + T w = sample_weight == nullptr ? 1 : sample_weight_data[sample_idx]; + sample_grad_data[i] = label_idx < num_true_class ? w * (b / (o + b)) * (o - 1) : w * (o * (1 - o) / (o + b)); - sample_grad_data[i] *= d_out_data[i / sample_labels->dims()[1]]; + sample_grad_data[i] *= d_out_data[sample_idx]; } // get d_bias auto d_bias = context.Output(framework::GradVarName("Bias")); @@ -207,6 +251,7 @@ class NCEGradKernel : public framework::OpKernel { w_matrix.chip(sample_labels_data[i], 0) * sample_grad_data[i]; } } + delete sampler; } }; } // namespace operators diff --git a/paddle/fluid/operators/optimizers/CMakeLists.txt b/paddle/fluid/operators/optimizers/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/optimizers/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/adadelta_op.cc b/paddle/fluid/operators/optimizers/adadelta_op.cc similarity index 98% rename from paddle/fluid/operators/adadelta_op.cc rename to paddle/fluid/operators/optimizers/adadelta_op.cc index 89a7a49e0fa8427826f5d91274912a68f2316b61..9039d02b673b3403c840492c088179b30e23da9c 100644 --- a/paddle/fluid/operators/adadelta_op.cc +++ b/paddle/fluid/operators/optimizers/adadelta_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/adadelta_op.h" +#include "paddle/fluid/operators/optimizers/adadelta_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/adadelta_op.cu b/paddle/fluid/operators/optimizers/adadelta_op.cu similarity index 93% rename from paddle/fluid/operators/adadelta_op.cu rename to paddle/fluid/operators/optimizers/adadelta_op.cu index fc10c6657476e7f87b2f703a1d0cb88eeebc35cf..3fbfee5df05770a1206ab3170d3baffdd20bc77b 100644 --- a/paddle/fluid/operators/adadelta_op.cu +++ b/paddle/fluid/operators/optimizers/adadelta_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/adadelta_op.h" +#include "paddle/fluid/operators/optimizers/adadelta_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/adadelta_op.h b/paddle/fluid/operators/optimizers/adadelta_op.h similarity index 100% rename from paddle/fluid/operators/adadelta_op.h rename to paddle/fluid/operators/optimizers/adadelta_op.h diff --git a/paddle/fluid/operators/adagrad_op.cc b/paddle/fluid/operators/optimizers/adagrad_op.cc similarity index 96% rename from paddle/fluid/operators/adagrad_op.cc rename to paddle/fluid/operators/optimizers/adagrad_op.cc index a3ef9ad9f91f1f626bd33876693ecc17ad76b96b..e8d5a9e2c875570a198629bd745c9d58036746cb 100644 --- a/paddle/fluid/operators/adagrad_op.cc +++ b/paddle/fluid/operators/optimizers/adagrad_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/adagrad_op.h" +#include "paddle/fluid/operators/optimizers/adagrad_op.h" #include #include @@ -119,8 +119,8 @@ struct SparseAdagradFunctor { auto* grad_merge_data = grad_merge.mutable_value()->template data(); // 2. m += g_m * g_m - math::scatter::Mul sqare_func; - auto grad_square = sqare_func(context, grad_merge, grad_merge); + auto grad_square = + SquareSelectedRows(context, grad_merge); math::SelectedRowsAddToTensor functor; functor(context, grad_square, moment); diff --git a/paddle/fluid/operators/adagrad_op.cu b/paddle/fluid/operators/optimizers/adagrad_op.cu similarity index 95% rename from paddle/fluid/operators/adagrad_op.cu rename to paddle/fluid/operators/optimizers/adagrad_op.cu index b25268786d622bc7a94117849763833e528bef48..4efe56855a4bdca41d24f02c29a618a8d4232887 100644 --- a/paddle/fluid/operators/adagrad_op.cu +++ b/paddle/fluid/operators/optimizers/adagrad_op.cu @@ -13,9 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/adagrad_op.h" #include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/math/selected_rows_functor.h" +#include "paddle/fluid/operators/optimizers/adagrad_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { @@ -84,8 +84,8 @@ struct SparseAdagradFunctor { auto* grad_merge_data = grad_merge.mutable_value()->template data(); framework::Vector merge_rows(grad_merge.rows()); // 2. m += g_m * g_m - math::scatter::Mul sqare_func; - auto grad_square = sqare_func(context, grad_merge, grad_merge); + auto grad_square = + SquareSelectedRows(context, grad_merge); math::SelectedRowsAddToTensor functor; functor(context, grad_square, moment); diff --git a/paddle/fluid/operators/adagrad_op.h b/paddle/fluid/operators/optimizers/adagrad_op.h similarity index 86% rename from paddle/fluid/operators/adagrad_op.h rename to paddle/fluid/operators/optimizers/adagrad_op.h index 0a16ce00f71586ef55007c3753e024be29d0ed56..9f6ef391696aa8718be71ae945e746b876813d94 100644 --- a/paddle/fluid/operators/adagrad_op.h +++ b/paddle/fluid/operators/optimizers/adagrad_op.h @@ -28,6 +28,20 @@ struct SparseAdagradFunctor { framework::Tensor *moment, framework::Tensor *param); }; +template +framework::SelectedRows SquareSelectedRows( + const DeviceContext &context, const framework::SelectedRows &input) { + framework::SelectedRows out; + out.set_rows(input.rows()); + out.set_height(input.height()); + out.mutable_value()->mutable_data(input.value().dims(), + context.GetPlace()); + auto e_out = framework::EigenVector::Flatten(*(out.mutable_value())); + auto e_in = framework::EigenVector::Flatten(input.value()); + e_out.device(*context.eigen_device()) = e_in.square(); + return out; +} + template class AdagradOpKernel : public framework::OpKernel { public: diff --git a/paddle/fluid/operators/adam_op.cc b/paddle/fluid/operators/optimizers/adam_op.cc similarity index 95% rename from paddle/fluid/operators/adam_op.cc rename to paddle/fluid/operators/optimizers/adam_op.cc index 5d670fe3b9d99a31a628ff707ff860564eca952e..5710cda39acce53e35dfceec675fcd4979a84e31 100644 --- a/paddle/fluid/operators/adam_op.cc +++ b/paddle/fluid/operators/optimizers/adam_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/adam_op.h" +#include "paddle/fluid/operators/optimizers/adam_op.h" namespace paddle { namespace operators { @@ -92,9 +92,9 @@ class AdamOpMaker : public framework::OpProtoAndCheckerMaker { AddInput("Beta1Pow", "(Tensor) Input beta1 power accumulator"); AddInput("Beta2Pow", "(Tensor) Input beta2 power accumulator"); - AddOutput("ParamOut", "(Tensor) Output parameter").Reuse("Param"); - AddOutput("Moment1Out", "(Tensor) Output first moment").Reuse("Moment1"); - AddOutput("Moment2Out", "(Tensor) Output second moment").Reuse("Moment2"); + AddOutput("ParamOut", "(Tensor) Output parameter"); + AddOutput("Moment1Out", "(Tensor) Output first moment"); + AddOutput("Moment2Out", "(Tensor) Output second moment"); AddAttr("beta1", "(float, default 0.9) " diff --git a/paddle/fluid/operators/adam_op.cu b/paddle/fluid/operators/optimizers/adam_op.cu similarity index 93% rename from paddle/fluid/operators/adam_op.cu rename to paddle/fluid/operators/optimizers/adam_op.cu index 77f1991002e6007e8b8dff4746739a90e836145d..e8090ebacfe85153aba9e275c9cd1c55fd7af15e 100644 --- a/paddle/fluid/operators/adam_op.cu +++ b/paddle/fluid/operators/optimizers/adam_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/adam_op.h" +#include "paddle/fluid/operators/optimizers/adam_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/adam_op.h b/paddle/fluid/operators/optimizers/adam_op.h similarity index 99% rename from paddle/fluid/operators/adam_op.h rename to paddle/fluid/operators/optimizers/adam_op.h index 3455d1ee54e8e6e498d0b0e6932ec099af9c0b30..48e0448d09c64e2c2fa655d125064e7a6572e30e 100644 --- a/paddle/fluid/operators/adam_op.h +++ b/paddle/fluid/operators/optimizers/adam_op.h @@ -297,7 +297,7 @@ class AdamOpKernel : public framework::OpKernel { auto& grad = Ref(ctx.Input("Grad"), "Must set Grad"); if (grad.rows().size() == 0) { - VLOG(3) << "grad row size is 0!!"; + VLOG(30) << "grad row size is 0!!"; return; } diff --git a/paddle/fluid/operators/adamax_op.cc b/paddle/fluid/operators/optimizers/adamax_op.cc similarity index 99% rename from paddle/fluid/operators/adamax_op.cc rename to paddle/fluid/operators/optimizers/adamax_op.cc index d4aa4d338a2379adf985ba7f89b528bc402eda06..4b244a76dc0ebee65b7c95db2d2754ebae03bbac 100644 --- a/paddle/fluid/operators/adamax_op.cc +++ b/paddle/fluid/operators/optimizers/adamax_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/adamax_op.h" +#include "paddle/fluid/operators/optimizers/adamax_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/adamax_op.cu b/paddle/fluid/operators/optimizers/adamax_op.cu similarity index 93% rename from paddle/fluid/operators/adamax_op.cu rename to paddle/fluid/operators/optimizers/adamax_op.cu index 05cafd7a8eef79588d1d5724084586cb9b51d3d4..e54adcb142fe0d50dad23fe5df14bd6f28220d8a 100644 --- a/paddle/fluid/operators/adamax_op.cu +++ b/paddle/fluid/operators/optimizers/adamax_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/adamax_op.h" +#include "paddle/fluid/operators/optimizers/adamax_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/adamax_op.h b/paddle/fluid/operators/optimizers/adamax_op.h similarity index 100% rename from paddle/fluid/operators/adamax_op.h rename to paddle/fluid/operators/optimizers/adamax_op.h diff --git a/paddle/fluid/operators/decayed_adagrad_op.cc b/paddle/fluid/operators/optimizers/decayed_adagrad_op.cc similarity index 98% rename from paddle/fluid/operators/decayed_adagrad_op.cc rename to paddle/fluid/operators/optimizers/decayed_adagrad_op.cc index d73ae9e2721b388212cb6efa354eb4b480df9cad..80278441c07203b03dbcff157193ea5976eefbf1 100644 --- a/paddle/fluid/operators/decayed_adagrad_op.cc +++ b/paddle/fluid/operators/optimizers/decayed_adagrad_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/decayed_adagrad_op.h" +#include "paddle/fluid/operators/optimizers/decayed_adagrad_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/decayed_adagrad_op.cu b/paddle/fluid/operators/optimizers/decayed_adagrad_op.cu similarity index 92% rename from paddle/fluid/operators/decayed_adagrad_op.cu rename to paddle/fluid/operators/optimizers/decayed_adagrad_op.cu index 7da16acf05eefc21cbe3dd0540dcbf69022431de..84d65e39329659f82099011f9ec60468d5db6328 100644 --- a/paddle/fluid/operators/decayed_adagrad_op.cu +++ b/paddle/fluid/operators/optimizers/decayed_adagrad_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/decayed_adagrad_op.h" +#include "paddle/fluid/operators/optimizers/decayed_adagrad_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/decayed_adagrad_op.h b/paddle/fluid/operators/optimizers/decayed_adagrad_op.h similarity index 100% rename from paddle/fluid/operators/decayed_adagrad_op.h rename to paddle/fluid/operators/optimizers/decayed_adagrad_op.h diff --git a/paddle/fluid/operators/ftrl_op.cc b/paddle/fluid/operators/optimizers/ftrl_op.cc similarity index 99% rename from paddle/fluid/operators/ftrl_op.cc rename to paddle/fluid/operators/optimizers/ftrl_op.cc index b77e12d6508eb07ae137b313ca91eac951afbcbe..1c9e91d9b610669def6d6d52e4753714745d1c0f 100644 --- a/paddle/fluid/operators/ftrl_op.cc +++ b/paddle/fluid/operators/optimizers/ftrl_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/ftrl_op.h" +#include "paddle/fluid/operators/optimizers/ftrl_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/ftrl_op.cu b/paddle/fluid/operators/optimizers/ftrl_op.cu similarity index 93% rename from paddle/fluid/operators/ftrl_op.cu rename to paddle/fluid/operators/optimizers/ftrl_op.cu index e7371c80da1d1cbb39247b50d8c6537ee8e948f8..f836b75df93861a0fd670f2a0e786e6a797a4661 100644 --- a/paddle/fluid/operators/ftrl_op.cu +++ b/paddle/fluid/operators/optimizers/ftrl_op.cu @@ -12,7 +12,7 @@ CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/ftrl_op.h" +#include "paddle/fluid/operators/optimizers/ftrl_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/ftrl_op.h b/paddle/fluid/operators/optimizers/ftrl_op.h similarity index 100% rename from paddle/fluid/operators/ftrl_op.h rename to paddle/fluid/operators/optimizers/ftrl_op.h diff --git a/paddle/fluid/operators/optimizers/lars_momentum_op.cc b/paddle/fluid/operators/optimizers/lars_momentum_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..574a03680b66962ac2d6ba249d0fc491a36794cd --- /dev/null +++ b/paddle/fluid/operators/optimizers/lars_momentum_op.cc @@ -0,0 +1,86 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/optimizers/lars_momentum_op.h" +#include "paddle/fluid/operators/optimizers/momentum_op.h" + +namespace paddle { +namespace operators { + +class LarsMomentumOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Param", + "(LoDTensor, default LoDTensor) " + "Input parameter that has to be updated"); + AddInput("Grad", + "(LoDTensor, default LoDTensor) " + "Input gradient of the parameter"); + AddInput("Velocity", + "(LoDTensor, default LoDTensor) " + "Input velocity (corresponding to the parameter) " + "that has to be updated"); + AddInput("LearningRate", + "(LoDTensor, default LoDTensor) " + "Input learning rate"); + + AddOutput("ParamOut", + "(LoDTensor) This output is updated parameter. " + "It shared memory with Input(Param)."); + AddOutput("VelocityOut", + "(LoDTensor) This output is updated velocity. " + "It shared memory with Input(Velocity)."); + + AddAttr("mu", "(float) Momentum coefficient"); + AddAttr("lars_coeff", "(float, default 0.001) LARS coefficient.") + .SetDefault(0.001); + AddAttr("lars_weight_decay", + "(float, default 0.0005) LARS weight decay") + .SetDefault(0.0005); + + AddComment(R"DOC( +Lars Momentum Optimizer. + +This optimizer use LARS (https://arxiv.org/abs/1708.03888) to optimize each +weight using a local learning rate: + +$$ +local\_lr = \eta * + \frac{\left \| param \right \|}{\left \| grad \right \| + \beta *\left \| param \right \|} \\ +velocity = mu * velocity + + local\_lr * (grad + \beta * param) \\ +param = param - velocity. \\ +$$ + +Note that we use lars_weight_decay here to decay weights, you may need not to +use L2 regularizers in case of using LARS. + +)DOC"); + } +}; + +class LarsMomentumOpVarTypeInference : public framework::VarTypeInference { + public: + void operator()(const framework::OpDesc &op_desc, + framework::BlockDesc *block) const override {} +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(lars_momentum, ops::MomentumOp, ops::LarsMomentumOpMaker, + paddle::framework::EmptyGradOpMaker, + ops::LarsMomentumOpVarTypeInference); +REGISTER_OP_CPU_KERNEL(lars_momentum, ops::LarsMomentumOpKernel, + ops::LarsMomentumOpKernel); diff --git a/paddle/fluid/operators/optimizers/lars_momentum_op.cu b/paddle/fluid/operators/optimizers/lars_momentum_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..a277d6ff2bea917addac8c6ea4b24b63dcbc8dba --- /dev/null +++ b/paddle/fluid/operators/optimizers/lars_momentum_op.cu @@ -0,0 +1,94 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/optimizers/lars_momentum_op.h" + +namespace paddle { +namespace operators { + +template +__global__ void MomentumLarsKernel(const T* p, const T* g, const T* v, + const T* learning_rate, const T mu, + const int64_t num, const T lars_coeff, + const T lars_weight_decay, const T* p_norm, + const T* g_norm, T* p_out, T* v_out) { + T lr = learning_rate[0]; + T local_lr = learning_rate[0]; + for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; + i += blockDim.x * gridDim.x) { + if (p_norm[0] > 0 && g_norm[0] > 0) { + local_lr = lr * lars_coeff * p_norm[0] / + (g_norm[0] + lars_weight_decay * p_norm[0]); + } + T v_new = v[i] * mu + local_lr * (g[i] + lars_weight_decay * p[i]); + v_out[i] = v_new; + p_out[i] = p[i] - v_new; + } +} + +template +class LarsMomentumOpCUDAKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto param_out = ctx.Output("ParamOut"); + auto velocity_out = ctx.Output("VelocityOut"); + auto param = ctx.Input("Param"); + auto velocity = ctx.Input("Velocity"); + auto grad = ctx.Input("Grad"); + auto learning_rate = ctx.Input("LearningRate"); + + T* p_out = param_out->mutable_data(ctx.GetPlace()); + T* v_out = velocity_out->mutable_data(ctx.GetPlace()); + + T mu = static_cast(ctx.Attr("mu")); + T lars_coeff = ctx.Attr("lars_coeff"); + T lars_weight_decay = ctx.Attr("lars_weight_decay"); + + auto* p = param->data(); + auto* v = velocity->data(); + auto* g = grad->data(); + auto* lr = learning_rate->data(); + + int block = 512; + int grid = (param->numel() + block - 1) / block; + + auto eigen_p = framework::EigenVector::Flatten(*param); + auto eigen_g = framework::EigenVector::Flatten(*grad); + // calculate norms using eigein and launch the kernel. + framework::Tensor p_norm_t, g_norm_t; + p_norm_t.Resize({1}); + g_norm_t.Resize({1}); + auto* p_norm_data = p_norm_t.mutable_data(ctx.GetPlace()); + auto* g_norm_data = g_norm_t.mutable_data(ctx.GetPlace()); + auto ep_norm = framework::EigenScalar::From(p_norm_t); + auto eg_norm = framework::EigenScalar::From(g_norm_t); + + auto* place = ctx.template device_context().eigen_device(); + ep_norm.device(*place) = eigen_p.square().sum().sqrt(); + eg_norm.device(*place) = eigen_g.square().sum().sqrt(); + MomentumLarsKernel<<>>( + p, g, v, lr, mu, param->numel(), lars_coeff, lars_weight_decay, + p_norm_data, g_norm_data, p_out, v_out); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + lars_momentum, + ops::LarsMomentumOpCUDAKernel, + ops::LarsMomentumOpCUDAKernel); diff --git a/paddle/fluid/operators/momentum_op.h b/paddle/fluid/operators/optimizers/lars_momentum_op.h similarity index 50% rename from paddle/fluid/operators/momentum_op.h rename to paddle/fluid/operators/optimizers/lars_momentum_op.h index 40073d21b7186ad319e4988e667750c267581300..e85be99fc42522e461a7915847d82144d8195a96 100644 --- a/paddle/fluid/operators/momentum_op.h +++ b/paddle/fluid/operators/optimizers/lars_momentum_op.h @@ -20,27 +20,25 @@ namespace paddle { namespace operators { template -class MomentumOpKernel : public framework::OpKernel { +class LarsMomentumOpKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { - const auto* param_var = ctx.InputVar("Param"); - PADDLE_ENFORCE(param_var->IsType(), - "The Var(%s)'s type should be LoDTensor, " - "but the received is %s", - ctx.Inputs("Param").front(), param_var->Type().name()); - - auto param_out = ctx.Output("ParamOut"); - auto velocity_out = ctx.Output("VelocityOut"); - auto param = ctx.Input("Param"); - auto velocity = ctx.Input("Velocity"); - auto grad = ctx.Input("Grad"); - auto learning_rate = ctx.Input("LearningRate"); + auto param_out = ctx.Output("ParamOut"); + auto velocity_out = ctx.Output("VelocityOut"); + auto param = ctx.Input("Param"); + auto velocity = ctx.Input("Velocity"); + auto learning_rate = ctx.Input("LearningRate"); + auto* grad_var = ctx.InputVar("Grad"); + // only support dense for now. + PADDLE_ENFORCE(grad_var->IsType()); + auto grad = ctx.Input("Grad"); param_out->mutable_data(ctx.GetPlace()); velocity_out->mutable_data(ctx.GetPlace()); T mu = static_cast(ctx.Attr("mu")); - bool use_nesterov = ctx.Attr("use_nesterov"); + T lars_coeff = ctx.Attr("lars_coeff"); + T lars_weight_decay = ctx.Attr("lars_weight_decay"); auto p_out = framework::EigenVector::Flatten(*param_out); auto v_out = framework::EigenVector::Flatten(*velocity_out); @@ -50,12 +48,23 @@ class MomentumOpKernel : public framework::OpKernel { auto g = framework::EigenVector::Flatten(*grad); auto* lr = learning_rate->data(); - v_out = v * mu + g; - if (use_nesterov) { - p_out = p - (g + v_out * mu) * lr[0]; - } else { - p_out = p - lr[0] * v_out; + framework::Tensor p_norm_t, g_norm_t; + p_norm_t.Resize({1}); + g_norm_t.Resize({1}); + p_norm_t.mutable_data(ctx.GetPlace()); + g_norm_t.mutable_data(ctx.GetPlace()); + auto ep_norm = framework::EigenScalar::From(p_norm_t); + auto eg_norm = framework::EigenScalar::From(g_norm_t); + + ep_norm = p.square().sum().sqrt(); + eg_norm = g.square().sum().sqrt(); + T local_lr = lr[0]; + if (ep_norm(0) > 0 && eg_norm(0) > 0) { + local_lr = lr[0] * lars_coeff * ep_norm(0) / + (eg_norm(0) + lars_weight_decay * ep_norm(0)); } + v_out = v * mu + local_lr * (g + lars_weight_decay * p); + p_out = p - v_out; } }; diff --git a/paddle/fluid/operators/optimizers/momentum_op.cc b/paddle/fluid/operators/optimizers/momentum_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..cde238c076b6991eb52dac328c3e30a045420c92 --- /dev/null +++ b/paddle/fluid/operators/optimizers/momentum_op.cc @@ -0,0 +1,99 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/optimizers/momentum_op.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; + +class MomentumOpInferVarType : public framework::VarTypeInference { + public: + void operator()(const framework::OpDesc& op_desc, + framework::BlockDesc* block) const override { + auto input_var = op_desc.Input("Param")[0]; + for (auto& out_var : op_desc.Output("ParamOut")) { + if (block->FindRecursiveOrCreateVar(input_var).GetType() == + framework::proto::VarType::SELECTED_ROWS) { + block->FindRecursiveOrCreateVar(out_var).SetType( + framework::proto::VarType::SELECTED_ROWS); + } else if (block->FindRecursiveOrCreateVar(input_var).GetType() == + framework::proto::VarType::LOD_TENSOR) { + block->FindRecursiveOrCreateVar(out_var).SetType( + framework::proto::VarType::LOD_TENSOR); + } else { + PADDLE_THROW( + "Only support LodTensor and SelectedRows, Unexpected Input Type."); + } + } + } +}; + +class MomentumOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("Param", + "(Tensor, default Tensor) " + "Input parameter that has to be updated"); + AddInput("Grad", + "(Tensor, default Tensor) " + "Input gradient of the parameter"); + AddInput("Velocity", + "(Tensor, default Tensor) " + "Input velocity (corresponding to the parameter) " + "that has to be updated"); + AddInput("LearningRate", + "(Tensor, default Tensor) " + "Input learning rate"); + + AddOutput("ParamOut", + "(Tensor) This output is updated parameter. " + "It shared memory with Input(Param)."); + AddOutput("VelocityOut", + "(Tensor) This output is updated velocity. " + "It shared memory with Input(Velocity)."); + + AddAttr("mu", "(float) Momentum coefficient"); + AddAttr("use_nesterov", + "(bool, default false) " + "Use Nesterov Momentum") + .SetDefault(false); + AddComment(R"DOC( +Momentum Optimizer. + +This optimizer has a flag for Nestrov Momentum. +The update equations are as follows: + +$$ +velocity = mu * velocity + gradient \\ +if (use\_nesterov): \\ + param = param - (gradient + mu * velocity) * learning\_rate \\ +else: \\ + param = param - learning\_rate * velocity. \\ +$$ + +)DOC"); + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(momentum, ops::MomentumOp, ops::MomentumOpMaker, + paddle::framework::EmptyGradOpMaker, + ops::MomentumOpInferVarType); +REGISTER_OP_CPU_KERNEL( + momentum, ops::MomentumOpKernel, + ops::MomentumOpKernel); diff --git a/paddle/fluid/operators/optimizers/momentum_op.cu b/paddle/fluid/operators/optimizers/momentum_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..8ce739de8dfd74cb43f9521bf39e3127a8a21925 --- /dev/null +++ b/paddle/fluid/operators/optimizers/momentum_op.cu @@ -0,0 +1,21 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/optimizers/momentum_op.h" + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + momentum, ops::MomentumOpKernel, + ops::MomentumOpKernel); diff --git a/paddle/fluid/operators/optimizers/momentum_op.h b/paddle/fluid/operators/optimizers/momentum_op.h new file mode 100644 index 0000000000000000000000000000000000000000..e5b756b4fa637f2d4136f8c8a87bf34c6c04413a --- /dev/null +++ b/paddle/fluid/operators/optimizers/momentum_op.h @@ -0,0 +1,402 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/algorithm.h" +#include "paddle/fluid/operators/math/selected_rows_functor.h" +#include "paddle/fluid/platform/for_range.h" + +namespace paddle { +namespace operators { + +using framework::Tensor; +using framework::SelectedRows; +struct NoNesterov; +struct UseNesterov; + +class MomentumOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + protected: + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("Param"), + "Input(param) of Momentum should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Grad"), + "Input(grad) of Momentum should not be null."); + PADDLE_ENFORCE(ctx->HasInput("Velocity"), + "Input(velocity) of Momentum should not be null."); + PADDLE_ENFORCE(ctx->HasInput("LearningRate"), + "Input(LearningRate) of Momentum should not be null."); + PADDLE_ENFORCE( + ctx->GetInputsVarType("Param").front() == + framework::proto::VarType::LOD_TENSOR, + "The input var's type should be LoDTensor, but the received is %s", + ctx->Inputs("Param").front(), ctx->GetInputsVarType("Param").front()); + + PADDLE_ENFORCE(ctx->HasOutput("ParamOut"), + "Output(ParamOut) of Momentum should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("VelocityOut"), + "Output(VelocityOut) of Momentum should not be null."); + + auto param_dim = ctx->GetInputDim("Param"); + if (ctx->GetInputsVarType("Grad")[0] == + framework::proto::VarType::LOD_TENSOR) { + PADDLE_ENFORCE_EQ( + param_dim, ctx->GetInputDim("Grad"), + "Param and Grad input of MomentumOp should have the same dimension."); + PADDLE_ENFORCE_EQ( + param_dim, ctx->GetInputDim("Velocity"), + "Param and Velocity of MomentumOp should have the same dimension."); + } + PADDLE_ENFORCE_EQ(framework::product(ctx->GetInputDim("LearningRate")), 1, + "Learning_rate should be a scalar"); + + ctx->SetOutputDim("ParamOut", param_dim); + ctx->SetOutputDim("VelocityOut", param_dim); + } + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + auto input_data_type = framework::GetDataTypeOfVar(ctx.InputVar("Param")); + return framework::OpKernelType(input_data_type, ctx.GetPlace()); + } +}; + +template +class CPUDenseMomentumFunctor { + private: + const Tensor* param; + const Tensor* grad; + const Tensor* velocity; + const Tensor* learning_rate; + const T mu; + const T use_nesterov; + Tensor* param_out; + Tensor* velocity_out; + + public: + CPUDenseMomentumFunctor(const Tensor* param, const Tensor* grad, + const Tensor* velocity, const Tensor* learning_rate, + const T mu, const bool use_nesterov, + Tensor* param_out, Tensor* velocity_out) + : param(param), + grad(grad), + velocity(velocity), + learning_rate(learning_rate), + mu(mu), + use_nesterov(use_nesterov), + param_out(param_out), + velocity_out(velocity_out) {} + + inline void operator()() { + auto p_out = framework::EigenVector::Flatten(*param_out); + auto v_out = framework::EigenVector::Flatten(*velocity_out); + + auto p = framework::EigenVector::Flatten(*param); + auto v = framework::EigenVector::Flatten(*velocity); + auto g = framework::EigenVector::Flatten(*grad); + auto* lr = learning_rate->data(); + + v_out = v * mu + g; + if (use_nesterov) { + p_out = p - (g + v_out * mu) * lr[0]; + } else { + p_out = p - lr[0] * v_out; + } + } +}; + +template +class DenseMomentumFunctor; + +// NOTE(dzh) for performance. +// avoid if/else in inside kernel, implement GPU UseNesterov/NoNesterov as two +// functor. +template +class DenseMomentumFunctor { + private: + const T* p_; + const T* g_; + const T* v_; + const T* lr_; + const T mu_; + const int64_t num_; + T* p_out_; + T* v_out_; + + public: + DenseMomentumFunctor(const T* p, const T* g, const T* v, + const T* learning_rate, const T mu, const int64_t num, + T* p_out, T* v_out) + : p_(p), + g_(g), + v_(v), + lr_(learning_rate), + mu_(mu), + num_(num), + p_out_(p_out), + v_out_(v_out) {} + inline HOSTDEVICE void operator()(size_t i) const { + // put memory access in register + const T p = p_[i]; + const T g = g_[i]; + const T lr = lr_[0]; + const T v = v_[i]; + T v_out = v * mu_ + g; + T p_out = p - (g + v_out * mu_) * lr; + // write reigster to memory + v_out_[i] = v_out; + p_out_[i] = p_out; + } +}; + +template +class DenseMomentumFunctor { + private: + const T* p_; + const T* g_; + const T* v_; + const T* lr_; + const T mu_; + const int64_t num_; + T* p_out_; + T* v_out_; + + public: + DenseMomentumFunctor(const T* p, const T* g, const T* v, + const T* learning_rate, const T mu, const int64_t num, + T* p_out, T* v_out) + : p_(p), + g_(g), + v_(v), + lr_(learning_rate), + mu_(mu), + num_(num), + p_out_(p_out), + v_out_(v_out) {} + inline HOSTDEVICE void operator()(size_t i) const { + // put memory access in register + const T p = p_[i]; + const T g = g_[i]; + const T lr = lr_[0]; + const T v = v_[i]; + T v_out = v * mu_ + g; + T p_out = p - lr * v_out; + // write reigster to memory + v_out_[i] = v_out; + p_out_[i] = p_out; + } +}; + +template +class SparseMomentumFunctor; + +template +class SparseMomentumFunctor { + private: + const T* p_; + const T* g_; + const T* v_; + const T* lr_; + const T mu_; + const int64_t* rows_; + const int64_t row_numel_; + const int64_t row_height_; + T* p_out_; + T* v_out_; + + public: + SparseMomentumFunctor(const T* p, const T* g, const T* v, const T* lr, + const T mu, const int64_t* rows, int64_t row_numel, + int64_t row_height, T* p_out, T* v_out) + : p_(p), + g_(g), + v_(v), + lr_(lr), + mu_(mu), + rows_(rows), + row_numel_(row_numel), + row_height_(row_height), + p_out_(p_out), + v_out_(v_out) {} + + inline HOSTDEVICE void operator()(size_t i) { + auto row_idx = + math::BinarySearch(rows_, row_height_, i / row_numel_); + T g = row_idx >= 0 ? g_[row_idx * row_numel_ + i % row_numel_] : 0; + // put memory access in register + const T p = p_[i]; + const T lr = lr_[0]; + const T v = v_[i]; + T v_out = v * mu_ + g; + T p_out = p - (g + v_out * mu_) * lr; + // write reigster to memory + v_out_[i] = v_out; + p_out_[i] = p_out; + } +}; + +template +class SparseMomentumFunctor { + private: + const T* p_; + const T* g_; + const T* v_; + const T* lr_; + const T mu_; + const int64_t* rows_; + const int64_t row_numel_; + const int64_t row_height_; + T* p_out_; + T* v_out_; + + public: + SparseMomentumFunctor(const T* p, const T* g, const T* v, const T* lr, + const T mu, const int64_t* rows, int64_t row_numel, + int64_t row_height, T* p_out, T* v_out) + : p_(p), + g_(g), + v_(v), + lr_(lr), + mu_(mu), + rows_(rows), + row_numel_(row_numel), + row_height_(row_height), + p_out_(p_out), + v_out_(v_out) {} + + inline HOSTDEVICE void operator()(size_t i) { + auto row_idx = + math::BinarySearch(rows_, row_height_, i / row_numel_); + T g = row_idx >= 0 ? g_[row_idx * row_numel_ + i % row_numel_] : 0; + // put memory access in register + const T p = p_[i]; + const T lr = lr_[0]; + const T v = v_[i]; + T v_out = v * mu_ + g; + T p_out = p - v_out * lr; + // write reigster to memory + v_out_[i] = v_out; + p_out_[i] = p_out; + } +}; + +template +class MomentumOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + T mu = static_cast(ctx.Attr("mu")); + bool use_nesterov = ctx.Attr("use_nesterov"); + + auto learning_rate = ctx.Input("LearningRate"); + auto param = ctx.Input("Param"); + auto param_out = ctx.Output("ParamOut"); + auto* velocity = ctx.Input("Velocity"); + auto velocity_out = ctx.Output("VelocityOut"); + param_out->mutable_data(ctx.GetPlace()); + velocity_out->mutable_data(ctx.GetPlace()); + + auto* grad_var = ctx.InputVar("Grad"); + if (grad_var->IsType()) { + auto grad = ctx.Input("Grad"); + if (platform::is_cpu_place(ctx.GetPlace())) { + CPUDenseMomentumFunctor functor(param, grad, velocity, learning_rate, + mu, use_nesterov, param_out, + velocity_out); + functor(); + } else if (platform::is_gpu_place(ctx.GetPlace())) { + platform::ForRange for_range( + static_cast(ctx.device_context()), + param->numel()); + if (use_nesterov) { + DenseMomentumFunctor functor( + param->data(), grad->data(), velocity->data(), + learning_rate->data(), mu, param->numel(), + param_out->mutable_data(ctx.GetPlace()), + velocity_out->mutable_data(ctx.GetPlace())); + for_range(functor); + + } else { + DenseMomentumFunctor functor( + param->data(), grad->data(), velocity->data(), + learning_rate->data(), mu, param->numel(), + param_out->mutable_data(ctx.GetPlace()), + velocity_out->mutable_data(ctx.GetPlace())); + for_range(functor); + } + } + + } else if (grad_var->IsType()) { + // sparse update embedding with selectedrows + auto grad = ctx.Input("Grad"); + + // sparse update maybe empty. + if (grad->rows().size() == 0) { + VLOG(30) << "Grad SelectedRows contains no data!"; + return; + } + auto* merged_grad = const_cast(ctx.scope()) + .Var() + ->GetMutable(); + math::scatter::MergeAdd merge_func; + merge_func(ctx.template device_context(), *grad, + merged_grad); + + const int64_t* rows = nullptr; +#ifdef PADDLE_WITH_CUDA + if (platform::is_gpu_place(ctx.GetPlace())) { + rows = merged_grad->rows().CUDAData(ctx.GetPlace()); + } else { +#endif + rows = merged_grad->rows().data(); +#ifdef PADDLE_WITH_CUDA + } +#endif + int64_t row_numel = + merged_grad->value().numel() / merged_grad->rows().size(); + platform::ForRange for_range( + static_cast(ctx.device_context()), + param->numel()); + if (use_nesterov) { + SparseMomentumFunctor functor( + param->data(), merged_grad->value().data(), + velocity->data(), learning_rate->data(), mu, rows, row_numel, + static_cast(merged_grad->rows().size()), + param_out->mutable_data(ctx.GetPlace()), + velocity_out->mutable_data(ctx.GetPlace())); + for_range(functor); + + } else { + SparseMomentumFunctor functor( + param->data(), merged_grad->value().data(), + velocity->data(), learning_rate->data(), mu, rows, row_numel, + static_cast(merged_grad->rows().size()), + param_out->mutable_data(ctx.GetPlace()), + velocity_out->mutable_data(ctx.GetPlace())); + for_range(functor); + } + } else { + PADDLE_THROW( + string::Sprintf("MomentumOp only supports LoDTensor or SelectedRows " + "gradient, but the received Variable Type is %s", + grad_var->Type().name())); + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/proximal_adagrad_op.cc b/paddle/fluid/operators/optimizers/proximal_adagrad_op.cc similarity index 98% rename from paddle/fluid/operators/proximal_adagrad_op.cc rename to paddle/fluid/operators/optimizers/proximal_adagrad_op.cc index 8d8075d76111928ec9855eb0b70fe6dbd90a979b..7b07b3b7071cb39e4e81cb4612372eec96efe489 100644 --- a/paddle/fluid/operators/proximal_adagrad_op.cc +++ b/paddle/fluid/operators/optimizers/proximal_adagrad_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/proximal_adagrad_op.h" +#include "paddle/fluid/operators/optimizers/proximal_adagrad_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/proximal_adagrad_op.cu b/paddle/fluid/operators/optimizers/proximal_adagrad_op.cu similarity index 92% rename from paddle/fluid/operators/proximal_adagrad_op.cu rename to paddle/fluid/operators/optimizers/proximal_adagrad_op.cu index 7e0226c62bfd5d4804cc70c00391237deec33ebb..d1c1f747b70c3ceb806da06e6786a70b62a32995 100644 --- a/paddle/fluid/operators/proximal_adagrad_op.cu +++ b/paddle/fluid/operators/optimizers/proximal_adagrad_op.cu @@ -12,7 +12,7 @@ CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/proximal_adagrad_op.h" +#include "paddle/fluid/operators/optimizers/proximal_adagrad_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/proximal_adagrad_op.h b/paddle/fluid/operators/optimizers/proximal_adagrad_op.h similarity index 100% rename from paddle/fluid/operators/proximal_adagrad_op.h rename to paddle/fluid/operators/optimizers/proximal_adagrad_op.h diff --git a/paddle/fluid/operators/proximal_gd_op.cc b/paddle/fluid/operators/optimizers/proximal_gd_op.cc similarity index 98% rename from paddle/fluid/operators/proximal_gd_op.cc rename to paddle/fluid/operators/optimizers/proximal_gd_op.cc index baf9cbcba2ed89f62afc9816e0ab9e0f112e6008..dcef4f7be249e04306732213a7c6209d32602048 100644 --- a/paddle/fluid/operators/proximal_gd_op.cc +++ b/paddle/fluid/operators/optimizers/proximal_gd_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/proximal_gd_op.h" +#include "paddle/fluid/operators/optimizers/proximal_gd_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/proximal_gd_op.cu b/paddle/fluid/operators/optimizers/proximal_gd_op.cu similarity index 92% rename from paddle/fluid/operators/proximal_gd_op.cu rename to paddle/fluid/operators/optimizers/proximal_gd_op.cu index 32ee9ab74cd58fd6f48b6c34e108f31315adaf71..7aa0e1015008eba0c1cf63ba1278dc2b8049b20b 100644 --- a/paddle/fluid/operators/proximal_gd_op.cu +++ b/paddle/fluid/operators/optimizers/proximal_gd_op.cu @@ -12,7 +12,7 @@ CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/proximal_gd_op.h" +#include "paddle/fluid/operators/optimizers/proximal_gd_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/proximal_gd_op.h b/paddle/fluid/operators/optimizers/proximal_gd_op.h similarity index 100% rename from paddle/fluid/operators/proximal_gd_op.h rename to paddle/fluid/operators/optimizers/proximal_gd_op.h diff --git a/paddle/fluid/operators/rmsprop_op.cc b/paddle/fluid/operators/optimizers/rmsprop_op.cc similarity index 99% rename from paddle/fluid/operators/rmsprop_op.cc rename to paddle/fluid/operators/optimizers/rmsprop_op.cc index f06f87e61d3a4d1fc8b864b9dd84e697fb12a006..99d1156ee6d5fc88161e25bfa581a265707e6f92 100644 --- a/paddle/fluid/operators/rmsprop_op.cc +++ b/paddle/fluid/operators/optimizers/rmsprop_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/rmsprop_op.h" +#include "paddle/fluid/operators/optimizers/rmsprop_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/rmsprop_op.cu b/paddle/fluid/operators/optimizers/rmsprop_op.cu similarity index 92% rename from paddle/fluid/operators/rmsprop_op.cu rename to paddle/fluid/operators/optimizers/rmsprop_op.cu index cdc473769598be5aac87a14613d9acdd5c1a1204..69e35a309e04f61068d9ff1b6d9f1450d2524253 100644 --- a/paddle/fluid/operators/rmsprop_op.cu +++ b/paddle/fluid/operators/optimizers/rmsprop_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/rmsprop_op.h" +#include "paddle/fluid/operators/optimizers/rmsprop_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/rmsprop_op.h b/paddle/fluid/operators/optimizers/rmsprop_op.h similarity index 96% rename from paddle/fluid/operators/rmsprop_op.h rename to paddle/fluid/operators/optimizers/rmsprop_op.h index 797cd45fdcdbd5c3567d1676f37e148304ee6e2d..389c84d2464090ff9bd9e8b471cd0103c86a347a 100644 --- a/paddle/fluid/operators/rmsprop_op.h +++ b/paddle/fluid/operators/optimizers/rmsprop_op.h @@ -179,8 +179,8 @@ class RmspropOpKernel : public framework::OpKernel { auto &mg_tensor = *ctx.Input("MeanGrad"); auto mg = EigenVector::Flatten(mg_tensor); auto *mean_grad_out = ctx.Output("MeanGradOut"); - PADDLE_ENFORCE(&mg_tensor, mean_grad_out, - "MeanGrad and MeanGradOut must be the same Tensor"); + PADDLE_ENFORCE_EQ(&mg_tensor, mean_grad_out, + "MeanGrad and MeanGradOut must be the same Tensor"); auto mg_out = EigenVector::Flatten(*mean_grad_out); mg_out.device(place) = rho * mg + (1 - rho) * g; @@ -198,8 +198,8 @@ class RmspropOpKernel : public framework::OpKernel { if (centered) { auto &mg_tensor = *ctx.Input("MeanGrad"); auto *mean_grad_out = ctx.Output("MeanGradOut"); - PADDLE_ENFORCE(&mg_tensor, mean_grad_out, - "MeanGrad and MeanGradOut must be the same Tensor"); + PADDLE_ENFORCE_EQ(&mg_tensor, mean_grad_out, + "MeanGrad and MeanGradOut must be the same Tensor"); for_range(CenteredRmspropFunctor>( param_out->mutable_data(ctx.GetPlace()), mean_square_out->mutable_data(ctx.GetPlace()), @@ -243,8 +243,8 @@ class RmspropOpKernel : public framework::OpKernel { if (centered) { auto &mg_tensor = *ctx.Input("MeanGrad"); auto *mean_grad_out = ctx.Output("MeanGradOut"); - PADDLE_ENFORCE(&mg_tensor, mean_grad_out, - "MeanGrad and MeanGradOut must be the same Tensor"); + PADDLE_ENFORCE_EQ(&mg_tensor, mean_grad_out, + "MeanGrad and MeanGradOut must be the same Tensor"); for_range(CenteredRmspropFunctor>( param_out->mutable_data(ctx.GetPlace()), mean_square_out->mutable_data(ctx.GetPlace()), diff --git a/paddle/fluid/operators/sgd_op.cc b/paddle/fluid/operators/optimizers/sgd_op.cc similarity index 97% rename from paddle/fluid/operators/sgd_op.cc rename to paddle/fluid/operators/optimizers/sgd_op.cc index 411a126bc8e2b3a8d25f436489c13970568ccae4..690381a67f89d18fe81c3b856b7ddce25d496ed0 100644 --- a/paddle/fluid/operators/sgd_op.cc +++ b/paddle/fluid/operators/optimizers/sgd_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sgd_op.h" +#include "paddle/fluid/operators/optimizers/sgd_op.h" namespace paddle { namespace operators { @@ -77,8 +77,7 @@ class SGDOpMaker : public framework::OpProtoAndCheckerMaker { AddInput("Grad", "(Tensor or SelectedRows) Input gradient"); AddOutput("ParamOut", "(Tensor or SelectedRows, same with Param) " - "Output parameter, should share the same memory with Param") - .Reuse("Param"); + "Output parameter, should share the same memory with Param"); AddComment(R"DOC( SGD operator diff --git a/paddle/fluid/operators/sgd_op.cu b/paddle/fluid/operators/optimizers/sgd_op.cu similarity index 98% rename from paddle/fluid/operators/sgd_op.cu rename to paddle/fluid/operators/optimizers/sgd_op.cu index d3f4eba3b24ec1ac0328ef270256cdf3abe499db..a9d303d55d8f681fe3a014db36ede5ef6b2742bd 100644 --- a/paddle/fluid/operators/sgd_op.cu +++ b/paddle/fluid/operators/optimizers/sgd_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include "paddle/fluid/operators/sgd_op.h" +#include "paddle/fluid/operators/optimizers/sgd_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { diff --git a/paddle/fluid/operators/sgd_op.h b/paddle/fluid/operators/optimizers/sgd_op.h similarity index 93% rename from paddle/fluid/operators/sgd_op.h rename to paddle/fluid/operators/optimizers/sgd_op.h index d8b0165b2a89b04bd55671a37d96ee4ba275b2eb..b27ef27e298d0f08129e2c0a349c741129acdfe2 100644 --- a/paddle/fluid/operators/sgd_op.h +++ b/paddle/fluid/operators/optimizers/sgd_op.h @@ -98,10 +98,10 @@ class SGDOpKernel : public framework::OpKernel { auto param_row_width = param.value().dims()[1]; auto grad_row_width = grad.value().dims()[1]; - VLOG(4) << " param rows: " << param.rows().size() - << " param memory rows: " << param.value().dims()[0] - << " grad rows: " << grad.rows().size() - << " grad memory rows: " << grad.value().dims()[0]; + VLOG(40) << " param rows: " << param.rows().size() + << " param memory rows: " << param.value().dims()[0] + << " grad rows: " << grad.rows().size() + << " grad memory rows: " << grad.value().dims()[0]; PADDLE_ENFORCE_EQ(param_row_width, grad_row_width, "param_row should have the same size with grad_row"); @@ -109,8 +109,6 @@ class SGDOpKernel : public framework::OpKernel { const auto *grad_data = grad.value().data(); auto *out_data = param_out->mutable_value()->data(); for (size_t i = 0; i < grad.rows().size(); i++) { - PADDLE_ENFORCE(grad.rows()[i] < grad.height(), - "Input rows index should less than height"); int64_t id_index = param_out->AutoGrownIndex(grad.rows()[i], false); PADDLE_ENFORCE_GE(id_index, static_cast(0), "id should be in the table"); diff --git a/paddle/fluid/operators/pad_constant_like_op.cc b/paddle/fluid/operators/pad_constant_like_op.cc index 37646c7b4c50fc7409002aca56e5462bde93cc30..685ebc393794337e03de0b9ce5134d6ce382c8bf 100644 --- a/paddle/fluid/operators/pad_constant_like_op.cc +++ b/paddle/fluid/operators/pad_constant_like_op.cc @@ -74,7 +74,7 @@ PadConstantLikeOp Operator. Pad input(Y) with a pad_value, the number of values padded to the edges of each axis is specified by the difference of the shape of X and Y. -((0, shape_x_0 - shape_y_0), … (0, shape_x_n - shape_y_n)) unique pad widths for +((0, shape_x_0 - shape_y_0), ... (0, shape_x_n - shape_y_n)) unique pad widths for each axis. The input should be a k-D tensor(k > 0 and k < 7). As an example: diff --git a/paddle/fluid/operators/pool_cudnn_op.cu.cc b/paddle/fluid/operators/pool_cudnn_op.cu.cc index 31f083565fddee66aea1485ed71f41b6199f4502..4a332ce10b59b21d2518684237ce0bbf1bbfa75a 100644 --- a/paddle/fluid/operators/pool_cudnn_op.cu.cc +++ b/paddle/fluid/operators/pool_cudnn_op.cu.cc @@ -41,6 +41,7 @@ class PoolCUDNNOpKernel : public framework::OpKernel { T *output_data = output->mutable_data(ctx.GetPlace()); std::string pooling_type = ctx.Attr("pooling_type"); + bool exclusive = ctx.Attr("exclusive"); std::vector ksize = ctx.Attr>("ksize"); std::vector strides = ctx.Attr>("strides"); std::vector paddings = ctx.Attr>("paddings"); @@ -72,7 +73,8 @@ class PoolCUDNNOpKernel : public framework::OpKernel { if (pooling_type == "max") { pooling_mode = PoolingMode::kMaximum; } else { - pooling_mode = PoolingMode::kAverage; + pooling_mode = exclusive ? PoolingMode::kAverageExclusive + : PoolingMode::kAverageInclusive; } cudnnPoolingDescriptor_t cudnn_pool_desc = @@ -101,6 +103,7 @@ class PoolCUDNNGradOpKernel : public framework::OpKernel { Tensor *input_grad = ctx.Output(framework::GradVarName("X")); std::string pooling_type = ctx.Attr("pooling_type"); + bool exclusive = ctx.Attr("exclusive"); std::vector ksize = ctx.Attr>("ksize"); std::vector strides = ctx.Attr>("strides"); std::vector paddings = ctx.Attr>("paddings"); @@ -141,7 +144,8 @@ class PoolCUDNNGradOpKernel : public framework::OpKernel { pooling_mode = PoolingMode::kMaximum; } } else { - pooling_mode = PoolingMode::kAverage; + pooling_mode = exclusive ? PoolingMode::kAverageExclusive + : PoolingMode::kAverageInclusive; } cudnnPoolingDescriptor_t cudnn_pool_desc = @@ -174,7 +178,8 @@ REGISTER_OP_KERNEL(pool2d, CUDNN, plat::CUDAPlace, ops::PoolCUDNNOpKernel); REGISTER_OP_KERNEL(pool2d_grad, CUDNN, plat::CUDAPlace, ops::PoolCUDNNGradOpKernel, - ops::PoolCUDNNGradOpKernel); + ops::PoolCUDNNGradOpKernel, + ops::PoolCUDNNGradOpKernel); REGISTER_OP_KERNEL(pool3d, CUDNN, plat::CUDAPlace, ops::PoolCUDNNOpKernel, diff --git a/paddle/fluid/operators/pool_mkldnn_op.cc b/paddle/fluid/operators/pool_mkldnn_op.cc index 56cef91e29cc7da27384c27a7ec63e90cfadfc3b..0a9a29956affedb8605ab9949070943fbbb54145 100644 --- a/paddle/fluid/operators/pool_mkldnn_op.cc +++ b/paddle/fluid/operators/pool_mkldnn_op.cc @@ -87,6 +87,7 @@ class PoolMKLDNNOpKernel : public paddle::framework::OpKernel { std::vector ksize = ctx.Attr>("ksize"); std::vector strides = ctx.Attr>("strides"); std::vector paddings = ctx.Attr>("paddings"); + bool is_test = ctx.Attr("is_test"); if (ctx.Attr("global_pooling")) { for (size_t i = 0; i < ksize.size(); ++i) { paddings[i] = 0; @@ -142,16 +143,10 @@ class PoolMKLDNNOpKernel : public paddle::framework::OpKernel { std::shared_ptr pool_pd = CreatePrimitiveDesc(src_md, dst_md, strides, padding_left_top, padding_right_bottom, ksize, pooling_type, - mkldnn_engine, ceil_mode); + mkldnn_engine, ceil_mode, is_test); // save pool_pd into global device context to be referred in backward path - dev_ctx.SetBlob(key_pool_pd, pool_pd); - - std::shared_ptr workspace_memory = - CreateWorkspaceMemory(pool_pd, pooling_type, mkldnn_engine); - - // save pool_workspace_memory to be referred in backward path - dev_ctx.SetBlob(key_pool_workspace_memory, workspace_memory); + if (!is_test) dev_ctx.SetBlob(key_pool_pd, pool_pd); auto src_memory = std::make_shared(pool_pd->src_primitive_desc(), to_void_cast(input_data)); @@ -161,9 +156,19 @@ class PoolMKLDNNOpKernel : public paddle::framework::OpKernel { dev_ctx.SetBlob(key_pool_src_mem_p, src_memory); dev_ctx.SetBlob(key_pool_dst_mem_p, dst_memory); - pool_p = std::make_shared(*pool_pd, *(src_memory.get()), - *(dst_memory.get()), - *workspace_memory); + if (is_test) { + pool_p = std::make_shared(*pool_pd, *src_memory, + *dst_memory); + } else { + std::shared_ptr workspace_memory = + CreateWorkspaceMemory(pool_pd, pooling_type, mkldnn_engine); + + // save pool_workspace_memory to be referred in backward path + dev_ctx.SetBlob(key_pool_workspace_memory, workspace_memory); + + pool_p = std::make_shared( + *pool_pd, *src_memory, *dst_memory, *workspace_memory); + } dev_ctx.SetBlob(key_pool_p, pool_p); @@ -201,9 +206,12 @@ class PoolMKLDNNOpKernel : public paddle::framework::OpKernel { const std::vector& stride, const std::vector& padding_left_top, const std::vector& padding_right_bot, const std::vector& kernel, const std::string& pooling_type, const mkldnn::engine& engine, - bool ceil_mode) const { + bool ceil_mode, bool is_test) const { + auto mkldnn_forward_prop_kind = is_test + ? mkldnn::prop_kind::forward_inference + : mkldnn::prop_kind::forward_training; auto pool_desc = mkldnn::pooling_forward::desc( - mkldnn::prop_kind::forward, + mkldnn_forward_prop_kind, pooling_type == "max" ? mkldnn::algorithm::pooling_max : mkldnn::algorithm::pooling_avg, src, dst, stride, kernel, padding_left_top, padding_right_bot, @@ -248,6 +256,10 @@ class PoolMKLDNNGradOpKernel : public paddle::framework::OpKernel { out_grad->format() != memory::format::format_undef, "Wrong layout/format set for Input output_grad tensor"); + PADDLE_ENFORCE( + !ctx.Attr("is_test"), + "is_test attribute should be set to False in training phase."); + std::string pooling_type = ctx.Attr("pooling_type"); std::vector ksize = ctx.Attr>("ksize"); std::vector strides = ctx.Attr>("strides"); diff --git a/paddle/fluid/operators/pool_op.cc b/paddle/fluid/operators/pool_op.cc index f8ad63690e84339da0390d4ddd2db45f25db385a..52b607df74446866c535751f3faa11765cb6f247 100644 --- a/paddle/fluid/operators/pool_op.cc +++ b/paddle/fluid/operators/pool_op.cc @@ -40,7 +40,7 @@ int PoolOutputSize(int input_size, int filter_size, int padding, int stride, return output_size; } -void PoolOp::InferShape(framework::InferShapeContext *ctx) const { +void PoolOp::InferShape(framework::InferShapeContext* ctx) const { PADDLE_ENFORCE(ctx->HasInput("X"), "X(Input) of Pooling should not be null."); PADDLE_ENFORCE(ctx->HasOutput("Out"), "Out(Output) of Pooling should not be null."); @@ -81,7 +81,7 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const { } framework::OpKernelType PoolOp::GetExpectedKernelType( - const framework::ExecutionContext &ctx) const { + const framework::ExecutionContext& ctx) const { framework::LibraryType library_{framework::LibraryType::kPlain}; std::string data_format = ctx.Attr("data_format"); framework::DataLayout layout_ = framework::StringToDataLayout(data_format); @@ -104,7 +104,7 @@ framework::OpKernelType PoolOp::GetExpectedKernelType( layout_, library_); } -void PoolOpGrad::InferShape(framework::InferShapeContext *ctx) const { +void PoolOpGrad::InferShape(framework::InferShapeContext* ctx) const { PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null."); PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("X")), "Input(X@GRAD) should not be null."); @@ -112,7 +112,7 @@ void PoolOpGrad::InferShape(framework::InferShapeContext *ctx) const { } framework::OpKernelType PoolOpGrad::GetExpectedKernelType( - const framework::ExecutionContext &ctx) const { + const framework::ExecutionContext& ctx) const { framework::LibraryType library_{framework::LibraryType::kPlain}; std::string data_format = ctx.Attr("data_format"); framework::DataLayout layout_ = framework::StringToDataLayout(data_format); @@ -151,8 +151,7 @@ void Pool2dOpMaker::Make() { "The format of output tensor is also NCHW, " "where N is batch size, C is the number of channels, " "H is the height of the feature, " - "and W is the width of the feature.") - .Reuse("X"); + "and W is the width of the feature."); AddAttr("pooling_type", "(string), pooling type, can be \"max\" for max-pooling " @@ -181,6 +180,12 @@ void Pool2dOpMaker::Make() { "operator." "If global_pooling = true, paddings and ksize will be ignored.") .SetDefault({0, 0}); + AddAttr( + "exclusive", + "(bool, default True) When true, will exclude the zero-padding in the " + "averaging calculating, otherwise, include the zero-padding. Note, it " + "is only used when pooling_type is avg. The defalut is True.") + .SetDefault(true); AddAttr( "use_cudnn", "(bool, default false) Only used in cudnn kernel, need install cudnn") @@ -201,6 +206,11 @@ void Pool2dOpMaker::Make() { "Defaults to \"NHWC\". Specify the data format of the output data, " "the input will be transformed automatically. ") .SetDefault("AnyLayout"); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); + // TODO(dzhwinter): need to registered layout transform function AddComment(R"DOC( @@ -237,9 +247,34 @@ Example: W_{out} = \\frac{(W_{in} - ksize[1] + 2 * paddings[1] + strides[1] - 1)}{strides[1]} + 1 $$ + For exclusive = true: + $$ + hstart = i * strides[0] - paddings[0] + hend = hstart + ksize[0] + wstart = j * strides[1] - paddings[1] + wend = wstart + ksize[1] + Output(i ,j) = \\frac{sum(Input[hstart:hend, wstart:wend])}{ksize[0] * ksize[1]} + $$ + For exclusive = false: + $$ + hstart = max(0, i * strides[0] - paddings[0]) + hend = min(H, hstart + ksize[0]) + wstart = max(0, j * strides[1] - paddings[1]) + wend = min(W, wstart + ksize[1]) + Output(i ,j) = \\frac{sum(Input[hstart:hend, wstart:wend])}{(hend - hstart) * (wend - wstart)} + $$ + )DOC"); } +class PoolOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; + } +}; + void Pool3dOpMaker::Make() { AddInput("X", "(Tensor) The input tensor of pooling operator. " @@ -252,8 +287,7 @@ void Pool3dOpMaker::Make() { "The format of output tensor is also NCDHW, " "where N is batch size, C is " "the number of channels, and D, H and W is the depth, height and " - "width of the feature, respectively.") - .Reuse("X"); + "width of the feature, respectively."); AddAttr("pooling_type", "(string) Pooling type, can be \"max\" for max-pooling " @@ -285,6 +319,12 @@ void Pool3dOpMaker::Make() { "If global_pooling = true, ksize and paddings will be ignored.") .SetDefault({0, 0, 0}); // TODO(Chengduo): Add checker. (Currently, // TypedAttrChecker don't support vector type.) + AddAttr( + "exclusive", + "(bool, default True) When true, will exclude the zero-padding in the " + "averaging calculating, otherwise, include the zero-padding. Note, it " + "is only used when pooling_type is avg. The defalut is True.") + .SetDefault(true); AddAttr( "use_cudnn", @@ -345,6 +385,7 @@ Example: namespace ops = paddle::operators; REGISTER_OPERATOR(pool2d, ops::PoolOp, ops::Pool2dOpMaker, + ops::PoolOpInferVarType, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(pool2d_grad, ops::PoolOpGrad); @@ -356,6 +397,7 @@ REGISTER_OP_CPU_KERNEL( ops::PoolGradKernel); REGISTER_OPERATOR(pool3d, ops::PoolOp, ops::Pool3dOpMaker, + ops::PoolOpInferVarType, paddle::framework::DefaultGradOpDescMaker); REGISTER_OPERATOR(pool3d_grad, ops::PoolOpGrad); diff --git a/paddle/fluid/operators/pool_op.h b/paddle/fluid/operators/pool_op.h index a63963ca926bb94ff99e5cfe6dbcb2b15075bcb8..c0594b7e3cc5602a44bb01951a22c2135ba5c7ce 100644 --- a/paddle/fluid/operators/pool_op.h +++ b/paddle/fluid/operators/pool_op.h @@ -69,6 +69,7 @@ class PoolKernel : public framework::OpKernel { std::vector ksize = context.Attr>("ksize"); std::vector strides = context.Attr>("strides"); std::vector paddings = context.Attr>("paddings"); + bool exclusive = context.Attr("exclusive"); if (context.Attr("global_pooling")) { for (size_t i = 0; i < ksize.size(); ++i) { paddings[i] = 0; @@ -84,7 +85,7 @@ class PoolKernel : public framework::OpKernel { pool2d_forward; paddle::operators::math::MaxPool pool_process; pool2d_forward(dev_ctx, *in_x, ksize, strides, paddings, pool_process, - out); + true, out); } else if (pooling_type == "avg") { paddle::operators::math::Pool2dFunctor< @@ -92,7 +93,7 @@ class PoolKernel : public framework::OpKernel { pool2d_forward; paddle::operators::math::AvgPool pool_process; pool2d_forward(dev_ctx, *in_x, ksize, strides, paddings, pool_process, - out); + exclusive, out); } } break; case 3: { @@ -102,14 +103,14 @@ class PoolKernel : public framework::OpKernel { pool3d_forward; paddle::operators::math::MaxPool pool_process; pool3d_forward(dev_ctx, *in_x, ksize, strides, paddings, pool_process, - out); + true, out); } else if (pooling_type == "avg") { paddle::operators::math::Pool3dFunctor< DeviceContext, paddle::operators::math::AvgPool, T> pool3d_forward; paddle::operators::math::AvgPool pool_process; pool3d_forward(dev_ctx, *in_x, ksize, strides, paddings, pool_process, - out); + exclusive, out); } } break; default: { PADDLE_THROW("Pool op only supports 2D and 3D input."); } @@ -131,6 +132,7 @@ class PoolGradKernel : public framework::OpKernel { std::vector ksize = context.Attr>("ksize"); std::vector strides = context.Attr>("strides"); std::vector paddings = context.Attr>("paddings"); + bool exclusive = context.Attr("exclusive"); if (context.Attr("global_pooling")) { for (size_t i = 0; i < ksize.size(); ++i) { @@ -157,7 +159,7 @@ class PoolGradKernel : public framework::OpKernel { pool2d_backward; paddle::operators::math::AvgPoolGrad pool_process; pool2d_backward(dev_ctx, *in_x, *out, *out_grad, ksize, strides, - paddings, pool_process, in_x_grad); + paddings, pool_process, exclusive, in_x_grad); } } break; case 3: { @@ -172,7 +174,7 @@ class PoolGradKernel : public framework::OpKernel { pool3d_backward; paddle::operators::math::AvgPoolGrad pool_process; pool3d_backward(dev_ctx, *in_x, *out, *out_grad, ksize, strides, - paddings, pool_process, in_x_grad); + paddings, pool_process, exclusive, in_x_grad); } } break; default: { PADDLE_THROW("Pool op only supports 2D and 3D input."); } diff --git a/paddle/fluid/operators/prelu_op.h b/paddle/fluid/operators/prelu_op.h index 12f1525594ecf0887618616ffe563bd2bda32496..594f1cb3abe49c61ad7c490ebcd100a5c9ea6fb9 100644 --- a/paddle/fluid/operators/prelu_op.h +++ b/paddle/fluid/operators/prelu_op.h @@ -32,7 +32,7 @@ class PReluKernel : public framework::OpKernel { T* o_ptr = out->mutable_data(context.GetPlace()); const T* alpha_ptr = alpha->data(); - std::string mode = context.Attr("mode"); + auto& mode = context.Attr("mode"); int numel = x->numel(); auto dim = x->dims(); @@ -99,6 +99,8 @@ class PReluGradKernel : public framework::OpKernel { index = 0; if (dalpha) { T* dalpha_ptr = dalpha->mutable_data(context.GetPlace()); + memset(dalpha_ptr, 0, sizeof(T) * dalpha->numel()); + if (mode == "channel") { for (i = 0; i < numel; i++) { temp = numel / (dim[0] * dim[1]); diff --git a/paddle/fluid/operators/random_crop_op.h b/paddle/fluid/operators/random_crop_op.h index d68ba9d661698bb0d33b139f5748daec2ead6595..5f1a48b6de01550978638917e3c66ef2851ee2ed 100644 --- a/paddle/fluid/operators/random_crop_op.h +++ b/paddle/fluid/operators/random_crop_op.h @@ -155,8 +155,8 @@ class RandomCropKernel : public framework::OpKernel { seed = *cpu_seed.data(); } } else { - VLOG(5) << "WARNING: The input 'Seed' is not initialized, use attribute " - "'startup_seed' instead."; + VLOG(50) << "WARNING: The input 'Seed' is not initialized, use attribute " + "'startup_seed' instead."; seed = ctx.Attr("startup_seed"); } auto shape = ctx.Attr>("shape"); diff --git a/paddle/fluid/operators/reader/CMakeLists.txt b/paddle/fluid/operators/reader/CMakeLists.txt index 1514f6566a8cd6fda64c48754c58169409e263fb..a82f2b13c22fa04bcb43e549589e7f6347ecdc55 100644 --- a/paddle/fluid/operators/reader/CMakeLists.txt +++ b/paddle/fluid/operators/reader/CMakeLists.txt @@ -1,3 +1,5 @@ +include(operators) + cc_library(reader_op_registry SRCS reader_op_registry.cc DEPS operator op_registry reader) set(LOCAL_READER_LIBS) @@ -31,4 +33,10 @@ reader_library(create_py_reader_op SRCS create_py_reader_op.cc) cc_test(reader_blocking_queue_test SRCS reader_blocking_queue_test.cc) # Export local libraries to parent -set(READER_LIBRARY ${LOCAL_READER_LIBS} PARENT_SCOPE) +# set(READER_LIBRARY ${LOCAL_READER_LIBS} PARENT_SCOPE) + +op_library(read_op) + +foreach(src ${LOCAL_READER_LIBS}) + set(OP_LIBRARY ${src} ${OP_LIBRARY} CACHE INTERNAL "op libs") +endforeach() diff --git a/paddle/fluid/operators/reader/blocking_queue.h b/paddle/fluid/operators/reader/blocking_queue.h index 51b980acb5a08d431d96a3a92479dec09119c27e..618248f87298d62078aeccfa135b853b9d2b1744 100644 --- a/paddle/fluid/operators/reader/blocking_queue.h +++ b/paddle/fluid/operators/reader/blocking_queue.h @@ -42,7 +42,7 @@ class BlockingQueue { std::unique_lock lock(mutex_); send_cv_.wait(lock, [&] { return queue_.size() < capacity_ || closed_; }); if (closed_) { - VLOG(5) + VLOG(50) << "WARNING: Sending an element to a closed reader::BlokcingQueue."; return false; } @@ -56,7 +56,7 @@ class BlockingQueue { std::unique_lock lock(mutex_); send_cv_.wait(lock, [&] { return queue_.size() < capacity_ || closed_; }); if (closed_) { - VLOG(5) + VLOG(50) << "WARNING: Sending an element to a closed reader::BlokcingQueue."; return false; } diff --git a/paddle/fluid/operators/reader/create_py_reader_op.cc b/paddle/fluid/operators/reader/create_py_reader_op.cc index 0f31ca1a94326956ae5e6dffd582daedeb55a9e3..901a92ab5b5c74b071be8b57a7653d90e2a4fb29 100644 --- a/paddle/fluid/operators/reader/create_py_reader_op.cc +++ b/paddle/fluid/operators/reader/create_py_reader_op.cc @@ -74,7 +74,7 @@ class CreatePyReaderOpMaker : public FileReaderMakerBase { "Name of the `LoDTensorBlockingQueueHolder` variable"); AddComment(R"DOC( - Create PyReader to support LoDTensor data feeding in Python side. + Create PyReader to support LoDTensor data feeding in Python side. )DOC"); } }; diff --git a/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc b/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc index a08a9dbd0da46e73082cdd24c019e8d210d8bcc4..d7a048257f92c1c58c34decf1a93ff95f5f736c7 100644 --- a/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc +++ b/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc @@ -13,6 +13,7 @@ // limitations under the License. #include "paddle/fluid/operators/reader/reader_op_registry.h" +#include "paddle/fluid/platform/lock_guard_ptr.h" #include "paddle/fluid/recordio/scanner.h" namespace paddle { @@ -33,11 +34,7 @@ class RecordIOFileReader : public framework::FileReader { protected: void ReadNextImpl(std::vector* out) override { - std::unique_ptr> guard; - if (ThreadSafe) { - guard.reset(new std::lock_guard(*mutex_)); - } - + platform::LockGuardPtr guard(mutex_); bool ok = framework::ReadFromRecordIO(&scanner_, dev_ctx_, out); if (!ok) { out->clear(); diff --git a/paddle/fluid/operators/reader/create_shuffle_reader_op.cc b/paddle/fluid/operators/reader/create_shuffle_reader_op.cc index 3f72890a7cee1453585d50afa04fa62a9b059dc3..3fe4e9e7adee071fd56cf9f3d2560829f096ba9b 100644 --- a/paddle/fluid/operators/reader/create_shuffle_reader_op.cc +++ b/paddle/fluid/operators/reader/create_shuffle_reader_op.cc @@ -26,7 +26,7 @@ class ShuffleReader : public framework::DecoratedReader { ShuffleReader(const std::shared_ptr& reader, size_t buffer_size, size_t seed = 0) : DecoratedReader(reader), buffer_size_(buffer_size), seed_(seed) { - VLOG(10) << "Create shuffle reader of " << reader_; + VLOG(100) << "Create shuffle reader of " << reader_; if (seed_ == 0) { std::random_device device; seed_ = device(); @@ -37,7 +37,7 @@ class ShuffleReader : public framework::DecoratedReader { void ReadNextImpl(std::vector* out) override { out->clear(); if (iteration_pos_ >= buffer_.size()) { - VLOG(10) << "Resetting shuffle buffer"; + VLOG(100) << "Resetting shuffle buffer"; ReloadBuffer(); if (buffer_.empty()) { return; @@ -73,7 +73,7 @@ class ShuffleReader : public framework::DecoratedReader { std::mt19937 g(seed_); std::shuffle(buffer_.begin(), buffer_.end(), g); seed_ = g(); // update seed_; - VLOG(10) << "random buffer size = " << buffer_.size(); + VLOG(100) << "random buffer size = " << buffer_.size(); } size_t buffer_size_; diff --git a/paddle/fluid/operators/read_op.cc b/paddle/fluid/operators/reader/read_op.cc similarity index 88% rename from paddle/fluid/operators/read_op.cc rename to paddle/fluid/operators/reader/read_op.cc index a0d640b2020958af53a4405ae886eadb2a1e117e..a0b70938d354cbb3bf10a9c8c589ba5153624f45 100644 --- a/paddle/fluid/operators/read_op.cc +++ b/paddle/fluid/operators/reader/read_op.cc @@ -33,6 +33,19 @@ class ReadInferShape : public framework::InferShapeBase { reader_dims.size(), out_names.size(), "The reader's dim number doesn't match the output number."); ctx->SetOutputsDim("Out", reader_dims); + if (!ctx->IsRuntime()) { + auto in_desc = + boost::get(ctx->GetInputVarPtrs("Reader")[0]); + auto in_lod_levels = in_desc->GetLoDLevels(); + auto out_var_ptrs = ctx->GetOutputVarPtrs("Out"); + PADDLE_ENFORCE_EQ(in_lod_levels.size(), out_var_ptrs.size(), + "LoDLevels of Input(Reader) must be the same as the " + "number of Outputs(Out)."); + for (size_t i = 0; i < out_var_ptrs.size(); ++i) { + auto* out_desc = boost::get(out_var_ptrs[i]); + out_desc->SetLoDLevel(in_lod_levels[i]); + } + } } }; diff --git a/paddle/fluid/operators/reader/reader_blocking_queue_test.cc b/paddle/fluid/operators/reader/reader_blocking_queue_test.cc index bd7ac64b2fce2452744e4756b149ee7f291d38aa..dc0940ac0b78d295b5088cb6ae26300da1dc883d 100644 --- a/paddle/fluid/operators/reader/reader_blocking_queue_test.cc +++ b/paddle/fluid/operators/reader/reader_blocking_queue_test.cc @@ -229,7 +229,7 @@ TEST(BlockingQueue, speed_test_mode) { q1.Receive(&b); EXPECT_EQ(b, i); } - EXPECT_EQ(q1.Size(), 0); + EXPECT_EQ(q1.Size(), 0UL); BlockingQueue q2(queue_size, true); for (size_t i = 0; i < queue_size; ++i) { @@ -237,7 +237,7 @@ TEST(BlockingQueue, speed_test_mode) { } for (size_t i = 0; i < queue_size; ++i) { q2.Receive(&b); - EXPECT_EQ(b, 0); + EXPECT_EQ(b, 0UL); } EXPECT_EQ(q2.Size(), queue_size); } diff --git a/paddle/fluid/operators/recurrent_op.cc b/paddle/fluid/operators/recurrent_op.cc index 162bfcbb0844d29385d0f8ad5d25a3f8de6bd41b..283dce93212ac91fc4a3276598c1f32cfd36d1e7 100644 --- a/paddle/fluid/operators/recurrent_op.cc +++ b/paddle/fluid/operators/recurrent_op.cc @@ -160,7 +160,7 @@ class RecurrentBase : public framework::OperatorBase { Callback callback) { PADDLE_ENFORCE_EQ(src_vars.size(), dst_vars.size()); for (size_t i = 0; i < dst_vars.size(); ++i) { - VLOG(10) << "Link " << src_vars[i] << " to " << dst_vars[i]; + VLOG(100) << "Link " << src_vars[i] << " to " << dst_vars[i]; AccessTensor(src_scope, src_vars[i], dst_scope, dst_vars[i], callback); } } @@ -176,7 +176,7 @@ class RecurrentBase : public framework::OperatorBase { Callback callback) { PADDLE_ENFORCE_EQ(src_vars.size(), dst_vars.size()); for (size_t i = 0; i < dst_vars.size(); ++i) { - VLOG(10) << "Link " << src_vars[i] << " to " << dst_vars[i]; + VLOG(100) << "Link " << src_vars[i] << " to " << dst_vars[i]; AccessTensor(src_scope, src_vars[i], dst_scope, dst_vars[i], callback); } } @@ -230,7 +230,7 @@ class RecurrentOp : public RecurrentBase { void RunImpl(const framework::Scope &scope, const platform::Place &place) const override { auto seq_len = static_cast(this->GetSequenceLength(scope)); - VLOG(3) << "Static RNN input sequence length = " << seq_len; + VLOG(30) << "Static RNN input sequence length = " << seq_len; StepScopes scopes = CreateStepScopes(scope, seq_len); auto reverse = Attr(kReverse); @@ -241,7 +241,7 @@ class RecurrentOp : public RecurrentBase { for (size_t i = 0; i < seq_len; ++i) { size_t seq_offset = reverse ? seq_len - i - 1 : i; - VLOG(3) << "Recurrent operate at the time step " << seq_offset; + VLOG(30) << "Recurrent operate at the time step " << seq_offset; auto &cur_scope = scopes.CurScope(); @@ -334,7 +334,7 @@ class RecurrentGradOp : public RecurrentBase { for (size_t step_id = 0; step_id < seq_len; ++step_id) { size_t seq_offset = reverse ? step_id : seq_len - step_id - 1; - VLOG(3) << "Recurrent backward operate at the time step " << seq_offset; + VLOG(30) << "Recurrent backward operate at the time step " << seq_offset; auto &cur_scope = scopes.CurScope(); // Link outside::output_grads --> inside::output_grads // inside::output_grad = outside::output_grad[seq_offset:seq_offset+1] @@ -348,11 +348,11 @@ class RecurrentGradOp : public RecurrentBase { }); auto og_set = List2Set(Inputs(kOutputGrads)); - if (VLOG_IS_ON(10)) { + if (VLOG_IS_ON(100)) { std::ostringstream sout; std::copy(og_set.begin(), og_set.end(), std::ostream_iterator(sout, ",")); - VLOG(10) << " RNN output gradients = [" << sout.str() << "]"; + VLOG(100) << " RNN output gradients = [" << sout.str() << "]"; } // Link states @@ -374,7 +374,7 @@ class RecurrentGradOp : public RecurrentBase { auto &ex_tensor = ex_scope.FindVar(ex_grad)->Get(); - VLOG(10) << " RNN link " << cur_grad << " from " << ex_grad; + VLOG(100) << " RNN link " << cur_grad << " from " << ex_grad; auto *cur_grad_var = cur_scope.Var(cur_grad); auto cur_grad_tensor = cur_grad_var->GetMutable(); @@ -382,12 +382,12 @@ class RecurrentGradOp : public RecurrentBase { } } - VLOG(5) << "Recurrent memory linking finished "; + VLOG(50) << "Recurrent memory linking finished "; // Run step block with cur_scope executor.Run(*program, &cur_scope, block->ID(), false /*create_local_scope*/); - VLOG(5) << "executor.Run finished "; + VLOG(50) << "executor.Run finished "; auto local_var_names = LocalVarNames(cur_scope); @@ -436,7 +436,7 @@ class RecurrentGradOp : public RecurrentBase { cur_scope.Rename(new_inside_name, inside_grad_name); } } - VLOG(5) << "Accumulate Parameter finished "; + VLOG(50) << "Accumulate Parameter finished "; // Copy input gradient from inside to outside // outside::input_grad[seq_offset: seq_offset + 1] = inside::input_grad @@ -455,7 +455,7 @@ class RecurrentGradOp : public RecurrentBase { auto dst = outside->Slice(seq_offset, seq_offset + 1); framework::TensorCopy(inside, place, dev_ctx, &dst); }); - VLOG(5) << "Link outside gradient finished "; + VLOG(50) << "Link outside gradient finished "; if (step_id + 1 == seq_len) { // at_end // copy initialize states gradient from inside to outside @@ -468,7 +468,7 @@ class RecurrentGradOp : public RecurrentBase { outside->mutable_data(place, inside.type()); framework::TensorCopy(inside, place, dev_ctx, outside); }); - VLOG(5) << "Link initialize state gradient finished "; + VLOG(50) << "Link initialize state gradient finished "; } scopes.Next(); } diff --git a/paddle/fluid/operators/reduce_ops/CMakeLists.txt b/paddle/fluid/operators/reduce_ops/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5fe4d15ae2c6254a50318813c852b6c314880aba --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/CMakeLists.txt @@ -0,0 +1,20 @@ +include(operators) +register_operators() + +if(WITH_GPU) + file(GLOB OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "*.part.cu") + string(REPLACE ".part.cu" "" OPS "${OPS}") + + foreach(src ${OPS}) + if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${src}.part.cu) + set(CUDA_KERNEL_FILE ${CMAKE_CURRENT_SOURCE_DIR}/${src}.part.cu) + file(READ ${CUDA_KERNEL_FILE} TARGET_CONTENT) + string(REGEX MATCH "REGISTER_OP_CUDA_KERNEL\\(\\n?([^,]+),.*" MATCHED ${TARGET_CONTENT}) + if (MATCHED) + string(STRIP ${CMAKE_MATCH_1} MATCHED) + file(APPEND ${pybind_file} "USE_OP_DEVICE_KERNEL(${MATCHED}, CUDA);\n") + endif() + + endif() + endforeach() +endif() diff --git a/paddle/fluid/operators/cub_reduce.h b/paddle/fluid/operators/reduce_ops/cub_reduce.h similarity index 100% rename from paddle/fluid/operators/cub_reduce.h rename to paddle/fluid/operators/reduce_ops/cub_reduce.h diff --git a/paddle/fluid/operators/reduce_max_op.cc b/paddle/fluid/operators/reduce_ops/reduce_max_op.cc similarity index 96% rename from paddle/fluid/operators/reduce_max_op.cc rename to paddle/fluid/operators/reduce_ops/reduce_max_op.cc index 95d3768e1fdf6947659c7b3a1c9d57fad741472a..cb438b4a8057267015c8b3c15dd8468fca5a4b44 100644 --- a/paddle/fluid/operators/reduce_max_op.cc +++ b/paddle/fluid/operators/reduce_ops/reduce_max_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_min_max_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" REGISTER_REDUCE_OP(reduce_max); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/reduce_max_op.cu b/paddle/fluid/operators/reduce_ops/reduce_max_op.cu similarity index 67% rename from paddle/fluid/operators/reduce_max_op.cu rename to paddle/fluid/operators/reduce_ops/reduce_max_op.cu index 0d86b3127e42f7ee14ba57b1c762e8128a0f2d54..832112ede833a06e053dcff5139e82f054b127c4 100644 --- a/paddle/fluid/operators/reduce_max_op.cu +++ b/paddle/fluid/operators/reduce_ops/reduce_max_op.cu @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_min_max_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" REGISTER_OP_CUDA_KERNEL(reduce_max, ops::ReduceKernel, ops::ReduceKernel); -REGISTER_OP_CUDA_KERNEL( - reduce_max_grad, ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_ops/reduce_max_op.part.cu b/paddle/fluid/operators/reduce_ops/reduce_max_op.part.cu new file mode 100644 index 0000000000000000000000000000000000000000..5ee38b8fa46290c86cd44ef1bcc71bd2fcd9bcd4 --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/reduce_max_op.part.cu @@ -0,0 +1,25 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" + +REGISTER_OP_CUDA_KERNEL( + reduce_max_grad, ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_mean_op.cc b/paddle/fluid/operators/reduce_ops/reduce_mean_op.cc similarity index 96% rename from paddle/fluid/operators/reduce_mean_op.cc rename to paddle/fluid/operators/reduce_ops/reduce_mean_op.cc index fc258c2496340b47d24dc89f16f7419dbb4b0d95..072bc34d3e23a48c8d856a51b0d5a6facc7ececf 100644 --- a/paddle/fluid/operators/reduce_mean_op.cc +++ b/paddle/fluid/operators/reduce_ops/reduce_mean_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_mean_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_mean_op.h" REGISTER_REDUCE_OP(reduce_mean); REGISTER_OP_CPU_KERNEL(reduce_mean, diff --git a/paddle/fluid/operators/reduce_mean_op.cu b/paddle/fluid/operators/reduce_ops/reduce_mean_op.cu similarity index 78% rename from paddle/fluid/operators/reduce_mean_op.cu rename to paddle/fluid/operators/reduce_ops/reduce_mean_op.cu index 59b30244839849d79e3e531953134633503c4090..4d3bce8fdd05e536baa5fecb4fc5a117e2031224 100644 --- a/paddle/fluid/operators/reduce_mean_op.cu +++ b/paddle/fluid/operators/reduce_ops/reduce_mean_op.cu @@ -13,8 +13,8 @@ // limitations under the License. #include -#include "paddle/fluid/operators/cub_reduce.h" -#include "paddle/fluid/operators/reduce_mean_op.h" +#include "paddle/fluid/operators/reduce_ops/cub_reduce.h" +#include "paddle/fluid/operators/reduce_ops/reduce_mean_op.h" namespace paddle { namespace operators { @@ -69,13 +69,3 @@ REGISTER_OP_CUDA_KERNEL(reduce_mean, ops::ReduceMeanKernel, ops::ReduceMeanKernel, ops::ReduceMeanKernel, ops::ReduceMeanKernel); - -REGISTER_OP_CUDA_KERNEL( - reduce_mean_grad, ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_mean_op.h b/paddle/fluid/operators/reduce_ops/reduce_mean_op.h similarity index 95% rename from paddle/fluid/operators/reduce_mean_op.h rename to paddle/fluid/operators/reduce_ops/reduce_mean_op.h index 1359679c4767d2032bf3e3a90849ad2a2ef3e829..240c43bc6d0af266e3500c14f894fe30abab728e 100644 --- a/paddle/fluid/operators/reduce_mean_op.h +++ b/paddle/fluid/operators/reduce_ops/reduce_mean_op.h @@ -14,7 +14,7 @@ #pragma once -#include "paddle/fluid/operators/reduce_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/reduce_ops/reduce_mean_op.part.cu b/paddle/fluid/operators/reduce_ops/reduce_mean_op.part.cu new file mode 100644 index 0000000000000000000000000000000000000000..9324ec1e1db6f40e463b415e5d2bdc5cfe664ef4 --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/reduce_mean_op.part.cu @@ -0,0 +1,26 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// .part used to speed up nvcc compile +#include "paddle/fluid/operators/reduce_ops/reduce_mean_op.h" + +REGISTER_OP_CUDA_KERNEL( + reduce_mean_grad, ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_min_max_op.h b/paddle/fluid/operators/reduce_ops/reduce_min_max_op.h similarity index 96% rename from paddle/fluid/operators/reduce_min_max_op.h rename to paddle/fluid/operators/reduce_ops/reduce_min_max_op.h index ec59f3e71c1c702655a3feed10935b2f5a29d8a8..2557e8dd488618dd4998845b0e6e3ba823b96986 100644 --- a/paddle/fluid/operators/reduce_min_max_op.h +++ b/paddle/fluid/operators/reduce_ops/reduce_min_max_op.h @@ -13,7 +13,7 @@ // limitations under the License. #pragma once -#include "paddle/fluid/operators/reduce_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/reduce_min_op.cc b/paddle/fluid/operators/reduce_ops/reduce_min_op.cc similarity index 96% rename from paddle/fluid/operators/reduce_min_op.cc rename to paddle/fluid/operators/reduce_ops/reduce_min_op.cc index 330a86d2e4237a10d8cf6fd40025540edf08d897..11aa78382e319331dc65ec22927f0d5762adfb43 100644 --- a/paddle/fluid/operators/reduce_min_op.cc +++ b/paddle/fluid/operators/reduce_ops/reduce_min_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_min_max_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" REGISTER_REDUCE_OP(reduce_min); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/reduce_min_op.cu b/paddle/fluid/operators/reduce_ops/reduce_min_op.cu similarity index 67% rename from paddle/fluid/operators/reduce_min_op.cu rename to paddle/fluid/operators/reduce_ops/reduce_min_op.cu index da466f805eff4709dc23471baef03e94052ee6c1..7b2706866f594228cbceb084e99d83aa8f345dfd 100644 --- a/paddle/fluid/operators/reduce_min_op.cu +++ b/paddle/fluid/operators/reduce_ops/reduce_min_op.cu @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_min_max_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" REGISTER_OP_CUDA_KERNEL(reduce_min, ops::ReduceKernel, ops::ReduceKernel); -REGISTER_OP_CUDA_KERNEL( - reduce_min_grad, ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_ops/reduce_min_op.part.cu b/paddle/fluid/operators/reduce_ops/reduce_min_op.part.cu new file mode 100644 index 0000000000000000000000000000000000000000..bf886063786a8c36884ed20fef41c99468156c01 --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/reduce_min_op.part.cu @@ -0,0 +1,25 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/reduce_ops/reduce_min_max_op.h" + +REGISTER_OP_CUDA_KERNEL( + reduce_min_grad, ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_op.h b/paddle/fluid/operators/reduce_ops/reduce_op.h similarity index 99% rename from paddle/fluid/operators/reduce_op.h rename to paddle/fluid/operators/reduce_ops/reduce_op.h index 72b6cf1773d5bcc42e40e72111179d454d2bb4a9..540742c4cd8b0efc4c6cf095d7a8b3516f551d4c 100644 --- a/paddle/fluid/operators/reduce_op.h +++ b/paddle/fluid/operators/reduce_ops/reduce_op.h @@ -18,7 +18,7 @@ limitations under the License. */ #include #include -#include "paddle/fluid/operators/reduce_op_function.h" +#include "paddle/fluid/operators/reduce_ops/reduce_op_function.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/reduce_op_function.h b/paddle/fluid/operators/reduce_ops/reduce_op_function.h similarity index 100% rename from paddle/fluid/operators/reduce_op_function.h rename to paddle/fluid/operators/reduce_ops/reduce_op_function.h diff --git a/paddle/fluid/operators/reduce_prod_op.cc b/paddle/fluid/operators/reduce_ops/reduce_prod_op.cc similarity index 96% rename from paddle/fluid/operators/reduce_prod_op.cc rename to paddle/fluid/operators/reduce_ops/reduce_prod_op.cc index 713728b99757a6f3bb128f665d5576ac64eef8ec..88935107df187da731e5b77bb6c24cd692d2994f 100644 --- a/paddle/fluid/operators/reduce_prod_op.cc +++ b/paddle/fluid/operators/reduce_ops/reduce_prod_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_prod_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_prod_op.h" REGISTER_REDUCE_OP(reduce_prod); REGISTER_OP_CPU_KERNEL(reduce_prod, diff --git a/paddle/fluid/operators/reduce_prod_op.cu b/paddle/fluid/operators/reduce_ops/reduce_prod_op.cu similarity index 67% rename from paddle/fluid/operators/reduce_prod_op.cu rename to paddle/fluid/operators/reduce_ops/reduce_prod_op.cu index d62e677d92cffecf629d1684026b0c7bcfec29e3..4434937f75397d8d5340a94abbd41efa7e7a8d4b 100644 --- a/paddle/fluid/operators/reduce_prod_op.cu +++ b/paddle/fluid/operators/reduce_ops/reduce_prod_op.cu @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_prod_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_prod_op.h" REGISTER_OP_CUDA_KERNEL(reduce_prod, ops::ReduceKernel, ops::ReduceKernel); -REGISTER_OP_CUDA_KERNEL( - reduce_prod_grad, ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_prod_op.h b/paddle/fluid/operators/reduce_ops/reduce_prod_op.h similarity index 95% rename from paddle/fluid/operators/reduce_prod_op.h rename to paddle/fluid/operators/reduce_ops/reduce_prod_op.h index 97748113e092719aceed9d806ca6242077111532..103e108e4bda1c33434ec0c5d6c58f24fa725f57 100644 --- a/paddle/fluid/operators/reduce_prod_op.h +++ b/paddle/fluid/operators/reduce_ops/reduce_prod_op.h @@ -14,7 +14,7 @@ #pragma once -#include "paddle/fluid/operators/reduce_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/reduce_ops/reduce_prod_op.part.cu b/paddle/fluid/operators/reduce_ops/reduce_prod_op.part.cu new file mode 100644 index 0000000000000000000000000000000000000000..0610cdd94f89c0371988fac7955d07fc5498a69f --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/reduce_prod_op.part.cu @@ -0,0 +1,25 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/reduce_ops/reduce_prod_op.h" + +REGISTER_OP_CUDA_KERNEL( + reduce_prod_grad, ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_sum_op.cc b/paddle/fluid/operators/reduce_ops/reduce_sum_op.cc similarity index 96% rename from paddle/fluid/operators/reduce_sum_op.cc rename to paddle/fluid/operators/reduce_ops/reduce_sum_op.cc index f0e5f6580fbc9e70562cb2fdd7e0c5d8729bc9a7..c7742f45dd147ea87413aa17680d671bede5dd6c 100644 --- a/paddle/fluid/operators/reduce_sum_op.cc +++ b/paddle/fluid/operators/reduce_ops/reduce_sum_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/reduce_sum_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_sum_op.h" REGISTER_REDUCE_OP(reduce_sum); REGISTER_OP_CPU_KERNEL( diff --git a/paddle/fluid/operators/reduce_sum_op.cu b/paddle/fluid/operators/reduce_ops/reduce_sum_op.cu similarity index 77% rename from paddle/fluid/operators/reduce_sum_op.cu rename to paddle/fluid/operators/reduce_ops/reduce_sum_op.cu index 53cd9e9419dd9aecee730917ae21d7a4ab332ffc..9051740e83aabd783750e8f415da09921608e470 100644 --- a/paddle/fluid/operators/reduce_sum_op.cu +++ b/paddle/fluid/operators/reduce_ops/reduce_sum_op.cu @@ -12,8 +12,8 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/cub_reduce.h" -#include "paddle/fluid/operators/reduce_sum_op.h" +#include "paddle/fluid/operators/reduce_ops/cub_reduce.h" +#include "paddle/fluid/operators/reduce_ops/reduce_sum_op.h" namespace paddle { namespace operators { @@ -64,13 +64,3 @@ class ReduceSumKernel : public framework::OpKernel { REGISTER_OP_CUDA_KERNEL(reduce_sum, ops::ReduceSumKernel, ops::ReduceSumKernel, ops::ReduceSumKernel, ops::ReduceSumKernel); - -REGISTER_OP_CUDA_KERNEL( - reduce_sum_grad, ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel, - ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/reduce_sum_op.h b/paddle/fluid/operators/reduce_ops/reduce_sum_op.h similarity index 98% rename from paddle/fluid/operators/reduce_sum_op.h rename to paddle/fluid/operators/reduce_ops/reduce_sum_op.h index 3e8d1bbdba504669bc06e0637094e3bee840adf2..26f59c72b4b99ff92a63c2fc2f00a31df0f5df61 100644 --- a/paddle/fluid/operators/reduce_sum_op.h +++ b/paddle/fluid/operators/reduce_ops/reduce_sum_op.h @@ -16,7 +16,7 @@ #include -#include "paddle/fluid/operators/reduce_op.h" +#include "paddle/fluid/operators/reduce_ops/reduce_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/reduce_ops/reduce_sum_op.part.cu b/paddle/fluid/operators/reduce_ops/reduce_sum_op.part.cu new file mode 100644 index 0000000000000000000000000000000000000000..eb3295731b047391a244bfb598c9d802bca1fc0c --- /dev/null +++ b/paddle/fluid/operators/reduce_ops/reduce_sum_op.part.cu @@ -0,0 +1,26 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/reduce_ops/cub_reduce.h" +#include "paddle/fluid/operators/reduce_ops/reduce_sum_op.h" + +REGISTER_OP_CUDA_KERNEL( + reduce_sum_grad, ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel, + ops::ReduceGradKernel); diff --git a/paddle/fluid/operators/rnn_memory_helper_op.cc b/paddle/fluid/operators/rnn_memory_helper_op.cc index 0fb7776fd9dbf437673820c7cf9411644272626c..b840e690960cf77a37895f5b3d83c4cdbc2fca35 100644 --- a/paddle/fluid/operators/rnn_memory_helper_op.cc +++ b/paddle/fluid/operators/rnn_memory_helper_op.cc @@ -93,7 +93,7 @@ class RNNMemoryHelperGradOp : public framework::OperatorBase { in_grad_var_name); if (out_grad_var == nullptr) { - VLOG(5) << "Using fill constant 0 as starting gradient"; + VLOG(50) << "Using fill constant 0 as starting gradient"; auto in_var_name = Input("X"); auto *in_var = scope.FindVar(in_var_name); auto &in_var_tensor = in_var->Get(); diff --git a/paddle/fluid/operators/roi_align_op.cc b/paddle/fluid/operators/roi_align_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..79f189222ef375a1e3f7b8c3e18619a1c4f2a829 --- /dev/null +++ b/paddle/fluid/operators/roi_align_op.cc @@ -0,0 +1,166 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/roi_align_op.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; + +class ROIAlignOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of ROIAlignOp should not be null."); + PADDLE_ENFORCE(ctx->HasInput("ROIs"), + "Input(ROIs) of ROIAlignOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of ROIAlignOp should not be null."); + auto input_dims = ctx->GetInputDim("X"); + auto rois_dims = ctx->GetInputDim("ROIs"); + + PADDLE_ENFORCE(input_dims.size() == 4, + "The format of input tensor is NCHW."); + PADDLE_ENFORCE(rois_dims.size() == 2, + "ROIs should be a 2-D LoDTensor of shape (num_rois, 4)" + "given as [[x1, y1, x2, y2], ...]."); + PADDLE_ENFORCE(rois_dims[1] == 4, + "ROIs should be a 2-D LoDTensor of shape (num_rois, 4)" + "given as [[x1, y1, x2, y2], ...]."); + int pooled_height = ctx->Attrs().Get("pooled_height"); + int pooled_width = ctx->Attrs().Get("pooled_width"); + float spatial_scale = ctx->Attrs().Get("spatial_scale"); + + PADDLE_ENFORCE_GT(pooled_height, 0, + "The pooled output height must greater than 0"); + PADDLE_ENFORCE_GT(pooled_width, 0, + "The pooled output width must greater than 0"); + PADDLE_ENFORCE_GT(spatial_scale, 0.0f, + "The spatial scale must greater than 0"); + + auto out_dims = input_dims; + out_dims[0] = rois_dims[0]; + out_dims[1] = input_dims[1]; + out_dims[2] = pooled_height; + out_dims[3] = pooled_width; + + ctx->SetOutputDim("Out", out_dims); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), + ctx.device_context()); + } +}; + +class ROIAlignGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), + "The GRAD@Out of ROIAlignGradOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName("X")), + "The GRAD@X of ROIAlignGradOp should not be null."); + ctx->SetOutputsDim(framework::GradVarName("X"), ctx->GetInputsDim("X")); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), + ctx.device_context()); + } +}; + +class ROIAlignOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor), " + "The input of ROIAlignOp. " + "The format of input tensor is NCHW. Where N is batch size, " + "C is the number of input channels, " + "H is the height of the feature, and " + "W is the width of the feature."); + AddInput("ROIs", + "(LoDTensor), " + "ROIs (Regions of Interest) to pool over. " + "should be a 2-D LoDTensor of shape (num_rois, 4)" + "given as [[x1, y1, x2, y2], ...]. " + "(x1, y1) is the top left coordinates, and " + "(x2, y2) is the bottom right coordinates."); + AddOutput("Out", + "(Tensor), " + "The output of ROIAlignOp is a 4-D tensor with shape " + "(num_rois, channels, pooled_h, pooled_w)."); + AddAttr("spatial_scale", + "(float, default 1.0), " + "Multiplicative spatial scale factor " + "to translate ROI coords from their input scale " + "to the scale used when pooling.") + .SetDefault(1.0); + AddAttr("pooled_height", + "(int, default 1), " + "The pooled output height.") + .SetDefault(1); + AddAttr("pooled_width", + "(int, default 1), " + "The pooled output width.") + .SetDefault(1); + AddAttr("sampling_ratio", + "(int,default -1)," + "number of sampling points in the interpolation grid" + "If <=0, then grid points are adaptive to roi_width " + "and pooled_w, likewise for height") + .SetDefault(-1); + AddComment(R"DOC( +**RoIAlign Operator** + +Region of interest align (also known as RoI align) is to perform +bilinear interpolation on inputs of nonuniform sizes to obtain +fixed-size feature maps (e.g. 7*7) + +Dividing each region proposal into equal-sized sections with +the pooled_width and pooled_height. Location remains the origin +result. + +In each ROI bin, the value of the four regularly sampled locations +are computed directly through bilinear interpolation. The output is +the mean of four locations. +Thus avoid the misaligned problem. + )DOC"); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(roi_align, ops::ROIAlignOp, ops::ROIAlignOpMaker, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(roi_align_grad, ops::ROIAlignGradOp); +REGISTER_OP_CPU_KERNEL( + roi_align, + ops::CPUROIAlignOpKernel, + ops::CPUROIAlignOpKernel); +REGISTER_OP_CPU_KERNEL( + roi_align_grad, + ops::CPUROIAlignGradOpKernel, + ops::CPUROIAlignGradOpKernel); diff --git a/paddle/fluid/operators/roi_align_op.cu b/paddle/fluid/operators/roi_align_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..bcec6f3563df7f4e1e48554cc891d596f9e56024 --- /dev/null +++ b/paddle/fluid/operators/roi_align_op.cu @@ -0,0 +1,353 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/roi_align_op.h" +#include "paddle/fluid/platform/cuda_primitives.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; + +static constexpr int kNumCUDAThreads = 512; +static constexpr int kNumMaxinumNumBlocks = 4096; + +static inline int NumBlocks(const int N) { + return std::min((N + kNumCUDAThreads - 1) / kNumCUDAThreads, + kNumMaxinumNumBlocks); +} + +#define CUDA_1D_KERNEL_LOOP(i, n) \ + for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \ + i += blockDim.x * gridDim.x) + +template +__device__ T BilinearInterpolate(const T* input_data, const int height, + const int width, T y, T x) { + if (y < -1.0 || y > height || x < -1.0 || x > width) { + return 0; + } + y = y <= 0 ? 0 : y; + x = x <= 0 ? 0 : x; + int y_low = static_cast(y); + int x_low = static_cast(x); + int y_high; + int x_high; + if (y_low >= height - 1) { + y_high = y_low = height - 1; + y = static_cast(y_low); + } else { + y_high = y_low + 1; + } + if (x_low >= width - 1) { + x_high = x_low = width - 1; + x = static_cast(x_low); + } else { + x_high = x_low + 1; + } + T ly = y - y_low, lx = x - x_low; + T hy = 1. - ly, hx = 1. - lx; + + T v1 = input_data[y_low * width + x_low]; + T v2 = input_data[y_low * width + x_high]; + T v3 = input_data[y_high * width + x_low]; + T v4 = input_data[y_high * width + x_high]; + T w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx; + + T val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4); + return val; +} + +template +__device__ void BilinearInterpolateGradient(const int height, const int width, + T y, T x, T* w1, T* w2, T* w3, + T* w4, int* x_low, int* x_high, + int* y_low, int* y_high) { + if (y < -1.0 || y > height || x < -1.0 || x > width) { + return; + } + + y = y <= 0 ? 0 : y; + x = x <= 0 ? 0 : x; + *y_low = static_cast(y); + *x_low = static_cast(x); + if (*y_low >= height - 1) { + *y_high = *y_low = height - 1; + y = static_cast(*y_low); + } else { + *y_high = *y_low + 1; + } + if (*x_low >= width - 1) { + *x_high = *x_low = width - 1; + x = static_cast(*x_low); + } else { + *x_high = *x_low + 1; + } + T ly = y - *y_low, lx = x - *x_low; + T hy = 1. - ly, hx = 1. - lx; + *w1 = hy * hx, *w2 = hy * lx, *w3 = ly * hx, *w4 = ly * lx; + + return; +} + +template +__global__ void GPUROIAlignForward( + const int nthreads, const T* input_data, const T* input_rois, + const float spatial_scale, const int channels, const int height, + const int width, const int pooled_height, const int pooled_width, + const int sampling_ratio, int* roi_batch_id_data, T* output_data) { + CUDA_1D_KERNEL_LOOP(i, nthreads) { + int pw = i % pooled_width; + int ph = (i / pooled_width) % pooled_height; + int c = (i / pooled_width / pooled_height) % channels; + int n = i / pooled_width / pooled_height / channels; + + const T* offset_input_rois = input_rois + n * kROISize; + int roi_batch_ind = roi_batch_id_data[n]; + + T roi_xmin = offset_input_rois[0] * spatial_scale; + T roi_ymin = offset_input_rois[1] * spatial_scale; + T roi_xmax = offset_input_rois[2] * spatial_scale; + T roi_ymax = offset_input_rois[3] * spatial_scale; + + T roi_width = max(roi_xmax - roi_xmin, static_cast(1.)); + T roi_height = max(roi_ymax - roi_ymin, static_cast(1.)); + T bin_size_h = static_cast(roi_height) / static_cast(pooled_height); + T bin_size_w = static_cast(roi_width) / static_cast(pooled_width); + + const T* offset_input_data = + input_data + (roi_batch_ind * channels + c) * height * width; + + int roi_bin_grid_h = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_height / pooled_height); + int roi_bin_grid_w = + (sampling_ratio > 0) ? sampling_ratio : ceil(roi_width / pooled_width); + const T count = roi_bin_grid_h * roi_bin_grid_w; + T output_val = 0; + for (int iy = 0; iy < roi_bin_grid_h; iy++) { + const T y = roi_ymin + ph * bin_size_h + + static_cast(iy + .5f) * bin_size_h / + static_cast(roi_bin_grid_h); + for (int ix = 0; ix < roi_bin_grid_w; ix++) { + const T x = roi_xmin + pw * bin_size_w + + static_cast(ix + .5f) * bin_size_w / + static_cast(roi_bin_grid_w); + T val = BilinearInterpolate(offset_input_data, height, width, y, x); + output_val += val; + } + } + output_val /= count; + output_data[i] = output_val; + } +} + +template +__global__ void GPUROIAlignBackward(const int nthreads, const T* input_rois, + const T* out_grad, const int num_rois, + const float spatial_scale, + const int channels, const int height, + const int width, const int pooled_height, + const int pooled_width, + const int sampling_ratio, + int* roi_batch_id_data, T* input_grad) { + CUDA_1D_KERNEL_LOOP(i, nthreads) { + int pw = i % pooled_width; + int ph = (i / pooled_width) % pooled_height; + int c = (i / pooled_width / pooled_height) % channels; + int n = i / pooled_width / pooled_height / channels; + const T* offset_input_rois = input_rois + n * kROISize; + int roi_batch_ind = roi_batch_id_data[n]; + + T roi_xmin = offset_input_rois[0] * spatial_scale; + T roi_ymin = offset_input_rois[1] * spatial_scale; + T roi_xmax = offset_input_rois[2] * spatial_scale; + T roi_ymax = offset_input_rois[3] * spatial_scale; + + T roi_width = max(roi_xmax - roi_xmin, static_cast(1.)); + T roi_height = max(roi_ymax - roi_ymin, static_cast(1.)); + T bin_size_h = static_cast(roi_height) / static_cast(pooled_height); + T bin_size_w = static_cast(roi_width) / static_cast(pooled_width); + + T* offset_input_grad = + input_grad + (roi_batch_ind * channels + c) * height * width; + + const T* offset_out_grad = + out_grad + (n * channels + c) * pooled_height * pooled_width; + const T out_grad_this_bin = offset_out_grad[ph * pooled_width + pw]; + + int roi_bin_grid_h = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_height / pooled_height); + int roi_bin_grid_w = + (sampling_ratio > 0) ? sampling_ratio : ceil(roi_width / pooled_width); + + const T count = roi_bin_grid_h * roi_bin_grid_w; + for (int iy = 0; iy < roi_bin_grid_h; iy++) { + const T y = roi_ymin + ph * bin_size_h + + static_cast(iy + .5f) * bin_size_h / + static_cast(roi_bin_grid_h); + for (int ix = 0; ix < roi_bin_grid_w; ix++) { + const T x = roi_xmin + pw * bin_size_w + + static_cast(ix + .5f) * bin_size_w / + static_cast(roi_bin_grid_w); + T w1 = 0, w2 = 0, w3 = 0, w4 = 0; + int x_low = -1, x_high = -1, y_low = -1, y_high = -1; + BilinearInterpolateGradient(height, width, y, x, &w1, &w2, &w3, &w4, + &x_low, &x_high, &y_low, &y_high); + T diff1 = out_grad_this_bin * w1 / count; + T diff2 = out_grad_this_bin * w2 / count; + T diff3 = out_grad_this_bin * w3 / count; + T diff4 = out_grad_this_bin * w4 / count; + if (x_low >= 0 && x_high >= 0 && y_low >= 0 && y_high >= 0) { + platform::CudaAtomicAdd(offset_input_grad + y_low * width + x_low, + diff1); + platform::CudaAtomicAdd(offset_input_grad + y_low * width + x_high, + diff2); + platform::CudaAtomicAdd(offset_input_grad + y_high * width + x_low, + diff3); + platform::CudaAtomicAdd(offset_input_grad + y_high * width + x_high, + diff4); + } + } + } + } +} + +template +class GPUROIAlignOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* in = ctx.Input("X"); + auto* rois = ctx.Input("ROIs"); + auto* out = ctx.Output("Out"); + + auto pooled_height = ctx.Attr("pooled_height"); + auto pooled_width = ctx.Attr("pooled_width"); + auto spatial_scale = ctx.Attr("spatial_scale"); + auto sampling_ratio = ctx.Attr("sampling_ratio"); + + auto in_dims = in->dims(); + int batch_size = in_dims[0]; + int channels = in_dims[1]; + int height = in_dims[2]; + int width = in_dims[3]; + + int rois_num = rois->dims()[0]; + + if (rois_num == 0) return; + + int output_size = out->numel(); + int blocks = NumBlocks(output_size); + int threads = kNumCUDAThreads; + + Tensor roi_batch_id_list; + roi_batch_id_list.Resize({rois_num}); + int* roi_batch_id_data = + roi_batch_id_list.mutable_data(platform::CPUPlace()); + auto rois_lod = rois->lod().back(); + int rois_batch_size = rois_lod.size() - 1; + PADDLE_ENFORCE_EQ( + rois_batch_size, batch_size, + "The rois_batch_size and imgs batch_size must be the same."); + int rois_num_with_lod = rois_lod[rois_batch_size]; + PADDLE_ENFORCE_EQ(rois_num, rois_num_with_lod, + "The rois_num from input and lod must be the same."); + for (int n = 0; n < rois_batch_size; ++n) { + for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) { + roi_batch_id_data[i] = n; + } + } + Tensor roi_batch_id_list_gpu; + framework::TensorCopySync(roi_batch_id_list, ctx.GetPlace(), + &roi_batch_id_list_gpu); + GPUROIAlignForward< + T><<>>( + output_size, in->data(), rois->data(), spatial_scale, channels, + height, width, pooled_height, pooled_width, sampling_ratio, + roi_batch_id_list_gpu.data(), + out->mutable_data(ctx.GetPlace())); + } +}; + +template +class GPUROIAlignGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* in = ctx.Input("X"); + auto* rois = ctx.Input("ROIs"); + + auto* out_grad = ctx.Input(framework::GradVarName("Out")); + auto* in_grad = ctx.Output(framework::GradVarName("X")); + + auto pooled_height = ctx.Attr("pooled_height"); + auto pooled_width = ctx.Attr("pooled_width"); + auto spatial_scale = ctx.Attr("spatial_scale"); + auto sampling_ratio = ctx.Attr("sampling_ratio"); + + int rois_num = rois->dims()[0]; + int channels = in->dims()[1]; + int height = in->dims()[2]; + int width = in->dims()[3]; + + if (!in_grad) { + return; + } + Tensor roi_batch_id_list; + roi_batch_id_list.Resize({rois_num}); + int* roi_batch_id_data = + roi_batch_id_list.mutable_data(platform::CPUPlace()); + auto rois_lod = rois->lod().back(); + int rois_batch_size = rois_lod.size() - 1; + for (int n = 0; n < rois_batch_size; ++n) { + for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) { + roi_batch_id_data[i] = n; + } + } + Tensor roi_batch_id_list_gpu; + framework::TensorCopySync(roi_batch_id_list, ctx.GetPlace(), + &roi_batch_id_list_gpu); + + in_grad->mutable_data(ctx.GetPlace()); + math::SetConstant set_zero; + set_zero(ctx.cuda_device_context(), in_grad, static_cast(0)); + + int output_grad_size = out_grad->numel(); + int blocks = NumBlocks(output_grad_size); + int threads = kNumCUDAThreads; + + if (output_grad_size > 0) { + GPUROIAlignBackward< + T><<>>( + output_grad_size, rois->data(), out_grad->data(), rois_num, + spatial_scale, channels, height, width, pooled_height, pooled_width, + sampling_ratio, roi_batch_id_list_gpu.data(), + in_grad->mutable_data(ctx.GetPlace())); + } + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + roi_align, + ops::GPUROIAlignOpKernel, + ops::GPUROIAlignOpKernel); +REGISTER_OP_CUDA_KERNEL( + roi_align_grad, + ops::GPUROIAlignGradOpKernel, + ops::GPUROIAlignGradOpKernel); diff --git a/paddle/fluid/operators/roi_align_op.h b/paddle/fluid/operators/roi_align_op.h new file mode 100644 index 0000000000000000000000000000000000000000..a18aee1b86283cbb48f0b804ccfc476d7cd78f3b --- /dev/null +++ b/paddle/fluid/operators/roi_align_op.h @@ -0,0 +1,332 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/math_function.h" + +namespace paddle { +namespace operators { + +using Tensor = framework::Tensor; +using LoDTensor = framework::LoDTensor; + +static constexpr int kROISize = 4; + +template +void PreCalcForBilinearInterpolate( + const platform::DeviceContext& ctx, const int height, const int width, + const int pooled_height, const int pooled_width, const int iy_upper, + const int ix_upper, T roi_ymin, T roi_xmin, T bin_size_h, T bin_size_w, + int roi_bin_grid_h, int roi_bin_grid_w, Tensor* pre_pos, Tensor* pre_w) { + int pre_calc_index = 0; + int* pre_pos_data = pre_pos->mutable_data(ctx.GetPlace()); + T* pre_w_data = pre_w->mutable_data(ctx.GetPlace()); + for (int ph = 0; ph < pooled_height; ph++) { + for (int pw = 0; pw < pooled_width; pw++) { + for (int iy = 0; iy < iy_upper; iy++) { + // calculate y of sample points + T y = roi_ymin + ph * bin_size_h + + static_cast(iy + .5f) * bin_size_h / + static_cast(roi_bin_grid_h); + // calculate x of samle points + for (int ix = 0; ix < ix_upper; ix++) { + T x = roi_xmin + pw * bin_size_w + + static_cast(ix + .5f) * bin_size_w / + static_cast(roi_bin_grid_w); + // deal with elements out of map + if (y < -1.0 || y > height || x < -1.0 || x > width) { + for (int i = 0; i < kROISize; ++i) { + pre_pos_data[i + pre_calc_index * kROISize] = 0; + pre_w_data[i + pre_calc_index * kROISize] = 0; + } + pre_calc_index += 1; + continue; + } + y = y <= 0 ? 0 : y; + x = x <= 0 ? 0 : x; + + int y_low = static_cast(y); + int x_low = static_cast(x); + int y_high; + int x_high; + if (y_low >= height - 1) { + y_high = y_low = height - 1; + y = static_cast(y_low); + } else { + y_high = y_low + 1; + } + if (x_low >= width - 1) { + x_high = x_low = width - 1; + x = static_cast(x_low); + } else { + x_high = x_low + 1; + } + T ly = y - y_low, lx = x - x_low; + T hy = 1. - ly, hx = 1. - lx; + pre_pos_data[pre_calc_index * kROISize] = y_low * width + x_low; + pre_pos_data[pre_calc_index * kROISize + 1] = y_low * width + x_high; + pre_pos_data[pre_calc_index * kROISize + 2] = y_high * width + x_low; + pre_pos_data[pre_calc_index * kROISize + 3] = y_high * width + x_high; + pre_w_data[pre_calc_index * kROISize] = hy * hx; + pre_w_data[pre_calc_index * kROISize + 1] = hy * lx; + pre_w_data[pre_calc_index * kROISize + 2] = ly * hx; + pre_w_data[pre_calc_index * kROISize + 3] = ly * lx; + pre_calc_index += 1; + } + } + } + } +} + +template +void bilinear_interpolate_gradient(const int height, const int width, T y, T x, + const T out_grad_this_bin, const T count, + T* batch_grad_data) { + int x_low, y_low, x_high, y_high; + T w1, w2, w3, w4; + if (y < -1.0 || y > height || x < -1.0 || x > width) { + w1 = w2 = w3 = w4 = 0; + x_low = x_high = y_low = y_high = -1; + return; + } + y = y <= 0 ? 0 : y; + x = x <= 0 ? 0 : x; + y_low = static_cast(y); + x_low = static_cast(x); + if (y_low >= height - 1) { + y_high = y_low = height - 1; + y = static_cast(y_low); + } else { + y_high = y_low + 1; + } + + if (x_low >= width - 1) { + x_high = x_low = width - 1; + x = static_cast(x_low); + } else { + x_high = x_low + 1; + } + + T ly = y - y_low, lx = x - x_low; + T hy = 1. - ly, hx = 1. - lx; + w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx; + T diff1 = out_grad_this_bin * w1 / count; + T diff2 = out_grad_this_bin * w2 / count; + T diff3 = out_grad_this_bin * w3 / count; + T diff4 = out_grad_this_bin * w4 / count; + if (x_low >= 0 && x_high >= 0 && y_low >= 0 && y_high >= 0) { + *(batch_grad_data + y_low * width + x_low) += diff1; + *(batch_grad_data + y_low * width + x_high) += diff2; + *(batch_grad_data + y_high * width + x_low) += diff3; + *(batch_grad_data + y_high * width + x_high) += diff4; + } +} + +template +class CPUROIAlignOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* in = ctx.Input("X"); + auto* rois = ctx.Input("ROIs"); + auto* out = ctx.Output("Out"); + auto pooled_height = ctx.Attr("pooled_height"); + auto pooled_width = ctx.Attr("pooled_width"); + auto spatial_scale = ctx.Attr("spatial_scale"); + auto sampling_ratio = ctx.Attr("sampling_ratio"); + + auto& dev_ctx = ctx.template device_context(); + + auto in_dims = in->dims(); + int batch_size = in_dims[0]; + int channels = in_dims[1]; + int height = in_dims[2]; + int width = in_dims[3]; + int rois_num = rois->dims()[0]; + + auto in_stride = framework::stride(in_dims); + auto roi_stride = framework::stride(rois->dims()); + auto out_stride = framework::stride(out->dims()); + + const T* input_data = in->data(); + framework::Tensor roi_batch_id_list; + roi_batch_id_list.Resize({rois_num}); + int* roi_batch_id_data = + roi_batch_id_list.mutable_data(ctx.GetPlace()); + + auto rois_lod = rois->lod().back(); + int rois_batch_size = rois_lod.size() - 1; + PADDLE_ENFORCE_EQ( + rois_batch_size, batch_size, + "The rois_batch_size and imgs batch_size must be the same."); + int rois_num_with_lod = rois_lod[rois_batch_size]; + PADDLE_ENFORCE_EQ(rois_num, rois_num_with_lod, + "The rois_num from input and lod must be the same."); + for (int n = 0; n < rois_batch_size; ++n) { + for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) { + roi_batch_id_data[i] = n; + } + } + T* output_data = out->mutable_data(ctx.GetPlace()); + const T* rois_data = rois->data(); + for (int n = 0; n < rois_num; ++n) { + int roi_batch_id = roi_batch_id_data[n]; + T roi_xmin = rois_data[0] * spatial_scale; + T roi_ymin = rois_data[1] * spatial_scale; + T roi_xmax = rois_data[2] * spatial_scale; + T roi_ymax = rois_data[3] * spatial_scale; + + T roi_width = std::max(roi_xmax - roi_xmin, static_cast(1.)); + T roi_height = std::max(roi_ymax - roi_ymin, static_cast(1.)); + T bin_size_h = static_cast(roi_height) / static_cast(pooled_height); + T bin_size_w = static_cast(roi_width) / static_cast(pooled_width); + const T* batch_data = input_data + roi_batch_id * in_stride[0]; + + int roi_bin_grid_h = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_height / pooled_height); + int roi_bin_grid_w = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_width / pooled_width); + const T count = roi_bin_grid_h * roi_bin_grid_w; + Tensor pre_pos; + Tensor pre_w; + int pre_size = count * out_stride[1]; + pre_pos.Resize({pre_size, kROISize}); + pre_w.Resize({pre_size, kROISize}); + + PreCalcForBilinearInterpolate( + dev_ctx, height, width, pooled_height, pooled_width, roi_bin_grid_h, + roi_bin_grid_w, roi_ymin, roi_xmin, bin_size_h, bin_size_w, + roi_bin_grid_h, roi_bin_grid_w, &pre_pos, &pre_w); + const int* pre_pos_data = pre_pos.data(); + const T* pre_w_data = pre_w.data(); + for (int c = 0; c < channels; c++) { + int pre_calc_index = 0; + for (int ph = 0; ph < pooled_height; ph++) { + for (int pw = 0; pw < pooled_width; pw++) { + const int pool_index = ph * pooled_width + pw; + T output_val = 0; + for (int iy = 0; iy < roi_bin_grid_h; iy++) { + for (int ix = 0; ix < roi_bin_grid_w; ix++) { + for (int i = 0; i < kROISize; i++) { + int pos = pre_pos_data[pre_calc_index * kROISize + i]; + T w = pre_w_data[pre_calc_index * kROISize + i]; + output_val += w * batch_data[pos]; + } + pre_calc_index += 1; + } + } + output_val /= count; + output_data[pool_index] = output_val; + } + } + batch_data += in_stride[1]; + output_data += out_stride[1]; + } + rois_data += roi_stride[0]; + } + } +}; + +template +class CPUROIAlignGradOpKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* in = ctx.Input("X"); + auto* rois = ctx.Input("ROIs"); + auto* out_grad = + ctx.Input(framework::GradVarName("Out")); + auto* in_grad = ctx.Output(framework::GradVarName("X")); + + auto pooled_height = ctx.Attr("pooled_height"); + auto pooled_width = ctx.Attr("pooled_width"); + auto spatial_scale = ctx.Attr("spatial_scale"); + auto sampling_ratio = ctx.Attr("sampling_ratio"); + auto in_dims = in->dims(); + if (!in_grad) { + return; + } + int channels = in_dims[1]; + int height = in_dims[2]; + int width = in_dims[3]; + int rois_num = rois->dims()[0]; + Tensor roi_batch_id_list; + roi_batch_id_list.Resize({rois_num}); + int* roi_batch_id_data = + roi_batch_id_list.mutable_data(ctx.GetPlace()); + + auto rois_lod = rois->lod().back(); + int rois_batch_size = rois_lod.size() - 1; + for (int n = 0; n < rois_batch_size; ++n) { + for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) { + roi_batch_id_data[i] = n; + } + } + + const T* rois_data = rois->data(); + const T* out_grad_data = out_grad->data(); + T* in_grad_data = in_grad->mutable_data(ctx.GetPlace()); + + auto in_stride = framework::stride(in->dims()); + auto roi_stride = framework::stride(rois->dims()); + auto out_stride = framework::stride(out_grad->dims()); + + for (int n = 0; n < rois_num; ++n) { + int roi_batch_idx = roi_batch_id_data[n]; + T roi_xmin = rois_data[0] * spatial_scale; + T roi_ymin = rois_data[1] * spatial_scale; + T roi_xmax = rois_data[2] * spatial_scale; + T roi_ymax = rois_data[3] * spatial_scale; + T roi_width = std::max(roi_xmax - roi_xmin, static_cast(1.)); + T roi_height = std::max(roi_ymax - roi_ymin, static_cast(1.)); + T bin_size_h = static_cast(roi_height) / static_cast(pooled_height); + T bin_size_w = static_cast(roi_width) / static_cast(pooled_width); + for (int c = 0; c < channels; ++c) { + T* batch_grad_data = + in_grad_data + roi_batch_idx * in_stride[0] + c * in_stride[1]; + const T* batch_out_grad_data = + out_grad_data + n * out_stride[0] + c * out_stride[1]; + for (int ph = 0; ph < pooled_height; ++ph) { + for (int pw = 0; pw < pooled_width; ++pw) { + int pool_index = ph * pooled_width + pw; + T out_grad_this_bin = batch_out_grad_data[pool_index]; + int roi_bin_grid_h = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_height / pooled_height); + int roi_bin_grid_w = (sampling_ratio > 0) + ? sampling_ratio + : ceil(roi_width / pooled_width); + T count = roi_bin_grid_h * roi_bin_grid_w; + for (int iy = 0; iy < roi_bin_grid_h; iy++) { + const T y = roi_ymin + ph * bin_size_h + + static_cast(iy + .5f) * bin_size_h / + static_cast(roi_bin_grid_h); + for (int ix = 0; ix < roi_bin_grid_w; ix++) { + const T x = roi_xmin + pw * bin_size_w + + static_cast(ix + .5f) * bin_size_w / + static_cast(roi_bin_grid_w); + bilinear_interpolate_gradient(height, width, y, x, + out_grad_this_bin, count, + batch_grad_data); + } + } + } + } + } + rois_data += roi_stride[0]; + } + } +}; +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/roi_pool_op.cc b/paddle/fluid/operators/roi_pool_op.cc index d6d209d5de041500a9b4893d70800a58e8ee1e1d..3f6b2e46c7014a8c57701099fcc44c8d9e4f08e0 100644 --- a/paddle/fluid/operators/roi_pool_op.cc +++ b/paddle/fluid/operators/roi_pool_op.cc @@ -40,10 +40,10 @@ class ROIPoolOp : public framework::OperatorWithKernel { "The format of input tensor is NCHW."); PADDLE_ENFORCE(rois_dims.size() == 2, "ROIs should be a 2-D LoDTensor of shape (num_rois, 4)" - "given as [[x1, y1, x2, y2], …]."); + "given as [[x1, y1, x2, y2], ...]."); PADDLE_ENFORCE(rois_dims[1] == kROISize, "ROIs should be a 2-D LoDTensor of shape (num_rois, 4)" - "given as [[x1, y1, x2, y2], …]."); + "given as [[x1, y1, x2, y2], ...]."); int pooled_height = ctx->Attrs().Get("pooled_height"); int pooled_width = ctx->Attrs().Get("pooled_width"); @@ -110,7 +110,7 @@ class ROIPoolOpMaker : public framework::OpProtoAndCheckerMaker { "(LoDTensor), " "ROIs (Regions of Interest) to pool over. " "should be a 2-D LoDTensor of shape (num_rois, 4)" - "given as [[x1, y1, x2, y2], …]. " + "given as [[x1, y1, x2, y2], ...]. " "Where batch_id is the id of the data, " "(x1, y1) is the top left coordinates, and " "(x2, y2) is the bottom right coordinates."); @@ -122,7 +122,7 @@ class ROIPoolOpMaker : public framework::OpProtoAndCheckerMaker { "(Tensor), " "Argmaxes corresponding to indices in X used " "for gradient computation. Only output " - "if arg “is_test” is false.") + "if arg \"is_test\" is false.") .AsIntermediate(); AddAttr("spatial_scale", "(float, default 1.0), " @@ -174,4 +174,4 @@ REGISTER_OP_CPU_KERNEL( REGISTER_OP_CPU_KERNEL( roi_pool_grad, ops::CPUROIPoolGradOpKernel, - ops::CPUROIPoolOpKernel); + ops::CPUROIPoolGradOpKernel); diff --git a/paddle/fluid/operators/roi_pool_op.cu b/paddle/fluid/operators/roi_pool_op.cu index 46e20285db6d7acd39dead3994409645adddf494..75c3dd6bc498e35c6249f79a1c24cfe17316670e 100644 --- a/paddle/fluid/operators/roi_pool_op.cu +++ b/paddle/fluid/operators/roi_pool_op.cu @@ -249,4 +249,4 @@ REGISTER_OP_CUDA_KERNEL( REGISTER_OP_CUDA_KERNEL( roi_pool_grad, ops::GPUROIPoolGradOpKernel, - ops::GPUROIPoolOpKernel); + ops::GPUROIPoolGradOpKernel); diff --git a/paddle/fluid/operators/save_op.cc b/paddle/fluid/operators/save_op.cc index e79cffcf498c52ed14db235f6221cfdf08399c9d..0dcf3f0e372f07370078553465973edfd7c96e07 100644 --- a/paddle/fluid/operators/save_op.cc +++ b/paddle/fluid/operators/save_op.cc @@ -110,7 +110,7 @@ class SaveOp : public framework::OperatorBase { lt_var != nullptr, "Can not find variable kLookupTablePath for SaveSelectedRows"); std::string filename = lt_var->data(); - VLOG(4) << "SaveSelectedRows get File name: " << filename; + VLOG(40) << "SaveSelectedRows get File name: " << filename; MkDirRecursively(DirName(filename).c_str()); diff --git a/paddle/fluid/operators/scale_op.cu b/paddle/fluid/operators/scale_op.cu index 04c802da12958a53626f533833c2709110531136..349f39360b8e3100a7f844d3e2d3768053c37c58 100644 --- a/paddle/fluid/operators/scale_op.cu +++ b/paddle/fluid/operators/scale_op.cu @@ -13,6 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/scale_op.h" +#include "paddle/fluid/platform/float16.h" +namespace plat = paddle::platform; REGISTER_OP_CUDA_KERNEL( scale, @@ -20,4 +22,6 @@ REGISTER_OP_CUDA_KERNEL( paddle::operators::ScaleKernel, paddle::operators::ScaleKernel, paddle::operators::ScaleKernel); + int64_t>, + paddle::operators::ScaleKernel); diff --git a/paddle/fluid/operators/scale_op.h b/paddle/fluid/operators/scale_op.h index d8a199bc2b860515645b4954b49d8eb59fbd02dc..96b8b00b429df72569ef2a292c8a600c56159f19 100644 --- a/paddle/fluid/operators/scale_op.h +++ b/paddle/fluid/operators/scale_op.h @@ -24,19 +24,13 @@ class ScaleKernel : public framework::OpKernel { public: virtual void Compute(const framework::ExecutionContext& ctx) const { auto* in_var = ctx.InputVar("X"); - auto* in = ctx.Input("X"); - - auto* out_var = ctx.OutputVar("Out"); - auto* out = ctx.Output("Out"); - out->mutable_data(in->place()); - - PADDLE_ENFORCE_EQ(in->dims(), out->dims(), - "in and out should have the same dim"); + auto* in = framework::GetLoDTensorOrSelectedRowsValueFromVar(*in_var); auto scale = static_cast(ctx.Attr("scale")); auto bias = static_cast(ctx.Attr("bias")); auto bias_after_scale = ctx.Attr("bias_after_scale"); + auto* out_var = ctx.OutputVar("Out"); if (in_var->IsType() && in_var != out_var) { auto& in_slr = in_var->Get(); auto* out_slr = out_var->GetMutable(); @@ -44,6 +38,13 @@ class ScaleKernel : public framework::OpKernel { out_slr->set_height(in_slr.height()); } + auto* out = + framework::GetMutableLoDTensorOrSelectedRowsValueFromVar(out_var); + out->mutable_data(in->place()); + + PADDLE_ENFORCE_EQ(in->dims(), out->dims(), + "in and out should have the same dim"); + auto eigen_out = framework::EigenVector::Flatten(*out); auto eigen_in = framework::EigenVector::Flatten(*in); auto& dev = *ctx.template device_context().eigen_device(); diff --git a/paddle/fluid/operators/scatter.cu.h b/paddle/fluid/operators/scatter.cu.h index ac7d69bfb549fd98c76fcf834e8d3ad9bec2ef23..b2e79f6c82bb748293f4219845e6798347c8c46e 100644 --- a/paddle/fluid/operators/scatter.cu.h +++ b/paddle/fluid/operators/scatter.cu.h @@ -51,7 +51,8 @@ void GPUScatterAssign(const platform::DeviceContext& ctx, const Tensor& src, const Tensor& index, Tensor* output) { // PADDLE_ENFORCE(platform::is_gpu_place(place)); // check index of shape 1-D - PADDLE_ENFORCE(index.dims().size() == 1); + PADDLE_ENFORCE(index.dims().size() == 1 || + (index.dims().size() == 2 && index.dims()[1] == 1)); int index_size = index.dims()[0]; auto src_dims = src.dims(); diff --git a/paddle/fluid/operators/scatter.h b/paddle/fluid/operators/scatter.h index 39af717615c01f5c121e32b176b74d05be738531..8bae6606c94620ab4fa8ae34f69236e7e87e9670 100644 --- a/paddle/fluid/operators/scatter.h +++ b/paddle/fluid/operators/scatter.h @@ -37,7 +37,8 @@ void ScatterAssign(const platform::DeviceContext& ctx, const Tensor& src, const Tensor& index, Tensor* output) { PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace())); // check index of shape 1-D - PADDLE_ENFORCE(index.dims().size() == 1); + PADDLE_ENFORCE(index.dims().size() == 1 || + (index.dims().size() == 2 && index.dims()[1] == 1)); int index_size = index.dims()[0]; auto src_dims = src.dims(); diff --git a/paddle/fluid/operators/scatter_test.cc b/paddle/fluid/operators/scatter_test.cc index 750245153a7df6c4a7ce088038005dcab1685b5f..eb248e59b6ce6e5c9c04f94b21e4bc14207c39b1 100644 --- a/paddle/fluid/operators/scatter_test.cc +++ b/paddle/fluid/operators/scatter_test.cc @@ -21,42 +21,38 @@ limitations under the License. */ #include "paddle/fluid/platform/place.h" TEST(scatter, ScatterUpdate) { - // using namespace paddle::framework; - // using namespace paddle::platform; - // using namespace paddle::operators; - - paddle::framework::Tensor* src = new paddle::framework::Tensor(); - paddle::framework::Tensor* index = new paddle::framework::Tensor(); - paddle::framework::Tensor* output = new paddle::framework::Tensor(); - - float* p_src = nullptr; - int* p_index = nullptr; - p_src = src->mutable_data(paddle::framework::make_ddim({1, 4}), - paddle::platform::CPUPlace()); - p_index = index->mutable_data(paddle::framework::make_ddim({1}), - paddle::platform::CPUPlace()); - - for (size_t i = 0; i < 4; ++i) p_src[i] = static_cast(i); + paddle::framework::Tensor src; + paddle::framework::Tensor index; + paddle::framework::Tensor output; + + auto* p_src = src.mutable_data(paddle::framework::make_ddim({1, 4}), + paddle::platform::CPUPlace()); + auto* p_index = index.mutable_data(paddle::framework::make_ddim({1}), + paddle::platform::CPUPlace()); + + for (size_t i = 0; i < 4; ++i) { + p_src[i] = static_cast(i); + } p_index[0] = 1; - float* p_output = output->mutable_data( + auto* p_output = output.mutable_data( paddle::framework::make_ddim({4, 4}), paddle::platform::CPUPlace()); + for (int64_t i = 0; i < output.numel(); ++i) { + p_output[i] = 0; + } + auto* cpu_place = new paddle::platform::CPUPlace(); paddle::platform::CPUDeviceContext ctx(*cpu_place); - paddle::operators::ScatterAssign(ctx, *src, *index, output); + paddle::operators::ScatterAssign(ctx, src, index, &output); for (size_t i = 0; i < 4; ++i) EXPECT_EQ(p_output[i], 0.0f); - for (size_t i = 0; i < 4; ++i) EXPECT_EQ(output->data()[i], 0.0f); + for (size_t i = 0; i < 4; ++i) EXPECT_EQ(output.data()[i], 0.0f); for (size_t i = 4; i < 8; ++i) { EXPECT_EQ(p_output[i], static_cast(i - 4)); } for (size_t i = 4; i < 8; ++i) - EXPECT_EQ(output->data()[i], static_cast(i - 4)); + EXPECT_EQ(output.data()[i], static_cast(i - 4)); for (size_t i = 8; i < 16; ++i) EXPECT_EQ(p_output[i], 0.0f); - for (size_t i = 8; i < 16; ++i) EXPECT_EQ(output->data()[i], 0.0f); - - delete src; - delete index; - delete output; + for (size_t i = 8; i < 16; ++i) EXPECT_EQ(output.data()[i], 0.0f); } diff --git a/paddle/fluid/operators/selu_op.cc b/paddle/fluid/operators/selu_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..67fca18000a4fac1e2ca39fc26ebe67649a51bc3 --- /dev/null +++ b/paddle/fluid/operators/selu_op.cc @@ -0,0 +1,135 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/selu_op.h" +#include + +namespace paddle { +namespace operators { + +class SeluOp : public framework::OperatorWithKernel { + public: + SeluOp(const std::string &type, const framework::VariableNameMap &inputs, + const framework::VariableNameMap &outputs, + const framework::AttributeMap &attrs) + : OperatorWithKernel(type, inputs, outputs, attrs) {} + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of SeluOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of SeluOp should not be null."); + + ctx->ShareDim("X", /*->*/ "Out"); + ctx->ShareLoD("X", /*->*/ "Out"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + return framework::OpKernelType( + framework::GetDataTypeOfVar(ctx.InputVar("X")), ctx.GetPlace()); + } +}; + +class SeluOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; + } +}; + +class SeluOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "The input tensor of selu operator."); + AddOutput("Out", "The output tensor of selu operator."); + AddAttr("scale", + "(float) the default value is 1.0507~. For more " + "information about this value, please refer to:" + "https://arxiv.org/abs/1706.02515.") + .SetDefault(1.0507009873554804934193349852946); + AddAttr("alpha", + "(float) the default value is 1.6732~. For more " + "information about this value, please refer to:" + "https://arxiv.org/abs/1706.02515.") + .SetDefault(1.6732632423543772848170429916717); + AddComment(R"DOC( +Selu Operator. + +The equation is: +$$ +f(x) =\lambda* +\begin{cases} + \quad \quad x, \quad \quad \quad \text{if} \ x > 0 \\ + \alpha * e^x - \alpha, \qquad \text{if} \ x <= 0 +\end{cases} +$$ + +The input `X` can carry the LoD (Level of Details) information, +or not. And the output shares the LoD information with input `X`. +)DOC"); + } +}; + +class SeluGradMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + std::unique_ptr Apply() const override { + auto *grad_op = new framework::OpDesc(); + grad_op->SetType("selu_grad"); + grad_op->SetInput("Out", Output("Out")); + grad_op->SetInput(framework::GradVarName("Out"), OutputGrad("Out")); + grad_op->SetOutput(framework::GradVarName("X"), InputGrad("X")); + grad_op->SetAttrMap(this->Attrs()); + return std::unique_ptr(grad_op); + } +}; + +class SeluGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), + "Input(Out@GRAD) should not be null"); + PADDLE_ENFORCE(ctx->HasInput("Out"), "Input(Out) should not be null"); + auto x_grad_name = framework::GradVarName("X"); + ctx->SetOutputDim(x_grad_name, ctx->GetInputDim("Out")); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext &ctx) const override { + return framework::OpKernelType( + framework::GetDataTypeOfVar(ctx.InputVar("Out")), ctx.GetPlace()); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OPERATOR(selu, ops::SeluOp, ops::SeluOpMaker, ops::SeluOpInferVarType, + ops::SeluGradMaker); +REGISTER_OPERATOR(selu_grad, ops::SeluGradOp); +REGISTER_OP_CPU_KERNEL( + selu, ops::SeluKernel, + ops::SeluKernel); +REGISTER_OP_CPU_KERNEL( + selu_grad, ops::SeluGradKernel, + ops::SeluGradKernel); diff --git a/paddle/fluid/operators/selu_op.cu b/paddle/fluid/operators/selu_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..fb3245ab7609ea9067709134a3713e9871dbb4d4 --- /dev/null +++ b/paddle/fluid/operators/selu_op.cu @@ -0,0 +1,22 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#include "paddle/fluid/operators/selu_op.h" + +namespace ops = paddle::operators; +REGISTER_OP_CUDA_KERNEL( + selu, ops::SeluKernel, + ops::SeluKernel); +REGISTER_OP_CUDA_KERNEL( + selu_grad, ops::SeluGradKernel, + ops::SeluGradKernel); diff --git a/paddle/fluid/operators/selu_op.h b/paddle/fluid/operators/selu_op.h new file mode 100644 index 0000000000000000000000000000000000000000..bdb506885c932708803fe8d84ee705aee0fe02b4 --- /dev/null +++ b/paddle/fluid/operators/selu_op.h @@ -0,0 +1,124 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once +#include +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/for_range.h" +namespace paddle { +namespace operators { + +static HOSTDEVICE float real_exp(float x) { return expf(x); } +static HOSTDEVICE float real_exp(double x) { return exp(x); } + +template +struct SeluFunctor { + SeluFunctor(const T* x_data_ptr, float alpha, float scale, T* y_data_ptr) + : x_data_ptr_(x_data_ptr), + alpha_(alpha), + scale_(scale), + y_data_ptr_(y_data_ptr) {} + + HOSTDEVICE void operator()(size_t idx) const { + T x_ele = x_data_ptr_[idx]; + if (x_ele <= 0) { + x_ele = alpha_ * real_exp(x_ele) - alpha_; + } + y_data_ptr_[idx] = scale_ * x_ele; + } + const T* x_data_ptr_; + const float alpha_; + const float scale_; + T* y_data_ptr_; +}; + +template +struct SeluGradFunctor { + SeluGradFunctor(const T* y_data_ptr, const T* dy_data_ptr, float alpha, + float scale, T* dx_data_ptr) + : y_data_ptr_(y_data_ptr), + dy_data_ptr_(dy_data_ptr), + alpha_(alpha), + scale_(scale), + la_(alpha * scale), + dx_data_ptr_(dx_data_ptr) {} + + HOSTDEVICE void operator()(size_t idx) const { + T y_ele = y_data_ptr_[idx]; + T dy_ele = dy_data_ptr_[idx]; + + float tmp = scale_; + if (y_ele <= 0) { + tmp = y_ele + la_; + } + dx_data_ptr_[idx] = dy_ele * tmp; + } + const T* y_data_ptr_; + const T* dy_data_ptr_; + const float alpha_; + const float scale_; + const float la_; + T* dx_data_ptr_; +}; + +template +class SeluKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + using Tensor = framework::Tensor; + + auto* x = context.Input("X"); + auto* out = context.Output("Out"); + + float alpha = context.Attr("alpha"); + float scale = context.Attr("scale"); + + auto out_ptr = out->mutable_data(context.GetPlace()); + + SeluFunctor functor(x->data(), alpha, scale, out_ptr); + + auto& dev_ctx = context.template device_context(); + size_t limit = static_cast(x->numel()); + platform::ForRange for_range(dev_ctx, limit); + for_range(functor); + } +}; + +template +class SeluGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + using Tensor = framework::Tensor; + + auto* out = context.Input("Out"); + auto* dout = context.Input(framework::GradVarName("Out")); + auto* dx = context.Output(framework::GradVarName("X")); + + float alpha = context.Attr("alpha"); + float scale = context.Attr("scale"); + + auto dx_ptr = dx->mutable_data(context.GetPlace()); + + SeluGradFunctor functor(out->data(), dout->data(), alpha, scale, + dx_ptr); + + auto& dev_ctx = context.template device_context(); + size_t limit = static_cast(out->numel()); + platform::ForRange for_range(dev_ctx, limit); + for_range(functor); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/sequence_ops/CMakeLists.txt b/paddle/fluid/operators/sequence_ops/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d468316e8eacb73c4a4ce81c784880bb5e46c2d --- /dev/null +++ b/paddle/fluid/operators/sequence_ops/CMakeLists.txt @@ -0,0 +1,2 @@ +include(operators) +register_operators() diff --git a/paddle/fluid/operators/sequence_concat_op.cc b/paddle/fluid/operators/sequence_ops/sequence_concat_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_concat_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_concat_op.cc index 3234b60861da3d0c6a8434eb11fd0488a95e171f..37f1b9dda50ba4b62d7cf75765125e0ad663d9d8 100644 --- a/paddle/fluid/operators/sequence_concat_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_concat_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_concat_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_concat_op.h" #include namespace paddle { diff --git a/paddle/fluid/operators/sequence_concat_op.cu.cc b/paddle/fluid/operators/sequence_ops/sequence_concat_op.cu.cc similarity index 94% rename from paddle/fluid/operators/sequence_concat_op.cu.cc rename to paddle/fluid/operators/sequence_ops/sequence_concat_op.cu.cc index eb6535235df80a9267b22403ae1f35c6cefb7fe7..7b8043bc4538b486bb73e005769e1585e5c4817e 100644 --- a/paddle/fluid/operators/sequence_concat_op.cu.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_concat_op.cu.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_concat_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_concat_op.h" template using Kernel = diff --git a/paddle/fluid/operators/sequence_concat_op.h b/paddle/fluid/operators/sequence_ops/sequence_concat_op.h similarity index 97% rename from paddle/fluid/operators/sequence_concat_op.h rename to paddle/fluid/operators/sequence_ops/sequence_concat_op.h index 33e9babff274af888b84d33c991cc0a5b70333ae..ff035f421c4907ba940b973b3fd2a9421ed2dbae 100644 --- a/paddle/fluid/operators/sequence_concat_op.h +++ b/paddle/fluid/operators/sequence_ops/sequence_concat_op.h @@ -17,7 +17,7 @@ #include #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/detail/safe_ref.h" -#include "paddle/fluid/operators/math/concat.h" +#include "paddle/fluid/operators/math/concat_and_split.h" namespace paddle { namespace operators { @@ -106,7 +106,7 @@ class SeqConcatGradKernel : public framework::OpKernel { } } - math::ConcatGradFunctor functor; + math::SplitFunctor functor; std::vector sliced_x_ptr; std::vector sliced_dx_ptr; for (auto &x : sliced_x) { diff --git a/paddle/fluid/operators/sequence_conv_op.cc b/paddle/fluid/operators/sequence_ops/sequence_conv_op.cc similarity index 99% rename from paddle/fluid/operators/sequence_conv_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_conv_op.cc index 95a21a5d3ee6d8037431083edc25d1cddf05dedb..65cd9edbc7125f605d6fb437a2e056054eb9a6d7 100644 --- a/paddle/fluid/operators/sequence_conv_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_conv_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_conv_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_conv_op.h" #include diff --git a/paddle/fluid/operators/sequence_conv_op.cu.cc b/paddle/fluid/operators/sequence_ops/sequence_conv_op.cu.cc similarity index 93% rename from paddle/fluid/operators/sequence_conv_op.cu.cc rename to paddle/fluid/operators/sequence_ops/sequence_conv_op.cu.cc index de482b7f10bafc4ac6f3838670e2da9a86374c26..600981b5e96c279329a67b608a8dd94dee7d88ef 100644 --- a/paddle/fluid/operators/sequence_conv_op.cu.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_conv_op.cu.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_conv_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_conv_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_conv_op.h b/paddle/fluid/operators/sequence_ops/sequence_conv_op.h similarity index 100% rename from paddle/fluid/operators/sequence_conv_op.h rename to paddle/fluid/operators/sequence_ops/sequence_conv_op.h diff --git a/paddle/fluid/operators/sequence_enumerate_op.cc b/paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cc similarity index 97% rename from paddle/fluid/operators/sequence_enumerate_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cc index 58e48c228bb34814700fd0f7a3d62ef4b1a435dd..1eebadc2c980ddf1cbaaefef1568dd401d0c77ed 100644 --- a/paddle/fluid/operators/sequence_enumerate_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_enumerate_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_enumerate_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_enumerate_op.cu b/paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cu similarity index 97% rename from paddle/fluid/operators/sequence_enumerate_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cu index bdc9a615aa9a1ecd99c1f6995361f8c5ff0aa383..28821e7129c1601f1214b0b56696fbf526a2123f 100644 --- a/paddle/fluid/operators/sequence_enumerate_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_enumerate_op.cu @@ -14,7 +14,7 @@ #include #include -#include "paddle/fluid/operators/sequence_enumerate_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_enumerate_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { diff --git a/paddle/fluid/operators/sequence_enumerate_op.h b/paddle/fluid/operators/sequence_ops/sequence_enumerate_op.h similarity index 100% rename from paddle/fluid/operators/sequence_enumerate_op.h rename to paddle/fluid/operators/sequence_ops/sequence_enumerate_op.h diff --git a/paddle/fluid/operators/sequence_erase_op.cc b/paddle/fluid/operators/sequence_ops/sequence_erase_op.cc similarity index 97% rename from paddle/fluid/operators/sequence_erase_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_erase_op.cc index 816ba123a6cbf84ec9b321d5d7cfef7fab9749b1..ddda80ee0824e261b0d737f86e03866d5fdfd77a 100644 --- a/paddle/fluid/operators/sequence_erase_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_erase_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_erase_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_erase_op.h" #include namespace paddle { diff --git a/paddle/fluid/operators/sequence_erase_op.cu b/paddle/fluid/operators/sequence_ops/sequence_erase_op.cu similarity index 98% rename from paddle/fluid/operators/sequence_erase_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_erase_op.cu index 3a58e47f1132cd1ac85584b2470e8c6cddcfb28a..619c40dbd10ad6b538f2d4e3567966b222fc5e2d 100644 --- a/paddle/fluid/operators/sequence_erase_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_erase_op.cu @@ -14,7 +14,7 @@ limitations under the License. */ #include #include -#include "paddle/fluid/operators/sequence_erase_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_erase_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { diff --git a/paddle/fluid/operators/sequence_erase_op.h b/paddle/fluid/operators/sequence_ops/sequence_erase_op.h similarity index 100% rename from paddle/fluid/operators/sequence_erase_op.h rename to paddle/fluid/operators/sequence_ops/sequence_erase_op.h diff --git a/paddle/fluid/operators/sequence_expand_as_op.cc b/paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_expand_as_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cc index 33c1e1c973c80ba3943924331380d35b225ac800..3b79d0c71975bb740b4085ce80f7d95b65f600c1 100644 --- a/paddle/fluid/operators/sequence_expand_as_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_expand_as_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_expand_as_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_expand_as_op.cu b/paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cu similarity index 98% rename from paddle/fluid/operators/sequence_expand_as_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cu index 7357f5ae6e732f28307af65d1f1b6b3cbed1f640..998bf82ab1ddcd815491de95a3f7cf987036ee65 100644 --- a/paddle/fluid/operators/sequence_expand_as_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_expand_as_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include "paddle/fluid/operators/sequence_expand_as_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_expand_as_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { diff --git a/paddle/fluid/operators/sequence_expand_as_op.h b/paddle/fluid/operators/sequence_ops/sequence_expand_as_op.h similarity index 100% rename from paddle/fluid/operators/sequence_expand_as_op.h rename to paddle/fluid/operators/sequence_ops/sequence_expand_as_op.h diff --git a/paddle/fluid/operators/sequence_expand_op.cc b/paddle/fluid/operators/sequence_ops/sequence_expand_op.cc similarity index 99% rename from paddle/fluid/operators/sequence_expand_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_expand_op.cc index 944c7f85e5f43679e1875fcce813382be2ba5526..c07e6962e673ceb274ef31cbf492f378ae696137 100644 --- a/paddle/fluid/operators/sequence_expand_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_expand_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_expand_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_expand_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_expand_op.cu b/paddle/fluid/operators/sequence_ops/sequence_expand_op.cu similarity index 98% rename from paddle/fluid/operators/sequence_expand_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_expand_op.cu index 550677b22694085059e914678a5361d914b455bc..afc08c7b3f6596efd3b6e0b74c17aa3c9268c47d 100644 --- a/paddle/fluid/operators/sequence_expand_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_expand_op.cu @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #include -#include "paddle/fluid/operators/sequence_expand_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_expand_op.h" #include "paddle/fluid/platform/cuda_primitives.h" namespace paddle { diff --git a/paddle/fluid/operators/sequence_expand_op.h b/paddle/fluid/operators/sequence_ops/sequence_expand_op.h similarity index 100% rename from paddle/fluid/operators/sequence_expand_op.h rename to paddle/fluid/operators/sequence_ops/sequence_expand_op.h diff --git a/paddle/fluid/operators/sequence_mask_op.cc b/paddle/fluid/operators/sequence_ops/sequence_mask_op.cc similarity index 95% rename from paddle/fluid/operators/sequence_mask_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_mask_op.cc index 798211f481659eb71248f7a6210e6522273d387f..7fc506aab4d3c6861282b68b09fdcb5fd8055f77 100644 --- a/paddle/fluid/operators/sequence_mask_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_mask_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_mask_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_mask_op.h" REGISTER_OPERATOR(sequence_mask, paddle::operators::SequenceMaskOp, paddle::operators::SequenceMaskOpMaker, diff --git a/paddle/fluid/operators/sequence_mask_op.cu b/paddle/fluid/operators/sequence_ops/sequence_mask_op.cu similarity index 94% rename from paddle/fluid/operators/sequence_mask_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_mask_op.cu index 2ad23774579533b62b9189c1564ad7c7db5c298a..e963ce610e2c147d66087a1df59f67a04d899ccc 100644 --- a/paddle/fluid/operators/sequence_mask_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_mask_op.cu @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_mask_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_mask_op.h" REGISTER_OP_CUDA_KERNEL( sequence_mask, diff --git a/paddle/fluid/operators/sequence_mask_op.h b/paddle/fluid/operators/sequence_ops/sequence_mask_op.h similarity index 99% rename from paddle/fluid/operators/sequence_mask_op.h rename to paddle/fluid/operators/sequence_ops/sequence_mask_op.h index 18acb735cecabd1e01f7821c880fd8ed5e52971f..7ff68f9c715e4c7243afe9de84af9474e7e4e260 100644 --- a/paddle/fluid/operators/sequence_mask_op.h +++ b/paddle/fluid/operators/sequence_ops/sequence_mask_op.h @@ -127,7 +127,7 @@ class SequenceMaskKernel : public framework::OpKernel { auto x_numel = x->numel(); if (maxlen < 0) { #ifdef __NVCC__ - VLOG(10) + VLOG(100) << "SequenceMaskOp on GPU may be slow when maxlen is not provided."; maxlen = static_cast( thrust::reduce(thrust::device_pointer_cast(x_data), diff --git a/paddle/fluid/operators/sequence_pad_op.cc b/paddle/fluid/operators/sequence_ops/sequence_pad_op.cc similarity index 99% rename from paddle/fluid/operators/sequence_pad_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_pad_op.cc index 4583b26256ba2e084bf7477c54d468df860d9b43..23c7bf7cea830bb0ccf5e81f99130043c2d5f80b 100644 --- a/paddle/fluid/operators/sequence_pad_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_pad_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_pad_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_pad_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_pad_op.cu b/paddle/fluid/operators/sequence_ops/sequence_pad_op.cu similarity index 95% rename from paddle/fluid/operators/sequence_pad_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_pad_op.cu index ff8f81a2f0ec4a72befc3be2a5fc48c3a586c824..7fc64a530ef5442ae927faac96ad92a4126febcd 100644 --- a/paddle/fluid/operators/sequence_pad_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_pad_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_pad_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_pad_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_pad_op.h b/paddle/fluid/operators/sequence_ops/sequence_pad_op.h similarity index 100% rename from paddle/fluid/operators/sequence_pad_op.h rename to paddle/fluid/operators/sequence_ops/sequence_pad_op.h diff --git a/paddle/fluid/operators/sequence_pool_op.cc b/paddle/fluid/operators/sequence_ops/sequence_pool_op.cc similarity index 95% rename from paddle/fluid/operators/sequence_pool_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_pool_op.cc index 15d3f064eb7b025dc9a85b2aabad24186061cbd4..44b09bf7c2c776cdc455a8706cb2b2251f3be509 100644 --- a/paddle/fluid/operators/sequence_pool_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_pool_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_pool_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_pool_op.h" #include namespace paddle { @@ -47,6 +47,10 @@ class SequencePoolOpMaker : public framework::OpProtoAndCheckerMaker { "(Tensor) This tensor is used for the sequence max-pooling " "to record the max indexes.") .AsIntermediate(); + AddAttr("is_test", + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") + .SetDefault(false); AddAttr( "pooltype", "(string, default 'AVERAGE') the pooling pooltype of SequencePoolOp.") diff --git a/paddle/fluid/operators/sequence_pool_op.cu b/paddle/fluid/operators/sequence_ops/sequence_pool_op.cu similarity index 93% rename from paddle/fluid/operators/sequence_pool_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_pool_op.cu index 2bf0697af3c74ee922a832fecaa2cd2399a06849..63cd47a38a0ff6413c430c6be6284c5f4bfc2595 100644 --- a/paddle/fluid/operators/sequence_pool_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_pool_op.cu @@ -14,7 +14,7 @@ limitations under the License. */ #define EIGEN_USE_GPU -#include "paddle/fluid/operators/sequence_pool_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_pool_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_pool_op.h b/paddle/fluid/operators/sequence_ops/sequence_pool_op.h similarity index 89% rename from paddle/fluid/operators/sequence_pool_op.h rename to paddle/fluid/operators/sequence_ops/sequence_pool_op.h index 2aa20792f24305a106c500a3d7a6e3d363bc31d8..f2e4a55dee49664b2fc09813f6dba5f68aaf11d5 100644 --- a/paddle/fluid/operators/sequence_pool_op.h +++ b/paddle/fluid/operators/sequence_ops/sequence_pool_op.h @@ -32,10 +32,6 @@ class SequencePoolKernel : public framework::OpKernel { auto* in = context.Input("X"); auto* out = context.Output("Out"); std::string pooltype = context.Attr("pooltype"); - Tensor* index = nullptr; - if (pooltype == "MAX") { - index = context.Output("MaxIndex"); - } auto dims = in->dims(); auto lod = in->lod(); @@ -48,13 +44,22 @@ class SequencePoolKernel : public framework::OpKernel { dims[0] = lod[0].size() - 1; out->Resize({dims}); out->mutable_data(context.GetPlace()); - if (pooltype == "MAX") { + Tensor* index = nullptr; + + const bool is_test = context.Attr("is_test"); + + // Do not create index buffer for inference (is_test) mode + // TODO(jczaja): Skip index buffer creation for other devices eg. GPU + if (pooltype == "MAX" && + (is_test == false || + platform::is_cpu_place(context.GetPlace()) == false)) { + index = context.Output("MaxIndex"); index->Resize({dims}); index->mutable_data(context.GetPlace()); } math::SequencePoolFunctor pool; pool(context.template device_context(), pooltype, *in, out, - index); + is_test, index); } }; diff --git a/paddle/fluid/operators/sequence_reshape_op.cc b/paddle/fluid/operators/sequence_ops/sequence_reshape_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_reshape_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_reshape_op.cc index 31d28d723498892f287246ba228df757d5b9f6c8..5421f35662b3b0a6a61748ac0b6b5f718d213b73 100644 --- a/paddle/fluid/operators/sequence_reshape_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_reshape_op.cc @@ -12,7 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. -#include "paddle/fluid/operators/sequence_reshape_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_reshape_op.h" #include "paddle/fluid/framework/ddim.h" namespace paddle { diff --git a/paddle/fluid/operators/sequence_reshape_op.cu b/paddle/fluid/operators/sequence_ops/sequence_reshape_op.cu similarity index 95% rename from paddle/fluid/operators/sequence_reshape_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_reshape_op.cu index 232e031c0b022497d9e5141750dbf8fccffc7615..38bc599165d5f84f67e2fe08bf96ebef4b03d8a4 100644 --- a/paddle/fluid/operators/sequence_reshape_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_reshape_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_reshape_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_reshape_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_reshape_op.h b/paddle/fluid/operators/sequence_ops/sequence_reshape_op.h similarity index 100% rename from paddle/fluid/operators/sequence_reshape_op.h rename to paddle/fluid/operators/sequence_ops/sequence_reshape_op.h diff --git a/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cc b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..dfbbf5f156983189ac1ab82fbff51d7eb4844f9a --- /dev/null +++ b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cc @@ -0,0 +1,29 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/sequence_ops/sequence_reverse_op.h" + +namespace ops = paddle::operators; + +REGISTER_OPERATOR(sequence_reverse, ops::SequenceReverseOp, + ops::SequenceReverseOpMaker, + ops::SequenceReverseGradOpDescMaker); + +REGISTER_OP_CPU_KERNEL( + sequence_reverse, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel); diff --git a/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cu b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..0a59ed7f9fee07bc3b12909973535f31ef049a4a --- /dev/null +++ b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.cu @@ -0,0 +1,25 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/sequence_ops/sequence_reverse_op.h" + +namespace ops = paddle::operators; + +REGISTER_OP_CUDA_KERNEL( + sequence_reverse, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel, + ops::SequenceReverseOpKernel); diff --git a/paddle/fluid/operators/sequence_ops/sequence_reverse_op.h b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.h new file mode 100644 index 0000000000000000000000000000000000000000..39dad2311b2bcf29f808723caf7bfaef4c88cef2 --- /dev/null +++ b/paddle/fluid/operators/sequence_ops/sequence_reverse_op.h @@ -0,0 +1,157 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/algorithm.h" +#include "paddle/fluid/platform/for_range.h" + +namespace paddle { +namespace operators { + +class SequenceReverseOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext *ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must exist"); + PADDLE_ENFORCE(ctx->HasOutput("Y"), "Output(Y) must exist"); + + auto x_dim = ctx->GetInputDim("X"); + PADDLE_ENFORCE_GE(x_dim.size(), 2, + "Rank of Input(X) must be not less than 2."); + + ctx->SetOutputDim("Y", x_dim); + ctx->ShareLoD("X", "Y"); + } +}; + +class SequenceReverseOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "The input LoDTensor of sequence_reverse op."); + AddOutput("Y", "The output LoDTensor of sequence_reverse op."); + AddComment(R"DOC( +SequenceReverse Operator. + +Reverse each sequence in input X along dim 0. + +Assuming X is a LoDTensor with dims [5, 4] and lod [[0, 2, 5]], where: + +X.data() = [ + [1, 2, 3, 4], + [5, 6, 7, 8], # the 0-th sequence with length 2 + [9, 10, 11, 12], + [13, 14, 15, 16], + [17, 18, 19, 20] # the 1-st sequence with length 3 +] + +The output Y would be a LoDTensor sharing the same dims and lod with input X, +and: + +Y.data() = [ + [5, 6, 7, 8], + [1, 2, 3, 4], # the reversed 0-th sequence with length 2 + [17, 18, 19, 20], + [13, 14, 15, 16], + [9, 10, 11, 12] # the reversed 1-st sequence with length 3 +] + +This Operator is useful to build a reverse dynamic RNN network. + +This Operator only supports one-level lod currently. + )DOC"); + } +}; + +template +struct SequenceReverseFunctor { + SequenceReverseFunctor(const T *x, T *y, const size_t *lod, size_t lod_count, + size_t row_numel) + : x_(x), y_(y), lod_(lod), lod_count_(lod_count), row_numel_(row_numel) {} + + HOSTDEVICE void operator()(size_t idx_x) const { + auto row_idx_x = idx_x / row_numel_; + auto lod_idx = math::UpperBound(lod_, lod_count_, row_idx_x); + auto row_idx_y = lod_[lod_idx - 1] + (lod_[lod_idx] - 1 - row_idx_x); + auto idx_y = row_idx_y * row_numel_ + idx_x % row_numel_; + y_[idx_y] = x_[idx_x]; + } + + const T *x_; + T *y_; + const size_t *lod_; + size_t lod_count_; + size_t row_numel_; +}; + +template +class SequenceReverseOpKernel : public framework::OpKernel { + using LoDTensor = framework::LoDTensor; + + public: + void Compute(const framework::ExecutionContext &ctx) const override { + auto &x = *ctx.Input("X"); + auto *y = ctx.Output("Y"); + + PADDLE_ENFORCE_EQ(x.lod().size(), 1, + "SequenceReverse Op only support one level lod."); + + auto &dev_ctx = ctx.template device_context(); + const size_t *lod; + size_t lod_count = x.lod()[0].size(); + +#ifdef PADDLE_WITH_CUDA + if (platform::is_gpu_place(ctx.GetPlace())) { + lod = x.lod()[0].CUDAData(ctx.GetPlace()); + } else { +#endif + lod = x.lod()[0].data(); +#ifdef PADDLE_WITH_CUDA + } +#endif + + size_t limit = static_cast(x.numel()); + size_t row_numel = static_cast(limit / x.dims()[0]); + auto *x_data = x.data(); + auto *y_data = y->mutable_data(ctx.GetPlace()); + + PADDLE_ENFORCE_NE(x_data, y_data, + "SequenceReverse Op does not support in-place operation"); + + SequenceReverseFunctor functor(x_data, y_data, lod, lod_count, + row_numel); + platform::ForRange for_range(dev_ctx, limit); + for_range(functor); + } +}; + +class SequenceReverseGradOpDescMaker : public framework::SingleGradOpDescMaker { + public: + using framework::SingleGradOpDescMaker::SingleGradOpDescMaker; + + protected: + std::unique_ptr Apply() const override { + std::unique_ptr op(new framework::OpDesc()); + op->SetType("sequence_reverse"); + op->SetInput("X", OutputGrad("Y")); + op->SetOutput("Y", InputGrad("X")); + op->SetAttrMap(Attrs()); + return op; + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/sequence_scatter_op.cc b/paddle/fluid/operators/sequence_ops/sequence_scatter_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_scatter_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_scatter_op.cc index adb81bffccb50069b3a2e5f391f3fdfde231b2be..c49d1ccb18427a1ec3c45f326b57bce32c60e1e2 100644 --- a/paddle/fluid/operators/sequence_scatter_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_scatter_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_scatter_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_scatter_op.h" #include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/operators/gather.h" diff --git a/paddle/fluid/operators/sequence_scatter_op.h b/paddle/fluid/operators/sequence_ops/sequence_scatter_op.h similarity index 100% rename from paddle/fluid/operators/sequence_scatter_op.h rename to paddle/fluid/operators/sequence_ops/sequence_scatter_op.h diff --git a/paddle/fluid/operators/sequence_slice_op.cc b/paddle/fluid/operators/sequence_ops/sequence_slice_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_slice_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_slice_op.cc index df9243dc04c584d70dfa6ca78d5fac8423796466..6f84023e26dbf1280d9622946ab20184fb835be1 100644 --- a/paddle/fluid/operators/sequence_slice_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_slice_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_slice_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_slice_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_slice_op.cu b/paddle/fluid/operators/sequence_ops/sequence_slice_op.cu similarity index 92% rename from paddle/fluid/operators/sequence_slice_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_slice_op.cu index 059e802df0ebdba68f758decfb8b54a362996335..1e4a1b8323dbaacdf3f74c33e7aa4484d9be2478 100644 --- a/paddle/fluid/operators/sequence_slice_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_slice_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_slice_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_slice_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_slice_op.h b/paddle/fluid/operators/sequence_ops/sequence_slice_op.h similarity index 100% rename from paddle/fluid/operators/sequence_slice_op.h rename to paddle/fluid/operators/sequence_ops/sequence_slice_op.h diff --git a/paddle/fluid/operators/sequence_softmax_cudnn_op.cu.cc b/paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc similarity index 100% rename from paddle/fluid/operators/sequence_softmax_cudnn_op.cu.cc rename to paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc diff --git a/paddle/fluid/operators/sequence_softmax_op.cc b/paddle/fluid/operators/sequence_ops/sequence_softmax_op.cc similarity index 98% rename from paddle/fluid/operators/sequence_softmax_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_softmax_op.cc index ada3e0c8dbba38729c2b9c8b02335327835f2ef4..644a5bebc18886a2ac9210576f1c2251ad5ad0be 100644 --- a/paddle/fluid/operators/sequence_softmax_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_softmax_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_softmax_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_softmax_op.h" #include namespace paddle { diff --git a/paddle/fluid/operators/sequence_softmax_op.cu b/paddle/fluid/operators/sequence_ops/sequence_softmax_op.cu similarity index 98% rename from paddle/fluid/operators/sequence_softmax_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_softmax_op.cu index e94ceaa170131e8bce7d1574b27f0baeaa8d1ffc..cc5e9821903fb7a726f52177df1d17757f697411 100644 --- a/paddle/fluid/operators/sequence_softmax_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_softmax_op.cu @@ -14,7 +14,7 @@ limitations under the License. */ #include #include // NOLINT -#include "paddle/fluid/operators/sequence_softmax_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_softmax_op.h" namespace paddle { namespace operators { diff --git a/paddle/fluid/operators/sequence_softmax_op.h b/paddle/fluid/operators/sequence_ops/sequence_softmax_op.h similarity index 100% rename from paddle/fluid/operators/sequence_softmax_op.h rename to paddle/fluid/operators/sequence_ops/sequence_softmax_op.h diff --git a/paddle/fluid/operators/sequence_unpad_op.cc b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.cc similarity index 97% rename from paddle/fluid/operators/sequence_unpad_op.cc rename to paddle/fluid/operators/sequence_ops/sequence_unpad_op.cc index f3a0762b9a4d3e080d5d6d10b249e0bd81980b95..2cf508e0b707ecc986886e72e5d42fde3c84894d 100644 --- a/paddle/fluid/operators/sequence_unpad_op.cc +++ b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_unpad_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_unpad_op.h" namespace paddle { namespace operators { @@ -50,7 +50,7 @@ class SequenceUnpadOp : public framework::OperatorWithKernel { if (x_dims.size() == 2) { out_dims_vec.push_back(1); } else { - for (size_t i = 2; i < x_dims.size(); ++i) { + for (int i = 2; i < x_dims.size(); ++i) { out_dims_vec.push_back(x_dims[i]); } } diff --git a/paddle/fluid/operators/sequence_unpad_op.cu b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.cu similarity index 95% rename from paddle/fluid/operators/sequence_unpad_op.cu rename to paddle/fluid/operators/sequence_ops/sequence_unpad_op.cu index 75248372237ec2cb23122f6b16e64f6ce750ebf9..bf54f77f5b55cf7eb19873e352359c028207308a 100644 --- a/paddle/fluid/operators/sequence_unpad_op.cu +++ b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.cu @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/sequence_unpad_op.h" +#include "paddle/fluid/operators/sequence_ops/sequence_unpad_op.h" namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( diff --git a/paddle/fluid/operators/sequence_unpad_op.h b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.h similarity index 98% rename from paddle/fluid/operators/sequence_unpad_op.h rename to paddle/fluid/operators/sequence_ops/sequence_unpad_op.h index ebe3118b985bdfd41ca55e8c572047aa87502ff4..07df3dca831d7e646050ae57402c1a493c2e50e9 100644 --- a/paddle/fluid/operators/sequence_unpad_op.h +++ b/paddle/fluid/operators/sequence_ops/sequence_unpad_op.h @@ -61,7 +61,7 @@ class SequenceUnpadOpKernel : public framework::OpKernel { if (x_t->dims().size() == 2) { out_dims_vec.push_back(1); } else { - for (size_t i = 2; i < x_t->dims().size(); ++i) { + for (int i = 2; i < x_t->dims().size(); ++i) { out_dims_vec.push_back(x_t->dims()[i]); } } diff --git a/paddle/fluid/operators/sign_op.cc b/paddle/fluid/operators/sign_op.cc index f3985dcc027f974e0213a73ea9a21e268d77615f..6837856a6da804e27af2cd6c83052c04f17140d8 100644 --- a/paddle/fluid/operators/sign_op.cc +++ b/paddle/fluid/operators/sign_op.cc @@ -67,4 +67,5 @@ namespace ops = paddle::operators; REGISTER_OPERATOR(sign, ops::SignOp, ops::SignOpMaker, ops::SignGradMaker); REGISTER_OP_CPU_KERNEL( - sign, ops::SignKernel); + sign, ops::SignKernel, + ops::SignKernel); diff --git a/paddle/fluid/operators/sign_op.cu b/paddle/fluid/operators/sign_op.cu index e0d7a87e6485a74dd1cdee1a05abc42eef460990..817e0fbbd511462f161633242d28e63062676eb9 100644 --- a/paddle/fluid/operators/sign_op.cu +++ b/paddle/fluid/operators/sign_op.cu @@ -13,7 +13,11 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/operators/sign_op.h" +#include "paddle/fluid/platform/float16.h" REGISTER_OP_CUDA_KERNEL( sign, - paddle::operators::SignKernel); + paddle::operators::SignKernel, + paddle::operators::SignKernel, + paddle::operators::SignKernel); diff --git a/paddle/fluid/operators/similarity_focus_op.cc b/paddle/fluid/operators/similarity_focus_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..9612f82b6d45dc4e08bfe288ddd1c7790875ee4d --- /dev/null +++ b/paddle/fluid/operators/similarity_focus_op.cc @@ -0,0 +1,87 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/similarity_focus_op.h" + +namespace paddle { +namespace operators { +class SimilarityFocusOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor, default Tensor), a 4-D tensor with shape," + " [BatchSize, X, Y, Z]"); + AddOutput("Out", + "(Tensor, default Tensor), the similarity focus mask" + " with the same shape of input X."); + AddAttr("axis", + "(int32), indicating the dimension to be select. It can" + " only be 1, 2, or 3."); + AddAttr>("indexes", + "(std::vector), indicating the indexes" + " of the selected dimension."); + AddComment(R"DOC( +SimilarityFocus Operator. + +Generate a similarity focus mask with the same shape of input using the following method: +1. Extract the 3-D tensor(here the first dimension is BatchSize) corresponding + to the axis according to the indexes. For example, if axis=1 and indexes=[a], + it will get the matrix T=X[:, a, :, :]. In this case, if the shape of input X + is (BatchSize, A, B, C), the shape of tensor T is (BatchSize, B, C). +2. For each index, find the largest numbers in the tensor T, so that the same + row and same column has at most one number(what it means is that if the + largest number has been found in the i-th row and the j-th column, then + the numbers in the i-th row or j-th column will be skipped. And then the + next largest number will be selected from the remaining numbers. Obviously + there will be min(B, C) numbers), and mark the corresponding position of the + 3-D similarity focus mask as 1, otherwise as 0. Do elementwise-or for + each index. +3. Broadcast the 3-D similarity focus mask to the same shape of input X. + +Refer to `Similarity Focus Layer `_ +)DOC"); + } +}; + +class SimilarityFocusOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should be not null."); + auto x_dims = ctx->GetInputDim("X"); + PADDLE_ENFORCE_EQ(x_dims.size(), 4, "Input(X)'s rank should be 4."); + ctx->SetOutputDim("Out", x_dims); + ctx->ShareLoD("X", /*->*/ "Out"); + } + + protected: + framework::OpKernelType GetExpectedKernelType( + const framework::ExecutionContext& ctx) const override { + return framework::OpKernelType( + framework::ToDataType(ctx.Input("X")->type()), + platform::CPUPlace()); + } +}; + +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +REGISTER_OPERATOR(similarity_focus, ops::SimilarityFocusOp, + ops::SimilarityFocusOpMaker, + paddle::framework::EmptyGradOpMaker); +REGISTER_OP_CPU_KERNEL(similarity_focus, ops::SimilarityFocusKernel, + ops::SimilarityFocusKernel); diff --git a/paddle/fluid/operators/similarity_focus_op.h b/paddle/fluid/operators/similarity_focus_op.h new file mode 100644 index 0000000000000000000000000000000000000000..bf3fed2aaf2cf92d5619ae5bce6dd70d9dfe9621 --- /dev/null +++ b/paddle/fluid/operators/similarity_focus_op.h @@ -0,0 +1,168 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#pragma once + +#include +#include +#include +#include +#include "paddle/fluid/framework/eigen.h" +#include "paddle/fluid/framework/op_registry.h" + +namespace paddle { +namespace operators { +using Tensor = framework::Tensor; + +template +class SimilarityFocusKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& context) const override { + Tensor* out = context.Output("Out"); + const Tensor* x = context.Input("X"); + T* out_data = out->mutable_data(context.GetPlace()); + const T* x_data = x->data(); + + int axis = context.Attr("axis"); + std::vector indexes = context.Attr>("indexes"); + + int64_t batch_size = x->dims()[0]; + int64_t dim[4]; + for (int i = 1; i <= 3; ++i) { + dim[i] = x->dims()[i]; + } + + if (indexes.size() < 1) { + PADDLE_THROW("Indexes' size can not be 0."); + } + for (auto index : indexes) { + if (dim[axis] < index) { + PADDLE_THROW("Index exceeds tensor shape limit."); + } + } + + int64_t array_size = 1; + for (int i = 1; i <= 3; ++i) { + if (i != axis) { + array_size *= dim[i]; + } + } + + std::vector> array(array_size); + + bool (*cmp)(std::pair, std::pair) = []( + std::pair x, std::pair y) { + return x.first > y.first; + }; + + int64_t (*compute_index)(int64_t*, int, int, int, int) = []( + int64_t* dim, int d1, int d2, int d3, int d4) { + return d1 * dim[1] * dim[2] * dim[3] + d2 * dim[2] * dim[3] + + d3 * dim[3] + d4; + }; + + memset(out_data, 0, sizeof(T) * batch_size * dim[1] * dim[2] * dim[3]); + for (int i = 0; i < batch_size; ++i) { + for (auto index : indexes) { + if (axis == 1) { + for (int j = 0; j < dim[2]; ++j) { + for (int k = 0; k < dim[3]; ++k) { + array[j * dim[3] + k] = std::make_pair( + x_data[compute_index(dim, i, index, j, k)], j * dim[3] + k); + } + } + + std::sort(array.begin(), array.end(), cmp); + int tag_num = 0; + std::vector tag2(dim[2]), tag3(dim[3]); + for (auto x : array) { + int idx2 = x.second / dim[3]; + int idx3 = x.second % dim[3]; + if (tag2[idx2] || tag3[idx3]) { + continue; + } + tag_num++; + tag2[idx2] = true; + tag3[idx3] = true; + for (int j = 0; j < dim[1]; ++j) { + out_data[compute_index(dim, i, j, idx2, idx3)] = 1; + } + if (tag_num == std::min(dim[2], dim[3])) { + break; + } + } + } else if (axis == 2) { + for (int j = 0; j < dim[1]; ++j) { + for (int k = 0; k < dim[3]; ++k) { + array[j * dim[3] + k] = std::make_pair( + x_data[compute_index(dim, i, j, index, k)], j * dim[3] + k); + } + } + + std::sort(array.begin(), array.end(), cmp); + int tag_num = 0; + std::vector tag1(dim[1]), tag3(dim[3]); + for (auto x : array) { + int idx1 = x.second / dim[3]; + int idx3 = x.second % dim[3]; + if (tag1[idx1] || tag3[idx3]) { + continue; + } + tag_num++; + tag1[idx1] = true; + tag3[idx3] = true; + for (int j = 0; j < dim[2]; ++j) { + out_data[compute_index(dim, i, idx1, j, idx3)] = 1; + } + if (tag_num == std::min(dim[1], dim[3])) { + break; + } + } + } else if (axis == 3) { + for (int j = 0; j < dim[1]; ++j) { + for (int k = 0; k < dim[2]; ++k) { + array[j * dim[2] + k] = std::make_pair( + x_data[compute_index(dim, i, j, k, index)], j * dim[2] + k); + } + } + + std::sort(array.begin(), array.end(), cmp); + int tag_num = 0; + std::vector tag1(dim[1]), tag2(dim[2]); + for (auto x : array) { + int idx1 = x.second / dim[2]; + int idx2 = x.second % dim[2]; + if (tag1[idx1] || tag2[idx2]) { + continue; + } + tag_num++; + tag1[idx1] = true; + tag2[idx2] = true; + for (int j = 0; j < dim[3]; ++j) { + out_data[compute_index(dim, i, idx1, idx2, j)] = 1; + } + if (tag_num == std::min(dim[1], dim[2])) { + break; + } + } + } else { + PADDLE_THROW("Axis must be 1 or 2 or 3"); + } + } + } + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/softmax_cudnn_op.cu.cc b/paddle/fluid/operators/softmax_cudnn_op.cu.cc index 2bdb23e999621b10799b5163f326bc4b66a437e6..ad3e5543f10ae05865565110ba2231c897c205b8 100644 --- a/paddle/fluid/operators/softmax_cudnn_op.cu.cc +++ b/paddle/fluid/operators/softmax_cudnn_op.cu.cc @@ -76,6 +76,9 @@ namespace ops = paddle::operators; namespace plat = paddle::platform; REGISTER_OP_KERNEL(softmax, CUDNN, plat::CUDAPlace, ops::SoftmaxCUDNNKernel, + ops::SoftmaxCUDNNKernel, ops::SoftmaxCUDNNKernel); REGISTER_OP_KERNEL(softmax_grad, CUDNN, plat::CUDAPlace, - ops::SoftmaxGradCUDNNKernel); + ops::SoftmaxGradCUDNNKernel, + ops::SoftmaxGradCUDNNKernel, + ops::SoftmaxGradCUDNNKernel); diff --git a/paddle/fluid/operators/softmax_op.cc b/paddle/fluid/operators/softmax_op.cc index bb081238820b9ee3ae095442d21cfce11f7b41e5..091ce4e6e8e2c3c6e2f064c1cfcae222af8299e0 100644 --- a/paddle/fluid/operators/softmax_op.cc +++ b/paddle/fluid/operators/softmax_op.cc @@ -80,8 +80,7 @@ class SoftmaxOpMaker : public framework::OpProtoAndCheckerMaker { AddInput("X", "The input tensor of softmax, " "whose last dimension is the input_feature_dimensions."); - AddOutput("Out", "The normalized values with the same shape as X.") - .Reuse("X"); + AddOutput("Out", "The normalized values with the same shape as X."); AddAttr( "use_cudnn", "(bool, default false) Only used in cudnn kernel, need install cudnn") @@ -97,20 +96,21 @@ class SoftmaxOpMaker : public framework::OpProtoAndCheckerMaker { "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); AddAttr("is_test", - "Disable epsilon adding to softmax results. Used by MKLDNN.") + "(bool, default false) Set to true for inference only, false " + "for training. Some layers may run faster when this is true.") .SetDefault(false); AddComment(R"DOC( Softmax Operator. -The input of the softmax operator is a tensor of any rank. The output tensor +The input of the softmax operator is a tensor of any rank. The output tensor has the same shape as the input. -The input tensor will first be logically flattened to a 2-D matrix. The matrix's -second dimension(row length) is as same as the last dimension of the input -tensor, and the first dimension(column length) is the product of all other -dimensions of the input tensor. For each row of the matrix, the softmax operator -squashes the K-dimensional(K is the width of the matrix, which is also the size -of the input tensor's last dimension) vector of arbitrary real values to a +The input tensor will first be logically flattened to a 2-D matrix. The matrix's +second dimension(row length) is as same as the last dimension of the input +tensor, and the first dimension(column length) is the product of all other +dimensions of the input tensor. For each row of the matrix, the softmax operator +squashes the K-dimensional(K is the width of the matrix, which is also the size +of the input tensor's last dimension) vector of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1. It computes the exponential of the given dimension and the sum of exponential values of all the other dimensions in the K-dimensional vector input. @@ -125,6 +125,14 @@ For each row $i$ and each column $j$ in the matrix, we have: } }; +class SoftmaxOpInferVarType : public framework::PassInDtypeAndVarTypeToOutput { + protected: + std::unordered_map GetInputOutputWithSameType() + const override { + return std::unordered_map{{"X", /*->*/ "Out"}}; + } +}; + class SoftmaxOpGrad : public framework::OperatorWithKernel { public: using framework::OperatorWithKernel::OperatorWithKernel; @@ -197,7 +205,7 @@ class SoftmaxOpGradMaker : public framework::SingleGradOpDescMaker { namespace ops = paddle::operators; REGISTER_OPERATOR(softmax, ops::SoftmaxOp, ops::SoftmaxOpMaker, - ops::SoftmaxOpGradMaker); + ops::SoftmaxOpInferVarType, ops::SoftmaxOpGradMaker); REGISTER_OPERATOR(softmax_grad, ops::SoftmaxOpGrad); REGISTER_OP_CPU_KERNEL( softmax, ops::SoftmaxKernel, diff --git a/paddle/fluid/operators/softmax_op.cu.cc b/paddle/fluid/operators/softmax_op.cu.cc index 5fb4f011d9b47cebc4a23bcce47eada825263343..19359b7eef5126d84f0707d39095a74ae4561186 100644 --- a/paddle/fluid/operators/softmax_op.cu.cc +++ b/paddle/fluid/operators/softmax_op.cu.cc @@ -23,4 +23,5 @@ REGISTER_OP_CUDA_KERNEL( ops::SoftmaxKernel); REGISTER_OP_CUDA_KERNEL( softmax_grad, ops::SoftmaxGradKernel, - ops::SoftmaxGradKernel); + ops::SoftmaxGradKernel, + ops::SoftmaxGradKernel); diff --git a/paddle/fluid/operators/softmax_op.h b/paddle/fluid/operators/softmax_op.h index cf1eeb017d666f605a431aa54637d8cbc99c7c46..8eb5c7691efe930e9f79ad6a381cb290107d1a14 100644 --- a/paddle/fluid/operators/softmax_op.h +++ b/paddle/fluid/operators/softmax_op.h @@ -35,8 +35,15 @@ class SoftmaxKernel : public framework::OpKernel { Tensor X_2d = framework::ReshapeToMatrix(*X, rank - 1); Tensor Out_2d = framework::ReshapeToMatrix(*Out, rank - 1); - math::SoftmaxFunctor()( +#ifdef PADDLE_ON_INFERENCE + math::SoftmaxFunctor< + DeviceContext, T, + std::is_same::value>()( context.template device_context(), &X_2d, &Out_2d); +#else + math::SoftmaxFunctor()( + context.template device_context(), &X_2d, &Out_2d); +#endif } }; diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.cc b/paddle/fluid/operators/softmax_with_cross_entropy_op.cc index 1a9324ec862fc3dd7ce669c5fed94527cac22b8f..2900221485e6ec097796ac38936ce31f8382c86a 100644 --- a/paddle/fluid/operators/softmax_with_cross_entropy_op.cc +++ b/paddle/fluid/operators/softmax_with_cross_entropy_op.cc @@ -44,6 +44,12 @@ class SoftmaxWithCrossEntropyOpMaker "(bool, default: false), A flag to indicate whether to interpretate " "the given labels as soft labels.") .SetDefault(false); + AddAttr( + "numeric_stable_mode", + "(bool, default: false), A flag to indicate whether to use more " + "numerically stable algorithm. This flag is only valid when " + "soft_label is false and GPU is used.") + .SetDefault(false); AddAttr( "ignore_index", "(int, default -100), Specifies a target value that is ignored and" diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu b/paddle/fluid/operators/softmax_with_cross_entropy_op.cu index a07c17348ebb3f768d1c8be65c2d31e3c130bd23..6d48796191dd13a45f0c7267bfaf05489f528a9d 100644 --- a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu +++ b/paddle/fluid/operators/softmax_with_cross_entropy_op.cu @@ -17,6 +17,7 @@ limitations under the License. */ #include #include "paddle/fluid/operators/math/cross_entropy.h" #include "paddle/fluid/operators/softmax_with_cross_entropy_op.h" +#include "paddle/fluid/platform/for_range.h" namespace paddle { namespace operators { @@ -117,8 +118,8 @@ using BlockReduceTempStorage = typename BlockReduce::TempStorage; // Make sure that BlockDim <= feature_size // This kernel is used to calculate the max element of each row template -__global__ void RowReductionForMax(const T* logits_data, T* max_data, - int feature_size) { +static __global__ void RowReductionForMax(const T* logits_data, T* max_data, + int feature_size) { __shared__ BlockReduceTempStorage temp_storage; auto beg_idx = feature_size * blockIdx.x + threadIdx.x; @@ -141,9 +142,10 @@ __global__ void RowReductionForMax(const T* logits_data, T* max_data, } // Make sure that BlockDim <= feature_size -template -__global__ void RowReductionForDiffMaxSum(const T* logits_data, T* max_data, - T* softmax, int feature_size) { +template +static __global__ void RowReductionForDiffMaxSum(const T* logits_data, + T* max_data, T* softmax, + int feature_size) { __shared__ BlockReduceTempStorage temp_storage; auto beg_idx = feature_size * blockIdx.x + threadIdx.x; @@ -153,24 +155,34 @@ __global__ void RowReductionForDiffMaxSum(const T* logits_data, T* max_data, softmax[beg_idx] = logits_data[beg_idx] - block_max; T diff_max_sum = real_exp(softmax[beg_idx]); - beg_idx += BlockDim; - while (beg_idx < end_idx) { - softmax[beg_idx] = logits_data[beg_idx] - block_max; - diff_max_sum += real_exp(softmax[beg_idx]); - beg_idx += BlockDim; + auto idx = beg_idx + BlockDim; + while (idx < end_idx) { + softmax[idx] = logits_data[idx] - block_max; + diff_max_sum += real_exp(softmax[idx]); + idx += BlockDim; } diff_max_sum = BlockReduce(temp_storage).Reduce(diff_max_sum, cub::Sum()); if (threadIdx.x == 0) max_data[blockIdx.x] = real_log(diff_max_sum); + + if (!CalculateLogSoftmax) return; + __syncthreads(); + diff_max_sum = max_data[blockIdx.x]; + softmax[beg_idx] -= diff_max_sum; + beg_idx += BlockDim; + while (beg_idx < end_idx) { + softmax[beg_idx] -= diff_max_sum; + beg_idx += BlockDim; + } + if (threadIdx.x == 0) max_data[blockIdx.x] = 0; } // Make sure that BlockDim <= feature_size template -__global__ void RowReductionForSoftmaxAndCrossEntropy(const T* logits_data, - const T* labels_data, - T* loss_data, T* softmax, - int feature_size) { +static __global__ void RowReductionForSoftmaxAndCrossEntropy( + const T* logits_data, const T* labels_data, T* loss_data, T* softmax, + int feature_size) { __shared__ BlockReduceTempStorage temp_storage; auto beg_idx = feature_size * blockIdx.x + threadIdx.x; @@ -194,11 +206,134 @@ __global__ void RowReductionForSoftmaxAndCrossEntropy(const T* logits_data, } template -__global__ void SetSoftmaxToOneWhenFeatureSizeIsOne(T* out, int batch_size) { +struct HardLabelSoftmaxWithCrossEntropyFunctor { + public: + HardLabelSoftmaxWithCrossEntropyFunctor(const T* logits, + const int64_t* labels, T* loss, + T* log_softmax, int feature_size) + : logits_(logits), + labels_(labels), + loss_(loss), + log_softmax_(log_softmax), + feature_size_(feature_size) {} + + __device__ void operator()(int idx) const { + auto row_idx = idx / feature_size_; + auto col_idx = idx % feature_size_; + if (col_idx != labels_[row_idx]) { + log_softmax_[idx] = real_exp(log_softmax_[idx]); + } else { + auto softmax = log_softmax_[idx]; + log_softmax_[idx] = real_exp(softmax); + loss_[row_idx] = -softmax; + } + } + + private: + const T* logits_; + const int64_t* labels_; + T* loss_; + T* log_softmax_; + int feature_size_; +}; + +template +struct HardLabelSoftmaxWithCrossEntropyFunctorWithIgnoreIdx { + public: + HardLabelSoftmaxWithCrossEntropyFunctorWithIgnoreIdx(const T* logits, + const int64_t* labels, + T* loss, T* log_softmax, + int feature_size, + int ignore_idx) + : logits_(logits), + labels_(labels), + loss_(loss), + log_softmax_(log_softmax), + feature_size_(feature_size), + ignore_idx_(ignore_idx) {} + + __device__ void operator()(int idx) const { + auto row_idx = idx / feature_size_; + auto col_idx = idx % feature_size_; + if (col_idx != labels_[row_idx] || col_idx == ignore_idx_) { + log_softmax_[idx] = real_exp(log_softmax_[idx]); + } else { + auto softmax = log_softmax_[idx]; + log_softmax_[idx] = real_exp(softmax); + loss_[row_idx] = -softmax; + } + } + + private: + const T* logits_; + const int64_t* labels_; + T* loss_; + T* log_softmax_; + int feature_size_; + int ignore_idx_; +}; + +template +static __global__ void SetSoftmaxToOneWhenFeatureSizeIsOne(T* out, + int batch_size) { auto idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < batch_size) out[idx] = static_cast(1); } +template +static void HardLabelSoftmaxWithCrossEntropy( + const platform::CUDADeviceContext& ctx, const T* logits_data, + const int64_t* labels_data, T* loss_data, T* softmax_data, int batch_size, + int feature_size, int ignore_idx) { + constexpr int kMaxBlockDim = 512; + int block_dim = feature_size >= kMaxBlockDim + ? kMaxBlockDim + : (1 << static_cast(std::log2(feature_size))); + auto stream = ctx.stream(); + +#define CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(BlockDim) \ + case BlockDim: { \ + RowReductionForMax<<>>( \ + logits_data, loss_data, feature_size); \ + RowReductionForDiffMaxSum<<>>( \ + logits_data, loss_data, softmax_data, feature_size); \ + platform::ForRange for_range( \ + ctx, batch_size* feature_size); \ + if (ignore_idx >= 0 && ignore_idx < feature_size) { \ + for_range(HardLabelSoftmaxWithCrossEntropyFunctorWithIgnoreIdx( \ + logits_data, labels_data, loss_data, softmax_data, feature_size, \ + ignore_idx)); \ + } else { \ + for_range(HardLabelSoftmaxWithCrossEntropyFunctor( \ + logits_data, labels_data, loss_data, softmax_data, feature_size)); \ + } \ + } break + + switch (block_dim) { + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(512); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(256); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(128); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(64); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(32); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(16); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(8); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(4); + CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL(2); + case 1: + SetSoftmaxToOneWhenFeatureSizeIsOne<<<(batch_size + kMaxBlockDim - 1) / + kMaxBlockDim, + kMaxBlockDim, 0, stream>>>( + softmax_data, batch_size); + cudaMemsetAsync(loss_data, 0, batch_size * sizeof(T), stream); + break; + default: + PADDLE_THROW("BlockDim must be 2^n in softmax_with_cross_entropy_op"); + break; + } +#undef CALL_HARD_LABEL_SOFTMAX_WITH_CROSS_ENTROPY_FUSED_KERNEL +} + template static void SoftmaxWithCrossEntropyFusedKernel(const T* logits_data, const T* labels_data, @@ -237,7 +372,7 @@ static void SoftmaxWithCrossEntropyFusedKernel(const T* logits_data, kMaxBlockDim, kMaxBlockDim, 0, stream>>>( softmax_data, batch_size); - cudaMemsetAsync(loss_data, 0, batch_size, stream); + cudaMemsetAsync(loss_data, 0, batch_size * sizeof(T), stream); break; default: PADDLE_THROW("BlockDim must be 2^n in softmax_with_cross_entropy_op"); @@ -272,11 +407,21 @@ class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel { logits_data, labels_data, softmax_data, loss_data, batch_size, feature_size, context.cuda_device_context().stream()); } else { - math::SoftmaxCUDNNFunctor()(context.cuda_device_context(), logits, - softmax); - math::CrossEntropyFunctor()( - context.cuda_device_context(), loss, softmax, labels, false, - ignore_index); + if (!context.Attr("numeric_stable_mode")) { + math::SoftmaxCUDNNFunctor()(context.cuda_device_context(), logits, + softmax); + math::CrossEntropyFunctor()( + context.cuda_device_context(), loss, softmax, labels, false, + ignore_index); + } else { + int batch_size = logits->dims()[0]; + int feature_size = logits->dims()[1]; + auto* logits_data = logits->data(); + auto* labels_data = labels->data(); + HardLabelSoftmaxWithCrossEntropy( + context.cuda_device_context(), logits_data, labels_data, loss_data, + softmax_data, batch_size, feature_size, ignore_index); + } } } }; diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.h b/paddle/fluid/operators/softmax_with_cross_entropy_op.h index e9aba3b37b8cc01d4fe5de5200579d4e93f67e56..c0530e3d8bc407ddd6d7bf6e10a715185d0beb1f 100644 --- a/paddle/fluid/operators/softmax_with_cross_entropy_op.h +++ b/paddle/fluid/operators/softmax_with_cross_entropy_op.h @@ -42,8 +42,8 @@ class SoftmaxWithCrossEntropyKernel : public framework::OpKernel { auto& dev_ctx = context.template device_context(); - math::SoftmaxFunctor()(dev_ctx, logits, - softmax); + math::SoftmaxFunctor()( + dev_ctx, logits, softmax); math::CrossEntropyFunctor()( dev_ctx, loss, softmax, labels, context.Attr("soft_label"), context.Attr("ignore_index")); diff --git a/paddle/fluid/operators/space_to_depth_op.cc b/paddle/fluid/operators/space_to_depth_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..b579244673fa1618c282c4d4fedf2ba6d1726a82 --- /dev/null +++ b/paddle/fluid/operators/space_to_depth_op.cc @@ -0,0 +1,131 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/operators/space_to_depth_op.h" +#include +#include + +namespace paddle { +namespace operators { + +class SpaceToDepthOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), + "Input(X) of SpaceToDepthOp should not be null."); + PADDLE_ENFORCE(ctx->HasOutput("Out"), + "Output(Out) of SpaceToDepthOp should not be null."); + + auto x_dims = ctx->GetInputDim("X"); + PADDLE_ENFORCE_EQ(x_dims.size(), 4, "input should be a 4D tensor"); + auto blocksize = ctx->Attrs().Get("blocksize"); + + PADDLE_ENFORCE_GT(blocksize, 1, "The blocksize should be Greater than 1"); + PADDLE_ENFORCE_GT(x_dims[1], 0, "input channel should be Greater than 0"); + PADDLE_ENFORCE_GT(x_dims[2], 0, "input Height should be Greater than 0"); + PADDLE_ENFORCE_GT(x_dims[3], 0, "input Width should be Greater than 0"); + + PADDLE_ENFORCE_EQ(x_dims[1] % (blocksize * blocksize), 0, + "input channel should be divisible of the square of " + "SpaceToDepthOp blocksize"); + PADDLE_ENFORCE_EQ(x_dims[2] % (blocksize), 0, + "input Height should be divisible of the square of " + "SpaceToDepthOp blocksize"); + PADDLE_ENFORCE_EQ(x_dims[3] % (blocksize), 0, + "input Width should be divisible of the square of " + "SpaceToDepthOp blocksize"); + + VLOG(3) << "SpaceToDepthOp operator x.shape=" << x_dims + << "Attribute blocksize" << blocksize << std::endl; + + std::vector output_shape(4, 0); // [B,C,H,W] + output_shape[0] = x_dims[0]; + output_shape[1] = x_dims[1] * blocksize * blocksize; + output_shape[2] = x_dims[2] / blocksize; + output_shape[3] = x_dims[3] / blocksize; + + auto out_dims = framework::make_ddim(output_shape); + + ctx->SetOutputDim("Out", out_dims); + + if (x_dims[0] == out_dims[0]) { + // Only pass LoD when the first dimension of output and Input(X) + // are the same. + ctx->ShareLoD("X", /*->*/ "Out"); + } + } +}; + +class SpaceToDepthOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", + "(Tensor). The input should be a 4D tensor B * C * W * H of " + "SpaceToDepthOp " + "operator."); + AddOutput("Out", + "(Tensor), The output should be a 4D tensor B * C2 * W2 * H2 of " + "SpaceToDepthOp operator."); + AddAttr( + "blocksize", + "(int64_t, default 2) blocksize used to do change Space To Depth.") + .SetDefault(2) + .GreaterThan(1); + AddComment(R"DOC( + reorg operator used in Yolo v2. + The equation is: C2 = C1/blocksize * blocksize, W2 = W1 * blocksize + offset % blocksize, H2 = H1 * blocksize + offset / blocksize, + + Reshape Input(X) into the shape according to Attr(blocksize). The + data in Input(X) are unchanged. + + Examples: + + 1. Given a 4-D tensor Input(X) with a shape [128, 2048, 26, 26], and the blocksize is 2, the reorg operator will transform Input(X) + into a 4-D tensor with shape [128, 2048, 13, 13] and leaving Input(X)'s data unchanged. + + )DOC"); + } +}; + +class SpaceToDepthGradOp : public framework::OperatorWithKernel { + public: + using framework::OperatorWithKernel::OperatorWithKernel; + + void InferShape(framework::InferShapeContext* ctx) const override { + PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shouldn't be null."); + PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")), + "Input(Out@GRAD) shouldn't be null."); + ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X")); + } +}; +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; + +REGISTER_OPERATOR(space_to_depth, ops::SpaceToDepthOp, ops::SpaceToDepthOpMaker, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(space_to_depth_grad, ops::SpaceToDepthGradOp); +REGISTER_OP_CPU_KERNEL( + space_to_depth, + ops::SpaceToDepthKernel, + ops::SpaceToDepthKernel, + ops::SpaceToDepthKernel); +REGISTER_OP_CPU_KERNEL( + space_to_depth_grad, + ops::SpaceToDepthGradKernel, + ops::SpaceToDepthGradKernel, + ops::SpaceToDepthGradKernel); diff --git a/paddle/fluid/operators/space_to_depth_op.cu b/paddle/fluid/operators/space_to_depth_op.cu new file mode 100644 index 0000000000000000000000000000000000000000..38d0a662733222386b8ecd68d064f3d1abe56c3b --- /dev/null +++ b/paddle/fluid/operators/space_to_depth_op.cu @@ -0,0 +1,30 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/operators/space_to_depth_op.h" + +namespace plat = paddle::platform; +namespace ops = paddle::operators; + +REGISTER_OP_CUDA_KERNEL( + space_to_depth, + ops::SpaceToDepthKernel, + ops::SpaceToDepthKernel, + ops::SpaceToDepthKernel); + +REGISTER_OP_CUDA_KERNEL( + space_to_depth_grad, + ops::SpaceToDepthGradKernel, + ops::SpaceToDepthGradKernel, + ops::SpaceToDepthGradKernel); diff --git a/paddle/fluid/operators/space_to_depth_op.h b/paddle/fluid/operators/space_to_depth_op.h new file mode 100644 index 0000000000000000000000000000000000000000..a71662b4813ab27b65f5c7a918e2bb6fb15a1993 --- /dev/null +++ b/paddle/fluid/operators/space_to_depth_op.h @@ -0,0 +1,127 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ +#ifndef PADDLE_FLUID_OPERATORS_SPACE_TO_DEPTH_OP_H_ +#define PADDLE_FLUID_OPERATORS_SPACE_TO_DEPTH_OP_H_ +#endif // PADDLE_FLUID_OPERATORS_SPACE_TO_DEPTH_OP_H_ + +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/platform/for_range.h" + +namespace paddle { +namespace operators { + +template +class space_to_depth_compute { + public: + HOSTDEVICE space_to_depth_compute(const T *x, int64_t w, int64_t h, int64_t c, + int64_t batch, int64_t blocksize, + int64_t forward, T *out) + : x_(x), + w_(w), + h_(h), + c_(c), + batch_(batch), + blocksize_(blocksize), + forward_(forward), + out_(out) {} + + HOSTDEVICE void operator()(int64_t in_index) { + int64_t out_c = c_ / (blocksize_ * blocksize_); + // calculate each dim position with index of tensor + int64_t b = in_index / (c_ * h_ * w_); + int64_t k = (in_index % (c_ * h_ * w_)) / (h_ * w_); + int64_t j = ((in_index % (c_ * h_ * w_)) % (h_ * w_)) / w_; + int64_t i = ((in_index % (c_ * h_ * w_)) % (h_ * w_)) % w_; + + int64_t c2 = k % out_c; + int64_t offset = k / out_c; + int64_t w2 = i * blocksize_ + offset % blocksize_; + int64_t h2 = j * blocksize_ + offset / blocksize_; + int64_t out_index = + w2 + w_ * blocksize_ * (h2 + h_ * blocksize_ * (c2 + out_c * b)); + if (forward_) + out_[out_index] = x_[in_index]; + else + out_[in_index] = x_[out_index]; + } + + private: + const T *x_; + int64_t w_, h_, c_, batch_, blocksize_, forward_; + T *out_; +}; + +template +class SpaceToDepthKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &context) const override { + auto *out = context.Output("Out"); + auto *x = context.Input("X"); + auto blocksize = context.Attr("blocksize"); + auto in_dims = x->dims(); + out->mutable_data(context.GetPlace(), x->type()); + + auto out_dims = out->dims(); + auto B = in_dims[0]; + auto C = in_dims[1]; + auto H = in_dims[2]; + auto W = in_dims[3]; + platform::ForRange for_range( + context.template device_context(), + static_cast(x->numel())); + + auto *x_data = x->data(); + auto *out_data = out->data(); + paddle::operators::space_to_depth_compute computer( + x_data, W, H, C, B, blocksize, 1, out_data); + for_range(computer); + + out->Resize(out_dims); + } +}; + +template +class SpaceToDepthGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext &context) const override { + auto *d_out = + context.Input(framework::GradVarName("Out")); + auto *d_x = + context.Output(framework::GradVarName("X")); + auto blocksize = context.Attr("blocksize"); + auto in_dims = d_x->dims(); + d_x->mutable_data(context.GetPlace(), d_out->type()); + + auto B = in_dims[0]; + auto C = in_dims[1]; + auto H = in_dims[2]; + auto W = in_dims[3]; + + platform::ForRange for_range( + context.template device_context(), + static_cast(d_x->numel())); + + auto *dx_data = d_x->data(); + auto *dout_data = d_out->data(); + + paddle::operators::space_to_depth_compute computer( + dout_data, W, H, C, B, blocksize, 0, dx_data); + for_range(computer); + + d_x->Resize(in_dims); + } +}; + +} // namespace operators +} // namespace paddle diff --git a/paddle/fluid/operators/split_op.cc b/paddle/fluid/operators/split_op.cc index d661b276bc31bf0c3ab181d706ffdccec89f0632..a05582ae09e16ee17194d299d713d321f28ccace 100644 --- a/paddle/fluid/operators/split_op.cc +++ b/paddle/fluid/operators/split_op.cc @@ -111,11 +111,10 @@ Example: } // namespace paddle namespace ops = paddle::operators; -USE_CPU_ONLY_OP(concat); REGISTER_OPERATOR(split, ops::SplitOp, ops::SplitOpMaker, ops::SplitGradMaker); -REGISTER_OP_CPU_KERNEL(split, - ops::SplitOpKernel, - ops::SplitOpKernel, - ops::SplitOpKernel, - ops::SplitOpKernel); +REGISTER_OP_CPU_KERNEL( + split, ops::SplitOpKernel, + ops::SplitOpKernel, + ops::SplitOpKernel, + ops::SplitOpKernel); diff --git a/paddle/fluid/operators/split_op.h b/paddle/fluid/operators/split_op.h index f0c417c70521b1bb3816f884d6ab7393473999e4..6f4a25ab5ed86937f2f5db532a9eba22b5a2c5be 100644 --- a/paddle/fluid/operators/split_op.h +++ b/paddle/fluid/operators/split_op.h @@ -17,6 +17,7 @@ limitations under the License. */ #include // NOLINT #include #include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/operators/math/concat_and_split.h" #include "paddle/fluid/operators/strided_memcpy.h" namespace paddle { @@ -28,18 +29,22 @@ class SplitOpKernel : public framework::OpKernel { void Compute(const framework::ExecutionContext& ctx) const override { auto* in = ctx.Input("X"); auto outs = ctx.MultiOutput("Out"); - auto in_stride = framework::stride_numel(in->dims()); - int64_t axis = static_cast(ctx.Attr("axis")); + int axis = ctx.Attr("axis"); auto place = ctx.GetPlace(); - size_t input_offset = 0; - for (auto& out : outs) { - out->mutable_data(ctx.GetPlace()); - auto out_stride = framework::stride_numel(out->dims()); - StridedNumelCopyWithAxis(ctx.device_context(), axis, out->data(), - out_stride, in->data() + input_offset, - in_stride, out_stride[axis]); - input_offset += out_stride[axis]; + std::vector shape_refer; + for (size_t j = 0; j < outs.size(); ++j) { + outs[j]->mutable_data(ctx.GetPlace()); + shape_refer.emplace_back(outs[j]); + } + + auto& dev_ctx = ctx.template device_context(); + // Sometimes direct copies will be faster, this maybe need deeply analysis. + if (axis == 0 && outs.size() < 10) { + StridedMemcpyWithAxis0(dev_ctx, *in, shape_refer, &outs); + } else { + math::SplitFunctor functor; + functor(dev_ctx, *in, shape_refer, axis, &outs); } } }; diff --git a/paddle/fluid/operators/split_selected_rows_op.cc b/paddle/fluid/operators/split_selected_rows_op.cc index 76615a9405d7a8e3fa9dba8d01a956209e02ae8f..0e7b1463d1ba81aed53e0e3f3a90d2a1fbf0ffbc 100644 --- a/paddle/fluid/operators/split_selected_rows_op.cc +++ b/paddle/fluid/operators/split_selected_rows_op.cc @@ -22,9 +22,9 @@ class SplitSelectedRowsOpMaker : public framework::OpProtoAndCheckerMaker { void Make() override { AddInput("X", "The input SelectedRows."); AddOutput("Out", "The outputs of the input SelectedRows.").AsDuplicable(); - AddAttr>("height_sections", - "Height for each output SelectedRows.") - .SetDefault(std::vector({})); + AddAttr>("height_sections", + "Height for each output SelectedRows.") + .SetDefault(std::vector({})); AddComment(R"DOC( Split a SelectedRows with a specified rows section. diff --git a/paddle/fluid/operators/split_selected_rows_op.h b/paddle/fluid/operators/split_selected_rows_op.h index 0e9ce165b98845f4745ee70b028513ea31cc6657..af64607fafc6544047714e731846a2440be219b8 100644 --- a/paddle/fluid/operators/split_selected_rows_op.h +++ b/paddle/fluid/operators/split_selected_rows_op.h @@ -21,7 +21,7 @@ limitations under the License. */ namespace paddle { namespace operators { -static int FindOutIdx(int row, const std::vector& abs_sections) { +static int FindOutIdx(int row, const std::vector& abs_sections) { for (size_t i = 1; i < abs_sections.size(); ++i) { if (row < abs_sections[i]) { return i - 1; @@ -30,9 +30,9 @@ static int FindOutIdx(int row, const std::vector& abs_sections) { return abs_sections.size() - 1; } -static std::vector ToAbsoluteSection( - const std::vector& height_sections) { - std::vector abs_sections; +static std::vector ToAbsoluteSection( + const std::vector& height_sections) { + std::vector abs_sections; abs_sections.resize(height_sections.size()); abs_sections[0] = 0; for (size_t i = 1; i < height_sections.size(); ++i) { @@ -47,7 +47,7 @@ class SplitSelectedRowsOpKernel : public framework::OpKernel { void Compute(const framework::ExecutionContext& ctx) const override { auto* x = ctx.Input("X"); auto outs = ctx.MultiOutput("Out"); - auto height_sections = ctx.Attr>("height_sections"); + auto height_sections = ctx.Attr>("height_sections"); auto abs_sections = ToAbsoluteSection(height_sections); diff --git a/paddle/fluid/operators/spp_op.h b/paddle/fluid/operators/spp_op.h index 08cb7849d20443862b66ea6096c095b294c7242c..35d9737ee01fe1505cbe30e8ed735e6b92cb8df2 100644 --- a/paddle/fluid/operators/spp_op.h +++ b/paddle/fluid/operators/spp_op.h @@ -56,12 +56,14 @@ class SppKernel : public framework::OpKernel { math::Pool2dFunctor, T> pool_forward; math::MaxPool max_process; pool_forward(context.template device_context(), *in_x, - kernel_size, strides, paddings, max_process, &out_level); + kernel_size, strides, paddings, max_process, true, + &out_level); } else if (pooling_type == "avg") { math::Pool2dFunctor, T> pool_forward; math::AvgPool avg_process; pool_forward(context.template device_context(), *in_x, - kernel_size, strides, paddings, avg_process, &out_level); + kernel_size, strides, paddings, avg_process, true, + &out_level); } // flatten pooling output shape int output_flatten_w = in_x->dims()[1] * bins * bins; @@ -154,7 +156,7 @@ class SppGradKernel : public framework::OpKernel { math::AvgPoolGrad avg_process; pool_backward(context.template device_context(), *in_x, *&out_level, *&outgrad_level, kernel_size, strides, - paddings, avg_process, in_x_grad); + paddings, avg_process, true, in_x_grad); } } } diff --git a/paddle/fluid/operators/stack_op.cc b/paddle/fluid/operators/stack_op.cc index 3f4b48bc7391def082c82ed451fc5a752009a2f1..9345b495415d203728238c19621a20f446c40bf5 100644 --- a/paddle/fluid/operators/stack_op.cc +++ b/paddle/fluid/operators/stack_op.cc @@ -21,8 +21,12 @@ REGISTER_OPERATOR(stack, ops::StackOp, ops::StackOpMaker, REGISTER_OPERATOR(stack_grad, ops::StackOpGrad); REGISTER_OP_CPU_KERNEL(stack, ops::StackKernel, - ops::StackKernel); + ops::StackKernel, + ops::StackKernel, + ops::StackKernel); REGISTER_OP_CPU_KERNEL(stack_grad, ops::StackGradKernel, - ops::StackGradKernel); + ops::StackGradKernel, + ops::StackGradKernel, + ops::StackGradKernel); diff --git a/paddle/fluid/operators/stack_op.cu b/paddle/fluid/operators/stack_op.cu index 92c1bde2bcf089e5c715e90e564408e6ad37ba17..bf2a9e5b3d22996e688621727cb280dc9aed7859 100644 --- a/paddle/fluid/operators/stack_op.cu +++ b/paddle/fluid/operators/stack_op.cu @@ -18,8 +18,12 @@ namespace plat = paddle::platform; namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL(stack, ops::StackKernel, - ops::StackKernel); + ops::StackKernel, + ops::StackKernel, + ops::StackKernel); REGISTER_OP_CUDA_KERNEL(stack_grad, ops::StackGradKernel, - ops::StackGradKernel); + ops::StackGradKernel, + ops::StackGradKernel, + ops::StackGradKernel); diff --git a/paddle/fluid/operators/stack_op.h b/paddle/fluid/operators/stack_op.h index d236c5b943704683c27b9b155c11ca9113edf514..3d132e4397e837442d406e1668126da9163129ef 100644 --- a/paddle/fluid/operators/stack_op.h +++ b/paddle/fluid/operators/stack_op.h @@ -147,20 +147,32 @@ class StackKernel : public framework::OpKernel { auto &dim = x[0]->dims(); for (auto i = 0; i < axis; ++i) pre *= dim[i]; for (auto i = axis; i < dim.size(); ++i) post *= dim[i]; - int total_num = pre * n * post; - auto &dev_ctx = ctx.template device_context(); #ifdef __NVCC__ + int total_num = pre * n * post; + auto &dev_ctx = ctx.template device_context(); + thrust::device_vector device_x_vec(x_datas); auto x_data_arr = device_x_vec.data().get(); -#else - auto x_data_arr = x_datas.data(); -#endif + StackFunctorForRange(dev_ctx, x_data_arr, y_data, total_num, n, post); -#ifdef __NVCC__ + // Wait() must be called because device_x_vec may be destructed before // kernel ends dev_ctx.Wait(); +#else + auto x_data_arr = x_datas.data(); + + size_t x_offset = 0; + size_t y_offset = 0; + for (int i = 0; i < pre; i++) { + for (int j = 0; j < n; j++) { + std::memcpy(y_data + y_offset, x_data_arr[j] + x_offset, + post * sizeof(T)); + y_offset += post; + } + x_offset += post; + } #endif } }; diff --git a/paddle/fluid/operators/strided_memcpy.h b/paddle/fluid/operators/strided_memcpy.h index 7a10218e1556698f3e0a1828db5de8851dd1c90b..c3d83a06f23a34ec8cf27d585863135ebfd56a4f 100644 --- a/paddle/fluid/operators/strided_memcpy.h +++ b/paddle/fluid/operators/strided_memcpy.h @@ -13,8 +13,9 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include +#include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/operators/detail/strided_memcpy.h" - namespace paddle { namespace operators { @@ -98,5 +99,26 @@ inline void StridedNumelCopyWithAxis(const platform::DeviceContext& ctx, } } +template +inline void StridedMemcpyWithAxis0( + const platform::DeviceContext& dev_ctx, const framework::Tensor& input, + const std::vector& shape_refer, + std::vector* outputs) { + const framework::DDim in_stride = stride_numel(input.dims()); + const int axis = 0; + size_t input_offset = 0; + + for (size_t i = 0; i < outputs->size(); ++i) { + auto out_stride = stride_numel(shape_refer[i]->dims()); + auto out = outputs->at(i); + if (out != nullptr) { + StridedNumelCopyWithAxis(dev_ctx, axis, out->data(), out_stride, + input.data() + input_offset, in_stride, + out_stride[axis]); + } + input_offset += out_stride[axis]; + } +} + } // namespace operators } // namespace paddle diff --git a/paddle/fluid/operators/strided_memcpy_test.cc b/paddle/fluid/operators/strided_memcpy_test.cc index a6ca82d16f216c98055fb92b4575a357b8b10348..3a450773a9d749eb3f73baa46e681e588e1fbd0f 100644 --- a/paddle/fluid/operators/strided_memcpy_test.cc +++ b/paddle/fluid/operators/strided_memcpy_test.cc @@ -87,13 +87,16 @@ TEST(StridedMemcpy, GPUCrop) { platform::CUDADeviceContext ctx(gpu0); - int* gpu_src = reinterpret_cast(memory::Alloc(gpu0, sizeof(src))); + auto src_allocation = memory::Alloc(gpu0, sizeof(src)); + + int* gpu_src = reinterpret_cast(src_allocation->ptr()); memory::Copy(gpu0, gpu_src, cpu, src, sizeof(src), ctx.stream()); framework::DDim src_stride({5, 1}); int dst[4]; - int* gpu_dst = reinterpret_cast(memory::Alloc(gpu0, sizeof(dst))); + auto dst_allocation = memory::Alloc(gpu0, sizeof(dst)); + int* gpu_dst = reinterpret_cast(dst_allocation->ptr()); framework::DDim dst_dim({2, 2}); framework::DDim dst_stride({2, 1}); @@ -108,9 +111,6 @@ TEST(StridedMemcpy, GPUCrop) { ASSERT_EQ(2, dst[1]); ASSERT_EQ(3, dst[2]); ASSERT_EQ(4, dst[3]); - - memory::Free(gpu0, gpu_dst); - memory::Free(gpu0, gpu_src); } TEST(StridedMemcpy, GPUConcat) { @@ -124,12 +124,13 @@ TEST(StridedMemcpy, GPUConcat) { platform::CUDAPlace gpu0(0); platform::CPUPlace cpu; platform::CUDADeviceContext ctx(gpu0); - - int* gpu_src = reinterpret_cast(memory::Alloc(gpu0, sizeof(src))); + auto gpu_src_allocation = memory::Alloc(gpu0, sizeof(src)); + int* gpu_src = reinterpret_cast(gpu_src_allocation->ptr()); memory::Copy(gpu0, gpu_src, cpu, src, sizeof(src), ctx.stream()); int dst[8]; - int* gpu_dst = reinterpret_cast(memory::Alloc(gpu0, sizeof(dst))); + auto gpu_dst_allocation = memory::Alloc(gpu0, sizeof(dst)); + int* gpu_dst = reinterpret_cast(gpu_dst_allocation->ptr()); framework::DDim src_stride({2, 1}); framework::DDim dst_dim({2, 2}); @@ -151,9 +152,6 @@ TEST(StridedMemcpy, GPUConcat) { for (size_t i = 0; i < sizeof(expect_dst) / sizeof(int); ++i) { ASSERT_EQ(expect_dst[i], dst[i]); } - - memory::Free(gpu0, gpu_dst); - memory::Free(gpu0, gpu_src); } #endif diff --git a/paddle/fluid/operators/sum_mkldnn_op.cc b/paddle/fluid/operators/sum_mkldnn_op.cc index f9a16ef35ecb9eeb6c8eda9d124ecb17e7f9d5ce..2ae5c17bf6465874572e80da54e40fbe22403660 100644 --- a/paddle/fluid/operators/sum_mkldnn_op.cc +++ b/paddle/fluid/operators/sum_mkldnn_op.cc @@ -186,7 +186,7 @@ class SumMKLDNNOpKernel : public paddle::framework::OpKernel { } if (in_dim.empty()) { - VLOG(3) << "WARNING: all the inputs are empty"; + VLOG(30) << "WARNING: all the inputs are empty"; in_dim = framework::vectorize(get_selected_row(N - 1).value().dims()); } else { in_dim[0] = static_cast(first_dim); diff --git a/paddle/fluid/operators/sum_op.cc b/paddle/fluid/operators/sum_op.cc index fe7c7039c7dec714e265ede1b7167fd800ddc2f7..c67b694283cd8f0203021c0329f5ac16ae7854a5 100644 --- a/paddle/fluid/operators/sum_op.cc +++ b/paddle/fluid/operators/sum_op.cc @@ -45,7 +45,7 @@ class SumOp : public framework::OperatorWithKernel { size_t N = x_dims.size(); PADDLE_ENFORCE_GT(N, 0, "Input tensors count should > 0."); if (N == 1) { - VLOG(3) << "Warning: sum have only one input, may waste memory"; + VLOG(30) << "Warning: sum have only one input, may waste memory"; } framework::DDim in_dim({0}); @@ -67,6 +67,7 @@ class SumOp : public framework::OperatorWithKernel { framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { auto x_vars = ctx.MultiInputVar("X"); + auto x_vars_name = ctx.Inputs("X"); framework::LibraryType library{framework::LibraryType::kPlain}; framework::DataLayout layout{framework::DataLayout::kAnyLayout}; @@ -81,15 +82,18 @@ class SumOp : public framework::OperatorWithKernel { if (x_vars[0]->IsType()) { int dtype = -1; - for (auto& x_var : x_vars) { - auto& lod_tensor = x_var->Get(); - if (lod_tensor.numel() == 0) { + for (size_t idx = 0; idx < x_vars.size(); ++idx) { + PADDLE_ENFORCE(x_vars[idx] != nullptr, + "Input var[%s] should not be nullptr", x_vars_name[idx]); + auto tensor = + framework::GetLoDTensorOrSelectedRowsValueFromVar(*x_vars[idx]); + if (tensor->numel() == 0) { continue; } if (dtype == -1) { - dtype = framework::ToDataType(lod_tensor.type()); + dtype = framework::ToDataType(tensor->type()); } else { - PADDLE_ENFORCE_EQ(dtype, framework::ToDataType(lod_tensor.type())); + PADDLE_ENFORCE_EQ(dtype, framework::ToDataType(tensor->type())); } } PADDLE_ENFORCE_NE(dtype, -1, @@ -132,7 +136,7 @@ class SumOpMaker : public framework::OpProtoAndCheckerMaker { void Make() override { AddInput("X", "(vector) The input tensors of sum operator.") .AsDuplicable(); - AddOutput("Out", "(Tensor) The output tensor of sum operator.").Reuse("X"); + AddOutput("Out", "(Tensor) The output tensor of sum operator."); AddAttr("use_mkldnn", "(bool, default false) Only used in mkldnn kernel") .SetDefault(false); @@ -153,8 +157,8 @@ class SumOpVarTypeInference : public framework::VarTypeInference { auto& inputs = op_desc.Input("X"); auto var_type = framework::proto::VarType::SELECTED_ROWS; for (auto& name : op_desc.Input("X")) { - VLOG(10) << name << " " - << block->FindRecursiveOrCreateVar(name).GetType(); + VLOG(100) << name << " " + << block->FindRecursiveOrCreateVar(name).GetType(); } bool any_input_is_lod_tensor = std::any_of( diff --git a/paddle/fluid/operators/sum_op.cu b/paddle/fluid/operators/sum_op.cu index 89bcd1bbc86dc29cb7b98cbef3057a8f98c74555..db4c2d6c115f04b436db00854ca4b02fea09866b 100644 --- a/paddle/fluid/operators/sum_op.cu +++ b/paddle/fluid/operators/sum_op.cu @@ -11,10 +11,13 @@ limitations under the License. */ #define EIGEN_USE_GPU #include "paddle/fluid/operators/sum_op.h" +#include "paddle/fluid/platform/float16.h" namespace ops = paddle::operators; +namespace plat = paddle::platform; REGISTER_OP_CUDA_KERNEL( sum, ops::SumKernel, ops::SumKernel, ops::SumKernel, - ops::SumKernel); + ops::SumKernel, + ops::SumKernel); diff --git a/paddle/fluid/operators/sum_op.h b/paddle/fluid/operators/sum_op.h index 34403c7a7aa717cca470be2931009e219e00e3ae..76cc796a9b8e21849b1d86e512cd70752fd027ac 100644 --- a/paddle/fluid/operators/sum_op.h +++ b/paddle/fluid/operators/sum_op.h @@ -43,17 +43,31 @@ class SumKernel : public framework::OpKernel { out->mutable_data(context.GetPlace()); } auto result = EigenVector::Flatten(*out); + auto &place = + *context.template device_context().eigen_device(); + int start = in_place ? 1 : 0; if (!in_place) { - math::SetConstant constant_functor; - constant_functor(context.template device_context(), out, - 0.0); + if ((in_num >= 2) && in_vars[0]->IsType() && + in_vars[1]->IsType()) { + auto &in_0 = in_vars[0]->Get(); + auto &in_1 = in_vars[1]->Get(); + if (in_0.numel() && in_1.numel()) { + auto in_0_e = EigenVector::Flatten(in_0); + auto in_1_e = EigenVector::Flatten(in_1); + result.device(place) = in_0_e + in_1_e; + start = 2; + } + } + if (start != 2) { + math::SetConstant constant_functor; + constant_functor(context.template device_context(), + out, static_cast(0)); + } } math::SelectedRowsAddToTensor functor; - auto &place = - *context.template device_context().eigen_device(); // If in_place, just skip the first tensor - for (size_t i = in_place ? 1 : 0; i < in_num; i++) { + for (size_t i = start; i < in_num; i++) { if (in_vars[i]->IsType()) { auto &in_t = in_vars[i]->Get(); if (in_t.numel() == 0) { @@ -69,79 +83,57 @@ class SumKernel : public framework::OpKernel { } } } else if (out_var->IsType()) { - std::unique_ptr in0; - if (in_place) { - // If is in_place, we store the input[0] to in0 - auto &in_sel0 = in_vars[0]->Get(); - auto &rows = in_sel0.rows(); -#ifdef PADDLE_WITH_CUDA - std::vector rows_in_cpu; - rows_in_cpu.reserve(rows.size()); - for (auto item : rows) { - rows_in_cpu.push_back(item); - } - in0.reset(new framework::SelectedRows(rows_in_cpu, in_sel0.height())); -#else - in0.reset(new framework::SelectedRows(rows, in_sel0.height())); -#endif - in0->mutable_value()->ShareDataWith(in_sel0.value()); + if (in_place && in_vars.size() < 2) { + return; } - auto get_selected_row = [&](size_t i) -> const SelectedRows & { - if (i == 0 && in0) { - return *in0.get(); - } else { - return in_vars[i]->Get(); + std::vector inputs; + SelectedRows temp_in0; + + if (in_place) { + auto &in0 = in_vars[0]->Get(); + temp_in0.set_height(in0.height()); + temp_in0.set_rows(in0.rows()); + framework::TensorCopy(in0.value(), in0.place(), + context.device_context(), + temp_in0.mutable_value()); + inputs.push_back(&temp_in0); + for (size_t i = 1; i < in_vars.size(); ++i) { + auto &in = in_vars[i]->Get(); + if (in.rows().size() > 0) { + inputs.push_back(&in); + } } - }; + } else { + for (auto &in_var : in_vars) { + auto &in = in_var->Get(); + if (in.rows().size() > 0) { + inputs.push_back(&in_var->Get()); + } + } + } auto *out = context.Output("Out"); out->mutable_rows()->clear(); - auto *out_value = out->mutable_value(); - // Runtime InferShape - size_t first_dim = 0; - for (size_t i = 0; i < in_num; i++) { - auto &sel_row = get_selected_row(i); - first_dim += sel_row.rows().size(); - } - - std::vector in_dim; - for (size_t i = 0; i < in_num; i++) { - auto &sel_row = get_selected_row(i); - if (sel_row.rows().size() > 0) { - in_dim = framework::vectorize(sel_row.value().dims()); + bool has_data = false; + for (auto &in : inputs) { + if (in->rows().size() > 0) { + has_data = true; break; } } - if (in_dim.empty()) { - VLOG(3) << "WARNING: all the inputs are empty"; - in_dim = - framework::vectorize(get_selected_row(in_num - 1).value().dims()); - } else { - in_dim[0] = static_cast(first_dim); - } - - out_value->Resize(framework::make_ddim(in_dim)); - out_value->mutable_data(context.GetPlace()); - // if all the input sparse vars are empty, no need to - // merge these vars. - if (first_dim == 0UL) { - return; - } + if (has_data) { + math::scatter::MergeAdd merge_add; + merge_add(context.template device_context(), inputs, + out); - math::SelectedRowsAddTo functor; + out->SyncIndex(); - int64_t offset = 0; - for (size_t i = 0; i < in_num; i++) { - auto &sel_row = get_selected_row(i); - if (sel_row.rows().size() == 0) { - continue; - } - PADDLE_ENFORCE_EQ(out->height(), sel_row.height()); - functor(context.template device_context(), sel_row, - offset, out); - offset += sel_row.value().numel(); + } else { + // no data, just set a empty out tensor. + out->mutable_value()->mutable_data(framework::make_ddim({0}), + context.GetPlace()); } } else if (out_var->IsType()) { auto &out_array = *out_var->GetMutable(); diff --git a/paddle/fluid/operators/tensor_array_to_tensor_op.cc b/paddle/fluid/operators/tensor_array_to_tensor_op.cc new file mode 100644 index 0000000000000000000000000000000000000000..58a74ec2c104f66e9e884cffd00e7fa6622e4714 --- /dev/null +++ b/paddle/fluid/operators/tensor_array_to_tensor_op.cc @@ -0,0 +1,246 @@ +/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include +#include + +#include "paddle/fluid/framework/lod_tensor_array.h" +#include "paddle/fluid/framework/op_registry.h" +#include "paddle/fluid/framework/variable.h" + +namespace paddle { +namespace operators { +using framework::Tensor; + +void LodTensorArray2LodTensorVector(const framework::Scope &scope, + const std::string &base_name, + const std::string &lod_tensor_array_name, + std::vector *res_names) { + auto &inx = + scope.FindVar(lod_tensor_array_name)->Get(); + for (size_t i = 0; i < inx.size(); i++) { + std::string var_name = base_name + std::to_string(i); + framework::Variable *g_feed_value = + const_cast(scope).Var(var_name); + auto &feed_input = + *(g_feed_value->GetMutable()); + feed_input.ShareDataWith(inx[i]); + res_names->push_back(var_name); + } +} + +void LodTensorVectorResizeFromLodTensorArray( + const framework::Scope &scope, const std::string &base_name, + const std::string &lod_tensor_array_name, + std::vector *res_names) { + auto &inx = + scope.FindVar(lod_tensor_array_name)->Get(); + for (size_t i = 0; i < inx.size(); i++) { + std::string var_name = base_name + std::to_string(i); + framework::Variable *g_feed_value = + const_cast(scope).Var(var_name); + auto &feed_input = + *(g_feed_value->GetMutable()); + auto dims = inx[i].dims(); + feed_input.Resize(dims); + res_names->push_back(var_name); + } +} + +void LodTensorArrayCreateFromLodTensorArray( + const framework::Scope &scope, + const std::string &input_lod_tensor_array_name, + const std::string &output_lod_tensor_array_name) { + auto &inx = scope.FindVar(input_lod_tensor_array_name) + ->Get(); + auto &grad_inx = *scope.FindVar(output_lod_tensor_array_name) + ->GetMutable(); + + for (size_t i = 0; i < inx.size(); i++) { + std::string var_name = output_lod_tensor_array_name + std::to_string(i); + framework::Variable *g_feed_value = + const_cast(scope).Var(var_name); + auto &feed_input = + *(g_feed_value->GetMutable()); + grad_inx.push_back(feed_input); + } +} + +class LoDTensorArray2TensorOp : public framework::OperatorBase { + public: + using OperatorBase::OperatorBase; + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + auto axis = Attr("axis"); + + framework::AttributeMap attrs; + attrs["axis"] = axis; + + auto &inx = scope.FindVar(Input("X"))->Get(); + auto &out = + *scope.FindVar(Output("Out"))->GetMutable(); + auto &out_inx = + *scope.FindVar(Output("OutIndex"))->GetMutable(); + + const size_t n = inx.size(); + PADDLE_ENFORCE_GT(n, 0, "Input tensorarray size should > 0."); + + std::string base_name = Inputs("X")[0]; + std::vector names; + + // get the input tensorarray items' dim in out_inx + auto out_inx_dim = out_inx.dims(); + out_inx_dim[0] = inx.size(); + out_inx.Resize(out_inx_dim); + + auto &local_scope = scope.NewScope(); + std::string var_name = "out_index"; + framework::Variable *tmp_index_var = local_scope.Var(var_name); + auto &tmp_index_tensor = + *(tmp_index_var->GetMutable()); + tmp_index_tensor.Resize(out_inx_dim); + int *tmp_index_data = + tmp_index_tensor.mutable_data(platform::CPUPlace()); + + auto out_dims = inx[0].dims(); + size_t out_dim_sum = 0; + for (size_t index = 0; index < inx.size(); index++) { + auto inx_dims = inx[index].dims(); + out_dim_sum += inx_dims[axis]; + tmp_index_data[index] = inx_dims[axis]; + } + out_inx.ShareDataWith(tmp_index_tensor); + + // get input array items' dims + out_dims[axis] = out_dim_sum; + out.Resize(out_dims); + + LodTensorArray2LodTensorVector(local_scope, base_name, Input("X"), &names); + // Invoke concat Op + auto concat_op = framework::OpRegistry::CreateOp( + "concat", {{"X", names}}, {{"Out", {Output("Out")}}}, attrs); + + concat_op->Run(local_scope, place); + } +}; + +class LoDTensorArray2TensorOpMaker : public framework::OpProtoAndCheckerMaker { + public: + void Make() override { + AddInput("X", "Input LoDTensorArray of tensor_array_to_tensor operator."); + AddOutput("Out", "Output tensor of tensor_array_to_tensor operator."); + AddOutput("OutIndex", + "Output input LoDTensorArray items' dims of " + "tensor_array_to_tensor operator."); + AddAttr("axis", + "The axis along which the input tensors will be concatenated.") + .SetDefault(0); + AddComment(R"DOC( +tensor_array_to_tensor Operator. + +Concatenate the input LoDTensorArray along dimension axis to the output Tensor. +Examples: + Input = {[1,2], [3,4], [5,6]} + axis = 0 + Output = [[1,2], + [3,4], + [5,6]] + OutputIndex = [1,1,1] + +)DOC"); + } +}; + +class LoDTensorArray2TensorOpInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *ctx) const override {} +}; + +class LoDTensorArray2TensorGradInferShape : public framework::InferShapeBase { + public: + void operator()(framework::InferShapeContext *context) const override {} +}; + +class LoDTensorArray2TensorGradInferVarType + : public framework::VarTypeInference { + public: + void operator()(const framework::OpDesc &op_desc, + framework::BlockDesc *block) const override { + for (auto &out_var : op_desc.Output(framework::GradVarName("X"))) { + block->Var(out_var)->SetType(framework::proto::VarType::LOD_TENSOR_ARRAY); + } + } +}; + +class LoDTensorArray2TensorGradOp : public framework::OperatorBase { + public: + using OperatorBase::OperatorBase; + + private: + void RunImpl(const framework::Scope &scope, + const platform::Place &place) const override { + auto axis = Attr("axis"); + framework::AttributeMap attrs; + attrs["axis"] = axis; + + auto &inx = scope.FindVar(Input("X"))->Get(); + const size_t n = inx.size(); + PADDLE_ENFORCE_GT(n, 0, "Input tensorarray size should > 0."); + + std::string base_name = Inputs("X")[0]; + std::vector names; + + LodTensorArray2LodTensorVector(scope, base_name, Input("X"), &names); + + // grad + auto dx_name = Output(framework::GradVarName("X")); + auto dout_name = Input(framework::GradVarName("Out")); + + std::vector grad_names; + + LodTensorVectorResizeFromLodTensorArray(scope, "grad_name", Input("X"), + &grad_names); + + auto concat_grad_op = framework::OpRegistry::CreateOp( + "concat_grad", {{"X", names}, {"Out@GRAD", {dout_name}}}, + {{"X@GRAD", grad_names}}, attrs); + + concat_grad_op->Run(scope, place); + + LodTensorArrayCreateFromLodTensorArray(scope, Input("X"), dx_name); + auto &grad_inx = + *scope.FindVar(dx_name)->GetMutable(); + + for (size_t i = 0; i < grad_names.size(); i++) { + std::string var_name = grad_names[i]; + auto &feed_input = scope.FindVar(var_name)->Get(); + grad_inx[i].ShareDataWith(feed_input); + } + } +}; + +} // namespace operators +} // namespace paddle +USE_OP(concat); + +namespace ops = paddle::operators; +REGISTER_OPERATOR(tensor_array_to_tensor, ops::LoDTensorArray2TensorOp, + ops::LoDTensorArray2TensorOpMaker, + ops::LoDTensorArray2TensorOpInferShape, + paddle::framework::DefaultGradOpDescMaker); +REGISTER_OPERATOR(tensor_array_to_tensor_grad, ops::LoDTensorArray2TensorGradOp, + ops::LoDTensorArray2TensorGradInferShape, + ops::LoDTensorArray2TensorGradInferVarType); diff --git a/paddle/fluid/operators/tensorrt/CMakeLists.txt b/paddle/fluid/operators/tensorrt/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..eee0b90fbae216e804e62993313796e914fcef5a --- /dev/null +++ b/paddle/fluid/operators/tensorrt/CMakeLists.txt @@ -0,0 +1,5 @@ +op_library(tensorrt_engine_op DEPS tensorrt_engine tensorrt_converter) +file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(tensorrt_engine);\n") +nv_test(test_tensorrt_engine_op SRCS tensorrt_engine_op_test.cc + DEPS tensorrt_engine_op + analysis) diff --git a/paddle/fluid/operators/tensorrt_engine_op.cc b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.cc similarity index 96% rename from paddle/fluid/operators/tensorrt_engine_op.cc rename to paddle/fluid/operators/tensorrt/tensorrt_engine_op.cc index 41a5786fe8c3295390144732221280e152d0a15a..3cf2ce3c7ef87dcf75548f7d9c3a55d06ed765e8 100644 --- a/paddle/fluid/operators/tensorrt_engine_op.cc +++ b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.cc @@ -17,7 +17,7 @@ #include #include -#include "paddle/fluid/operators/tensorrt_engine_op.h" +#include "paddle/fluid/operators/tensorrt/tensorrt_engine_op.h" namespace paddle { diff --git a/paddle/fluid/operators/tensorrt_engine_op.cu.cc b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.cu.cc similarity index 93% rename from paddle/fluid/operators/tensorrt_engine_op.cu.cc rename to paddle/fluid/operators/tensorrt/tensorrt_engine_op.cu.cc index e1ddfde6d51ef719ca0b89cf286b176195ee682a..cbe1b426f65386e722a7b02ec1fdfdf75bfd770c 100644 --- a/paddle/fluid/operators/tensorrt_engine_op.cu.cc +++ b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.cu.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/tensorrt_engine_op.h" +#include "paddle/fluid/operators/tensorrt/tensorrt_engine_op.h" namespace ops = paddle::operators; diff --git a/paddle/fluid/operators/tensorrt_engine_op.h b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.h similarity index 95% rename from paddle/fluid/operators/tensorrt_engine_op.h rename to paddle/fluid/operators/tensorrt/tensorrt_engine_op.h index d4ba0f9c33c91811647f9d19a332f139c16b0eb2..3af9376da1d3fa096b277e6b5a9d1a8de197d6f1 100644 --- a/paddle/fluid/operators/tensorrt_engine_op.h +++ b/paddle/fluid/operators/tensorrt/tensorrt_engine_op.h @@ -34,7 +34,7 @@ namespace operators { using FluidDT = framework::proto::VarType_Type; using TRT_DT = nvinfer1::DataType; -namespace { +namespace { // NOLINT TRT_DT FluidDataType2TRT(FluidDT type) { switch (type) { @@ -60,7 +60,7 @@ nvinfer1::Dims Vec2TRT_Dims(const std::vector& shape) { return nvinfer1::DimsCHW(shape[1], 1, 1); } -} // namespace +} // namespace // NOLINT using inference::Singleton; using inference::tensorrt::TRT_EngineManager; @@ -127,9 +127,9 @@ class TensorRTEngineKernel : public framework::OpKernel { // Convert output tensor from engine to fluid int output_index = 0; - VLOG(4) << "TensorRT Engine Op Outputs:"; + VLOG(40) << "TensorRT Engine Op Outputs:"; for (const auto& y : context.Outputs("Ys")) { - VLOG(4) << y; + VLOG(40) << y; // convert output and copy to fluid. nvinfer1::ITensor* trt_t = engine->GetITensor(output_maps[output_index]); auto dims = trt_t->getDimensions(); @@ -167,7 +167,7 @@ class TensorRTEngineKernel : public framework::OpKernel { protected: void Prepare(const framework::ExecutionContext& context) const { - VLOG(4) << "Prepare engine"; + VLOG(40) << "Prepare engine"; // Get the ProgramDesc and pass to convert. framework::proto::BlockDesc block_desc; block_desc.ParseFromString(context.Attr("subgraph")); @@ -192,12 +192,12 @@ class TensorRTEngineKernel : public framework::OpKernel { engine->InitNetwork(); framework::BlockDesc block(nullptr /*programdesc*/, &block_desc); - VLOG(4) << "parsed var size " << block.AllVars().size(); + VLOG(40) << "parsed var size " << block.AllVars().size(); // Add inputs - VLOG(4) << "declare inputs"; + VLOG(40) << "declare inputs"; for (auto& input : context.Inputs("Xs")) { if (parameters.count(input)) continue; - VLOG(4) << "declare input " << input; + VLOG(40) << "declare input " << input; auto* var = block.FindVar(input); // TensorRT engine need to create parameters. The parameter's description // should be set in @@ -223,7 +223,9 @@ class TensorRTEngineKernel : public framework::OpKernel { // Add outputs for (auto& output : output_maps) { - engine->DeclareOutput(output); + if (!engine->HasDeclared(output)) { + engine->DeclareOutput(output); + } } engine->FreezeNetwork(); diff --git a/paddle/fluid/operators/tensorrt_engine_op_test.cc b/paddle/fluid/operators/tensorrt/tensorrt_engine_op_test.cc similarity index 99% rename from paddle/fluid/operators/tensorrt_engine_op_test.cc rename to paddle/fluid/operators/tensorrt/tensorrt_engine_op_test.cc index e21101e8d12f210af08284dbcebe5c14c1af6dd3..56bdd6c2f2801967829f2baf889b5517a1d9d8d9 100644 --- a/paddle/fluid/operators/tensorrt_engine_op_test.cc +++ b/paddle/fluid/operators/tensorrt/tensorrt_engine_op_test.cc @@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ -#include "paddle/fluid/operators/tensorrt_engine_op.h" +#include "paddle/fluid/operators/tensorrt/tensorrt_engine_op.h" #include #include "paddle/fluid/framework/block_desc.h" #include "paddle/fluid/framework/lod_tensor.h" diff --git a/paddle/fluid/operators/top_k_op.cc b/paddle/fluid/operators/top_k_op.cc index 4a8ac441cfaf642fde58ee30865a22e83c065498..c17d1afc309c65035063348d4934ea1783b018ed 100644 --- a/paddle/fluid/operators/top_k_op.cc +++ b/paddle/fluid/operators/top_k_op.cc @@ -50,7 +50,7 @@ class TopkOpMaker : public framework::OpProtoAndCheckerMaker { public: void Make() override { AddInput("X", "(Tensor) The input of Topk op"); - AddOutput("Out", "(Tensor) The output tensor of Topk op").Reuse("X"); + AddOutput("Out", "(Tensor) The output tensor of Topk op"); AddOutput("Indices", "(Tensor) The indices of Topk elements of input"); AddComment(R"DOC( Top K operator diff --git a/paddle/fluid/operators/top_k_op.cu b/paddle/fluid/operators/top_k_op.cu index 8e4a07556fb51dbb15ef948fcee120e2f68e089a..0cad224ca8860b0e4bc2e3f2bc1659235aadfe2d 100644 --- a/paddle/fluid/operators/top_k_op.cu +++ b/paddle/fluid/operators/top_k_op.cu @@ -262,31 +262,31 @@ __global__ void KeMatrixTopK(T* output, int output_stride, int64_t* indices, const T* src, int lds, int dim, int k, int grid_dim, int num) { __shared__ Pair sh_topk[BlockSize]; - __shared__ int maxid[BlockSize / 2]; const int tid = threadIdx.x; const int warp = threadIdx.x / 32; const int bid = blockIdx.x; for (int i = bid; i < num; i += grid_dim) { - output += i * output_stride; - indices += i * k; - + int top_num = k; + __shared__ int maxid[BlockSize / 2]; + T* out = output + i * output_stride; + int64_t* inds = indices + i * k; Pair topk[MaxLength]; int beam = MaxLength; Pair max; bool is_empty = false; bool firststep = true; - for (int k = 0; k < MaxLength; k++) { - topk[k].set(-INFINITY, -1); + for (int j = 0; j < MaxLength; j++) { + topk[j].set(-INFINITY, -1); } - while (k) { + while (top_num) { ThreadGetTopK( topk, &beam, k, src + i * lds, &firststep, &is_empty, &max, dim, tid); sh_topk[tid] = topk[0]; - BlockReduce(sh_topk, maxid, topk, &output, - &indices, &beam, &k, tid, warp); + BlockReduce(sh_topk, maxid, topk, &out, &inds, + &beam, &top_num, tid, warp); } } } @@ -327,13 +327,15 @@ class TopkOpCUDAKernel : public framework::OpKernel { size_t k = static_cast(ctx.Attr("k")); const T* input_data = input->data(); - T* output_data = output->mutable_data(ctx.GetPlace()); // FIXME(typhoonzero): data is always converted to type T? int64_t* indices_data = indices->mutable_data(ctx.GetPlace()); - size_t input_height = input->dims()[0]; - size_t input_width = input->dims()[1]; + framework::DDim inputdims = input->dims(); + const size_t input_height = framework::product( + framework::slice_ddim(inputdims, 0, inputdims.size() - 1)); + const size_t input_width = inputdims[inputdims.size() - 1]; + if (k > input_width) k = input_width; // NOTE: pass lds and dim same to input width. @@ -342,14 +344,12 @@ class TopkOpCUDAKernel : public framework::OpKernel { const int kMaxHeight = 2048; int gridx = input_height < kMaxHeight ? input_height : kMaxHeight; auto& dev_ctx = ctx.cuda_device_context(); - switch (GetDesiredBlockDim(input_width)) { FIXED_BLOCK_DIM( KeMatrixTopK<<>>( - output_data, output->dims()[1], indices_data, input_data, - input_width, input_width, static_cast(k), gridx, - input_height)); + output_data, k, indices_data, input_data, input_width, + input_width, static_cast(k), gridx, input_height)); default: PADDLE_THROW("Error"); } diff --git a/paddle/fluid/operators/top_k_op.h b/paddle/fluid/operators/top_k_op.h index 054dd481994d03f71b0ed5dc73e103085f6c91aa..76ece57b39919148da04caecaa43ea9d2b9d95df 100644 --- a/paddle/fluid/operators/top_k_op.h +++ b/paddle/fluid/operators/top_k_op.h @@ -34,7 +34,6 @@ class TopkKernel : public framework::OpKernel { public: void Compute(const framework::ExecutionContext& ctx) const override { // Get the top k elements of each row of input tensor - // FIXME: only deal with matrix(2d tensor). auto* input = ctx.Input("X"); auto* output = ctx.Output("Out"); auto* indices = ctx.Output("Indices"); @@ -44,8 +43,6 @@ class TopkKernel : public framework::OpKernel { T* output_data = output->mutable_data(ctx.GetPlace()); int64_t* indices_data = indices->mutable_data(ctx.GetPlace()); - auto eg_input = EigenMatrix::From(*input); - // reshape input to a flattern matrix(like flat_inner_dims) framework::DDim inputdims = input->dims(); const size_t row = framework::product( @@ -53,7 +50,7 @@ class TopkKernel : public framework::OpKernel { const size_t col = inputdims[inputdims.size() - 1]; Eigen::DSizes flat2dims(row, col); // NOTE: eigen shape doesn't affect paddle tensor. - eg_input.reshape(flat2dims); + auto eg_input = EigenMatrix::Reshape(*input, inputdims.size() - 1); #ifdef PADDLE_WITH_MKLML #pragma omp parallel for diff --git a/paddle/fluid/operators/transpose_op.cc b/paddle/fluid/operators/transpose_op.cc index 6a9fc6611a8f8eaa6749aefac0673ccabaebbcfe..bbd71db6062107f6ba40343c84d942b54b3958e6 100644 --- a/paddle/fluid/operators/transpose_op.cc +++ b/paddle/fluid/operators/transpose_op.cc @@ -210,18 +210,21 @@ REGISTER_OPERATOR(transpose, ops::TransposeOp, ops::TransposeOpMaker, REGISTER_OPERATOR(transpose_grad, ops::TransposeOpGrad); REGISTER_OP_CPU_KERNEL( - transpose, ops::TransposeKernel); + transpose, ops::TransposeKernel, + ops::TransposeKernel); REGISTER_OP_CPU_KERNEL( transpose_grad, - ops::TransposeGradKernel); + ops::TransposeGradKernel, + ops::TransposeGradKernel); REGISTER_OPERATOR(transpose2, ops::Transpose2Op, ops::Transpose2OpMaker, ops::Transpose2GradMaker); REGISTER_OPERATOR(transpose2_grad, ops::Transpose2OpGrad); REGISTER_OP_CPU_KERNEL( - transpose2, - ops::TransposeKernel); + transpose2, ops::TransposeKernel, + ops::TransposeKernel); REGISTER_OP_CPU_KERNEL( transpose2_grad, - ops::TransposeGradKernel); + ops::TransposeGradKernel, + ops::TransposeGradKernel); diff --git a/paddle/fluid/operators/transpose_op.cu.cc b/paddle/fluid/operators/transpose_op.cu.cc index c1b5a8b31be243fab3af06a18c8e51986c953700..b4025350fa9f3610bde43eee91cd059f3063813f 100644 --- a/paddle/fluid/operators/transpose_op.cu.cc +++ b/paddle/fluid/operators/transpose_op.cu.cc @@ -16,15 +16,18 @@ limitations under the License. */ namespace ops = paddle::operators; REGISTER_OP_CUDA_KERNEL( - transpose, - ops::TransposeKernel); + transpose, ops::TransposeKernel, + ops::TransposeKernel); REGISTER_OP_CUDA_KERNEL( transpose_grad, - ops::TransposeGradKernel); + ops::TransposeGradKernel, + ops::TransposeGradKernel); REGISTER_OP_CUDA_KERNEL( transpose2, - ops::TransposeKernel); + ops::TransposeKernel, + ops::TransposeKernel); REGISTER_OP_CUDA_KERNEL( transpose2_grad, - ops::TransposeGradKernel); + ops::TransposeGradKernel, + ops::TransposeGradKernel); diff --git a/paddle/fluid/operators/uniform_random_op.cc b/paddle/fluid/operators/uniform_random_op.cc index aa907595cb7cf165974caa69fe8eb0370471732d..e3132ae76f624f3338d749e4fcebbd0ecd7ffe79 100644 --- a/paddle/fluid/operators/uniform_random_op.cc +++ b/paddle/fluid/operators/uniform_random_op.cc @@ -29,7 +29,7 @@ class CPUUniformRandomKernel : public framework::OpKernel { if (out_var->IsType()) { tensor = out_var->GetMutable(); } else if (out_var->IsType()) { - auto shape = ctx.Attr>("shape"); + auto shape = ctx.Attr>("shape"); auto *selected_rows = out_var->GetMutable(); tensor = selected_rows->mutable_value(); tensor->Resize(framework::make_ddim(shape)); @@ -67,7 +67,7 @@ class UniformRandomOp : public framework::OperatorWithKernel { PADDLE_ENFORCE( ctx->Attrs().Get("min") < ctx->Attrs().Get("max"), "uniform_random's min must less then max"); - auto &shape = ctx->Attrs().Get>("shape"); + auto &shape = ctx->Attrs().Get>("shape"); std::vector temp; temp.reserve(shape.size()); for (auto dim : shape) { @@ -94,7 +94,7 @@ This operator initializes a tensor with random values sampled from a uniform distribution. The random result is in set [min, max]. )DOC"); - AddAttr>("shape", "The shape of the output tensor"); + AddAttr>("shape", "The shape of the output tensor"); AddAttr("min", "Minimum value of uniform random. [default -1.0].") .SetDefault(-1.0f); AddAttr("max", "Maximun value of uniform random. [default 1.0].") diff --git a/paddle/fluid/operators/uniform_random_op.cu b/paddle/fluid/operators/uniform_random_op.cu index bbb692b0ddfc18e8a62c0d2a6bac88f9932f6704..2bb0ecc139f7096d1b61150e0a2d4fb095338749 100644 --- a/paddle/fluid/operators/uniform_random_op.cu +++ b/paddle/fluid/operators/uniform_random_op.cu @@ -48,7 +48,7 @@ class GPUUniformRandomKernel : public framework::OpKernel { if (out_var->IsType()) { tensor = out_var->GetMutable(); } else if (out_var->IsType()) { - auto shape = context.Attr>("shape"); + auto shape = context.Attr>("shape"); tensor = out_var->GetMutable()->mutable_value(); tensor->Resize(framework::make_ddim(shape)); } else { diff --git a/paddle/fluid/operators/unpool_op.cc b/paddle/fluid/operators/unpool_op.cc index 1d441b43b14ea194152095874645f8133c423efd..6d2ccb38f677962d52ff97df25321bf195759dcd 100644 --- a/paddle/fluid/operators/unpool_op.cc +++ b/paddle/fluid/operators/unpool_op.cc @@ -57,8 +57,8 @@ class Unpool2dOpMaker : public framework::OpProtoAndCheckerMaker { Input shape is: $(N, C_{in}, H_{in}, W_{in})$, Output shape is: $(N, C_{out}, H_{out}, W_{out})$, where $$ -H_{out} = (H_{in}−1) * strides[0] − 2 * paddings[0] + ksize[0] \\ -W_{out} = (W_{in}−1) * strides[1] − 2 * paddings[1] + ksize[1] +H_{out} = (H_{in}-1) * strides[0] - 2 * paddings[0] + ksize[0] \\ +W_{out} = (W_{in}-1) * strides[1] - 2 * paddings[1] + ksize[1] $$ Paper: http://www.matthewzeiler.com/wp-content/uploads/2017/07/iccv2011.pdf )DOC"); diff --git a/paddle/fluid/operators/warpctc_cudnn_op.cu.cc b/paddle/fluid/operators/warpctc_cudnn_op.cu.cc new file mode 100644 index 0000000000000000000000000000000000000000..a764d59410c90535dbda0b3f11e89ae9bf578c04 --- /dev/null +++ b/paddle/fluid/operators/warpctc_cudnn_op.cu.cc @@ -0,0 +1,195 @@ +/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. */ + +#include "paddle/fluid/framework/mixed_vector.h" +#include "paddle/fluid/operators/math/softmax.h" +#include "paddle/fluid/operators/warpctc_op.h" +#include "paddle/fluid/platform/cudnn_helper.h" + +namespace paddle { +namespace operators { + +#if CUDNN_VERSION >= 7001 +using ScopedTensorDescriptor = platform::ScopedTensorDescriptor; +using ScopedCTCLossDescriptor = platform::ScopedCTCLossDescriptor; +using DataLayout = platform::DataLayout; + +template +class CudnnCTCKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + // =====================Copied code from warpctc=========================== + auto* logits = ctx.Input("Logits"); + auto* label = ctx.Input("Label"); + auto* warpctc_grad = ctx.Output("WarpCTCGrad"); + auto* loss = ctx.Output("Loss"); + + const size_t level = 0; + + auto logits_lod = framework::ToAbsOffset(logits->lod()); + auto logits_dims = logits->dims(); + PADDLE_ENFORCE_EQ(logits_dims[0], + static_cast(logits_lod[level].back()), + "The first dimension of Input(Logits) should be equal to " + "the sum of all sequences' lengths."); + + auto label_lod = framework::ToAbsOffset(label->lod()); + auto label_dims = label->dims(); + PADDLE_ENFORCE_EQ( + label_dims[0], label->numel(), + "The width of each timestep in Input(Label) should be 1."); + + const size_t num_sequences = logits_lod[level].size() - 1; + PADDLE_ENFORCE_EQ(num_sequences, label_lod[level].size() - 1, + "The number of sequences of Input(Logits) should be " + "equal to that of Input(Label)."); + PADDLE_ENFORCE_LE(num_sequences, 256, + "The labelLengths must less than 256 for cudnn call."); + + const size_t sequence_width = logits->numel() / logits_dims[0]; + auto loss_dims = + framework::make_ddim({static_cast(num_sequences), 1}); + + // NOTE: cudnn takes softmax input, calculate softmax first, then do padding + auto& dev_ctx = ctx.template device_context(); + LoDTensor softmax_logits; + softmax_logits.mutable_data(logits->dims(), ctx.GetPlace()); + softmax_logits.set_lod(logits_lod); + int rank = logits->dims().size(); + Tensor in_2d = framework::ReshapeToMatrix(*logits, rank - 1); + Tensor out_2d = framework::ReshapeToMatrix(softmax_logits, rank - 1); + math::SoftmaxFunctor()(dev_ctx, &in_2d, &out_2d); + + // ctc needs sequences data stored in transposed padding format + // logits and grad using padding data of layout 'TNC' + // T: max_sequence_length + // N: batch_size (num_sequences) + // C: width + LoDTensor warpctc_logits; + const size_t max_sequence_length = + math::MaximumSequenceLength(logits_lod[level]); + auto warpctc_logits_dims = + framework::make_ddim({static_cast(max_sequence_length), + static_cast(num_sequences), + static_cast(sequence_width)}); + warpctc_logits.mutable_data(warpctc_logits_dims, ctx.GetPlace()); + + LoDTensor cpu_pad_value; + T* pad_value_data = + cpu_pad_value.mutable_data({1}, platform::CPUPlace()); + *pad_value_data = static_cast(0); + LoDTensor pad_value; + if (platform::is_cpu_place(ctx.GetPlace())) { + pad_value = cpu_pad_value; + } else { + TensorCopySync(cpu_pad_value, ctx.GetPlace(), &pad_value); + } + + math::PaddingLoDTensorFunctor()( + ctx.template device_context(), softmax_logits, + &warpctc_logits, pad_value, -1, 0, false /* norm_by_times */, + math::kLengthBatchWidth); + const T* warpctc_logits_data = warpctc_logits.data(); + + std::vector warpctc_label_lengths(num_sequences); + std::vector warpctc_logits_lengths(num_sequences); + + for (size_t i = 0; i < num_sequences; ++i) { + warpctc_label_lengths[i] = label_lod[level][i + 1] - label_lod[level][i]; + warpctc_logits_lengths[i] = + logits_lod[level][i + 1] - logits_lod[level][i]; + } + + T* warpctc_grad_data = + warpctc_grad->mutable_data(warpctc_logits.dims(), ctx.GetPlace()); + + math::SetConstant()( + ctx.template device_context(), warpctc_grad, + static_cast(0)); + + Tensor warpctc_label; + TensorCopySync(*label, platform::CPUPlace(), &warpctc_label); + const int* warpctc_label_data = warpctc_label.data(); + // ======================================================================== + + ScopedTensorDescriptor logits_desc; + ScopedTensorDescriptor grad_desc; + ScopedCTCLossDescriptor ctcloss_desc; + // layout here doesn't have effect. + DataLayout layout = DataLayout::kNCHW; + + auto cu_logits_desc = logits_desc.descriptor( + layout, framework::vectorize2int(warpctc_logits.dims())); + auto cu_grad_desc = grad_desc.descriptor( + layout, framework::vectorize2int(warpctc_grad->dims())); + auto cu_ctcloss_desc = ctcloss_desc.descriptor(); + + auto handle = dev_ctx.cudnn_handle(); + size_t workspace_size; + + CUDNN_ENFORCE(platform::dynload::cudnnGetCTCLossWorkspaceSize( + handle, cu_logits_desc, cu_grad_desc, warpctc_label_data, + warpctc_label_lengths.data(), warpctc_logits_lengths.data(), + CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, cu_ctcloss_desc, &workspace_size)); + + T* loss_data = loss->mutable_data(loss_dims, ctx.GetPlace()); + + auto workspace_handle = dev_ctx.cudnn_workspace_handle(); + auto cudnn_func = [&](void* cudnn_workspace) { + CUDNN_ENFORCE(platform::dynload::cudnnCTCLoss( + handle, cu_logits_desc, warpctc_logits_data, warpctc_label_data, + warpctc_label_lengths.data(), warpctc_logits_lengths.data(), + loss_data, cu_grad_desc, warpctc_grad_data, + CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, cu_ctcloss_desc, cudnn_workspace, + workspace_size)); + }; + workspace_handle.RunFunc(cudnn_func, workspace_size); + } +}; + +template +class CudnnCTCGradKernel : public framework::OpKernel { + public: + void Compute(const framework::ExecutionContext& ctx) const override { + auto* warpctc_grad = ctx.Input("WarpCTCGrad"); + auto* logits_grad = ctx.Output(framework::GradVarName("Logits")); + const Tensor* loss_grad = ctx.Input(framework::GradVarName("Loss")); + + logits_grad->mutable_data(ctx.GetPlace()); + bool norm_by_times = ctx.Attr("norm_by_times"); + math::UnpaddingLoDTensorFunctor()( + ctx.template device_context(), *warpctc_grad, + logits_grad, -1, 0, norm_by_times, math::kLengthBatchWidth); + + const T* loss_grad_data = loss_grad->data(); + math::ScaleLoDTensorFunctor()( + ctx.template device_context(), loss_grad_data, + logits_grad); + } +}; + +#endif +} // namespace operators +} // namespace paddle + +namespace ops = paddle::operators; +namespace plat = paddle::platform; +#if CUDNN_VERSION >= 7001 +REGISTER_OP_KERNEL( + warpctc, CUDNN, plat::CUDAPlace, + ops::CudnnCTCKernel); +REGISTER_OP_KERNEL( + warpctc_grad, CUDNN, plat::CUDAPlace, + ops::CudnnCTCGradKernel); +#endif diff --git a/paddle/fluid/operators/warpctc_op.cc b/paddle/fluid/operators/warpctc_op.cc index e06c8c962f45a4e91b7efed7431571f0fc6870a3..6a257cebf523bfeb1951b709480140e733126f6a 100644 --- a/paddle/fluid/operators/warpctc_op.cc +++ b/paddle/fluid/operators/warpctc_op.cc @@ -14,6 +14,10 @@ limitations under the License. */ #include "paddle/fluid/operators/warpctc_op.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cudnn_helper.h" +#endif + namespace paddle { namespace operators { @@ -45,9 +49,16 @@ class WarpCTCOp : public framework::OperatorWithKernel { protected: framework::OpKernelType GetExpectedKernelType( const framework::ExecutionContext& ctx) const override { + framework::LibraryType library_{framework::LibraryType::kPlain}; +#ifdef PADDLE_WITH_CUDA + if (platform::CanCUDNNBeUsed(ctx)) { + library_ = framework::LibraryType::kCUDNN; + } +#endif + framework::DataLayout layout_ = framework::DataLayout::kAnyLayout; return framework::OpKernelType( framework::ToDataType(ctx.Input("Logits")->type()), - ctx.device_context()); + ctx.device_context(), layout_, library_); } }; @@ -86,6 +97,10 @@ class WarpCTCOpMaker : public framework::OpProtoAndCheckerMaker { "normalize the gradients by the number of time-step, " "which is also the sequence's length.") .SetDefault(false); + AddAttr("use_cudnn", + "(bool, default: false), whether to " + "use cudnn kernel.") + .SetDefault(false); AddComment(R"DOC( An operator integrating the open-source [warp-ctc](https://github.com/baidu-research/warp-ctc) library, which is used in diff --git a/paddle/fluid/platform/CMakeLists.txt b/paddle/fluid/platform/CMakeLists.txt index 5af8af640e43a5b2e5ee9856f09f66a9fdf4463c..93cb5eb2dc0b3480ebd05dcc6b36d8915d057bab 100644 --- a/paddle/fluid/platform/CMakeLists.txt +++ b/paddle/fluid/platform/CMakeLists.txt @@ -1,4 +1,3 @@ -if (NOT WIN32) proto_library(profiler_proto SRCS profiler.proto DEPS framework_proto) py_proto_compile(profiler_py_proto SRCS profiler.proto) @@ -6,11 +5,19 @@ add_custom_target(profiler_py_proto_init ALL COMMAND ${CMAKE_COMMAND} -E touch _ add_dependencies(profiler_py_proto profiler_py_proto_init) +if (NOT WIN32) add_custom_command(TARGET profiler_py_proto POST_BUILD COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/profiler COMMAND cp *.py ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/profiler COMMENT "Copy generated python proto into directory paddle/fluid/proto/profiler." WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) +else(NOT WIN32) +string(REPLACE "/" "\\" proto_dstpath "${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/profiler/") +add_custom_command(TARGET profiler_py_proto POST_BUILD + COMMAND ${CMAKE_COMMAND} -E make_directory ${PADDLE_BINARY_DIR}/python/paddle/fluid/proto/profiler + COMMAND copy /Y *.py ${proto_dstpath} + COMMENT "Copy generated python proto into directory paddle/fluid/proto/profiler." + WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) endif(NOT WIN32) if(WITH_GPU) @@ -60,12 +67,9 @@ cc_test(init_test SRCS init_test.cc DEPS device_context) nv_test(cudnn_helper_test SRCS cudnn_helper_test.cc DEPS dynload_cuda) nv_test(transform_test SRCS transform_test.cu DEPS memory place device_context) - -if (NOT WIN32) cc_library(device_tracer SRCS device_tracer.cc DEPS boost profiler_proto framework_proto ${GPU_CTX_DEPS}) cc_library(profiler SRCS profiler.cc DEPS device_context device_tracer) cc_test(profiler_test SRCS profiler_test.cc DEPS profiler) -endif(NOT WIN32) nv_test(float16_gpu_test SRCS float16_test.cu DEPS lod_tensor) cc_test(float16_test SRCS float16_test.cc DEPS lod_tensor) @@ -73,3 +77,4 @@ cc_test(float16_test SRCS float16_test.cc DEPS lod_tensor) IF(WITH_GPU) nv_test(cuda_helper_test SRCS cuda_helper_test.cu) ENDIF() +nv_library(cuda_device_guard SRCS cuda_device_guard.cc DEPS gpu_info) diff --git a/paddle/fluid/platform/cpu_helper.cc b/paddle/fluid/platform/cpu_helper.cc index 234a04b5c2eb5ee643e8a4e723b28331cd8e6ee0..b737a6c38d0441cbfcd55ed2c55969ceca68db5d 100644 --- a/paddle/fluid/platform/cpu_helper.cc +++ b/paddle/fluid/platform/cpu_helper.cc @@ -29,12 +29,19 @@ namespace platform { void SetNumThreads(int num_threads) { #ifdef PADDLE_USE_OPENBLAS +// windows has no support for openblas multi-thread +// please refer to: https://github.com/PaddlePaddle/Paddle/issues/7234 +#ifdef _WIN32 + if (num_threads > 1) { + num_threads = 1; + } +#endif int real_num_threads = num_threads > 1 ? num_threads : 1; openblas_set_num_threads(real_num_threads); #elif defined(PADDLE_WITH_MKLML) int real_num_threads = num_threads > 1 ? num_threads : 1; platform::dynload::MKL_Set_Num_Threads(real_num_threads); - omp_set_num_threads(num_threads); + omp_set_num_threads(real_num_threads); #else PADDLE_ENFORCE(false, "To be implemented."); #endif diff --git a/paddle/fluid/platform/cpu_info.cc b/paddle/fluid/platform/cpu_info.cc index b5f472d20f40fa182a4aa55ff384b0954e4ba9e3..d466f28d1ea0a8327f8d7a45c3e55c5aacd61544 100644 --- a/paddle/fluid/platform/cpu_info.cc +++ b/paddle/fluid/platform/cpu_info.cc @@ -56,10 +56,17 @@ DEFINE_double( "Default use 50% of CPU memory as the pinned_memory for PaddlePaddle," "reserve the rest for page tables, etc"); +// If use_pinned_memory is true, CPUAllocator calls mlock, which +// returns pinned and locked memory as staging areas for data exchange +// between host and device. Allocates too much would reduce the amount +// of memory available to the system for paging. So, by default, we +// should set false to use_pinned_memory. +DEFINE_bool(use_pinned_memory, true, "If set, allocate cpu pinned memory."); + namespace paddle { namespace platform { -inline size_t CpuTotalPhysicalMemory() { +size_t CpuTotalPhysicalMemory() { #ifdef __APPLE__ int mib[2]; mib[0] = CTL_HW; diff --git a/paddle/fluid/platform/cpu_info.h b/paddle/fluid/platform/cpu_info.h index 6810a1651a14cdb2080af846b21cad242b70bf35..fd31ef77b46d5b5b641983a0421da31914c87c18 100644 --- a/paddle/fluid/platform/cpu_info.h +++ b/paddle/fluid/platform/cpu_info.h @@ -19,6 +19,8 @@ limitations under the License. */ namespace paddle { namespace platform { +size_t CpuTotalPhysicalMemory(); + //! Get the maximum allocation size for a machine. size_t CpuMaxAllocSize(); diff --git a/paddle/fluid/platform/cuda_device_guard.cc b/paddle/fluid/platform/cuda_device_guard.cc new file mode 100644 index 0000000000000000000000000000000000000000..8582ec9f604f96b244a0f2d650aa8d669d6fc66c --- /dev/null +++ b/paddle/fluid/platform/cuda_device_guard.cc @@ -0,0 +1,22 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#include "paddle/fluid/platform/cuda_device_guard.h" + +namespace paddle { +namespace platform { +// Even this source file does not contains any code, it is better to keep this +// source file for cmake dependency. +} // namespace platform +} // namespace paddle diff --git a/paddle/fluid/platform/cuda_device_guard.h b/paddle/fluid/platform/cuda_device_guard.h new file mode 100644 index 0000000000000000000000000000000000000000..a85ebf4b8136630712d39d98e2341ee919cf6e45 --- /dev/null +++ b/paddle/fluid/platform/cuda_device_guard.h @@ -0,0 +1,45 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include "paddle/fluid/platform/gpu_info.h" + +namespace paddle { +namespace platform { + +class CUDADeviceGuard { + public: + explicit inline CUDADeviceGuard(int dev_id) { + int prev_id = platform::GetCurrentDeviceId(); + if (prev_id != dev_id) { + prev_id_ = prev_id; + platform::SetDeviceId(dev_id); + } + } + + inline ~CUDADeviceGuard() { + if (prev_id_ != -1) { + platform::SetDeviceId(prev_id_); + } + } + + CUDADeviceGuard(const CUDADeviceGuard& o) = delete; + CUDADeviceGuard& operator=(const CUDADeviceGuard& o) = delete; + + private: + int prev_id_{-1}; +}; + +} // namespace platform +} // namespace paddle diff --git a/paddle/fluid/platform/cudnn_helper.h b/paddle/fluid/platform/cudnn_helper.h index bb8b14bb9fa41942c3aa653ca224c0842fbf9a00..61a25064d17994e3ce5853017263f24a859c69be 100644 --- a/paddle/fluid/platform/cudnn_helper.h +++ b/paddle/fluid/platform/cudnn_helper.h @@ -14,6 +14,7 @@ limitations under the License. */ #pragma once +#include #include #include "paddle/fluid/framework/operator.h" @@ -61,7 +62,7 @@ inline const char* cudnnGetErrorString(cudnnStatus_t status) { #define CUDNN_ENFORCE(condition) \ do { \ - cudnnStatus_t status = condition; \ + auto status = condition; \ if (UNLIKELY(status != CUDNN_STATUS_SUCCESS)) { \ PADDLE_THROW(::paddle::platform::cudnnGetErrorString(status)); \ } \ @@ -76,8 +77,19 @@ enum class DataLayout { // Not use enum class PoolingMode { kMaximum, - kAverage, kMaximumDeterministic, + kAverageExclusive, + kAverageInclusive, +}; + +enum ActivationMode { + kNone, // activation identity + kSigmoid, + kRelu, + kRelu6, + kReluX, + kTanh, + kBandPass, }; #if CUDNN_VERSION < 6000 @@ -91,8 +103,10 @@ inline cudnnPoolingMode_t GetPoolingMode(const PoolingMode& mode) { switch (mode) { case PoolingMode::kMaximumDeterministic: return CUDNN_POOLING_MAX; - case PoolingMode::kAverage: + case PoolingMode::kAverageExclusive: return CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING; + case PoolingMode::kAverageInclusive: + return CUDNN_POOLING_AVERAGE_COUNT_INCLUDE_PADDING; case PoolingMode::kMaximum: return CUDNN_POOLING_MAX; default: @@ -105,8 +119,10 @@ inline cudnnPoolingMode_t GetPoolingMode(const PoolingMode& mode) { switch (mode) { case PoolingMode::kMaximumDeterministic: return CUDNN_POOLING_MAX_DETERMINISTIC; - case PoolingMode::kAverage: + case PoolingMode::kAverageExclusive: return CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING; + case PoolingMode::kAverageInclusive: + return CUDNN_POOLING_AVERAGE_COUNT_INCLUDE_PADDING; case PoolingMode::kMaximum: return CUDNN_POOLING_MAX; default: @@ -115,6 +131,26 @@ inline cudnnPoolingMode_t GetPoolingMode(const PoolingMode& mode) { } #endif // CUDNN_VERSION < 6000 +inline ActivationMode StringToActivationMode(const std::string& str) { + if (str == "identity") { + return ActivationMode::kNone; + } else if (str == "sigmoid") { + return ActivationMode::kSigmoid; + } else if (str == "relu") { + return ActivationMode::kRelu; + } else if (str == "relu6") { + return ActivationMode::kRelu6; + } else if (str == "relux") { + return ActivationMode::kReluX; + } else if (str == "tanh") { + return ActivationMode::kTanh; + } else if (str == "bandpass") { + return ActivationMode::kBandPass; + } else { + PADDLE_THROW("Unknown activation string: %s", str); + } +} + template class CudnnDataType; @@ -341,6 +377,80 @@ class ScopedPoolingDescriptor { DISABLE_COPY_AND_ASSIGN(ScopedPoolingDescriptor); }; +class ScopedSpatialTransformerDescriptor { + public: + ScopedSpatialTransformerDescriptor() { + PADDLE_ENFORCE(dynload::cudnnCreateSpatialTransformerDescriptor(&desc_)); + } + ~ScopedSpatialTransformerDescriptor() { + PADDLE_ENFORCE(dynload::cudnnDestroySpatialTransformerDescriptor(desc_)); + } + + template + inline cudnnSpatialTransformerDescriptor_t descriptor(const int nbDims, + const int dimA[]) { + PADDLE_ENFORCE(dynload::cudnnSetSpatialTransformerNdDescriptor( + desc_, CUDNN_SAMPLER_BILINEAR, CudnnDataType::type, nbDims, dimA)); + return desc_; + } + + private: + cudnnSpatialTransformerDescriptor_t desc_; + DISABLE_COPY_AND_ASSIGN(ScopedSpatialTransformerDescriptor); +}; + +class ScopedActivationDescriptor { + public: + ScopedActivationDescriptor() { + PADDLE_ENFORCE(dynload::cudnnCreateActivationDescriptor(&desc_)); + } + ~ScopedActivationDescriptor() { + PADDLE_ENFORCE(dynload::cudnnDestroyActivationDescriptor(desc_)); + } + + template + inline cudnnActivationDescriptor_t descriptor( + const std::string& act, double value_max = static_cast(0.)) { + double relu_ceiling = 0.0; + ActivationMode activation_mode = StringToActivationMode(act); + cudnnActivationMode_t mode; + switch (activation_mode) { +#if CUDNN_VERSION >= 7100 + case ActivationMode::kNone: + mode = CUDNN_ACTIVATION_IDENTITY; + break; +#endif + case ActivationMode::kRelu6: + relu_ceiling = 6.0; + mode = CUDNN_ACTIVATION_CLIPPED_RELU; + break; + case ActivationMode::kReluX: + relu_ceiling = value_max; + mode = CUDNN_ACTIVATION_CLIPPED_RELU; + break; + case ActivationMode::kRelu: + mode = CUDNN_ACTIVATION_RELU; + break; + case ActivationMode::kSigmoid: + mode = CUDNN_ACTIVATION_SIGMOID; + break; + case ActivationMode::kTanh: + mode = CUDNN_ACTIVATION_TANH; + break; + default: + PADDLE_THROW("unrecognized activation mode: %d .", + static_cast(activation_mode)); + } + CUDNN_ENFORCE(dynload::cudnnSetActivationDescriptor( + desc_, mode, CUDNN_NOT_PROPAGATE_NAN, relu_ceiling)); + return desc_; + } + + private: + cudnnActivationDescriptor_t desc_; + DISABLE_COPY_AND_ASSIGN(ScopedActivationDescriptor); +}; + inline bool CanCUDNNBeUsed(const framework::ExecutionContext& ctx) { bool use_cudnn = ctx.Attr("use_cudnn"); use_cudnn &= paddle::platform::is_gpu_place(ctx.GetPlace()); @@ -353,5 +463,28 @@ inline bool CanCUDNNBeUsed(const framework::ExecutionContext& ctx) { return use_cudnn; } +#if CUDNN_VERSION >= 7001 +class ScopedCTCLossDescriptor { + public: + ScopedCTCLossDescriptor() { + PADDLE_ENFORCE(dynload::cudnnCreateCTCLossDescriptor(&desc_)); + } + ~ScopedCTCLossDescriptor() { + PADDLE_ENFORCE(dynload::cudnnDestroyCTCLossDescriptor(desc_)); + } + + template + inline cudnnCTCLossDescriptor_t descriptor() { + PADDLE_ENFORCE( + dynload::cudnnSetCTCLossDescriptor(desc_, CudnnDataType::type)); + return desc_; + } + + private: + cudnnCTCLossDescriptor_t desc_; + DISABLE_COPY_AND_ASSIGN(ScopedCTCLossDescriptor); +}; +#endif + } // namespace platform } // namespace paddle diff --git a/paddle/fluid/platform/device_context.cc b/paddle/fluid/platform/device_context.cc index 4286242b2a93d7046e7349a99d1d1a09dca09113..d0a108f905f46135bcd2b68be19ab396ab897272 100644 --- a/paddle/fluid/platform/device_context.cc +++ b/paddle/fluid/platform/device_context.cc @@ -9,7 +9,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/platform/device_context.h" - #include #include #include @@ -18,6 +17,7 @@ limitations under the License. */ #include "paddle/fluid/memory/memory.h" #ifdef PADDLE_WITH_CUDA #include "paddle/fluid/framework/rw_lock.h" +#include "paddle/fluid/platform/cuda_device_guard.h" #endif namespace paddle { @@ -32,13 +32,25 @@ platform::DeviceContext* DeviceContextPool::Get(const platform::Place& place) { "'Place' is not supported, Please re-compile with WITH_GPU " "option"); } - return it->second.get(); + return it->second.get().get(); +} + +template +inline void EmplaceDeviceContext( + std::map>>* + map_ptr, + platform::Place p) { + using PtrType = std::unique_ptr; + map_ptr->emplace(p, std::async(std::launch::deferred, [=] { + // lazy evaluation. i.e., only create device context at + // first `Get` + return PtrType(new DevCtx(boost::get(p))); + })); } DeviceContextPool::DeviceContextPool( const std::vector& places) { PADDLE_ENFORCE_GT(places.size(), 0); - using PtrType = std::unique_ptr; std::set set; for (auto& p : places) { set.insert(p); @@ -47,16 +59,13 @@ DeviceContextPool::DeviceContextPool( for (auto& p : set) { if (platform::is_cpu_place(p)) { #ifdef PADDLE_WITH_MKLDNN - device_contexts_.emplace( - p, PtrType(new MKLDNNDeviceContext(boost::get(p)))); + EmplaceDeviceContext(&device_contexts_, p); #else - device_contexts_.emplace( - p, PtrType(new CPUDeviceContext(boost::get(p)))); + EmplaceDeviceContext(&device_contexts_, p); #endif } else if (platform::is_gpu_place(p)) { #ifdef PADDLE_WITH_CUDA - device_contexts_.emplace( - p, PtrType(new CUDADeviceContext(boost::get(p)))); + EmplaceDeviceContext(&device_contexts_, p); #else PADDLE_THROW( "'CUDAPlace' is not supported, Please re-compile with WITH_GPU " @@ -64,9 +73,8 @@ DeviceContextPool::DeviceContextPool( #endif } else if (platform::is_cuda_pinned_place(p)) { #ifdef PADDLE_WITH_CUDA - device_contexts_.emplace( - p, - PtrType(new CUDAPinnedDeviceContext(boost::get(p)))); + EmplaceDeviceContext( + &device_contexts_, p); #else PADDLE_THROW( "'CUDAPlace' is not supported, Please re-compile with WITH_GPU " @@ -112,11 +120,15 @@ class EigenCudaStreamDevice : public Eigen::StreamInterface { } void* allocate(size_t num_bytes) const override { - return paddle::memory::Alloc(place_, num_bytes); + auto buf = paddle::memory::Alloc(place_, num_bytes, + memory::Allocator::kScratchpad); + void* retv = buf->ptr(); + allocations_[buf->ptr()] = std::move(buf); + return retv; } void deallocate(void* buffer) const override { - paddle::memory::Free(place_, buffer); + allocations_.erase(allocations_.find(buffer)); } void* scratchpad() const override { @@ -143,61 +155,35 @@ class EigenCudaStreamDevice : public Eigen::StreamInterface { const cudaDeviceProp* device_prop_; // not owned; mutable void* scratch_; mutable unsigned int* semaphore_; + mutable std::unordered_map allocations_; }; -class CudnnHolder { - public: - CudnnHolder(const cudaStream_t* stream, const CUDAPlace& place) - : workspace_(nullptr), workspace_len_(0), stream_(stream), place_(place) { - PADDLE_ENFORCE(dynload::cudnnCreate(&cudnn_handle_)); - PADDLE_ENFORCE(dynload::cudnnSetStream(cudnn_handle_, *stream_)); - } - - cudnnHandle_t cudnn_handle() const { return cudnn_handle_; } +CudnnHolder::CudnnHolder(const cudaStream_t* stream, const CUDAPlace& place) + : workspace_(nullptr), stream_(stream), place_(place) { + PADDLE_ENFORCE(dynload::cudnnCreate(&cudnn_handle_)); + PADDLE_ENFORCE(dynload::cudnnSetStream(cudnn_handle_, *stream_)); +} - void RunFunc(const std::function& cudnn_func, - size_t required_workspace_len) { - std::lock_guard lock(mtx_); - if (required_workspace_len > workspace_len_) { - ReallocateWorkspace(required_workspace_len); - } - cudnn_func(workspace_); - } +CudnnHolder::~CudnnHolder() { + PADDLE_ENFORCE(dynload::cudnnDestroy(cudnn_handle_)); +} - ~CudnnHolder() { - PADDLE_ENFORCE(dynload::cudnnDestroy(cudnn_handle_)); - if (workspace_ != nullptr) { - paddle::memory::Free(place_, workspace_); - } +void CudnnHolder::ReallocateWorkspace(size_t required_workspace_len) { + if (required_workspace_len <= WorkspaceSize()) { + return; } - - private: - void ReallocateWorkspace(size_t required_workspace_len) { - if (required_workspace_len <= workspace_len_) { - return; - } - if (workspace_ != nullptr) { - // Maybe someone is using the current workspace - PADDLE_ENFORCE(cudaStreamSynchronize(*stream_)); - paddle::memory::Free(place_, workspace_); - } - workspace_ = paddle::memory::Alloc(place_, required_workspace_len); - workspace_len_ = required_workspace_len; + if (workspace_ != nullptr) { + // Maybe someone is using the current workspace + PADDLE_ENFORCE(cudaStreamSynchronize(*stream_)); + workspace_.reset(); } - - cudnnHandle_t cudnn_handle_; - void* workspace_; - size_t workspace_len_; - - const cudaStream_t* stream_; // not owned; - const CUDAPlace place_; - - std::mutex mtx_; -}; + workspace_ = paddle::memory::Alloc(place_, required_workspace_len, + paddle::memory::Allocator::kScratchpad); +} CUDADeviceContext::CUDADeviceContext(CUDAPlace place) : place_(place), cudnn_holder_(nullptr) { - SetDeviceId(place_.device); + CUDADeviceGuard guard(place_.device); compute_capability_ = GetCUDAComputeCapability(place_.device); multi_process_ = GetCUDAMultiProcessors(place_.device); max_threads_per_mp_ = GetCUDAMaxThreadsPerMultiProcessor(place_.device); @@ -214,13 +200,16 @@ CUDADeviceContext::CUDADeviceContext(CUDAPlace place) driver_version_ = GetCUDADriverVersion(place_.device); runtime_version_ = GetCUDARuntimeVersion(place_.device); - LOG(INFO) << "device: " << place_.device - << ", CUDA Capability: " << compute_capability_ - << ", Driver Version: " << driver_version_ / 1000 << "." - << (driver_version_ % 100) / 10 - << ", Runtime Version: " << runtime_version_ / 1000 << "." - << (runtime_version_ % 100) / 10; - + LOG_FIRST_N(WARNING, 1) << "Please NOTE: device: " << place_.device + << ", CUDA Capability: " << compute_capability_ + << ", Driver Version: " << driver_version_ / 1000 + << "." << (driver_version_ % 100) / 10 + << ", Runtime Version: " << runtime_version_ / 1000 + << "." << (runtime_version_ % 100) / 10; + size_t cudnn_dso_ver = dynload::cudnnGetVersion(); + LOG_FIRST_N(WARNING, 1) << "device: " << place_.device + << ", cuDNN Version: " << cudnn_dso_ver / 1000 << "." + << (cudnn_dso_ver % 100) / 10 << "."; callback_manager_.reset(new StreamCallbackManager(stream_)); } @@ -261,9 +250,8 @@ cudnnHandle_t CUDADeviceContext::cudnn_handle() const { return cudnn_holder_->cudnn_handle(); } -void CUDADeviceContext::RunCudnnFuncWithWorkspace( - const std::function& cudnn_func, size_t workspace_len) const { - cudnn_holder_->RunFunc(cudnn_func, workspace_len); +CudnnWorkspaceHandle CUDADeviceContext::cudnn_workspace_handle() const { + return CudnnWorkspaceHandle(cudnn_holder_.get()); } cudaStream_t CUDADeviceContext::stream() const { return stream_; } @@ -286,38 +274,73 @@ Place CUDAPinnedDeviceContext::GetPlace() const { return place_; } #ifdef PADDLE_WITH_MKLDNN MKLDNNDeviceContext::MKLDNNDeviceContext(CPUPlace place) - : CPUDeviceContext(place), engine_(mkldnn::engine::cpu, 0), p_blobs_() { - p_blobs_.reset(new std::unordered_map>()); + : CPUDeviceContext(place), engine_(mkldnn::engine::cpu, 0), p_blobmap_() { + p_blobmap_.reset(new BlobMap()); + p_mutex_.reset(new std::mutex()); } +namespace { +// Current thread's id. +thread_local int cur_thread_id = 0; +} + +void set_cur_thread_id(int tid) { cur_thread_id = tid; } +int get_cur_thread_id(void) { return cur_thread_id; } + void MKLDNNDeviceContext::SetBlob(const std::string& name, std::shared_ptr data) const { - std::unordered_map>* p; - p = p_blobs_.get(); + BlobMap* pMap = p_blobmap_.get(); + std::shared_ptr pBlob = nullptr; + + int tid = platform::get_cur_thread_id(); - auto it = p->find(name); + std::lock_guard lock(*p_mutex_.get()); - if (it == p->end()) { - (*p)[name] = data; // create new blob + // Find KeyBlob for current thread + auto map_it = pMap->find(tid); + + if (map_it == pMap->end()) { + // 1st time to set blob in current thread + pBlob = std::shared_ptr(new KeyBlob()); + (*pMap)[tid] = pBlob; + } else { + pBlob = map_it->second; + } + + // Find Key in found (or newly created) KeyBlob + auto key_it = pBlob->find(name); + + if (key_it == pBlob->end()) { + (*pBlob)[name] = data; // create new blob } else { - it->second = data; // set data to existing blob + key_it->second = data; // set data to existing blob } + // lock will be automatically released when out of scope return; } std::shared_ptr MKLDNNDeviceContext::GetBlob( const std::string& name) const { - std::unordered_map>* p; - p = p_blobs_.get(); + BlobMap* pMap = p_blobmap_.get(); + std::shared_ptr pBlob = nullptr; - auto it = p->find(name); + int tid = platform::get_cur_thread_id(); - if (it != p->end()) { - return it->second; - } + std::lock_guard lock(*p_mutex_.get()); + + // Find KeyBlob for current thread firstly + auto map_it = pMap->find(tid); + if (map_it == pMap->end()) return nullptr; + pBlob = map_it->second; + + // Find Blob via name + auto key_it = pBlob->find(name); + + if (key_it == pBlob->end()) return nullptr; - return nullptr; + // lock will be automatically released when out of scope + return key_it->second; } #endif diff --git a/paddle/fluid/platform/device_context.h b/paddle/fluid/platform/device_context.h index e1ff1a1746952de5aa4bead361b50af4e99bc9bc..3edd727978010e20203ab994562ce922b6ee0bad 100644 --- a/paddle/fluid/platform/device_context.h +++ b/paddle/fluid/platform/device_context.h @@ -10,12 +10,13 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include // NOLINT #include #include // NOLINT #include #include #include - +#include "paddle/fluid/memory/malloc.h" #ifdef PADDLE_WITH_CUDA #include "paddle/fluid/platform/dynload/cublas.h" #include "paddle/fluid/platform/dynload/cudnn.h" @@ -72,7 +73,108 @@ struct DefaultDeviceContextType { #ifdef PADDLE_WITH_CUDA class EigenCudaStreamDevice; -class CudnnHolder; +class CudnnHolder { + public: + CudnnHolder(const cudaStream_t* stream, const CUDAPlace& place); + ~CudnnHolder(); + cudnnHandle_t cudnn_handle() const { return cudnn_handle_; } + + private: + friend class CudnnWorkspaceHandle; + void ReallocateWorkspace(size_t required_workspace_len); + + template + void RunFuncImpl(Callback&& cudnn_func, size_t required_workspace_len) { + if (required_workspace_len > WorkspaceSize()) { + ReallocateWorkspace(required_workspace_len); + } + cudnn_func(WorkspacePtr()); + } + + inline void* WorkspacePtr() { + if (workspace_) { + return workspace_->ptr(); + } else { + return nullptr; + } + } + + inline size_t WorkspaceSize() { + if (workspace_) { + return workspace_->size(); + } else { + return 0; + } + } + + std::mutex& Mutex() { return mtx_; } + + cudnnHandle_t cudnn_handle_; + memory::AllocationPtr workspace_; + + const cudaStream_t* stream_; // not owned; + const CUDAPlace place_; + + std::mutex mtx_; +}; + +class CudnnWorkspaceHandle { + public: + /*! \brief The lock would not be acquired when constructor calls. + * The lock would be acquired when RunFunc() is called first time. */ + inline explicit CudnnWorkspaceHandle(CudnnHolder* holder) : holder_(holder) {} + + /*! \brief Thread which call RunFunc() would acquire the lock first + * before invoking cudnn functions. */ + template + inline void RunFunc(Callback&& cudnn_func, size_t required_workspace_len) { + if (!guard_) { + guard_.reset(new std::lock_guard(holder_->Mutex())); + } + holder_->RunFuncImpl(std::forward(cudnn_func), + required_workspace_len); + } + + CudnnWorkspaceHandle(CudnnWorkspaceHandle&&) = default; + CudnnWorkspaceHandle& operator=(CudnnWorkspaceHandle&&) = delete; + + private: + CudnnHolder* holder_; // not own + std::unique_ptr> guard_; +}; + +#if CUDA_VERSION >= 9000 +class ScopedCublasMathMode { + public: + ScopedCublasMathMode(cublasHandle_t handle, cublasMath_t new_math_mode) + : handle_(handle) { + need_reset = false; + PADDLE_ENFORCE( + platform::dynload::cublasGetMathMode(handle_, &old_math_mode_), + "Failed to get old cublas math mode"); + if (old_math_mode_ != new_math_mode) { + PADDLE_ENFORCE( + platform::dynload::cublasSetMathMode(handle_, new_math_mode), + "Failed to set old cublas math mode"); + need_reset = true; + } + } + + ~ScopedCublasMathMode() { + if (need_reset) { + PADDLE_ENFORCE( + platform::dynload::cublasSetMathMode(handle_, old_math_mode_), + "Failed to set old cublas math mode"); + } + } + + private: + cublasHandle_t handle_; + cublasMath_t old_math_mode_; + bool need_reset; +}; + +#endif class CUDADeviceContext : public DeviceContext { public: @@ -100,10 +202,14 @@ class CUDADeviceContext : public DeviceContext { /*! \brief Return cudnn handle in the device context. */ cudnnHandle_t cudnn_handle() const; - /*! \brief Run a cudnn function with the workspace provided by - * CUDADeviceContext */ - void RunCudnnFuncWithWorkspace(const std::function& cudnn_func, - size_t workspace_len) const; + /*! \brief Return a cudnn workspace handle to call multiple cudnn + * functions without interrupting by other threads. + * Once the first cudnn function is called by the handle, a lock + * would be acquired to prevent other threads from accessing the + * workspace. Once the handle is destructed, the lock would be released. + * CudnnWorkspaceHandle is an RAII object to implement thread-safe + * sequential cudnn function calls. */ + CudnnWorkspaceHandle cudnn_workspace_handle() const; /*! \brief Return cuda stream in the device context. */ cudaStream_t stream() const; @@ -126,6 +232,18 @@ class CUDADeviceContext : public DeviceContext { callback_manager_->Wait(); } +#if CUDA_VERSION >= 9000 + /*! \brief CublasCall may need to change cublas's config, + * but the cublas may be hold by multi-thread, so we should + * add lock here. */ + template + void CublasCall(Callback callback, cublasMath_t new_math) { + std::lock_guard guard(cublas_mtx_); + ScopedCublasMathMode scoped_cublas_math(cublas_handle_, new_math); + callback(); + } +#endif + private: CUDAPlace place_; @@ -147,6 +265,8 @@ class CUDADeviceContext : public DeviceContext { // If we use mtx_ for StreamCallbackManager, deadlock may occur sometimes mutable std::mutex callback_mtx_; std::unique_ptr callback_manager_; + + mutable std::mutex cublas_mtx_; }; template <> @@ -176,6 +296,12 @@ struct DefaultDeviceContextType { #endif #ifdef PADDLE_WITH_MKLDNN +using KeyBlob = std::unordered_map>; +using BlobMap = std::unordered_map>; + +void set_cur_thread_id(int); +int get_cur_thread_id(void); + class MKLDNNDeviceContext : public CPUDeviceContext { public: explicit MKLDNNDeviceContext(CPUPlace place); @@ -191,8 +317,8 @@ class MKLDNNDeviceContext : public CPUDeviceContext { private: mkldnn::engine engine_; - std::shared_ptr>> - p_blobs_; + std::shared_ptr p_blobmap_; + std::shared_ptr p_mutex_; }; #endif @@ -228,7 +354,8 @@ class DeviceContextPool { private: static DeviceContextPool* pool; - std::map> device_contexts_; + std::map>> + device_contexts_; DISABLE_COPY_AND_ASSIGN(DeviceContextPool); }; diff --git a/paddle/fluid/platform/device_tracer.cc b/paddle/fluid/platform/device_tracer.cc index dc1d751141187edb7738e42c41514614d4d399b0..ea4564058d602a9abe43bd063f1ed73f88a2de08 100644 --- a/paddle/fluid/platform/device_tracer.cc +++ b/paddle/fluid/platform/device_tracer.cc @@ -203,7 +203,7 @@ class DeviceTracerImpl : public DeviceTracer { void AddCPURecords(const std::string &anno, uint64_t start_ns, uint64_t end_ns, int64_t device_id, int64_t thread_id) { if (anno.empty()) { - VLOG(1) << "Empty timeline annotation."; + VLOG(10) << "Empty timeline annotation."; return; } std::lock_guard l(trace_mu_); @@ -216,7 +216,7 @@ class DeviceTracerImpl : public DeviceTracer { uint32_t correlation_id, uint64_t bytes) { // 0 means timestamp information could not be collected for the kernel. if (start_ns == 0 || end_ns == 0) { - VLOG(3) << name << " cannot be traced"; + VLOG(30) << name << " cannot be traced"; return; } std::lock_guard l(trace_mu_); @@ -228,7 +228,7 @@ class DeviceTracerImpl : public DeviceTracer { int64_t stream_id, uint32_t correlation_id) { // 0 means timestamp information could not be collected for the kernel. if (start == 0 || end == 0) { - VLOG(3) << correlation_id << " cannot be traced"; + VLOG(30) << correlation_id << " cannot be traced"; return; } std::lock_guard l(trace_mu_); @@ -347,7 +347,7 @@ class DeviceTracerImpl : public DeviceTracer { tracer->AddAnnotation(cbInfo->correlationId, anno); } } else { - VLOG(1) << "Unhandled API Callback for " << domain << " " << cbid; + VLOG(10) << "Unhandled API Callback for " << domain << " " << cbid; } } CUpti_SubscriberHandle subscriber_; diff --git a/paddle/fluid/platform/device_tracer.h b/paddle/fluid/platform/device_tracer.h index f59fc40b71699a790978e22fd7e26da8d4d94c5f..eaf047d4744762f69d50bff8d467da8e3b8317cc 100644 --- a/paddle/fluid/platform/device_tracer.h +++ b/paddle/fluid/platform/device_tracer.h @@ -13,17 +13,11 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#if !defined(_WIN32) -#include -#else -#include -#endif // !_WIN32 - -#include #include // NOLINT #include #include "paddle/fluid/platform/dynload/cupti.h" +#include "paddle/fluid/platform/port.h" #include "paddle/fluid/platform/profiler.pb.h" namespace paddle { @@ -32,15 +26,11 @@ namespace platform { /////////////////////// // WARN: Under Development. Don't depend on it yet. ////////////////////// -#if !defined(_WIN32) inline uint64_t PosixInNsec() { struct timeval tv; gettimeofday(&tv, nullptr); return 1000 * (static_cast(tv.tv_sec) * 1000000 + tv.tv_usec); } -#else -inline uint64_t PosixInNsec() { return static_cast(0); } -#endif // !_WIN32 // DeviceTracer performs the following tasks: // 1. Register cuda callbacks for various events: kernel, memcpy, etc. diff --git a/paddle/fluid/platform/dynload/cublas.cc b/paddle/fluid/platform/dynload/cublas.cc index 361d3439b844e9f68d3fba0a0e41ec457118a4a9..41648c32fe6f98bb0b78ea7891065e5586f70463 100644 --- a/paddle/fluid/platform/dynload/cublas.cc +++ b/paddle/fluid/platform/dynload/cublas.cc @@ -32,6 +32,9 @@ CUBLAS_BLAS_ROUTINE_EACH_R2(DEFINE_WRAP); CUBLAS_BLAS_ROUTINE_EACH_R3(DEFINE_WRAP); #endif +#ifdef CUBLAS_BLAS_ROUTINE_EACH_R4 +CUBLAS_BLAS_ROUTINE_EACH_R4(DEFINE_WRAP); +#endif } // namespace dynload } // namespace platform } // namespace paddle diff --git a/paddle/fluid/platform/dynload/cublas.h b/paddle/fluid/platform/dynload/cublas.h index 4ea0cd7283b55649dbdbbf97f81f10c69ac6a1d2..ced789b90d067218c3b01d124cfd2c93dc94e528 100644 --- a/paddle/fluid/platform/dynload/cublas.h +++ b/paddle/fluid/platform/dynload/cublas.h @@ -61,9 +61,6 @@ extern void *cublas_dso_handle; extern DynLoad__##__name __name #endif -#define DECLARE_DYNAMIC_LOAD_CUBLAS_V2_WRAP(__name) \ - DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(__name) - #define CUBLAS_BLAS_ROUTINE_EACH(__macro) \ __macro(cublasSaxpy_v2); \ __macro(cublasDaxpy_v2); \ @@ -106,11 +103,22 @@ CUBLAS_BLAS_ROUTINE_EACH_R2(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP) // APIs available after CUDA 9.0 #if CUDA_VERSION >= 9000 -#define CUBLAS_BLAS_ROUTINE_EACH_R3(__macro) __macro(cublasSetMathMode); +#define CUBLAS_BLAS_ROUTINE_EACH_R3(__macro) \ + __macro(cublasSetMathMode); \ + __macro(cublasGetMathMode); CUBLAS_BLAS_ROUTINE_EACH_R3(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP) #endif +// APIs available after CUDA 9.1 +#if CUDA_VERSION >= 9010 +#define CUBLAS_BLAS_ROUTINE_EACH_R4(__macro) \ + __macro(cublasGemmBatchedEx); \ + __macro(cublasGemmStridedBatchedEx); + +CUBLAS_BLAS_ROUTINE_EACH_R4(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP) +#endif + #undef DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP } // namespace dynload } // namespace platform diff --git a/paddle/fluid/platform/dynload/cudnn.h b/paddle/fluid/platform/dynload/cudnn.h index e6353f67ef118072a2d8e49111e8ecc486589998..db62377898339def415a13d185f85f34de326d7f 100644 --- a/paddle/fluid/platform/dynload/cudnn.h +++ b/paddle/fluid/platform/dynload/cudnn.h @@ -13,8 +13,6 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once -#define GLOG_NO_ABBREVIATED_SEVERITIES -#define GOOGLE_GLOG_DLL_DECL #include #include @@ -50,13 +48,13 @@ extern void EnforceCUDNNLoaded(const char* fn_name); #else -#define DECLARE_DYNAMIC_LOAD_CUDNN_WRAP(__name) \ - struct DynLoad__##__name { \ - template \ - inline cudnnStatus_t operator()(Args... args) { \ - return ::__name(args...); \ - } \ - }; \ +#define DECLARE_DYNAMIC_LOAD_CUDNN_WRAP(__name) \ + struct DynLoad__##__name { \ + template \ + inline auto operator()(Args... args) { \ + return ::__name(args...); \ + } \ + }; \ extern DynLoad__##__name __name #endif @@ -65,44 +63,54 @@ extern void EnforceCUDNNLoaded(const char* fn_name); * include all needed cudnn functions in HPPL * different cudnn version has different interfaces **/ -#define CUDNN_DNN_ROUTINE_EACH(__macro) \ - __macro(cudnnSetTensor4dDescriptor); \ - __macro(cudnnSetTensor4dDescriptorEx); \ - __macro(cudnnSetTensorNdDescriptor); \ - __macro(cudnnGetTensorNdDescriptor); \ - __macro(cudnnGetConvolutionNdForwardOutputDim); \ - __macro(cudnnGetConvolutionForwardAlgorithm); \ - __macro(cudnnCreateTensorDescriptor); \ - __macro(cudnnDestroyTensorDescriptor); \ - __macro(cudnnCreateFilterDescriptor); \ - __macro(cudnnSetFilter4dDescriptor); \ - __macro(cudnnSetFilterNdDescriptor); \ - __macro(cudnnGetFilterNdDescriptor); \ - __macro(cudnnSetPooling2dDescriptor); \ - __macro(cudnnSetPoolingNdDescriptor); \ - __macro(cudnnGetPoolingNdDescriptor); \ - __macro(cudnnDestroyFilterDescriptor); \ - __macro(cudnnCreateConvolutionDescriptor); \ - __macro(cudnnCreatePoolingDescriptor); \ - __macro(cudnnDestroyPoolingDescriptor); \ - __macro(cudnnSetConvolution2dDescriptor); \ - __macro(cudnnDestroyConvolutionDescriptor); \ - __macro(cudnnSetConvolutionNdDescriptor); \ - __macro(cudnnGetConvolutionNdDescriptor); \ - __macro(cudnnDeriveBNTensorDescriptor); \ - __macro(cudnnCreate); \ - __macro(cudnnDestroy); \ - __macro(cudnnSetStream); \ - __macro(cudnnActivationForward); \ - __macro(cudnnConvolutionForward); \ - __macro(cudnnConvolutionBackwardBias); \ - __macro(cudnnGetConvolutionForwardWorkspaceSize); \ - __macro(cudnnTransformTensor); \ - __macro(cudnnPoolingForward); \ - __macro(cudnnPoolingBackward); \ - __macro(cudnnSoftmaxBackward); \ - __macro(cudnnSoftmaxForward); \ - __macro(cudnnGetVersion); \ +#define CUDNN_DNN_ROUTINE_EACH(__macro) \ + __macro(cudnnSetTensor4dDescriptor); \ + __macro(cudnnSetTensor4dDescriptorEx); \ + __macro(cudnnSetTensorNdDescriptor); \ + __macro(cudnnGetTensorNdDescriptor); \ + __macro(cudnnGetConvolutionNdForwardOutputDim); \ + __macro(cudnnGetConvolutionForwardAlgorithm); \ + __macro(cudnnCreateTensorDescriptor); \ + __macro(cudnnDestroyTensorDescriptor); \ + __macro(cudnnCreateFilterDescriptor); \ + __macro(cudnnSetFilter4dDescriptor); \ + __macro(cudnnSetFilterNdDescriptor); \ + __macro(cudnnGetFilterNdDescriptor); \ + __macro(cudnnSetPooling2dDescriptor); \ + __macro(cudnnSetPoolingNdDescriptor); \ + __macro(cudnnGetPoolingNdDescriptor); \ + __macro(cudnnDestroyFilterDescriptor); \ + __macro(cudnnCreateConvolutionDescriptor); \ + __macro(cudnnCreatePoolingDescriptor); \ + __macro(cudnnDestroyPoolingDescriptor); \ + __macro(cudnnSetConvolution2dDescriptor); \ + __macro(cudnnDestroyConvolutionDescriptor); \ + __macro(cudnnSetConvolutionNdDescriptor); \ + __macro(cudnnGetConvolutionNdDescriptor); \ + __macro(cudnnDeriveBNTensorDescriptor); \ + __macro(cudnnCreateSpatialTransformerDescriptor); \ + __macro(cudnnSetSpatialTransformerNdDescriptor); \ + __macro(cudnnDestroySpatialTransformerDescriptor); \ + __macro(cudnnSpatialTfGridGeneratorForward); \ + __macro(cudnnSpatialTfGridGeneratorBackward); \ + __macro(cudnnSpatialTfSamplerForward); \ + __macro(cudnnSpatialTfSamplerBackward); \ + __macro(cudnnCreate); \ + __macro(cudnnDestroy); \ + __macro(cudnnSetStream); \ + __macro(cudnnActivationForward); \ + __macro(cudnnConvolutionForward); \ + __macro(cudnnConvolutionBackwardBias); \ + __macro(cudnnGetConvolutionForwardWorkspaceSize); \ + __macro(cudnnTransformTensor); \ + __macro(cudnnPoolingForward); \ + __macro(cudnnPoolingBackward); \ + __macro(cudnnSoftmaxBackward); \ + __macro(cudnnSoftmaxForward); \ + __macro(cudnnGetVersion); \ + __macro(cudnnFindConvolutionForwardAlgorithmEx); \ + __macro(cudnnFindConvolutionBackwardFilterAlgorithmEx); \ + __macro(cudnnFindConvolutionBackwardDataAlgorithmEx); \ __macro(cudnnGetErrorString); CUDNN_DNN_ROUTINE_EACH(DECLARE_DYNAMIC_LOAD_CUDNN_WRAP) @@ -142,9 +150,16 @@ CUDNN_DNN_ROUTINE_EACH_R5(DECLARE_DYNAMIC_LOAD_CUDNN_WRAP) #endif #if CUDNN_VERSION >= 7001 -#define CUDNN_DNN_ROUTINE_EACH_R7(__macro) \ - __macro(cudnnSetConvolutionGroupCount); \ - __macro(cudnnSetConvolutionMathType); +#define CUDNN_DNN_ROUTINE_EACH_R7(__macro) \ + __macro(cudnnSetConvolutionGroupCount); \ + __macro(cudnnSetConvolutionMathType); \ + __macro(cudnnConvolutionBiasActivationForward); \ + __macro(cudnnCreateCTCLossDescriptor); \ + __macro(cudnnDestroyCTCLossDescriptor); \ + __macro(cudnnGetCTCLossDescriptor); \ + __macro(cudnnSetCTCLossDescriptor); \ + __macro(cudnnGetCTCLossWorkspaceSize); \ + __macro(cudnnCTCLoss); CUDNN_DNN_ROUTINE_EACH_R7(DECLARE_DYNAMIC_LOAD_CUDNN_WRAP) #endif diff --git a/paddle/fluid/platform/dynload/dynamic_loader.cc b/paddle/fluid/platform/dynload/dynamic_loader.cc index cc5cda6106c188f3156d33480b5d3641eed32556..d53907b749805d9c16737da3105d6c66cacb12fb 100644 --- a/paddle/fluid/platform/dynload/dynamic_loader.cc +++ b/paddle/fluid/platform/dynload/dynamic_loader.cc @@ -72,8 +72,8 @@ static inline std::string join(const std::string& part1, static inline void* GetDsoHandleFromDefaultPath(const std::string& dso_path, int dynload_flags) { - VLOG(3) << "Try to find library: " << dso_path - << " from default system path."; + VLOG(30) << "Try to find library: " << dso_path + << " from default system path."; // default search from LD_LIBRARY_PATH/DYLD_LIBRARY_PATH // and /usr/local/lib path void* dso_handle = dlopen(dso_path.c_str(), dynload_flags); diff --git a/paddle/fluid/platform/dynload/mklml.h b/paddle/fluid/platform/dynload/mklml.h index aa20553ceffceded09447693c6e92f55fb48702d..9273e9b1e72f0ad7abd6c20d4a34283fbe24378a 100644 --- a/paddle/fluid/platform/dynload/mklml.h +++ b/paddle/fluid/platform/dynload/mklml.h @@ -76,6 +76,10 @@ extern void* mklml_dso_handle; __macro(vdMul); \ __macro(vsExp); \ __macro(vdExp); \ + __macro(vsSqr); \ + __macro(vdSqr); \ + __macro(vsPowx); \ + __macro(vdPowx); \ __macro(MKL_Set_Num_Threads) MKLML_ROUTINE_EACH(DECLARE_DYNAMIC_LOAD_MKLML_WRAP); diff --git a/paddle/fluid/platform/enforce.h b/paddle/fluid/platform/enforce.h index a251bfcd9914422cb6300adbbcdef3dfa79f441c..a85972bdb72ca3119cc14f9e2b810c3875443538 100644 --- a/paddle/fluid/platform/enforce.h +++ b/paddle/fluid/platform/enforce.h @@ -18,12 +18,6 @@ limitations under the License. */ #include // for __cxa_demangle #endif // __GNUC__ -#if defined(_WIN32) -#define NOMINMAX // msvc max/min macro conflict with std::min/max -#define GLOG_NO_ABBREVIATED_SEVERITIES // msvc conflict logging with windows.h -#define GOOGLE_GLOG_DLL_DECL -#endif - #ifdef PADDLE_WITH_CUDA #include #include @@ -127,14 +121,14 @@ struct EOFException : public std::exception { #define UNLIKELY(condition) __builtin_expect(static_cast(condition), 0) #else // there is no equivalent intrinsics in msvc. -#define UNLIKELY(condition) (condition == 0) +#define UNLIKELY(condition) (condition) #endif #if !defined(_WIN32) #define LIKELY(condition) __builtin_expect(static_cast(condition), 1) #else // there is no equivalent intrinsics in msvc. -#define LIKELY(condition) (condition != 0) +#define LIKELY(condition) (condition) #endif template @@ -248,7 +242,6 @@ inline void throw_on_error(T e) { throw_on_error(e, ""); } -#if !defined(_WIN32) #define PADDLE_THROW(...) \ do { \ throw ::paddle::platform::EnforceNotMet( \ @@ -272,17 +265,6 @@ inline void throw_on_error(T e) { #define PADDLE_ENFORCE(...) ::paddle::platform::throw_on_error(__VA_ARGS__); #endif // REPLACE_ENFORCE_GLOG -#else // !_WIN32 -// disable enforce, caused by the varardic macro exception error -#define PADDLE_THROW(x) \ - do { \ - throw std::make_exception_ptr( \ - std::runtime_error("Windows disable the enforce.")); \ - } while (false) - -#define PADDLE_ENFORCE(x, ...) x -#endif // !_WIN32 - #define PADDLE_THROW_EOF() \ do { \ throw ::paddle::platform::EOFException("There is no next data.", __FILE__, \ @@ -302,20 +284,6 @@ inline void throw_on_error(T e) { * extra messages is also supported, for example: * PADDLE_ENFORCE(a, b, "some simple enforce failed between %d numbers", 2) */ -#if !defined(_WIN32) -#define PADDLE_ENFORCE_EQ(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, __VA_ARGS__) -#define PADDLE_ENFORCE_NE(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, !=, ==, __VA_ARGS__) -#define PADDLE_ENFORCE_GT(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, >, <=, __VA_ARGS__) -#define PADDLE_ENFORCE_GE(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, >=, <, __VA_ARGS__) -#define PADDLE_ENFORCE_LT(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, <, >=, __VA_ARGS__) -#define PADDLE_ENFORCE_LE(__VAL0, __VAL1, ...) \ - __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, <=, >, __VA_ARGS__) - #define PADDLE_ENFORCE_NOT_NULL(__VAL, ...) \ do { \ if (UNLIKELY(nullptr == (__VAL))) { \ @@ -335,27 +303,19 @@ inline void throw_on_error(T e) { paddle::string::Sprintf("" __VA_ARGS__)); \ } \ } while (0) -#else -#define PADDLE_ENFORCE_EQ(__VAL0, __VAL1, ...) ((__VAL0) == (__VAL1)) -#define PADDLE_ENFORCE_NE(__VAL0, __VAL1, ...) ((__VAL0) != (__VAL1)) -#define PADDLE_ENFORCE_GT(__VAL0, __VAL1, ...) ((__VAL0) > (__VAL1)) -#define PADDLE_ENFORCE_GE(__VAL0, __VAL1, ...) ((__VAL0) >= (__VAL1)) -#define PADDLE_ENFORCE_LT(__VAL0, __VAL1, ...) ((__VAL0) < (__VAL1)) -#define PADDLE_ENFORCE_LE(__VAL0, __VAL1, ...) ((__VAL0) <= (__VAL1)) - -#define __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, __CMP, __INV_CMP, ...) \ - do { \ - if (!((__VAL0)__CMP(__VAL1))) { \ - PADDLE_THROW("Windows disable the enforce. Enforce failed."); \ - } \ - } while (0) -#define PADDLE_ENFORCE_NOT_NULL(__VAL1, ...) \ - do { \ - if (nullptr == (__VAL1)) { \ - PADDLE_THROW("Windows disable the enforce. Enforce failed"); \ - } \ - } while (0) -#endif // !_WIN32 + +#define PADDLE_ENFORCE_EQ(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, ==, !=, __VA_ARGS__) +#define PADDLE_ENFORCE_NE(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, !=, ==, __VA_ARGS__) +#define PADDLE_ENFORCE_GT(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, >, <=, __VA_ARGS__) +#define PADDLE_ENFORCE_GE(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, >=, <, __VA_ARGS__) +#define PADDLE_ENFORCE_LT(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, <, >=, __VA_ARGS__) +#define PADDLE_ENFORCE_LE(__VAL0, __VAL1, ...) \ + __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, <=, >, __VA_ARGS__) } // namespace platform } // namespace paddle diff --git a/paddle/fluid/platform/gpu_info.cc b/paddle/fluid/platform/gpu_info.cc index 8fff9844db738dbd6508569a8aaeed044e445e5f..20bd349b94bb211b62a0dee32bcad79759fb500f 100644 --- a/paddle/fluid/platform/gpu_info.cc +++ b/paddle/fluid/platform/gpu_info.cc @@ -19,13 +19,32 @@ limitations under the License. */ #include "gflags/gflags.h" #include "paddle/fluid/platform/enforce.h" -DEFINE_double(fraction_of_gpu_memory_to_use, 0.92, +#ifndef _WIN32 +const float fraction_of_gpu_memory_to_use = 0.92f; +#else +// fraction_of_gpu_memory_to_use cannot be too high on windows, +// since the win32 graphic sub-system can occupy some GPU memory +// which may lead to insufficient memory left for paddle +const float fraction_of_gpu_memory_to_use = 0.5f; +#endif + +DEFINE_double(fraction_of_gpu_memory_to_use, fraction_of_gpu_memory_to_use, "Allocate a trunk of gpu memory that is this fraction of the " "total gpu memory size. Future memory usage will be allocated " "from the trunk. If the trunk doesn't have enough gpu memory, " "additional trunks of the same size will be requested from gpu " "until the gpu has no memory left for another trunk."); +DEFINE_bool( + enable_cublas_tensor_op_math, false, + "The enable_cublas_tensor_op_math indicate whether to use Tensor Core, " + "but it may loss precision. Currently, There are two CUDA libraries that" + " use Tensor Cores, cuBLAS and cuDNN. cuBLAS uses Tensor Cores to speed up" + " GEMM computations(the matrices must be either half precision or single " + "precision); cuDNN uses Tensor Cores to speed up both convolutions(the " + "input and output must be half precision) and recurrent neural networks " + "(RNNs)."); + namespace paddle { namespace platform { @@ -64,6 +83,16 @@ int GetCUDADriverVersion(int id) { return driver_version; } +bool TensorCoreAvailable() { +#if CUDA_VERSION >= 9000 + int device = GetCurrentDeviceId(); + int driver_version = GetCUDAComputeCapability(device); + return driver_version >= 70; +#else + return false; +#endif +} + int GetCUDAMultiProcessors(int id) { PADDLE_ENFORCE_LT(id, GetCUDADeviceCount(), "id must less than GPU count"); int count; @@ -124,8 +153,8 @@ size_t GpuMaxChunkSize() { size_t available = 0; GpuMemoryUsage(&available, &total); - VLOG(10) << "GPU Usage " << available / 1024 / 1024 << "M/" - << total / 1024 / 1024 << "M"; + VLOG(100) << "GPU Usage " << available / 1024 / 1024 << "M/" + << total / 1024 / 1024 << "M"; size_t reserving = static_cast(0.05 * total); // If available less than minimum chunk size, no usable memory exists. available = diff --git a/paddle/fluid/platform/gpu_info.h b/paddle/fluid/platform/gpu_info.h index be44158431ff80a41f7fdf4dfd4d070667f2ac63..6a0b3c8e02d49068c2dbe14c7feea7e139947694 100644 --- a/paddle/fluid/platform/gpu_info.h +++ b/paddle/fluid/platform/gpu_info.h @@ -35,6 +35,9 @@ int GetCUDARuntimeVersion(int id); //! Get the driver version of the ith GPU int GetCUDADriverVersion(int id); +//! Wheter the current device support TensorCore +bool TensorCoreAvailable(); + //! Get the MultiProcessors of the ith GPU. int GetCUDAMultiProcessors(int i); diff --git a/paddle/fluid/platform/init.cc b/paddle/fluid/platform/init.cc index ab91ca5345047f3053eb8771e6a265d2a3011f85..0ccef6c6a8345e31cee3ef2422fe3f56c059c231 100644 --- a/paddle/fluid/platform/init.cc +++ b/paddle/fluid/platform/init.cc @@ -19,6 +19,9 @@ limitations under the License. */ #include "paddle/fluid/framework/operator.h" #include "paddle/fluid/platform/cpu_helper.h" #include "paddle/fluid/platform/cpu_info.h" +#ifdef PADDLE_WITH_CUDA +#include "paddle/fluid/platform/cuda_device_guard.h" +#endif #include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/init.h" #include "paddle/fluid/platform/place.h" @@ -35,6 +38,7 @@ std::once_flag p2p_init_flag; void InitGflags(std::vector argv) { std::call_once(gflags_init_flag, [&]() { + FLAGS_logtostderr = true; argv.insert(argv.begin(), "dummy"); int argc = argv.size(); char **arr = new char *[argv.size()]; @@ -45,7 +49,7 @@ void InitGflags(std::vector argv) { line += ' '; } google::ParseCommandLineFlags(&argc, &arr, true); - VLOG(1) << "Init commandline: " << line; + VLOG(10) << "Init commandline: " << line; }); } @@ -64,7 +68,7 @@ void InitP2P(std::vector devices) { LOG(WARNING) << "Cannot enable P2P access from " << devices[i] << " to " << devices[j]; } else { - cudaSetDevice(devices[i]); + platform::CUDADeviceGuard guard(devices[i]); cudaDeviceEnablePeerAccess(devices[j], 0); } } @@ -112,32 +116,65 @@ void InitDevices(bool init_p2p, const std::vector devices) { } places.emplace_back(platform::CPUPlace()); platform::DeviceContextPool::Init(places); + #ifndef PADDLE_WITH_MKLDNN platform::SetNumThreads(FLAGS_paddle_num_threads); #endif - if (platform::jit::MayIUse(platform::jit::avx512f)) { -#ifndef __AVX512F__ - LOG(WARNING) << "AVX512F is available, Please re-compile on local machine"; +#if !defined(_WIN32) && !defined(__APPLE__) && !defined(__OSX__) + if (platform::jit::MayIUse(platform::jit::avx)) { +#ifndef __AVX__ + LOG(WARNING) << "AVX is available, Please re-compile on local machine"; #endif } - if (platform::jit::MayIUse(platform::jit::avx2)) { -#ifndef __AVX2__ - LOG(WARNING) << "AVX2 is available, Please re-compile on local machine"; + +// Throw some informations when CPU instructions mismatch. +#define AVX_GUIDE(compiletime, runtime) \ + LOG(FATAL) \ + << "This version is compiled on higher instruction(" #compiletime \ + ") system, you may encounter illegal instruction error running on" \ + " your local CPU machine. Please reinstall the " #runtime \ + " version or compile from source code." + +#ifdef __AVX512F__ + if (!platform::jit::MayIUse(platform::jit::avx512f)) { + if (platform::jit::MayIUse(platform::jit::avx2)) { + AVX_GUIDE(AVX512, AVX2); + } else if (platform::jit::MayIUse(platform::jit::avx)) { + AVX_GUIDE(AVX512, AVX); + } else { + AVX_GUIDE(AVX512, NonAVX); + } + } #endif + +#ifdef __AVX2__ + if (!platform::jit::MayIUse(platform::jit::avx2)) { + if (platform::jit::MayIUse(platform::jit::avx)) { + AVX_GUIDE(AVX2, AVX); + } else { + AVX_GUIDE(AVX2, NonAVX); + } } - if (platform::jit::MayIUse(platform::jit::avx)) { -#ifndef __AVX__ - LOG(WARNING) << "AVX is available, Please re-compile on local machine"; #endif + +#ifdef __AVX__ + if (!platform::jit::MayIUse(platform::jit::avx)) { + AVX_GUIDE(AVX, NonAVX); } +#endif +#undef AVX_GUIDE + +#endif } void InitGLOG(const std::string &prog_name) { // glog will not hold the ARGV[0] inside. // Use strdup to alloc a new string. google::InitGoogleLogging(strdup(prog_name.c_str())); +#ifndef _WIN32 google::InstallFailureSignalHandler(); +#endif } } // namespace framework diff --git a/paddle/fluid/platform/init.h b/paddle/fluid/platform/init.h index 992ca5e6f6a966a331616a698e3bebd2eee129d5..0e30594672927253cc8083dcb88bb867d63ec729 100644 --- a/paddle/fluid/platform/init.h +++ b/paddle/fluid/platform/init.h @@ -16,9 +16,6 @@ limitations under the License. */ #include #include -#define GLOG_NO_ABBREVIATED_SEVERITIES -#define GOOGLE_GLOG_DLL_DECL - #include "gflags/gflags.h" #include "glog/logging.h" diff --git a/paddle/fluid/platform/lock_guard_ptr.h b/paddle/fluid/platform/lock_guard_ptr.h new file mode 100644 index 0000000000000000000000000000000000000000..bff24e74a7070b31d6385b2d5924bdc62d7219c9 --- /dev/null +++ b/paddle/fluid/platform/lock_guard_ptr.h @@ -0,0 +1,55 @@ +// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +#pragma once +#include +#include +#include // NOLINT +namespace paddle { +namespace platform { + +/** + * LockGuard for std::unique_ptr. It will do nothing when guarded ptr + * is nullptr. + * + * The advantage of using `LockGuardPtr` instead of + * std::unique> is this type is totally a stack + * variable. There is no heap allocation at all. + */ +template +class LockGuardPtr { + public: + explicit LockGuardPtr(std::unique_ptr& lock_ptr) // NOLINT + : lock_(lock_ptr.get()) { + if (lock_) { + lock_->lock(); + } + } + ~LockGuardPtr() { + if (lock_) { + lock_->unlock(); + } + } + + LockGuardPtr(const LockGuardPtr&) = delete; + LockGuardPtr& operator=(const LockGuardPtr&) = delete; + LockGuardPtr(LockGuardPtr&&) = delete; + LockGuardPtr& operator=(LockGuardPtr&&) = delete; + + private: + LockType* lock_; +}; + +} // namespace platform +} // namespace paddle diff --git a/paddle/fluid/platform/mkldnn_helper.h b/paddle/fluid/platform/mkldnn_helper.h index c0a2543ba5d8ff8f34cb6231c51cb5053a6a9481..814012e6c1fad414d10f5a64af283bed57e11fe3 100644 --- a/paddle/fluid/platform/mkldnn_helper.h +++ b/paddle/fluid/platform/mkldnn_helper.h @@ -187,6 +187,29 @@ class MKLDNNHandler { return mem_p; } + std::shared_ptr AcquireMemory( + const std::shared_ptr& user_memory_p, + const std::shared_ptr& target_memory_p, + const std::string& suffix, + std::vector& pipeline) { // NOLINT + auto local_key = key_ + suffix; + auto key_reorder_p = key_ + suffix + "reorder_p"; + + auto stored_reorder_p = std::static_pointer_cast( + dev_ctx_.GetBlob(key_reorder_p)); + + if (stored_reorder_p) { + pipeline.push_back(*stored_reorder_p); + } else { + auto reorder_p = + std::make_shared(*user_memory_p, *target_memory_p); + dev_ctx_.SetBlob(key_reorder_p, reorder_p); + pipeline.push_back(*reorder_p); + } + + return target_memory_p; + } + std::shared_ptr AcquireMemory( mkldnn::memory::primitive_desc& mpd, // NOLINT mkldnn::memory::primitive_desc& user_mpd, // NOLINT diff --git a/paddle/fluid/platform/nccl_helper.h b/paddle/fluid/platform/nccl_helper.h index 115abb98d56e633c938695c8127c832eab602110..a6360a884d74f06603f28efeb36e39fbd0257cf6 100644 --- a/paddle/fluid/platform/nccl_helper.h +++ b/paddle/fluid/platform/nccl_helper.h @@ -12,6 +12,7 @@ // See the License for the specific language governing permissions and // limitations under the License. +#ifndef _WIN32 #pragma once #include @@ -112,7 +113,7 @@ struct NCCLContextMap { NCCLGroupGuard gurad; for (auto &gpu_id : order_) { int rank = trainer_id * order_.size() + gpu_id; - VLOG(3) << "init nccl rank: " << rank << " nranks: " << nranks; + VLOG(30) << "init nccl rank: " << rank << " nranks: " << nranks; PADDLE_ENFORCE(cudaSetDevice(gpu_id)); PADDLE_ENFORCE(platform::dynload::ncclCommInitRank( comms.get() + gpu_id, nranks, *nccl_id, rank)); @@ -149,3 +150,4 @@ struct NCCLContextMap { } // namespace platform } // namespace paddle +#endif diff --git a/paddle/fluid/platform/place.h b/paddle/fluid/platform/place.h index e3ee504f3d042d6a99036e34507c4c8bee306750..a095d4929ec2130b4af48d32bf016d9fe108b418 100644 --- a/paddle/fluid/platform/place.h +++ b/paddle/fluid/platform/place.h @@ -13,6 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. */ #pragma once +#include #include #include diff --git a/paddle/fluid/platform/port.h b/paddle/fluid/platform/port.h index cf9f4aa95bc1cb79d95b79331fbc09e11af64194..ad070171df32fd436f24613561d9bc384f79195a 100644 --- a/paddle/fluid/platform/port.h +++ b/paddle/fluid/platform/port.h @@ -17,6 +17,7 @@ #include #include +#include #include #include @@ -24,21 +25,21 @@ #include "glog/logging.h" #if !defined(_WIN32) -#define UNUSED __attribute__((unused)) #include // dladdr #include // backtrace #include +#include #include // std::accumulate #else +#define NOMINMAX // msvc max/min macro conflict with std::min/max +// solve static linking error in windows +// https://github.com/google/glog/issues/301 +#define GOOGLE_GLOG_DLL_DECL #include // _popen, _pclose +#include #include -#if defined(_WIN32) #include // std::accumulate in msvc -#endif -// windows version of __attribute__((unused)) -#define UNUSED __pragma(warning(suppress : 4100)) - -#ifndef S_ISDIR // windows port for sys/stat.h +#ifndef S_ISDIR // windows port for sys/stat.h #define S_ISDIR(mode) (((mode)&S_IFMT) == S_IFDIR) #endif // S_ISDIR @@ -62,6 +63,25 @@ static void *dlopen(const char *filename, int flag) { return reinterpret_cast(hModule); } +static int gettimeofday(struct timeval *tp, void *tzp) { + time_t clock; + struct tm tm; + SYSTEMTIME wtm; + + GetLocalTime(&wtm); + tm.tm_year = wtm.wYear - 1900; + tm.tm_mon = wtm.wMonth - 1; + tm.tm_mday = wtm.wDay; + tm.tm_hour = wtm.wHour; + tm.tm_min = wtm.wMinute; + tm.tm_sec = wtm.wSecond; + tm.tm_isdst = -1; + clock = mktime(&tm); + tp->tv_sec = clock; + tp->tv_usec = wtm.wMilliseconds * 1000; + + return (0); +} #endif // !_WIN32 static void ExecShellCommand(const std::string &cmd, std::string *message) { @@ -137,10 +157,12 @@ static void MkDir(const char *path) { } } #else - CreateDirectory(path, NULL); - auto errorno = GetLastError(); - if (errorno != ERROR_ALREADY_EXISTS) { - throw std::runtime_error(path_error); + BOOL return_value = CreateDirectory(path, NULL); + if (!return_value) { + auto errorno = GetLastError(); + if (errorno != ERROR_ALREADY_EXISTS) { + throw std::runtime_error(path_error); + } } #endif // !_WIN32 } diff --git a/paddle/fluid/platform/profiler.cc b/paddle/fluid/platform/profiler.cc index a35147da90e87af85308431fd7dbe965bb1fd1d7..998242fb4a09138db24aa75759f4990ffdc4d4e2 100644 --- a/paddle/fluid/platform/profiler.cc +++ b/paddle/fluid/platform/profiler.cc @@ -13,8 +13,8 @@ See the License for the specific language governing permissions and limitations under the License. */ #include "paddle/fluid/platform/profiler.h" +#include "paddle/fluid/platform/port.h" -#include #include #include #include @@ -30,6 +30,8 @@ limitations under the License. */ #include "paddle/fluid/platform/device_tracer.h" #include "paddle/fluid/string/printf.h" +DEFINE_bool(enable_rpc_profiler, false, "Enable rpc profiler or not."); + namespace paddle { namespace platform { @@ -193,6 +195,13 @@ RecordEvent::~RecordEvent() { PopEvent(name_, dev_ctx_); } +RecordRPCEvent::RecordRPCEvent(const std::string& name, + const DeviceContext* dev_ctx) { + if (FLAGS_enable_rpc_profiler) { + event_.reset(new platform::RecordEvent(name, dev_ctx)); + } +} + RecordBlock::RecordBlock(int block_id) : is_enabled_(false), start_ns_(PosixInNsec()) { std::lock_guard l(profiler_mu); @@ -217,7 +226,7 @@ RecordBlock::~RecordBlock() { void EnableProfiler(ProfilerState state) { PADDLE_ENFORCE(state != ProfilerState::kDisabled, - "Can't enbale profling, since the input state is ", + "Can't enable profiling, since the input state is ", "ProfilerState::kDisabled"); std::lock_guard l(profiler_mu); diff --git a/paddle/fluid/platform/profiler.h b/paddle/fluid/platform/profiler.h index 38630686f7cf3c669373f941d989adf11ba6cfe6..f5d3490634f3199a23986ec3ae13d9fe3577ac35 100644 --- a/paddle/fluid/platform/profiler.h +++ b/paddle/fluid/platform/profiler.h @@ -69,8 +69,8 @@ void PushEvent(const std::string& name, const DeviceContext* dev_ctx); void PopEvent(const std::string& name, const DeviceContext* dev_ctx); -#if !defined(_WIN32) struct RecordEvent { + // dev_ctx can be set to nullptr if device is cpu. RecordEvent(const std::string& name, const DeviceContext* dev_ctx); ~RecordEvent(); @@ -86,6 +86,16 @@ struct RecordEvent { std::string full_name_; }; +class RecordRPCEvent { + public: + // dev_ctx can be set to nullptr if device is cpu. + RecordRPCEvent(const std::string& name, const DeviceContext* dev_ctx); + ~RecordRPCEvent() {} + + private: + std::unique_ptr event_; +}; + struct RecordBlock { explicit RecordBlock(int block_id); ~RecordBlock(); @@ -95,15 +105,6 @@ struct RecordBlock { std::string name_; uint64_t start_ns_; }; -#else -// windows do not support profiler temporarily. -struct RecordEvent { - RecordEvent(const std::string& name, const DeviceContext* dev_ctx) {} -}; -struct RecordBlock { - explicit RecordBlock(int block_id) {} -}; -#endif // Return the event list of all threads. Assumed the returned value calls // event_lists, event_lists[i][j] represents the j-th Event of i-th thread. diff --git a/paddle/fluid/platform/stream_callback_manager.h b/paddle/fluid/platform/stream_callback_manager.h index 6c984065aa5fa1a8875aebe84051ab396bc417ec..ed8734c98cb996721a3002523c9276e6d7f492ae 100644 --- a/paddle/fluid/platform/stream_callback_manager.h +++ b/paddle/fluid/platform/stream_callback_manager.h @@ -14,18 +14,16 @@ #pragma once +#include #include #include #include #include -#include "ThreadPool.h" #include "paddle/fluid/platform/enforce.h" namespace paddle { namespace platform { -using StreamCallback = std::function; - class StreamCallbackManager; struct StreamCallbackContext { @@ -35,7 +33,7 @@ struct StreamCallbackContext { : manager_(manager), callback_(callback) {} const StreamCallbackManager *manager_; // do not own - StreamCallback callback_; + std::function callback_; }; class StreamCallbackManager { @@ -45,16 +43,17 @@ class StreamCallbackManager { template inline void AddCallback(Callback &&callback) const { - AddCallbackWithStreamAndErrorInfo( - [=](cudaStream_t, cudaError_t) { callback(); }); - } - - template - inline void AddCallbackWithStreamAndErrorInfo(Callback &&callback) const { - auto *stream_callback_context = new StreamCallbackContext(this, callback); + auto *stream_callback_context = + new StreamCallbackContext(this, std::forward(callback)); +#if CUDA_VERSION >= 10000 + PADDLE_ENFORCE(cudaLaunchHostFunc(stream_, + StreamCallbackManager::StreamCallbackFunc, + stream_callback_context)); // NOLINT +#else PADDLE_ENFORCE(cudaStreamAddCallback( stream_, StreamCallbackManager::StreamCallbackFunc, - stream_callback_context, 0)); + stream_callback_context, 0)); // NOLINT +#endif } void Wait() const { thread_pool_.reset(new ThreadPool(1)); } @@ -63,17 +62,21 @@ class StreamCallbackManager { const cudaStream_t stream_; mutable std::unique_ptr thread_pool_; - // cudaStreamCallback cannot call CUDA API inside, so we have to use - // thread_pool here +// cudaStreamCallback cannot call CUDA API inside, so we have to use +// thread_pool here +#if CUDA_VERSION >= 10000 + static void CUDART_CB StreamCallbackFunc(void *user_data) +#else static void CUDART_CB StreamCallbackFunc(cudaStream_t stream, - cudaError_t status, - void *user_data) { + cudaError_t status, void *user_data) +#endif + { auto *callback_context_ptr = reinterpret_cast(user_data); callback_context_ptr->manager_->thread_pool_->enqueue([=]() { std::unique_ptr callback_context( callback_context_ptr); - callback_context->callback_(stream, status); + callback_context->callback_(); }); } }; diff --git a/paddle/fluid/platform/transform_test.cu b/paddle/fluid/platform/transform_test.cu index f65d1f60100edc85ba9745ed36f26a0ed160d80f..23f5865971246b2862f859885f5bfccd926b9697 100644 --- a/paddle/fluid/platform/transform_test.cu +++ b/paddle/fluid/platform/transform_test.cu @@ -18,8 +18,6 @@ limitations under the License. */ #include "paddle/fluid/platform/hostdevice.h" #include "paddle/fluid/platform/transform.h" -namespace { - template class Scale { public: @@ -36,10 +34,7 @@ class Multiply { HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; } }; -} // namespace - using paddle::memory::Alloc; -using paddle::memory::Free; using paddle::memory::Copy; using paddle::platform::CPUPlace; @@ -63,13 +58,13 @@ TEST(Transform, GPUUnary) { CUDAPlace gpu0(0); CUDADeviceContext ctx(gpu0); float cpu_buf[4] = {0.1, 0.2, 0.3, 0.4}; - float* gpu_buf = static_cast(Alloc(gpu0, sizeof(float) * 4)); + auto gpu_allocation = Alloc(gpu0, sizeof(float) * 4); + float* gpu_buf = static_cast(gpu_allocation->ptr()); Copy(gpu0, gpu_buf, CPUPlace(), cpu_buf, sizeof(cpu_buf), ctx.stream()); Transform trans; trans(ctx, gpu_buf, gpu_buf + 4, gpu_buf, Scale(10)); ctx.Wait(); Copy(CPUPlace(), cpu_buf, gpu0, gpu_buf, sizeof(cpu_buf), ctx.stream()); - Free(gpu0, gpu_buf); for (int i = 0; i < 4; ++i) { ASSERT_NEAR(cpu_buf[i], static_cast(i + 1), 1e-5); } @@ -89,13 +84,13 @@ TEST(Transform, GPUBinary) { int buf[4] = {1, 2, 3, 4}; CUDAPlace gpu0(0); CUDADeviceContext ctx(gpu0); - int* gpu_buf = static_cast(Alloc(gpu0, sizeof(buf))); + auto gpu_allocation = Alloc(gpu0, sizeof(buf)); + int* gpu_buf = static_cast(gpu_allocation->ptr()); Copy(gpu0, gpu_buf, CPUPlace(), buf, sizeof(buf), ctx.stream()); Transform trans; trans(ctx, gpu_buf, gpu_buf + 4, gpu_buf, gpu_buf, Multiply()); ctx.Wait(); Copy(CPUPlace(), buf, gpu0, gpu_buf, sizeof(buf), ctx.stream()); - Free(gpu0, gpu_buf); for (int i = 0; i < 4; ++i) { ASSERT_EQ((i + 1) * (i + 1), buf[i]); } diff --git a/paddle/fluid/platform/variant.h b/paddle/fluid/platform/variant.h index dc9fad29f281a1c6ac300b48f9e600ff802a5752..e9aef621acea44b0dab7a687c13223617d5603c0 100644 --- a/paddle/fluid/platform/variant.h +++ b/paddle/fluid/platform/variant.h @@ -41,4 +41,13 @@ limitations under the License. */ #include #include #include +#include #include + +// some platform-independent defintion +#if defined(_WIN32) +#define UNUSED +#define __builtin_expect(EXP, C) (EXP) +#else +#define UNUSED __attribute__((unused)) +#endif diff --git a/paddle/fluid/pybind/CMakeLists.txt b/paddle/fluid/pybind/CMakeLists.txt index 5ef519367427d9eb4398093e933193c6ecff64d6..25d241d9768c16e1da304a78f259d5a626f702fc 100644 --- a/paddle/fluid/pybind/CMakeLists.txt +++ b/paddle/fluid/pybind/CMakeLists.txt @@ -1,25 +1,29 @@ -set(PYBIND_DEPS pybind python proto_desc memory executor prune feed_fetch_method pass_builder boost) -set(PYBIND_SRCS pybind.cc exception.cc protobuf.cc const_value.cc) -if(NOT WIN32) -list(APPEND PYBIND_DEPS parallel_executor profiler) -list(APPEND PYBIND_SRCS recordio.cc) -endif() +set(PYBIND_DEPS pybind python proto_desc memory executor prune feed_fetch_method pass_builder parallel_executor profiler) +set(PYBIND_SRCS pybind.cc exception.cc protobuf.cc const_value.cc recordio.cc) if(WITH_PYTHON) if(WITH_AMD_GPU) hip_library(paddle_pybind SHARED SRCS ${PYBIND_SRCS} - DEPS ${PYBIND_DEPS} - ${GLOB_OP_LIB}) + DEPS ARCHIVE_START ${PYBIND_DEPS} + ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS} ARCHIVE_END) else() cc_library(paddle_pybind SHARED SRCS ${PYBIND_SRCS} DEPS ${PYBIND_DEPS} - ${GLOB_OP_LIB}) + ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS}) if(NOT APPLE AND NOT ANDROID AND NOT WIN32) target_link_libraries(paddle_pybind rt) endif(NOT APPLE AND NOT ANDROID AND NOT WIN32) endif(WITH_AMD_GPU) + if(WIN32) + if(WITH_GPU AND NOT WITH_DSO) + get_property(cuda_modules GLOBAL PROPERTY CUDA_MODULES) + target_link_libraries(paddle_pybind ${cuda_modules}) + endif(WITH_GPU AND NOT WITH_DSO) + target_link_libraries(paddle_pybind shlwapi) + endif(WIN32) + cc_test(tensor_py_test SRCS tensor_py_test.cc DEPS python) endif(WITH_PYTHON) diff --git a/paddle/fluid/pybind/const_value.cc b/paddle/fluid/pybind/const_value.cc index 1f61a0e289f32196ead04d71d07b513cbe4655b1..06d8b65fb1480d9f621ca937c1d66ab7e910f010 100644 --- a/paddle/fluid/pybind/const_value.cc +++ b/paddle/fluid/pybind/const_value.cc @@ -27,6 +27,7 @@ void BindConstValue(pybind11::module* m) { m->def("kZeroVarSuffix", [] { return framework::kZeroVarSuffix; }); m->def("kControlDepVarName", [] { return framework::ir::Node::kControlDepVarName; }); + m->def("kNewGradSuffix", [] { return framework::kNewGradSuffix; }); auto op_proto_and_checker_maker = m->def_submodule("op_proto_and_checker_maker"); diff --git a/paddle/fluid/pybind/protobuf.cc b/paddle/fluid/pybind/protobuf.cc index 3b22718a8c6f994dbc2dc3e7aaa19a7163f716ba..0443ff3fc3d055c54527240cdf57c195e1c0acf8 100644 --- a/paddle/fluid/pybind/protobuf.cc +++ b/paddle/fluid/pybind/protobuf.cc @@ -30,11 +30,12 @@ namespace pybind11 { namespace detail { // Can be replaced by a generic lambda in C++14 -struct variant_caster_visitor : public boost::static_visitor { +struct __attribute__((visibility("hidden"))) paddle_variant_caster_visitor + : public boost::static_visitor { return_value_policy policy; handle parent; - variant_caster_visitor(return_value_policy policy, handle parent) + paddle_variant_caster_visitor(return_value_policy policy, handle parent) : policy(policy), parent(parent) {} template @@ -44,10 +45,10 @@ struct variant_caster_visitor : public boost::static_visitor { }; template -struct variant_caster; +struct paddle_variant_caster; template