提交 3d0f6df3 编写于 作者: M minqiyang

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into fix_nms_op_ut_in_py36

test=develop
---
name: 建议(Feature request)
about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue.
---
欢迎您对PaddlePaddle提出建议,非常感谢您对PaddlePaddle的贡献!
在留下您的建议时,辛苦您同步提供如下信息:
- 版本、环境信息
1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1
2)CPU/GPU:您是否使用GPU进行训练,如是,请提供您的CUDA和cuDNN版本号
3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14
- 复现信息:如为报错,请给出复现环境、复现步骤
- 建议描述:请您详细描述,您认为需优化的功能
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
Please make sure that this is a feature request.
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
**To Reproduce**
Steps to reproduce the behavior
**Describe the feature and the current behavior/state.**
**Any Other info.**
---
name: 预测(Inference Issue)
about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue.
---
为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准描述您的问题,例如“最新预测库的API文档在哪儿 ”
- 版本、环境信息:
   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID
   2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况
   3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号
   4)系统环境:请您描述系统类型、版本(如Mac OS 10.14),Python版本
-预测信息
   1)C++预测:请您提供预测库安装包的版本信息,及其中的version.txt文件
   2)CMake包含路径的完整命令
   3)API信息(如调用请提供)
   4)预测库来源:官网下载/特殊环境(如BCLOUD编译)
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that th
If there is no solution,please make sure that this is an inference issue including the following details :
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
-Python version
-Cmake orders
-C++version.txt
-API information
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**
---
name: 安装(Installation Issue)
about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation
 issue.
---
为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
建立issue时,为快速解决问题,请您根据使用情况给出如下信息:
- 标题:请包含关键词“安装错误”/“编译错误”,例如“Mac编译错误”
- 版本、环境信息:
   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID
   2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况
   3)GPU:请提供GPU型号,CUDA和CUDNN版本号
   4)系统环境:请说明系统类型、版本(如Mac OS 10.14)、Python版本
- 安装方式信息:
1)pip安装/docker安装
2)本地编译:请提供cmake命令,编译命令
3)docker编译:请提供docker镜像,编译命令           
 特殊环境请注明:如离线安装等
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is an installation issue including the following details:
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg. Mac OS 10.14)
-Python version
- Install method: pip install/install with docker/build from source(without docker)/build within docker
- Other special cases that you think may be related to this problem, eg. offline install, special internet condition  
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**
---
name: 模型(Model Issue)
about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/
algorithm/dataset  issue.
---
为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
建立issue时,为快速解决问题,请您根据使用情况给出如下信息:
- 标题:简洁、精准描述您的问题,例如“ssd 模型前置lstm报错  ”
- 版本、环境信息:
   1)PaddlePaddle版本:请提供PaddlePaddle版本号,例如1.1或CommitID
   2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况
   3)GPU:请提供GPU型号,CUDA和CUDNN版本号
   4)系统环境:请说明系统类型、版本(例如Mac OS 10.14),Python版本
- 模型信息
   1)模型名称 2)使用数据集名称 3)使用算法名称 4)模型链接
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is a issue of models including the following details:
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
-Python version
-Name of Models&Dataset/details of operator
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**
---
name: 其他(Others)
about: 如上述分类未包含您的问题,可在此提出。 You could use this template for reporting other issues
---
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准概括您的问题
- 版本、环境信息:
   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID
   2)CPU/GPU:如果您使用GPU训练,请提供GPU驱动版本、CUDA和cuDNN版本号
   3)系统环境:请您描述系统类型、版本,例如Mac OS 10.14
   4)Python版本号
   5)显存信息
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please provide us with the following details :
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/cuDNN version
-OS Platform and Distribution(eg.Mac OS 10.14)
-Python version
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**
---
name: 训练(Training issue)
about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training
 issue.
---
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
- 版本、环境信息:
   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,例如1.1或CommitID
   2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况
   3)GPU:预测若用GPU,请提供GPU型号、CUDA和CUDNN版本号
   4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本
- 训练信息
   1)单机/多机,单卡/多卡
   2)显存信息
   3)Operator信息
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is a training issue including the following details:
**System information**
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
-Other imformation: Distriuted training/informantion of operator/
Graphics card storage
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**
...@@ -204,7 +204,9 @@ include(external/eigen) # download eigen3 ...@@ -204,7 +204,9 @@ include(external/eigen) # download eigen3
include(external/pybind11) # download pybind11 include(external/pybind11) # download pybind11
include(external/cares) include(external/cares)
include(external/cub) include(external/cub)
include(external/rocprim)
include(external/xxhash) # download xxhash include(external/xxhash) # download xxhash
include(external/dlpack)
include(external/snappy) # download snappy include(external/snappy) # download snappy
include(external/snappystream) # download snappystream include(external/snappystream) # download snappystream
......
include(ExternalProject)
set(DLPACK_SOURCE_DIR ${THIRD_PARTY_PATH}/dlpack)
set(DLPACK_INCLUDE_DIR ${DLPACK_SOURCE_DIR}/src/extern_dlpack/include)
include_directories(${DLPACK_INCLUDE_DIR})
ExternalProject_Add(
extern_dlpack
${EXTERNAL_PROJECT_LOG_ARGS}
GIT_REPOSITORY "https://github.com/dmlc/dlpack.git"
GIT_TAG "v0.2"
PREFIX ${DLPACK_SOURCE_DIR}
UPDATE_COMMAND ""
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)
if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/dlpack_dummy.c)
file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";")
add_library(dlpack STATIC ${dummyfile})
else()
add_library(dlpack INTERFACE)
endif()
add_dependencies(dlpack extern_dlpack)
LIST(APPEND externl_project_dependencies dlpack)
...@@ -17,7 +17,7 @@ if(WITH_AMD_GPU) ...@@ -17,7 +17,7 @@ if(WITH_AMD_GPU)
extern_eigen3 extern_eigen3
${EXTERNAL_PROJECT_LOG_ARGS} ${EXTERNAL_PROJECT_LOG_ARGS}
GIT_REPOSITORY "https://github.com/sabreshao/hipeigen.git" GIT_REPOSITORY "https://github.com/sabreshao/hipeigen.git"
GIT_TAG 0cba03ff9f8f9f70bbd92ac5857b031aa8fed6f9 GIT_TAG 7cb2b6e5a4b4a1efe658abb215cd866c6fb2275e
PREFIX ${EIGEN_SOURCE_DIR} PREFIX ${EIGEN_SOURCE_DIR}
UPDATE_COMMAND "" UPDATE_COMMAND ""
CONFIGURE_COMMAND "" CONFIGURE_COMMAND ""
......
...@@ -53,7 +53,7 @@ ExternalProject_Add( ...@@ -53,7 +53,7 @@ ExternalProject_Add(
${EXTERNAL_PROJECT_LOG_ARGS} ${EXTERNAL_PROJECT_LOG_ARGS}
DEPENDS ${MKLDNN_DEPENDS} DEPENDS ${MKLDNN_DEPENDS}
GIT_REPOSITORY "https://github.com/01org/mkl-dnn.git" GIT_REPOSITORY "https://github.com/01org/mkl-dnn.git"
GIT_TAG "21fb5f2af1dd14e132af4f1b79160977ee487818" GIT_TAG "830a10059a018cd2634d94195140cf2d8790a75a"
PREFIX ${MKLDNN_SOURCES_DIR} PREFIX ${MKLDNN_SOURCES_DIR}
UPDATE_COMMAND "" UPDATE_COMMAND ""
CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} CMAKE_ARGS -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
......
if (NOT WITH_AMD_GPU)
return()
endif()
# rocprim is "ROCm Parallel Primitives" for short.
# It is a header-only library providing HIP and HC parallel primitives
# for developing performant GPU-accelerated code on AMD ROCm platform.
if("x${HCC_HOME}" STREQUAL "x")
set(HCC_HOME "/opt/rocm/hcc")
endif()
INCLUDE(ExternalProject)
SET(ROCPRIM_SOURCE_DIR ${THIRD_PARTY_PATH}/rocprim)
SET(ROCPRIM_INSTALL_DIR ${THIRD_PARTY_PATH}/install/rocprim)
SET(ROCPRIM_INCLUDE_DIR ${ROCPRIM_INSTALL_DIR}/include)
ExternalProject_Add(
extern_rocprim
GIT_REPOSITORY "https://github.com/ROCmSoftwarePlatform/rocPRIM.git"
GIT_TAG 5bd41b96ab8d8343330fb2c3e1b96775bde3b3fc
PREFIX ${ROCPRIM_SOURCE_DIR}
UPDATE_COMMAND ""
CMAKE_ARGS -DCMAKE_CXX_COMPILER=${HCC_HOME}/bin/hcc
CMAKE_ARGS -DONLY_INSTALL=ON
CMAKE_ARGS -DBUILD_TEST=OFF
CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${ROCPRIM_INSTALL_DIR}
INSTALL_DIR ${ROCPRIM_INSTALL_DIR}
${EXTERNAL_PROJECT_LOG_ARGS}
)
INCLUDE_DIRECTORIES(${ROCPRIM_INCLUDE_DIR})
if (${CMAKE_VERSION} VERSION_LESS "3.3.0")
set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/rocprim_dummy.c)
file(WRITE ${dummyfile} "const char *dummy_rocprim = \"${dummyfile}\";")
add_library(rocprim STATIC ${dummyfile})
else()
add_library(rocprim INTERFACE)
endif()
add_dependencies(rocprim extern_rocprim)
...@@ -129,6 +129,9 @@ set(COMMON_FLAGS ...@@ -129,6 +129,9 @@ set(COMMON_FLAGS
-Wno-error=parentheses-equality # Warnings in pybind11 -Wno-error=parentheses-equality # Warnings in pybind11
-Wno-error=ignored-attributes # Warnings in Eigen, gcc 6.3 -Wno-error=ignored-attributes # Warnings in Eigen, gcc 6.3
-Wno-error=terminate # Warning in PADDLE_ENFORCE -Wno-error=terminate # Warning in PADDLE_ENFORCE
-Wno-error=int-in-bool-context # Warning in Eigen gcc 7.2
-Wimplicit-fallthrough=0 # Warning in tinyformat.h
-Wno-error=maybe-uninitialized # Warning in boost gcc 7.2
) )
set(GPU_COMMON_FLAGS set(GPU_COMMON_FLAGS
......
...@@ -454,25 +454,29 @@ function(hip_library TARGET_NAME) ...@@ -454,25 +454,29 @@ function(hip_library TARGET_NAME)
else() else()
add_library(${TARGET_NAME} STATIC ${_cmake_options} ${_generated_files} ${_sources}) add_library(${TARGET_NAME} STATIC ${_cmake_options} ${_generated_files} ${_sources})
set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE CXX) set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE CXX)
target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a) target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a /opt/rocm/rccl/lib/librccl.so /opt/rocm/hiprand/lib/libhiprand.so)
find_fluid_modules(${TARGET_NAME}) find_fluid_modules(${TARGET_NAME})
endif() endif()
if (hip_library_DEPS) if("${hip_library_DEPS}" MATCHES "ARCHIVE_START")
add_dependencies(${TARGET_NAME} ${hip_library_DEPS}) # Support linking flags: --whole-archive (Linux) / -force_load (MacOS).
target_link_libraries(${TARGET_NAME} ${hip_library_DEPS}) # WARNING: Please don't use ARCHIVE_START&ARCHIVE_END if TARGET_NAME will be linked by other libraries.
target_circle_link_libraries(${TARGET_NAME} ${hip_library_DEPS})
list(REMOVE_ITEM hip_library_DEPS ARCHIVE_START ARCHIVE_END)
else()
target_link_libraries(${TARGET_NAME} ${hip_library_DEPS})
endif() endif()
# cpplint code style # cpplint code style
foreach(source_file ${hip_library_SRCS}) foreach(source_file ${hip_library_SRCS})
string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file}) string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file})
if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h) list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
endif() endif()
endforeach() endforeach()
else(hip_library_SRCS) else(hip_library_SRCS)
if (hip_library_DEPS) if (hip_library_DEPS)
merge_static_libs(${TARGET_NAME} ${hip_library_DEPS}) merge_static_libs(${TARGET_NAME} ${hip_library_DEPS})
else() else()
message(FATAL "Please specify source file or library in nv_library.") message(FATAL "Please specify source file or library in nv_library.")
endif() endif()
endif(hip_library_SRCS) endif(hip_library_SRCS)
endif() endif()
......
...@@ -3,6 +3,8 @@ if(NOT WITH_AMD_GPU) ...@@ -3,6 +3,8 @@ if(NOT WITH_AMD_GPU)
endif() endif()
include_directories("/opt/rocm/include") include_directories("/opt/rocm/include")
include_directories("/opt/rocm/hip/include")
include_directories("/opt/rocm/miopen/include")
include_directories("/opt/rocm/hipblas/include") include_directories("/opt/rocm/hipblas/include")
include_directories("/opt/rocm/hiprand/include") include_directories("/opt/rocm/hiprand/include")
include_directories("/opt/rocm/rocrand/include") include_directories("/opt/rocm/rocrand/include")
...@@ -11,20 +13,40 @@ include_directories("/opt/rocm/thrust") ...@@ -11,20 +13,40 @@ include_directories("/opt/rocm/thrust")
list(APPEND EXTERNAL_LIBS "-L/opt/rocm/lib/ -lhip_hcc") list(APPEND EXTERNAL_LIBS "-L/opt/rocm/lib/ -lhip_hcc")
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++14" ) set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++11" )
if(WITH_DSO) if(WITH_DSO)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_USE_DSO") set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_USE_DSO")
endif(WITH_DSO) endif(WITH_DSO)
if(WITH_DOUBLE)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_TYPE_DOUBLE")
endif(WITH_DOUBLE)
if(WITH_TESTING) if(WITH_TESTING)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_TESTING") set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_TESTING")
endif(WITH_TESTING) endif(WITH_TESTING)
if(WITH_DISTRIBUTE)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_DISTRIBUTE")
endif(WITH_DISTRIBUTE)
if(WITH_GRPC)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_GRPC")
endif(WITH_GRPC)
if(NOT WITH_GOLANG)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITHOUT_GOLANG")
endif(NOT WITH_GOLANG)
if(WITH_MKLDNN)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_MKLDNN")
endif(WITH_MKLDNN)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DANY_IMPL_ANY_CAST_MOVEABLE")
if(NOT WITH_RDMA)
set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_DISABLE_RDMA")
endif(NOT WITH_RDMA)
if(CMAKE_BUILD_TYPE STREQUAL "Debug") if(CMAKE_BUILD_TYPE STREQUAL "Debug")
list(APPEND HIP_HCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG}) list(APPEND HIP_HCC_FLAGS ${CMAKE_CXX_FLAGS_DEBUG})
elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo") elseif(CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo")
......
...@@ -276,7 +276,7 @@ paddle.fluid.layers.hard_shrink ArgSpec(args=['x', 'threshold'], varargs=None, k ...@@ -276,7 +276,7 @@ paddle.fluid.layers.hard_shrink ArgSpec(args=['x', 'threshold'], varargs=None, k
paddle.fluid.layers.cumsum ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.cumsum ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None))
paddle.fluid.layers.thresholded_relu ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,)) paddle.fluid.layers.thresholded_relu ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,))
paddle.fluid.layers.prior_box ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False)) paddle.fluid.layers.prior_box ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False))
paddle.fluid.layers.density_prior_box ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, None)) paddle.fluid.layers.density_prior_box ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'flatten_to_2d', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, False, None))
paddle.fluid.layers.multi_box_head ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False)) paddle.fluid.layers.multi_box_head ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False))
paddle.fluid.layers.bipartite_match ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.bipartite_match ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None))
paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None)) paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None))
......
...@@ -192,3 +192,6 @@ cc_test(tuple_test SRCS tuple_test.cc ) ...@@ -192,3 +192,6 @@ cc_test(tuple_test SRCS tuple_test.cc )
if (NOT WIN32) if (NOT WIN32)
cc_test(rw_lock_test SRCS rw_lock_test.cc) cc_test(rw_lock_test SRCS rw_lock_test.cc)
endif (NOT WIN32) endif (NOT WIN32)
cc_library(dlpack_tensor SRCS dlpack_tensor.cc DEPS tensor dlpack)
cc_test(dlpack_tensor_test SRCS dlpack_tensor_test.cc DEPS dlpack_tensor glog)
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/dlpack_tensor.h"
namespace paddle {
namespace framework {
namespace internal {
template <typename T>
static ::DLDataType GetDLDataTypeCode() {
::DLDataType dtype;
if (std::is_same<T, platform::float16>::value ||
std::is_floating_point<T>::value) {
dtype.code = kDLFloat;
} else if (std::is_unsigned<T>::value) {
dtype.code = kDLUInt;
} else if (std::is_integral<T>::value) {
dtype.code = kDLInt;
} else {
PADDLE_THROW("Unsupported data type %s", typeid(T).name());
}
dtype.bits = 8 * sizeof(T);
dtype.lanes = 1;
return dtype;
}
static DLDataType GetDLDataTypeFromTypeIndex(const std::type_index &type) {
#define REG_DL_DATA_TYPE(type) \
{ std::type_index(typeid(type)), GetDLDataTypeCode<type>() }
static const std::unordered_map<std::type_index, ::DLDataType>
type_to_dtype_map({
REG_DL_DATA_TYPE(platform::float16), // NOLINT
REG_DL_DATA_TYPE(float), // NOLINT
REG_DL_DATA_TYPE(double), // NOLINT
REG_DL_DATA_TYPE(int), // NOLINT
REG_DL_DATA_TYPE(int64_t), // NOLINT
REG_DL_DATA_TYPE(bool), // NOLINT
REG_DL_DATA_TYPE(size_t), // NOLINT
REG_DL_DATA_TYPE(int16_t), // NOLINT
REG_DL_DATA_TYPE(uint8_t), // NOLINT
REG_DL_DATA_TYPE(int8_t) // NOLINT
});
static auto type_to_dtype_map_end_it = type_to_dtype_map.end();
auto it = type_to_dtype_map.find(type);
PADDLE_ENFORCE(it != type_to_dtype_map_end_it, "Unsupported data type %s",
type.name());
return it->second;
#undef REG_DL_DATA_TYPE
}
struct DLContextVisitor : public boost::static_visitor<::DLContext> {
inline ::DLContext operator()(const platform::CPUPlace &place) const {
DLContext ctx;
ctx.device_type = kDLCPU;
ctx.device_id = 0;
return ctx;
}
inline ::DLContext operator()(const platform::CUDAPlace &place) const {
#ifdef PADDLE_WITH_CUDA
DLContext ctx;
ctx.device_type = kDLGPU;
ctx.device_id = place.device;
return ctx;
#else
PADDLE_THROW("platform::CUDAPlace is not supported in CPU only version");
#endif
}
inline ::DLContext operator()(const platform::CUDAPinnedPlace &place) const {
#ifdef PADDLE_WITH_CUDA
DLContext ctx;
ctx.device_type = kDLCPUPinned;
ctx.device_id = 0;
return ctx;
#else
PADDLE_THROW(
"platform::CUDAPinnedPlace is not supported in CPU only version");
#endif
}
};
} // namespace internal
DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) {
// init data, data buffer
t_.data = const_cast<void *>(tensor.data<void>());
// init ctx, DLContext type with device_type and device_id
auto place = tensor.place();
t_.ctx = boost::apply_visitor(internal::DLContextVisitor(), place);
// init dtype
t_.dtype = internal::GetDLDataTypeFromTypeIndex(tensor.type());
t_.dtype.lanes = lanes;
// init ndim, tensor rank
auto &dims = tensor.dims();
using DimType = decltype(t_.ndim); // int
t_.ndim = static_cast<DimType>(dims.size());
// init shape, tensor dims
t_.shape = shape_;
for (DimType i = 0; i < t_.ndim; ++i) {
t_.shape[i] = dims[i];
}
// init strides, nullptr means the tensor is compact
t_.strides = nullptr;
// init byte_offset
t_.byte_offset = 0;
}
} // namespace framework
} // namespace paddle
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#pragma once
#include <dlpack/dlpack.h>
#include "paddle/fluid/framework/tensor.h"
namespace paddle {
namespace framework {
class DLPackTensor {
public:
using LaneType = decltype(::DLTensor::dtype.lanes); // uint16_t
using ShapeType =
std::remove_reference<decltype(::DLTensor::shape[0])>::type; // int64_t
// lanes is only used in CPU to enable vectorization
explicit DLPackTensor(const Tensor& tensor, LaneType lanes = 1);
inline operator const ::DLTensor&() const { return t_; }
inline operator ::DLTensor&() { return t_; }
private:
::DLTensor t_;
// The shape in DLTensor is defined as int64_t*
// Add this member to make TVMTensor init without heap allocation
ShapeType shape_[9];
};
} // namespace framework
} // namespace paddle
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/dlpack_tensor.h"
#include <glog/logging.h>
#include <gtest/gtest.h>
#include <vector>
namespace paddle {
namespace framework {
namespace { // NOLINT
template <typename T>
constexpr uint8_t GetDLDataTypeCode() {
return std::is_same<platform::float16, T>::value ||
std::is_floating_point<T>::value
? static_cast<uint8_t>(kDLFloat)
: (std::is_unsigned<T>::value
? static_cast<uint8_t>(kDLUInt)
: (std::is_integral<T>::value ? static_cast<uint8_t>(kDLInt)
: static_cast<uint8_t>(-1)));
}
} // NOLINT
template <typename T>
void TestMain(const platform::Place &place, uint16_t lanes) {
DDim dims{4, 5, 6, 7};
Tensor tensor;
tensor.Resize(dims);
void *p = tensor.mutable_data<T>(place);
DLPackTensor dlpack_tensor(tensor, lanes);
::DLTensor &dl_tensor = dlpack_tensor;
CHECK_EQ(p, dl_tensor.data);
if (platform::is_cpu_place(place)) {
CHECK_EQ(kDLCPU, dl_tensor.ctx.device_type);
CHECK_EQ(0, dl_tensor.ctx.device_id);
} else if (platform::is_gpu_place(place)) {
CHECK_EQ(kDLGPU, dl_tensor.ctx.device_type);
CHECK_EQ(boost::get<platform::CUDAPlace>(place).device,
dl_tensor.ctx.device_id);
} else if (platform::is_cuda_pinned_place(place)) {
CHECK_EQ(kDLCPUPinned, dl_tensor.ctx.device_type);
CHECK_EQ(0, dl_tensor.ctx.device_id);
} else {
CHECK_EQ(false, true);
}
CHECK_EQ(dims.size(), dl_tensor.ndim);
for (auto i = 0; i < dims.size(); ++i) {
CHECK_EQ(dims[i], dl_tensor.shape[i]);
}
CHECK_EQ(dl_tensor.strides == nullptr, true);
CHECK_EQ(static_cast<uint64_t>(0), dl_tensor.byte_offset);
CHECK_EQ(lanes, dl_tensor.dtype.lanes);
CHECK_EQ(sizeof(T) * 8, dl_tensor.dtype.bits);
CHECK_EQ(GetDLDataTypeCode<T>(), dl_tensor.dtype.code);
}
template <typename T>
void TestMainLoop() {
#ifdef PADDLE_WITH_CUDA
std::vector<platform::Place> places{platform::CPUPlace(),
platform::CUDAPlace(0),
platform::CUDAPinnedPlace()};
if (platform::GetCUDADeviceCount() > 1) {
places.emplace_back(platform::CUDAPlace(1));
}
#else
std::vector<platform::Place> places{platform::CPUPlace()};
#endif
std::vector<uint16_t> lanes{1, 2};
for (auto &p : places) {
for (auto &l : lanes) {
TestMain<T>(p, l);
}
}
}
#define PADDLE_DLPACK_TEST(type) \
TEST(dlpack, test_##type) { TestMainLoop<type>(); }
using float16 = platform::float16;
PADDLE_DLPACK_TEST(float16);
PADDLE_DLPACK_TEST(float);
PADDLE_DLPACK_TEST(double);
PADDLE_DLPACK_TEST(int);
PADDLE_DLPACK_TEST(int64_t);
PADDLE_DLPACK_TEST(bool);
PADDLE_DLPACK_TEST(size_t);
PADDLE_DLPACK_TEST(int16_t);
PADDLE_DLPACK_TEST(uint8_t);
PADDLE_DLPACK_TEST(int8_t);
#undef PADDLE_DLPACK_TEST
} // namespace framework
} // namespace paddle
...@@ -252,6 +252,12 @@ void OpDesc::SetAttr(const std::string &name, const Attribute &v) { ...@@ -252,6 +252,12 @@ void OpDesc::SetAttr(const std::string &name, const Attribute &v) {
this->attrs_[name] = std::vector<int>(); this->attrs_[name] = std::vector<int>();
break; break;
} }
case proto::AttrType::LONGS: {
VLOG(110) << "SetAttr: " << Type() << ", " << name
<< " from LONGS to LONGS";
this->attrs_[name] = std::vector<int64_t>();
break;
}
case proto::AttrType::FLOATS: { case proto::AttrType::FLOATS: {
VLOG(110) << "SetAttr: " << Type() << ", " << name VLOG(110) << "SetAttr: " << Type() << ", " << name
<< " from INTS to FLOATS"; << " from INTS to FLOATS";
......
...@@ -19,9 +19,6 @@ namespace paddle { ...@@ -19,9 +19,6 @@ namespace paddle {
namespace inference { namespace inference {
namespace tensorrt { namespace tensorrt {
/*
* SplitOp.
*/
class SplitOpConverter : public OpConverter { class SplitOpConverter : public OpConverter {
public: public:
void operator()(const framework::proto::OpDesc& op, void operator()(const framework::proto::OpDesc& op,
...@@ -40,16 +37,11 @@ class SplitOpConverter : public OpConverter { ...@@ -40,16 +37,11 @@ class SplitOpConverter : public OpConverter {
int axis = boost::get<int>(op_desc.GetAttr("axis")); int axis = boost::get<int>(op_desc.GetAttr("axis"));
std::vector<int> output_lengths = std::vector<int> output_lengths =
boost::get<std::vector<int>>(op_desc.GetAttr("sections")); boost::get<std::vector<int>>(op_desc.GetAttr("sections"));
// split on batch is not supported in TensorRT
PADDLE_ENFORCE(axis != 0); PADDLE_ENFORCE(axis != 0);
if (axis < 0) { axis += (axis < 0) ? input_dims.nbDims : -1;
axis += input_dims.nbDims;
} else {
axis -= 1;
}
PADDLE_ENFORCE(output_lengths.size() == output_num); PADDLE_ENFORCE(output_lengths.size() == output_num);
//
plugin::SplitPlugin* plugin = new plugin::SplitPlugin(axis, output_lengths); plugin::SplitPlugin* plugin = new plugin::SplitPlugin(axis, output_lengths);
nvinfer1::IPluginLayer* layer = nvinfer1::IPluginLayer* layer =
engine_->AddPlugin(&input, input_num, plugin); engine_->AddPlugin(&input, input_num, plugin);
......
...@@ -20,30 +20,92 @@ namespace paddle { ...@@ -20,30 +20,92 @@ namespace paddle {
namespace inference { namespace inference {
namespace tensorrt { namespace tensorrt {
TEST(split_op, test) { template <int BatchSize, int Axis>
void TensorRTSplitTest(const std::vector<int> &in_shape,
const std::vector<int> &sections) {
std::unordered_set<std::string> parameters({""}); std::unordered_set<std::string> parameters({""});
framework::Scope scope; framework::Scope scope;
TRTConvertValidation validator(10, parameters, scope, 1000); TRTConvertValidation validator(BatchSize + 1, parameters, scope, 10000);
validator.DeclInputVar("split_input", nvinfer1::DimsCHW(3, 2, 2));
validator.DeclOutputVar("split_out1", nvinfer1::DimsCHW(2, 2, 2)); auto make_dim = [](const std::vector<int> &shape) {
validator.DeclOutputVar("split_out2", nvinfer1::DimsCHW(1, 2, 2)); nvinfer1::DimsCHW dim;
dim.c() = shape[0];
dim.h() = shape[1];
dim.w() = shape[2];
return dim;
};
validator.DeclInputVar("split_input", make_dim(in_shape));
std::vector<std::string> output_vars;
for (size_t i = 0; i < sections.size(); ++i) {
auto out_shape = in_shape;
out_shape[Axis - 1] = sections[i];
std::string output_name = "split_out" + std::to_string(i);
validator.DeclOutputVar(output_name, make_dim(out_shape));
output_vars.push_back(output_name);
}
// Prepare Op description // Prepare Op description
framework::OpDesc desc; framework::OpDesc desc;
desc.SetType("split"); desc.SetType("split");
desc.SetInput("X", {"split_input"}); desc.SetInput("X", {"split_input"});
desc.SetOutput("Out", {"split_out1", "split_out2"}); desc.SetOutput("Out", output_vars);
int num = 0; desc.SetAttr("axis", Axis);
int axis = 1; desc.SetAttr("num", 0);
std::vector<int> output_lengths = {2, 1}; desc.SetAttr("sections", sections);
desc.SetAttr("axis", axis);
desc.SetAttr("num", num);
desc.SetAttr("sections", output_lengths);
validator.SetOp(*desc.Proto()); validator.SetOp(*desc.Proto());
validator.Execute(1); validator.Execute(BatchSize);
}
// batch = 0, axis = 1, same shape
TEST(split_op, test_same_shape_axis1_batch1) {
TensorRTSplitTest<1, 1>({4, 2, 2}, {2, 2});
}
// batch = 0, axis = 1, different shape
TEST(split_op, test_different_shape_axis1_batch1) {
TensorRTSplitTest<1, 1>({3, 2, 2}, {2, 1});
}
// batch = 10, axis = 1, same shape
TEST(split_op, test_same_shape_axis1_batch10) {
TensorRTSplitTest<10, 1>({4, 2, 2}, {2, 2});
}
// batch = 10, axis = 1, different shape
TEST(split_op, test_different_shape_axis1_batch10) {
TensorRTSplitTest<10, 1>({3, 2, 2}, {2, 1});
}
// batch = 0, axis = 2, same shape
TEST(split_op, test_same_shape_axis2_batch1) {
TensorRTSplitTest<1, 2>({3, 4, 2}, {2, 2});
}
// batch = 0, axis = 2, different shape
TEST(split_op, test_different_shape_axis2_batch1) {
TensorRTSplitTest<1, 2>({3, 3, 2}, {2, 1});
}
// batch = 10, axis = 2, same shape
TEST(split_op, test_same_shape_axis2_batch10) {
TensorRTSplitTest<10, 2>({3, 4, 2}, {2, 2});
}
// batch = 10, axis = 2, different shape
TEST(split_op, test_different_shape_axis2_batch10) {
TensorRTSplitTest<10, 2>({3, 3, 2}, {2, 1});
}
// batch = 0, axis = 3, same shape
TEST(split_op, test_same_shape_axis3_batch1) {
TensorRTSplitTest<1, 3>({3, 2, 4}, {2, 2});
}
// batch = 0, axis = 3, different shape
TEST(split_op, test_different_shape_axis3_batch1) {
TensorRTSplitTest<1, 3>({3, 2, 3}, {2, 1});
}
// batch = 10, axis = 3, same shape
TEST(split_op, test_same_shape_axis3_batch10) {
TensorRTSplitTest<10, 3>({3, 2, 4}, {2, 2});
}
// batch = 10, axis = 3, different shape
TEST(split_op, test_different_shape_axis3_batch10) {
TensorRTSplitTest<10, 3>({3, 2, 3}, {2, 1});
} }
} // namespace tensorrt } // namespace tensorrt
......
...@@ -12,6 +12,8 @@ ...@@ -12,6 +12,8 @@
// See the License for the specific language governing permissions and // See the License for the specific language governing permissions and
// limitations under the License. // limitations under the License.
#include <cuda_fp16.h>
#include <algorithm>
#include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h" #include "paddle/fluid/inference/tensorrt/plugin/split_op_plugin.h"
namespace paddle { namespace paddle {
...@@ -19,6 +21,52 @@ namespace inference { ...@@ -19,6 +21,52 @@ namespace inference {
namespace tensorrt { namespace tensorrt {
namespace plugin { namespace plugin {
// copied from operators::math::SplitFunctor
template <typename T>
__global__ void SplitKernel(const T* input_data, const int in_row,
const int in_col, const int* out_cols,
int out_cols_size, T** outputs_data) {
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
int curr_segment = 0;
int curr_offset = out_cols[0];
for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) {
int curr_col_offset = out_cols[curr_segment + 1];
while (curr_col_offset <= tid_x) {
curr_offset = curr_col_offset;
++curr_segment;
curr_col_offset = out_cols[curr_segment + 1];
}
int local_col = tid_x - curr_offset;
int segment_width = curr_col_offset - curr_offset;
T* output_ptr = outputs_data[curr_segment];
if (output_ptr != nullptr) {
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y)
output_ptr[tid_y * segment_width + local_col] =
input_data[tid_y * in_col + tid_x];
}
}
}
template <typename T>
__global__ void SplitKernel(const T* input_data, const int in_row,
const int in_col, const int fixed_out_col,
T** outputs_data) {
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
for (; tid_x < in_col; tid_x += blockDim.x * gridDim.x) {
int split = tid_x / fixed_out_col;
int in_offset = tid_x - split * fixed_out_col;
T* output_ptr = outputs_data[split];
if (output_ptr != nullptr) {
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
for (; tid_y < in_row; tid_y += blockDim.y * gridDim.y)
output_ptr[tid_y * fixed_out_col + in_offset] =
input_data[tid_y * in_col + tid_x];
}
}
}
nvinfer1::Dims SplitPlugin::getOutputDimensions( nvinfer1::Dims SplitPlugin::getOutputDimensions(
int index, const nvinfer1::Dims* input_dims, int num_inputs) { int index, const nvinfer1::Dims* input_dims, int num_inputs) {
PADDLE_ENFORCE_EQ(num_inputs, 1); PADDLE_ENFORCE_EQ(num_inputs, 1);
...@@ -31,48 +79,96 @@ nvinfer1::Dims SplitPlugin::getOutputDimensions( ...@@ -31,48 +79,96 @@ nvinfer1::Dims SplitPlugin::getOutputDimensions(
int SplitPlugin::initialize() { int SplitPlugin::initialize() {
PADDLE_ENFORCE_LE(axis_, nvinfer1::Dims::MAX_DIMS); PADDLE_ENFORCE_LE(axis_, nvinfer1::Dims::MAX_DIMS);
// notice input dims is [C, H, W]
nvinfer1::Dims dims = this->getInputDims(0);
outer_rows_ = 1;
inner_cols_ = 1;
for (int i = 0; i < axis_; ++i) {
outer_rows_ *= dims.d[i];
}
for (int i = axis_ + 1; i < dims.nbDims; ++i) {
inner_cols_ *= dims.d[i];
}
same_shape_ = true;
std::vector<int> segment_offsets(1, 0); std::vector<int> segment_offsets(1, 0);
for (int i = 0; i < this->getNbOutputs(); ++i) { for (int i = 0; i < this->getNbOutputs(); ++i) {
segment_offsets.push_back(segment_offsets.back() + output_length_[i]); if (output_length_[i] != output_length_[0]) {
same_shape_ = false;
}
segment_offsets.push_back(segment_offsets.back() +
output_length_[i] * inner_cols_);
} }
segment_offsets_ = segment_offsets; inner_cols_ *= dims.d[axis_];
nvinfer1::Dims dims = this->getInputDims(0); d_segment_offsets_ = segment_offsets;
nx_ = 1; segment_offsets_ = std::move(segment_offsets);
for (int i = dims.nbDims - 1; i > axis_; --i) { d_output_ptrs_.resize(this->getNbOutputs(), nullptr);
nx_ *= dims.d[i]; return 0;
}
template <typename T>
inline void Split(cudaStream_t stream, const bool same_shape,
const int outer_rows, const int inner_cols,
const std::vector<int>& segment_offsets,
const int* d_segment_offsets, const T* input, T** outputs) {
const int kThreadsPerBlock = 1024;
const int kMaxBlocks = 65535;
int block_cols = kThreadsPerBlock;
if (inner_cols < kThreadsPerBlock) { // block_cols is aligned by 32.
block_cols = ((inner_cols + 31) >> 5) << 5;
} }
ny_ = dims.d[axis_]; int block_rows = kThreadsPerBlock / block_cols;
nz_ = 1; dim3 block_size = dim3(block_cols, block_rows, 1);
for (int i = axis_ - 1; i >= 0; --i) {
nz_ *= dims.d[i]; int grid_cols =
std::min((inner_cols + block_cols - 1) / block_cols, kMaxBlocks);
int grid_rows =
std::min(kMaxBlocks / grid_cols, std::max(outer_rows / block_rows, 1));
dim3 grid_size = dim3(grid_cols, grid_rows, 1);
if (same_shape) {
SplitKernel<<<grid_size, block_size, 0, stream>>>(
input, outer_rows, inner_cols, segment_offsets[1], outputs);
} else {
SplitKernel<<<grid_size, block_size, 0, stream>>>(
input, outer_rows, inner_cols, d_segment_offsets,
static_cast<int>(segment_offsets.size()), outputs);
} }
return 0;
} }
int SplitPlugin::enqueue(int batchSize, const void* const* inputs, int SplitPlugin::enqueue(int batchSize, const void* const* inputs,
void** outputs, void* workspace, cudaStream_t stream) { void** outputs, void* workspace, cudaStream_t stream) {
auto const& input_dims = this->getInputDims(0); float const* input_ptr = reinterpret_cast<float const*>(inputs[0]);
int input_size = 0; if (((batchSize == 1 && axis_ == 0) || axis_ == -1) &&
float const* idata = reinterpret_cast<float const*>(inputs[0]); this->getNbOutputs() < 10) {
float** odatas = reinterpret_cast<float**>(outputs); float** output_ptrs = reinterpret_cast<float**>(outputs);
int data_type_size = (this->getDataType() == nvinfer1::DataType::kFLOAT)
// kernel impl here. ? sizeof(float)
int inputBatchOffset = nx_ * ny_ * nz_; : sizeof(__half);
for (size_t i = 0; i < this->getNbOutputs(); i++) { for (int i = 0; i < this->getNbOutputs(); ++i) {
for (size_t j = 0; j < batchSize; j++) { PADDLE_ENFORCE(
cudaMemcpyAsync( cudaMemcpyAsync(
odatas[i] + output_ptrs[i], input_ptr + segment_offsets_[i],
j * (segment_offsets_[i + 1] - segment_offsets_[i]) * nx_ * (segment_offsets_[i + 1] - segment_offsets_[i]) * data_type_size,
sizeof(float), cudaMemcpyDeviceToDevice, stream) == cudaSuccess);
inputs[0] + }
(inputBatchOffset * j + segment_offsets_[i] * nx_) * } else {
sizeof(float), outer_rows_ *= batchSize;
(segment_offsets_[i + 1] - segment_offsets_[i]) * nx_ * sizeof(float), const int* d_segment_offsets_ptr =
cudaMemcpyDeviceToDevice, stream); thrust::raw_pointer_cast(&d_segment_offsets_[0]);
float** output_ptrs = thrust::raw_pointer_cast(&d_output_ptrs_[0]);
PADDLE_ENFORCE(cudaMemcpyAsync(output_ptrs, outputs,
this->getNbOutputs() * sizeof(float*),
cudaMemcpyHostToDevice,
stream) == cudaSuccess);
if (this->getDataType() == nvinfer1::DataType::kFLOAT) {
Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_,
d_segment_offsets_ptr, input_ptr, output_ptrs);
} else {
Split(stream, same_shape_, outer_rows_, inner_cols_, segment_offsets_,
d_segment_offsets_ptr, (__half*)input_ptr, // NOLINT
(__half**)output_ptrs); // NOLINT
} }
} }
return cudaGetLastError() != cudaSuccess; return cudaGetLastError() != cudaSuccess;
} }
......
...@@ -14,6 +14,7 @@ ...@@ -14,6 +14,7 @@
#pragma once #pragma once
#include <thrust/device_vector.h>
#include <vector> #include <vector>
#include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h" #include "paddle/fluid/inference/tensorrt/plugin/trt_plugin.h"
...@@ -25,7 +26,7 @@ namespace plugin { ...@@ -25,7 +26,7 @@ namespace plugin {
class SplitPlugin : public PluginTensorRT { class SplitPlugin : public PluginTensorRT {
public: public:
SplitPlugin(int axis, std::vector<int> const &output_lengths) SplitPlugin(int axis, std::vector<int> const &output_lengths)
: axis_(axis), output_length_(output_lengths) {} : axis_(axis), same_shape_(true), output_length_(output_lengths) {}
SplitPlugin(void const *serial_data, size_t serial_length) { SplitPlugin(void const *serial_data, size_t serial_length) {
deserializeBase(serial_data, serial_length); deserializeBase(serial_data, serial_length);
...@@ -60,9 +61,13 @@ class SplitPlugin : public PluginTensorRT { ...@@ -60,9 +61,13 @@ class SplitPlugin : public PluginTensorRT {
} }
int axis_; int axis_;
int outer_rows_;
int inner_cols_;
bool same_shape_;
std::vector<int> output_length_; std::vector<int> output_length_;
int nx_, ny_, nz_;
std::vector<int> segment_offsets_; std::vector<int> segment_offsets_;
thrust::device_vector<int> d_segment_offsets_;
thrust::device_vector<float *> d_output_ptrs_;
}; };
} // namespace plugin } // namespace plugin
......
...@@ -22,7 +22,7 @@ iou_similarity_op.cu) ...@@ -22,7 +22,7 @@ iou_similarity_op.cu)
detection_library(mine_hard_examples_op SRCS mine_hard_examples_op.cc) detection_library(mine_hard_examples_op SRCS mine_hard_examples_op.cc)
detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc poly_util.cc gpc.cc) detection_library(multiclass_nms_op SRCS multiclass_nms_op.cc poly_util.cc gpc.cc)
detection_library(prior_box_op SRCS prior_box_op.cc prior_box_op.cu) detection_library(prior_box_op SRCS prior_box_op.cc prior_box_op.cu)
detection_library(density_prior_box_op SRCS density_prior_box_op.cc) detection_library(density_prior_box_op SRCS density_prior_box_op.cc density_prior_box_op.cu)
detection_library(anchor_generator_op SRCS anchor_generator_op.cc detection_library(anchor_generator_op SRCS anchor_generator_op.cc
anchor_generator_op.cu) anchor_generator_op.cu)
detection_library(target_assign_op SRCS target_assign_op.cc detection_library(target_assign_op SRCS target_assign_op.cc
......
...@@ -39,24 +39,27 @@ class DensityPriorBoxOp : public framework::OperatorWithKernel { ...@@ -39,24 +39,27 @@ class DensityPriorBoxOp : public framework::OperatorWithKernel {
auto fixed_sizes = ctx->Attrs().Get<std::vector<float>>("fixed_sizes"); auto fixed_sizes = ctx->Attrs().Get<std::vector<float>>("fixed_sizes");
auto fixed_ratios = ctx->Attrs().Get<std::vector<float>>("fixed_ratios"); auto fixed_ratios = ctx->Attrs().Get<std::vector<float>>("fixed_ratios");
auto densities = ctx->Attrs().Get<std::vector<int>>("densities"); auto densities = ctx->Attrs().Get<std::vector<int>>("densities");
bool flatten = ctx->Attrs().Get<bool>("flatten_to_2d");
PADDLE_ENFORCE_EQ(fixed_sizes.size(), densities.size(), PADDLE_ENFORCE_EQ(fixed_sizes.size(), densities.size(),
"The number of fixed_sizes and densities must be equal."); "The number of fixed_sizes and densities must be equal.");
size_t num_priors = 0; size_t num_priors = 0;
if ((fixed_sizes.size() > 0) && (densities.size() > 0)) { for (size_t i = 0; i < densities.size(); ++i) {
for (size_t i = 0; i < densities.size(); ++i) { num_priors += (fixed_ratios.size()) * (pow(densities[i], 2));
if (fixed_ratios.size() > 0) { }
num_priors += (fixed_ratios.size()) * (pow(densities[i], 2)); if (!flatten) {
} std::vector<int64_t> dim_vec(4);
} dim_vec[0] = input_dims[2];
dim_vec[1] = input_dims[3];
dim_vec[2] = num_priors;
dim_vec[3] = 4;
ctx->SetOutputDim("Boxes", framework::make_ddim(dim_vec));
ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec));
} else {
int64_t dim0 = input_dims[2] * input_dims[3] * num_priors;
ctx->SetOutputDim("Boxes", {dim0, 4});
ctx->SetOutputDim("Variances", {dim0, 4});
} }
std::vector<int64_t> dim_vec(4);
dim_vec[0] = input_dims[2];
dim_vec[1] = input_dims[3];
dim_vec[2] = num_priors;
dim_vec[3] = 4;
ctx->SetOutputDim("Boxes", framework::make_ddim(dim_vec));
ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec));
} }
protected: protected:
...@@ -64,7 +67,7 @@ class DensityPriorBoxOp : public framework::OperatorWithKernel { ...@@ -64,7 +67,7 @@ class DensityPriorBoxOp : public framework::OperatorWithKernel {
const framework::ExecutionContext& ctx) const override { const framework::ExecutionContext& ctx) const override {
return framework::OpKernelType( return framework::OpKernelType(
framework::ToDataType(ctx.Input<framework::Tensor>("Input")->type()), framework::ToDataType(ctx.Input<framework::Tensor>("Input")->type()),
platform::CPUPlace()); ctx.GetPlace());
} }
}; };
...@@ -101,7 +104,10 @@ class DensityPriorBoxOpMaker : public framework::OpProtoAndCheckerMaker { ...@@ -101,7 +104,10 @@ class DensityPriorBoxOpMaker : public framework::OpProtoAndCheckerMaker {
}); });
AddAttr<bool>("clip", "(bool) Whether to clip out-of-boundary boxes.") AddAttr<bool>("clip", "(bool) Whether to clip out-of-boundary boxes.")
.SetDefault(true); .SetDefault(true);
AddAttr<bool>("flatten_to_2d",
"(bool) Whether to flatten to 2D and "
"the second dim is 4.")
.SetDefault(false);
AddAttr<float>( AddAttr<float>(
"step_w", "step_w",
"Density prior boxes step across width, 0.0 for auto calculation.") "Density prior boxes step across width, 0.0 for auto calculation.")
......
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/detection/density_prior_box_op.h"
namespace paddle {
namespace operators {
template <typename T>
static __device__ inline T Clip(T in) {
return min(max(in, 0.), 1.);
}
template <typename T>
static __global__ void GenDensityPriorBox(
const int height, const int width, const int im_height, const int im_width,
const T offset, const T step_width, const T step_height,
const int num_priors, const T* ratios_shift, bool is_clip, const T var_xmin,
const T var_ymin, const T var_xmax, const T var_ymax, T* out, T* var) {
int gidx = blockIdx.x * blockDim.x + threadIdx.x;
int gidy = blockIdx.y * blockDim.y + threadIdx.y;
int step_x = blockDim.x * gridDim.x;
int step_y = blockDim.y * gridDim.y;
const T* width_ratio = ratios_shift;
const T* height_ratio = ratios_shift + num_priors;
const T* width_shift = ratios_shift + 2 * num_priors;
const T* height_shift = ratios_shift + 3 * num_priors;
for (int j = gidy; j < height; j += step_y) {
for (int i = gidx; i < width * num_priors; i += step_x) {
int h = j;
int w = i / num_priors;
int k = i % num_priors;
T center_x = (w + offset) * step_width;
T center_y = (h + offset) * step_height;
T center_x_temp = center_x + width_shift[k];
T center_y_temp = center_y + height_shift[k];
T box_width_ratio = width_ratio[k] / 2.;
T box_height_ratio = height_ratio[k] / 2.;
T xmin = max((center_x_temp - box_width_ratio) / im_width, 0.);
T ymin = max((center_y_temp - box_height_ratio) / im_height, 0.);
T xmax = min((center_x_temp + box_width_ratio) / im_width, 1.);
T ymax = min((center_y_temp + box_height_ratio) / im_height, 1.);
int out_offset = (j * width * num_priors + i) * 4;
out[out_offset] = is_clip ? Clip<T>(xmin) : xmin;
out[out_offset + 1] = is_clip ? Clip<T>(ymin) : ymin;
out[out_offset + 2] = is_clip ? Clip<T>(xmax) : xmax;
out[out_offset + 3] = is_clip ? Clip<T>(ymax) : ymax;
var[out_offset] = var_xmin;
var[out_offset + 1] = var_ymin;
var[out_offset + 2] = var_xmax;
var[out_offset + 3] = var_ymax;
}
}
}
template <typename T>
class DensityPriorBoxOpCUDAKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* input = ctx.Input<paddle::framework::Tensor>("Input");
auto* image = ctx.Input<paddle::framework::Tensor>("Image");
auto* boxes = ctx.Output<paddle::framework::Tensor>("Boxes");
auto* vars = ctx.Output<paddle::framework::Tensor>("Variances");
auto variances = ctx.Attr<std::vector<float>>("variances");
auto is_clip = ctx.Attr<bool>("clip");
auto fixed_sizes = ctx.Attr<std::vector<float>>("fixed_sizes");
auto fixed_ratios = ctx.Attr<std::vector<float>>("fixed_ratios");
auto densities = ctx.Attr<std::vector<int>>("densities");
T step_w = static_cast<T>(ctx.Attr<float>("step_w"));
T step_h = static_cast<T>(ctx.Attr<float>("step_h"));
T offset = static_cast<T>(ctx.Attr<float>("offset"));
auto img_width = image->dims()[3];
auto img_height = image->dims()[2];
auto feature_width = input->dims()[3];
auto feature_height = input->dims()[2];
T step_width, step_height;
if (step_w == 0 || step_h == 0) {
step_width = static_cast<T>(img_width) / feature_width;
step_height = static_cast<T>(img_height) / feature_height;
} else {
step_width = step_w;
step_height = step_h;
}
int num_priors = 0;
for (size_t i = 0; i < densities.size(); ++i) {
num_priors += (fixed_ratios.size()) * (pow(densities[i], 2));
}
int step_average = static_cast<int>((step_width + step_height) * 0.5);
framework::Tensor h_temp;
T* tdata = h_temp.mutable_data<T>({num_priors * 4}, platform::CPUPlace());
int idx = 0;
for (size_t s = 0; s < fixed_sizes.size(); ++s) {
auto fixed_size = fixed_sizes[s];
int density = densities[s];
for (size_t r = 0; r < fixed_ratios.size(); ++r) {
float ar = fixed_ratios[r];
int shift = step_average / density;
float box_width_ratio = fixed_size * sqrt(ar);
float box_height_ratio = fixed_size / sqrt(ar);
for (int di = 0; di < density; ++di) {
for (int dj = 0; dj < density; ++dj) {
float center_x_temp = shift / 2. + dj * shift - step_average / 2.;
float center_y_temp = shift / 2. + di * shift - step_average / 2.;
tdata[idx] = box_width_ratio;
tdata[num_priors + idx] = box_height_ratio;
tdata[2 * num_priors + idx] = center_x_temp;
tdata[3 * num_priors + idx] = center_y_temp;
idx++;
}
}
}
}
boxes->mutable_data<T>(ctx.GetPlace());
vars->mutable_data<T>(ctx.GetPlace());
framework::Tensor d_temp;
framework::TensorCopySync(h_temp, ctx.GetPlace(), &d_temp);
// At least use 32 threads, at most 512 threads.
// blockx is multiple of 32.
int blockx = std::min(((feature_width * num_priors + 31) >> 5) << 5, 512L);
int gridx = (feature_width * num_priors + blockx - 1) / blockx;
dim3 threads(blockx, 1);
dim3 grids(gridx, feature_height);
auto stream =
ctx.template device_context<platform::CUDADeviceContext>().stream();
GenDensityPriorBox<T><<<grids, threads, 0, stream>>>(
feature_height, feature_width, img_height, img_width, offset,
step_width, step_height, num_priors, d_temp.data<T>(), is_clip,
variances[0], variances[1], variances[2], variances[3],
boxes->data<T>(), vars->data<T>());
}
}; // namespace operators
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_CUDA_KERNEL(density_prior_box,
ops::DensityPriorBoxOpCUDAKernel<float>,
ops::DensityPriorBoxOpCUDAKernel<double>);
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
licensed under the Apache License, Version 2.0 (the "License"); licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
You may obtain a copy of the License at You may obtain a copy of the License at
...@@ -52,18 +52,16 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> { ...@@ -52,18 +52,16 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
step_height = step_h; step_height = step_h;
} }
int num_priors = 0; int num_priors = 0;
if (fixed_sizes.size() > 0 && densities.size() > 0) { for (size_t i = 0; i < densities.size(); ++i) {
for (size_t i = 0; i < densities.size(); ++i) { num_priors += (fixed_ratios.size()) * (pow(densities[i], 2));
if (fixed_ratios.size() > 0) {
num_priors += (fixed_ratios.size()) * (pow(densities[i], 2));
}
}
} }
boxes->mutable_data<T>(ctx.GetPlace()); boxes->mutable_data<T>(ctx.GetPlace());
vars->mutable_data<T>(ctx.GetPlace()); vars->mutable_data<T>(ctx.GetPlace());
auto e_boxes = framework::EigenTensor<T, 4>::From(*boxes).setConstant(0.0);
auto box_dim = vars->dims();
boxes->Resize({feature_height, feature_width, num_priors, 4});
auto e_boxes = framework::EigenTensor<T, 4>::From(*boxes).setConstant(0.0);
int step_average = static_cast<int>((step_width + step_height) * 0.5); int step_average = static_cast<int>((step_width + step_height) * 0.5);
for (int h = 0; h < feature_height; ++h) { for (int h = 0; h < feature_height; ++h) {
...@@ -76,36 +74,34 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> { ...@@ -76,36 +74,34 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
auto fixed_size = fixed_sizes[s]; auto fixed_size = fixed_sizes[s];
int density = densities[s]; int density = densities[s];
// Generate density prior boxes with fixed ratios. // Generate density prior boxes with fixed ratios.
if (fixed_ratios.size() > 0) { for (size_t r = 0; r < fixed_ratios.size(); ++r) {
for (size_t r = 0; r < fixed_ratios.size(); ++r) { float ar = fixed_ratios[r];
float ar = fixed_ratios[r]; int shift = step_average / density;
int shift = step_average / density; float box_width_ratio = fixed_size * sqrt(ar);
float box_width_ratio = fixed_size * sqrt(ar); float box_height_ratio = fixed_size / sqrt(ar);
float box_height_ratio = fixed_size / sqrt(ar); for (int di = 0; di < density; ++di) {
for (int di = 0; di < density; ++di) { for (int dj = 0; dj < density; ++dj) {
for (int dj = 0; dj < density; ++dj) { float center_x_temp =
float center_x_temp = center_x - step_average / 2. + shift / 2. + dj * shift;
center_x - step_average / 2. + shift / 2. + dj * shift; float center_y_temp =
float center_y_temp = center_y - step_average / 2. + shift / 2. + di * shift;
center_y - step_average / 2. + shift / 2. + di * shift; e_boxes(h, w, idx, 0) =
e_boxes(h, w, idx, 0) = (center_x_temp - box_width_ratio / 2.) / img_width >= 0
(center_x_temp - box_width_ratio / 2.) / img_width >= 0 ? (center_x_temp - box_width_ratio / 2.) / img_width
? (center_x_temp - box_width_ratio / 2.) / img_width : 0;
: 0; e_boxes(h, w, idx, 1) =
e_boxes(h, w, idx, 1) = (center_y_temp - box_height_ratio / 2.) / img_height >= 0
(center_y_temp - box_height_ratio / 2.) / img_height >= 0 ? (center_y_temp - box_height_ratio / 2.) / img_height
? (center_y_temp - box_height_ratio / 2.) / img_height : 0;
: 0; e_boxes(h, w, idx, 2) =
e_boxes(h, w, idx, 2) = (center_x_temp + box_width_ratio / 2.) / img_width <= 1
(center_x_temp + box_width_ratio / 2.) / img_width <= 1 ? (center_x_temp + box_width_ratio / 2.) / img_width
? (center_x_temp + box_width_ratio / 2.) / img_width : 1;
: 1; e_boxes(h, w, idx, 3) =
e_boxes(h, w, idx, 3) = (center_y_temp + box_height_ratio / 2.) / img_height <= 1
(center_y_temp + box_height_ratio / 2.) / img_height <= 1 ? (center_y_temp + box_height_ratio / 2.) / img_height
? (center_y_temp + box_height_ratio / 2.) / img_height : 1;
: 1; idx++;
idx++;
}
} }
} }
} }
...@@ -139,6 +135,7 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> { ...@@ -139,6 +135,7 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
e_vars = var_et.broadcast(Eigen::DSizes<int, 2>(box_num, 1)); e_vars = var_et.broadcast(Eigen::DSizes<int, 2>(box_num, 1));
vars->Resize(var_dim); vars->Resize(var_dim);
boxes->Resize(box_dim);
} }
}; // namespace operators }; // namespace operators
......
...@@ -16,6 +16,9 @@ ...@@ -16,6 +16,9 @@
#include "paddle/fluid/operators/math/math_function.h" #include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/platform/dynload/cublas.h" #include "paddle/fluid/platform/dynload/cublas.h"
#include "paddle/fluid/platform/gpu_info.h"
DECLARE_bool(enable_cublas_tensor_op_math);
namespace paddle { namespace paddle {
namespace operators { namespace operators {
...@@ -42,11 +45,44 @@ struct CUBlas<float> { ...@@ -42,11 +45,44 @@ struct CUBlas<float> {
} }
template <typename... ARGS> template <typename... ARGS>
static void GEMM_BATCH(ARGS... args) { static void GEMM_STRIDED_BATCH(ARGS... args) {
#if CUDA_VERSION >= 8000 #if CUDA_VERSION >= 8000
PADDLE_ENFORCE(platform::dynload::cublasSgemmStridedBatched(args...)); PADDLE_ENFORCE(platform::dynload::cublasSgemmStridedBatched(args...));
#else #else
PADDLE_THROW("SgemmStridedBatched is not supported on cuda <= 7.5"); PADDLE_THROW("SgemmStridedBatched is not supported on cuda <= 7.5");
#endif
}
// NOTES: GEMM_EX can use Tensor Core to accelerate matrix multiply.
// https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode
template <typename... ARGS>
static void GEMM_EX(platform::CUDADeviceContext *dev_ctx,
cublasOperation_t transa, cublasOperation_t transb, int m,
int n, int k, const float *alpha, const void *A,
cudaDataType_t Atype, int lda, const void *B,
cudaDataType_t Btype, int ldb, const float *beta, void *C,
cudaDataType_t Ctype, int ldc) {
// Because the gcc 4.8 doesn't expand template parameter pack that
// appears in a lambda-expression, I can not use template parameter pack
// here.
auto cublas_call = [&]() {
#if CUDA_VERSION >= 8000
VLOG(5) << "use_tensor_op_math: "
<< (platform::TensorCoreAvailable() ? "True" : "False");
PADDLE_ENFORCE(platform::dynload::cublasSgemmEx(
dev_ctx->cublas_handle(), transa, transb, m, n, k, alpha, A, Atype,
lda, B, Btype, ldb, beta, C, Ctype, ldc));
#else
PADDLE_THROW("cublasSgemmEx is supported on cuda >= 8.0");
#endif
};
#if CUDA_VERSION >= 9000
// NOTES: To use Tensor Core, we should change the cublas config,
// but the cublas may be hold by multi-thread.
dev_ctx->CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH);
#else
cublas_call();
#endif #endif
} }
}; };
...@@ -69,13 +105,18 @@ struct CUBlas<double> { ...@@ -69,13 +105,18 @@ struct CUBlas<double> {
} }
template <typename... ARGS> template <typename... ARGS>
static void GEMM_BATCH(ARGS... args) { static void GEMM_STRIDED_BATCH(ARGS... args) {
#if CUDA_VERSION >= 8000 #if CUDA_VERSION >= 8000
PADDLE_ENFORCE(platform::dynload::cublasDgemmStridedBatched(args...)); PADDLE_ENFORCE(platform::dynload::cublasDgemmStridedBatched(args...));
#else #else
PADDLE_THROW("DgemmStridedBatched is not supported on cuda <= 7.5"); PADDLE_THROW("DgemmStridedBatched is not supported on cuda <= 7.5");
#endif #endif
} }
template <typename... ARGS>
static void GEMM_EX(ARGS... args) {
PADDLE_THROW("Currently there are not cublasDgemmEx.");
}
}; };
template <> template <>
...@@ -96,14 +137,16 @@ struct CUBlas<platform::float16> { ...@@ -96,14 +137,16 @@ struct CUBlas<platform::float16> {
reinterpret_cast<__half *>(C), ldc)); reinterpret_cast<__half *>(C), ldc));
} }
static void GEMM_BATCH(cublasHandle_t handle, cublasOperation_t transa, static void GEMM_STRIDED_BATCH(cublasHandle_t handle,
cublasOperation_t transb, int m, int n, int k, cublasOperation_t transa,
const float16 *alpha, const float16 *A, int lda, cublasOperation_t transb, int m, int n, int k,
long long int strideA, const float16 *B, // NOLINT const float16 *alpha, const float16 *A,
int ldb, long long int strideB, // NOLINT int lda, long long int strideA, // NOLINT
const float16 *beta, float16 *C, int ldc, const float16 *B, // NOLINT
long long int strideC, // NOLINT int ldb, long long int strideB, // NOLINT
int batchCount) { const float16 *beta, float16 *C, int ldc,
long long int strideC, // NOLINT
int batchCount) {
#if CUDA_VERSION >= 8000 #if CUDA_VERSION >= 8000
PADDLE_ENFORCE(platform::dynload::cublasHgemmStridedBatched( PADDLE_ENFORCE(platform::dynload::cublasHgemmStridedBatched(
handle, transa, transb, m, n, k, handle, transa, transb, m, n, k,
...@@ -114,6 +157,45 @@ struct CUBlas<platform::float16> { ...@@ -114,6 +157,45 @@ struct CUBlas<platform::float16> {
ldc, strideC, batchCount)); ldc, strideC, batchCount));
#else #else
PADDLE_THROW("HgemmStridedBatched is not supported on cuda <= 7.5"); PADDLE_THROW("HgemmStridedBatched is not supported on cuda <= 7.5");
#endif
}
// NOTES: GEMM_EX can use Tensor Core to accelerate matrix multiply.
// https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode
template <typename... ARGS>
static void GEMM_EX(platform::CUDADeviceContext *dev_ctx,
cublasOperation_t transa, cublasOperation_t transb, int m,
int n, int k, const void *alpha, const void *A,
cudaDataType_t Atype, int lda, const void *B,
cudaDataType_t Btype, int ldb, const void *beta, void *C,
cudaDataType_t Ctype, int ldc,
cudaDataType_t computeType) {
auto cublas_call = [&]() {
#if CUDA_VERSION >= 8000
cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT;
#if CUDA_VERSION >= 9000
bool use_tensor_op_math = platform::TensorCoreAvailable();
if (use_tensor_op_math) {
algo = CUBLAS_GEMM_DFALT_TENSOR_OP;
}
VLOG(5) << "use_tensor_op_math: "
<< (use_tensor_op_math ? "True" : "False");
#endif // CUDA_VERSION >= 9000
PADDLE_ENFORCE(platform::dynload::cublasGemmEx(
dev_ctx->cublas_handle(), transa, transb, m, n, k, alpha, A, Atype,
lda, B, Btype, ldb, beta, C, Ctype, ldc, computeType, algo));
#else
PADDLE_THROW("cublasGemmEx is supported on cuda >= 8.0");
#endif
};
#if CUDA_VERSION >= 9000
// NOTES: To use Tensor Core, we should change the cublas config,
// but the cublas may be hold by multi-thread.
dev_ctx->CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH);
#else
cublas_call();
#endif #endif
} }
}; };
...@@ -133,8 +215,21 @@ void Blas<platform::CUDADeviceContext>::GEMM(CBLAS_TRANSPOSE transA, ...@@ -133,8 +215,21 @@ void Blas<platform::CUDADeviceContext>::GEMM(CBLAS_TRANSPOSE transA,
cublasOperation_t cuTransB = cublasOperation_t cuTransB =
(transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T; (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
CUBlas<T>::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, #if CUDA_VERSION >= 8000
B, ldb, A, lda, &beta, C, N); if (FLAGS_enable_cublas_tensor_op_math && std::is_same<T, float>::value) {
auto &cuda_ctx = const_cast<platform::CUDADeviceContext &>(context_);
CUBlas<T>::GEMM_EX(&cuda_ctx, cuTransB, cuTransA, N, M, K, &alpha, B,
CUDA_R_32F, ldb, A, CUDA_R_32F, lda, &beta, C,
CUDA_R_32F, N);
} else {
#endif // CUDA_VERSION >= 8000
CUBlas<T>::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K,
&alpha, B, ldb, A, lda, &beta, C, N);
#if CUDA_VERSION >= 8000
}
#endif // CUDA_VERSION >= 8000
} }
template <> template <>
...@@ -157,30 +252,18 @@ inline void Blas<platform::CUDADeviceContext>::GEMM( ...@@ -157,30 +252,18 @@ inline void Blas<platform::CUDADeviceContext>::GEMM(
PADDLE_ENFORCE_GE(context_.GetComputeCapability(), 53, PADDLE_ENFORCE_GE(context_.GetComputeCapability(), 53,
"cublas fp16 gemm requires GPU compute capability >= 53"); "cublas fp16 gemm requires GPU compute capability >= 53");
#if CUDA_VERSION >= 8000
float h_alpha = static_cast<float>(alpha); float h_alpha = static_cast<float>(alpha);
float h_beta = static_cast<float>(beta); float h_beta = static_cast<float>(beta);
cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT; #if CUDA_VERSION >= 8000
#if CUDA_VERSION >= 9000
if (context_.GetComputeCapability() >= 70) {
PADDLE_ENFORCE(platform::dynload::cublasSetMathMode(
context_.cublas_handle(), CUBLAS_TENSOR_OP_MATH));
algo = CUBLAS_GEMM_DFALT_TENSOR_OP;
} else {
PADDLE_ENFORCE(platform::dynload::cublasSetMathMode(
context_.cublas_handle(), CUBLAS_DEFAULT_MATH));
}
#endif // CUDA_VERSION >= 9000
// cublasHgemm does true FP16 computation which is slow for non-Volta // cublasHgemm does true FP16 computation which is slow for non-Volta
// GPUs. So use cublasGemmEx instead which does pesudo FP16 computation: // GPUs. So use cublasGemmEx instead which does pesudo FP16 computation:
// input/output in fp16, computation in fp32, which can also be accelerated // input/output in fp16, computation in fp32, which can also be accelerated
// using tensor cores in volta GPUs. // using tensor cores in volta GPUs.
PADDLE_ENFORCE(platform::dynload::cublasGemmEx( auto &cuda_ctx = const_cast<platform::CUDADeviceContext &>(context_);
context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &h_alpha, B, CUBlas<platform::float16>::GEMM_EX(
CUDA_R_16F, ldb, A, CUDA_R_16F, lda, &h_beta, C, CUDA_R_16F, N, &cuda_ctx, cuTransB, cuTransA, N, M, K, &h_alpha, B, CUDA_R_16F, ldb, A,
CUDA_R_32F, algo)); CUDA_R_16F, lda, &h_beta, C, CUDA_R_16F, N, CUDA_R_32F);
#else #else
// CUDA 7.5 does not support cublasGemmEx, hence we fall back to use hgemm // CUDA 7.5 does not support cublasGemmEx, hence we fall back to use hgemm
CUBlas<platform::float16>::GEMM(context_.cublas_handle(), cuTransB, cuTransA, CUBlas<platform::float16>::GEMM(context_.cublas_handle(), cuTransB, cuTransA,
...@@ -199,8 +282,38 @@ void Blas<platform::CUDADeviceContext>::GEMM(bool transA, bool transB, int M, ...@@ -199,8 +282,38 @@ void Blas<platform::CUDADeviceContext>::GEMM(bool transA, bool transB, int M,
// the cblas convention. // the cblas convention.
cublasOperation_t cuTransA = transA ? CUBLAS_OP_T : CUBLAS_OP_N; cublasOperation_t cuTransA = transA ? CUBLAS_OP_T : CUBLAS_OP_N;
cublasOperation_t cuTransB = transB ? CUBLAS_OP_T : CUBLAS_OP_N; cublasOperation_t cuTransB = transB ? CUBLAS_OP_T : CUBLAS_OP_N;
CUBlas<T>::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha,
B, ldb, A, lda, &beta, C, ldc); #if CUDA_VERSION >= 8000
if (FLAGS_enable_cublas_tensor_op_math && std::is_same<T, float>::value) {
auto &cuda_ctx = const_cast<platform::CUDADeviceContext &>(context_);
CUBlas<T>::GEMM_EX(&cuda_ctx, cuTransB, cuTransA, N, M, K, &alpha, B,
CUDA_R_32F, ldb, A, CUDA_R_32F, lda, &beta, C,
CUDA_R_32F, ldc);
} else {
#endif // CUDA_VERSION >= 8000
CUBlas<T>::GEMM(context_.cublas_handle(), cuTransB, cuTransA, N, M, K,
&alpha, B, ldb, A, lda, &beta, C, ldc);
#if CUDA_VERSION >= 8000
}
#endif // CUDA_VERSION >= 8000
}
template <>
template <>
inline void Blas<platform::CUDADeviceContext>::GEMM(
bool transA, bool transB, int M, int N, int K, platform::float16 alpha,
const platform::float16 *A, int lda, const platform::float16 *B, int ldb,
platform::float16 beta, platform::float16 *C, int ldc) const {
// Note that cublas follows fortran order, so the order is different from
// the cblas convention.
cublasOperation_t cuTransA = transA ? CUBLAS_OP_T : CUBLAS_OP_N;
cublasOperation_t cuTransB = transB ? CUBLAS_OP_T : CUBLAS_OP_N;
CUBlas<platform::float16>::GEMM(context_.cublas_handle(), cuTransB, cuTransA,
N, M, K, &alpha, B, ldb, A, lda, &beta, C,
ldc);
} }
template <> template <>
...@@ -238,9 +351,34 @@ void Blas<platform::CUDADeviceContext>::BatchedGEMM( ...@@ -238,9 +351,34 @@ void Blas<platform::CUDADeviceContext>::BatchedGEMM(
(transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T; (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
const int64_t strideC = M * N; const int64_t strideC = M * N;
CUBlas<T>::GEMM_BATCH(context_.cublas_handle(), cuTransB, cuTransA, N, M, K, #if CUDA_VERSION >= 9010
&alpha, B, ldb, strideB, A, lda, strideA, &beta, C, ldc, if (FLAGS_enable_cublas_tensor_op_math && std::is_same<T, float>::value) {
strideC, batchCount); auto cublas_call = [&]() {
cublasGemmAlgo_t algo = CUBLAS_GEMM_DFALT;
bool use_tensor_op_math = platform::TensorCoreAvailable();
if (use_tensor_op_math) {
algo = CUBLAS_GEMM_DFALT_TENSOR_OP;
}
VLOG(5) << "use_tensor_op_math: "
<< (use_tensor_op_math ? "True" : "False");
PADDLE_ENFORCE(platform::dynload::cublasGemmStridedBatchedEx(
context_.cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, B,
CUDA_R_32F, ldb, strideB, A, CUDA_R_32F, lda, strideA, &beta, C,
CUDA_R_32F, ldc, strideC, batchCount, CUDA_R_32F, algo));
};
auto &dev_ctx = const_cast<platform::CUDADeviceContext &>(context_);
dev_ctx.CublasCall(cublas_call, CUBLAS_TENSOR_OP_MATH);
} else {
#endif // CUDA_VERSION >= 9010
CUBlas<T>::GEMM_STRIDED_BATCH(context_.cublas_handle(), cuTransB, cuTransA,
N, M, K, &alpha, B, ldb, strideB, A, lda,
strideA, &beta, C, ldc, strideC, batchCount);
#if CUDA_VERSION >= 9010
}
#endif // CUDA_VERSION >= 9010
} }
} // namespace math } // namespace math
......
...@@ -143,6 +143,39 @@ class CudnnWorkspaceHandle { ...@@ -143,6 +143,39 @@ class CudnnWorkspaceHandle {
std::unique_ptr<std::lock_guard<std::mutex>> guard_; std::unique_ptr<std::lock_guard<std::mutex>> guard_;
}; };
#if CUDA_VERSION >= 9000
class ScopedCublasMathMode {
public:
ScopedCublasMathMode(cublasHandle_t handle, cublasMath_t new_math_mode)
: handle_(handle) {
need_reset = false;
PADDLE_ENFORCE(
platform::dynload::cublasGetMathMode(handle_, &old_math_mode_),
"Failed to get old cublas math mode");
if (old_math_mode_ != new_math_mode) {
PADDLE_ENFORCE(
platform::dynload::cublasSetMathMode(handle_, new_math_mode),
"Failed to set old cublas math mode");
need_reset = true;
}
}
~ScopedCublasMathMode() {
if (need_reset) {
PADDLE_ENFORCE(
platform::dynload::cublasSetMathMode(handle_, old_math_mode_),
"Failed to set old cublas math mode");
}
}
private:
cublasHandle_t handle_;
cublasMath_t old_math_mode_;
bool need_reset;
};
#endif
class CUDADeviceContext : public DeviceContext { class CUDADeviceContext : public DeviceContext {
public: public:
explicit CUDADeviceContext(CUDAPlace place); explicit CUDADeviceContext(CUDAPlace place);
...@@ -199,6 +232,18 @@ class CUDADeviceContext : public DeviceContext { ...@@ -199,6 +232,18 @@ class CUDADeviceContext : public DeviceContext {
callback_manager_->Wait(); callback_manager_->Wait();
} }
#if CUDA_VERSION >= 9000
/*! \brief CublasCall may need to change cublas's config,
* but the cublas may be hold by multi-thread, so we should
* add lock here. */
template <typename Callback>
void CublasCall(Callback callback, cublasMath_t new_math) {
std::lock_guard<std::mutex> guard(cublas_mtx_);
ScopedCublasMathMode scoped_cublas_math(cublas_handle_, new_math);
callback();
}
#endif
private: private:
CUDAPlace place_; CUDAPlace place_;
...@@ -220,6 +265,8 @@ class CUDADeviceContext : public DeviceContext { ...@@ -220,6 +265,8 @@ class CUDADeviceContext : public DeviceContext {
// If we use mtx_ for StreamCallbackManager, deadlock may occur sometimes // If we use mtx_ for StreamCallbackManager, deadlock may occur sometimes
mutable std::mutex callback_mtx_; mutable std::mutex callback_mtx_;
std::unique_ptr<StreamCallbackManager> callback_manager_; std::unique_ptr<StreamCallbackManager> callback_manager_;
mutable std::mutex cublas_mtx_;
}; };
template <> template <>
......
...@@ -61,9 +61,6 @@ extern void *cublas_dso_handle; ...@@ -61,9 +61,6 @@ extern void *cublas_dso_handle;
extern DynLoad__##__name __name extern DynLoad__##__name __name
#endif #endif
#define DECLARE_DYNAMIC_LOAD_CUBLAS_V2_WRAP(__name) \
DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(__name)
#define CUBLAS_BLAS_ROUTINE_EACH(__macro) \ #define CUBLAS_BLAS_ROUTINE_EACH(__macro) \
__macro(cublasSaxpy_v2); \ __macro(cublasSaxpy_v2); \
__macro(cublasDaxpy_v2); \ __macro(cublasDaxpy_v2); \
...@@ -93,22 +90,23 @@ CUBLAS_BLAS_ROUTINE_EACH(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP) ...@@ -93,22 +90,23 @@ CUBLAS_BLAS_ROUTINE_EACH(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP)
// APIs available after CUDA 8.0 // APIs available after CUDA 8.0
#if CUDA_VERSION >= 8000 #if CUDA_VERSION >= 8000
#define CUBLAS_BLAS_ROUTINE_EACH_R2(__macro) \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmEx);
__macro(cublasGemmEx); \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasSgemmStridedBatched);
__macro(cublasSgemmStridedBatched); \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasDgemmStridedBatched);
__macro(cublasDgemmStridedBatched); \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasCgemmStridedBatched);
__macro(cublasCgemmStridedBatched); \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasZgemmStridedBatched);
__macro(cublasZgemmStridedBatched); \ DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasHgemmStridedBatched);
__macro(cublasHgemmStridedBatched);
CUBLAS_BLAS_ROUTINE_EACH_R2(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP)
#endif #endif
// APIs available after CUDA 9.0 // APIs available after CUDA 9.0
#if CUDA_VERSION >= 9000 #if CUDA_VERSION >= 9000
#define CUBLAS_BLAS_ROUTINE_EACH_R3(__macro) __macro(cublasSetMathMode); DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasSetMathMode);
DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGetMathMode);
#endif
CUBLAS_BLAS_ROUTINE_EACH_R3(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP) #if CUDA_VERSION >= 9010
DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmBatchedEx);
DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmStridedBatchedEx);
#endif #endif
#undef DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP #undef DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP
......
...@@ -26,6 +26,16 @@ DEFINE_double(fraction_of_gpu_memory_to_use, 0.92, ...@@ -26,6 +26,16 @@ DEFINE_double(fraction_of_gpu_memory_to_use, 0.92,
"additional trunks of the same size will be requested from gpu " "additional trunks of the same size will be requested from gpu "
"until the gpu has no memory left for another trunk."); "until the gpu has no memory left for another trunk.");
DEFINE_bool(
enable_cublas_tensor_op_math, false,
"The enable_cublas_tensor_op_math indicate whether to use Tensor Core, "
"but it may loss precision. Currently, There are two CUDA libraries that"
" use Tensor Cores, cuBLAS and cuDNN. cuBLAS uses Tensor Cores to speed up"
" GEMM computations(the matrices must be either half precision or single "
"precision); cuDNN uses Tensor Cores to speed up both convolutions(the "
"input and output must be half precision) and recurrent neural networks "
"(RNNs).");
namespace paddle { namespace paddle {
namespace platform { namespace platform {
...@@ -64,6 +74,16 @@ int GetCUDADriverVersion(int id) { ...@@ -64,6 +74,16 @@ int GetCUDADriverVersion(int id) {
return driver_version; return driver_version;
} }
bool TensorCoreAvailable() {
#if CUDA_VERSION >= 9000
int device = GetCurrentDeviceId();
int driver_version = GetCUDAComputeCapability(device);
return driver_version >= 70;
#else
return false;
#endif
}
int GetCUDAMultiProcessors(int id) { int GetCUDAMultiProcessors(int id) {
PADDLE_ENFORCE_LT(id, GetCUDADeviceCount(), "id must less than GPU count"); PADDLE_ENFORCE_LT(id, GetCUDADeviceCount(), "id must less than GPU count");
int count; int count;
......
...@@ -35,6 +35,9 @@ int GetCUDARuntimeVersion(int id); ...@@ -35,6 +35,9 @@ int GetCUDARuntimeVersion(int id);
//! Get the driver version of the ith GPU //! Get the driver version of the ith GPU
int GetCUDADriverVersion(int id); int GetCUDADriverVersion(int id);
//! Wheter the current device support TensorCore
bool TensorCoreAvailable();
//! Get the MultiProcessors of the ith GPU. //! Get the MultiProcessors of the ith GPU.
int GetCUDAMultiProcessors(int i); int GetCUDAMultiProcessors(int i);
......
...@@ -5,8 +5,8 @@ if(WITH_PYTHON) ...@@ -5,8 +5,8 @@ if(WITH_PYTHON)
if(WITH_AMD_GPU) if(WITH_AMD_GPU)
hip_library(paddle_pybind SHARED hip_library(paddle_pybind SHARED
SRCS ${PYBIND_SRCS} SRCS ${PYBIND_SRCS}
DEPS ${PYBIND_DEPS} DEPS ARCHIVE_START ${PYBIND_DEPS}
${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS}) ${GLOB_OP_LIB} ${GLOB_OPERATOR_DEPS} ARCHIVE_END)
else() else()
cc_library(paddle_pybind SHARED cc_library(paddle_pybind SHARED
SRCS ${PYBIND_SRCS} SRCS ${PYBIND_SRCS}
......
...@@ -94,6 +94,30 @@ function cmake_gen() { ...@@ -94,6 +94,30 @@ function cmake_gen() {
else else
exit 1 exit 1
fi fi
elif [ "$1" == "cp36-cp36m" ]; then
if [ -d "/Library/Frameworks/Python.framework/Versions/3.6" ]; then
export LD_LIBRARY_PATH=/Library/Frameworks/Python.framework/Versions/3.6/lib/
export DYLD_LIBRARY_PATH=/Library/Frameworks/Python.framework/Versions/3.6/lib/
export PATH=/Library/Frameworks/Python.framework/Versions/3.6/bin/:${PATH}
PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.6/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m/
-DPYTHON_LIBRARY:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.6/lib/libpython3.6m.dylib"
WITH_FLUID_ONLY=${WITH_FLUID_ONLY:-ON}
else
exit 1
fi
elif [ "$1" == "cp37-cp37m" ]; then
if [ -d "/Library/Frameworks/Python.framework/Versions/3.7" ]; then
export LD_LIBRARY_PATH=/Library/Frameworks/Python.framework/Versions/3.7/lib/
export DYLD_LIBRARY_PATH=/Library/Frameworks/Python.framework/Versions/3.7/lib/
export PATH=/Library/Frameworks/Python.framework/Versions/3.7/bin/:${PATH}
PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.7/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/
-DPYTHON_LIBRARY:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.7/lib/libpython3.7m.dylib"
WITH_FLUID_ONLY=${WITH_FLUID_ONLY:-ON}
else
exit 1
fi
fi fi
else else
if [ "$1" != "" ]; then if [ "$1" != "" ]; then
...@@ -116,6 +140,18 @@ function cmake_gen() { ...@@ -116,6 +140,18 @@ function cmake_gen() {
export PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.5.1/bin/python3 export PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.5.1/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.5.1/include/python3.5m -DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.5.1/include/python3.5m
-DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.5.1/lib/libpython3.so" -DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.5.1/lib/libpython3.so"
elif [ "$1" == "cp36-cp36m" ]; then
export LD_LIBRARY_PATH=/opt/_internal/cpython-3.6.0/lib/:${LD_LIBRARY_PATH}
export PATH=/opt/_internal/cpython-3.6.0/bin/:${PATH}
export PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.6.0/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.6.0/include/python3.6m
-DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.6.0/lib/libpython3.so"
elif [ "$1" == "cp37-cp37m" ]; then
export LD_LIBRARY_PATH=/opt/_internal/cpython-3.7.0/lib/:${LD_LIBRARY_PATH}
export PATH=/opt/_internal/cpython-3.7.0/bin/:${PATH}
export PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.7.0/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.7.0/include/python3.7m
-DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.7.0/lib/libpython3.so"
fi fi
fi fi
fi fi
...@@ -419,7 +455,7 @@ function assert_api_not_changed() { ...@@ -419,7 +455,7 @@ function assert_api_not_changed() {
source .env/bin/activate source .env/bin/activate
pip install ${PADDLE_ROOT}/build/python/dist/*whl pip install ${PADDLE_ROOT}/build/python/dist/*whl
python ${PADDLE_ROOT}/tools/print_signatures.py paddle.fluid > new.spec python ${PADDLE_ROOT}/tools/print_signatures.py paddle.fluid > new.spec
if [ "$1" == "cp35-cp35m" ]; then if [ "$1" == "cp35-cp35m" ] || [ "$1" == "cp36-cp36m" ] || [ "$1" == "cp37-cp37m" ]; then
# Use sed to make python2 and python3 sepc keeps the same # Use sed to make python2 and python3 sepc keeps the same
sed -i 's/arg0: str/arg0: unicode/g' new.spec sed -i 's/arg0: str/arg0: unicode/g' new.spec
sed -i "s/\(.*Transpiler.*\).__init__ ArgSpec(args=\['self'].*/\1.__init__ /g" new.spec sed -i "s/\(.*Transpiler.*\).__init__ ArgSpec(args=\['self'].*/\1.__init__ /g" new.spec
......
...@@ -28,7 +28,7 @@ int main(int argc, char** argv) { ...@@ -28,7 +28,7 @@ int main(int argc, char** argv) {
for (int i = 0; i < argc; ++i) { for (int i = 0; i < argc; ++i) {
new_argv.push_back(argv[i]); new_argv.push_back(argv[i]);
} }
#ifdef PADDLE_WITH_CUDA #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
new_argv.push_back( new_argv.push_back(
strdup("--tryfromenv=fraction_of_gpu_memory_to_use,allocator_strategy")); strdup("--tryfromenv=fraction_of_gpu_memory_to_use,allocator_strategy"));
#else #else
......
...@@ -133,7 +133,8 @@ def __bootstrap__(): ...@@ -133,7 +133,8 @@ def __bootstrap__():
if core.is_compiled_with_cuda(): if core.is_compiled_with_cuda():
read_env_flags += [ read_env_flags += [
'fraction_of_gpu_memory_to_use', 'cudnn_deterministic', 'fraction_of_gpu_memory_to_use', 'cudnn_deterministic',
'conv_workspace_size_limit', 'cudnn_exhaustive_search' 'enable_cublas_tensor_op_math', 'conv_workspace_size_limit',
'cudnn_exhaustive_search'
] ]
core.init_gflags([sys.argv[0]] + core.init_gflags([sys.argv[0]] +
["--tryfromenv=" + ",".join(read_env_flags)]) ["--tryfromenv=" + ",".join(read_env_flags)])
......
...@@ -1029,6 +1029,7 @@ def density_prior_box(input, ...@@ -1029,6 +1029,7 @@ def density_prior_box(input,
clip=False, clip=False,
steps=[0.0, 0.0], steps=[0.0, 0.0],
offset=0.5, offset=0.5,
flatten_to_2d=False,
name=None): name=None):
""" """
**Density Prior Box Operator** **Density Prior Box Operator**
...@@ -1065,22 +1066,24 @@ def density_prior_box(input, ...@@ -1065,22 +1066,24 @@ def density_prior_box(input,
height/weight of the input will be automatically calculated. height/weight of the input will be automatically calculated.
Default: [0., 0.] Default: [0., 0.]
offset(float): Prior boxes center offset. Default: 0.5 offset(float): Prior boxes center offset. Default: 0.5
flatten_to_2d(bool): Whether to flatten output prior boxes and variance
to 2D shape, the second dim is 4. Default: False.
name(str): Name of the density prior box op. Default: None. name(str): Name of the density prior box op. Default: None.
Returns: Returns:
tuple: A tuple with two Variable (boxes, variances) tuple: A tuple with two Variable (boxes, variances)
boxes: the output density prior boxes of PriorBox. boxes: the output density prior boxes of PriorBox.
The layout is [H, W, num_priors, 4]. The layout is [H, W, num_priors, 4] when flatten_to_2d is False.
H is the height of input, W is the width of input, The layout is [H * W * num_priors, 4] when flatten_to_2d is True.
num_priors is the total H is the height of input, W is the width of input,
box count of each position of input. num_priors is the total box count of each position of input.
variances: the expanded variances of PriorBox. variances: the expanded variances of PriorBox.
The layout is [H, W, num_priors, 4]. The layout is [H, W, num_priors, 4] when flatten_to_2d is False.
H is the height of input, W is the width of input The layout is [H * W * num_priors, 4] when flatten_to_2d is True.
num_priors is the total H is the height of input, W is the width of input
box count of each position of input num_priors is the total box count of each position of input.
Examples: Examples:
...@@ -1089,14 +1092,11 @@ def density_prior_box(input, ...@@ -1089,14 +1092,11 @@ def density_prior_box(input,
box, var = fluid.layers.density_prior_box( box, var = fluid.layers.density_prior_box(
input=conv1, input=conv1,
image=images, image=images,
min_sizes=[100.], densities=[4, 2, 1],
max_sizes=[200.], fixed_sizes=[32.0, 64.0, 128.0],
aspect_ratios=[1.0, 1.0 / 2.0, 2.0], fixed_ratios=[1.],
densities=[3, 4], clip=True,
fixed_sizes=[50., 60.], flatten_to_2d=True)
fixed_ratios=[1.0, 3.0, 1.0 / 3.0],
flip=True,
clip=True)
""" """
helper = LayerHelper("density_prior_box", **locals()) helper = LayerHelper("density_prior_box", **locals())
dtype = helper.input_dtype() dtype = helper.input_dtype()
...@@ -1127,14 +1127,11 @@ def density_prior_box(input, ...@@ -1127,14 +1127,11 @@ def density_prior_box(input,
'step_w': steps[0], 'step_w': steps[0],
'step_h': steps[1], 'step_h': steps[1],
'offset': offset, 'offset': offset,
'densities': densities,
'fixed_sizes': fixed_sizes,
'fixed_ratios': fixed_ratios,
'flatten_to_2d': flatten_to_2d,
} }
if densities is not None and len(densities) > 0:
attrs['densities'] = densities
if fixed_sizes is not None and len(fixed_sizes) > 0:
attrs['fixed_sizes'] = fixed_sizes
if fixed_ratios is not None and len(fixed_ratios) > 0:
attrs['fixed_ratios'] = fixed_ratios
box = helper.create_variable_for_type_inference(dtype) box = helper.create_variable_for_type_inference(dtype)
var = helper.create_variable_for_type_inference(dtype) var = helper.create_variable_for_type_inference(dtype)
helper.append_op( helper.append_op(
......
...@@ -2134,11 +2134,16 @@ def pool2d(input, ...@@ -2134,11 +2134,16 @@ def pool2d(input,
input tensor is NCHW, where N is batch size, C is input tensor is NCHW, where N is batch size, C is
the number of channels, H is the height of the the number of channels, H is the height of the
feature, and W is the width of the feature. feature, and W is the width of the feature.
pool_size (int): The side length of pooling windows. All pooling pool_size (int|list|tuple): The pool kernel size. If pool kernel size is a tuple or list,
windows are squares with pool_size on a side. it must contain two integers, (pool_size_Height, pool_size_Width).
Otherwise, the pool kernel size will be a square of an int.
pool_type: ${pooling_type_comment} pool_type: ${pooling_type_comment}
pool_stride (int): stride of the pooling layer. pool_stride (int|list|tuple): The pool stride size. If pool stride size is a tuple or list,
pool_padding (int): padding size. it must contain two integers, (pool_stride_Height, pool_stride_Width).
Otherwise, the pool stride size will be a square of an int.
pool_padding (int|list|tuple): The pool padding size. If pool padding size is a tuple,
it must contain two integers, (pool_padding_on_Height, pool_padding_on_Width).
Otherwise, the pool padding size will be a square of an int.
global_pooling (bool): ${global_pooling_comment} global_pooling (bool): ${global_pooling_comment}
use_cudnn (bool): ${use_cudnn_comment} use_cudnn (bool): ${use_cudnn_comment}
ceil_mode (bool): ${ceil_mode_comment} ceil_mode (bool): ${ceil_mode_comment}
......
...@@ -112,38 +112,42 @@ class TestDetection(unittest.TestCase): ...@@ -112,38 +112,42 @@ class TestDetection(unittest.TestCase):
class TestPriorBox(unittest.TestCase): class TestPriorBox(unittest.TestCase):
def test_prior_box(self): def test_prior_box(self):
data_shape = [3, 224, 224] program = Program()
images = fluid.layers.data( with program_guard(program):
name='pixel', shape=data_shape, dtype='float32') data_shape = [3, 224, 224]
conv1 = fluid.layers.conv2d(images, 3, 3, 2) images = fluid.layers.data(
box, var = layers.prior_box( name='pixel', shape=data_shape, dtype='float32')
input=conv1, conv1 = fluid.layers.conv2d(images, 3, 3, 2)
image=images, box, var = layers.prior_box(
min_sizes=[100.0], input=conv1,
aspect_ratios=[1.], image=images,
flip=True, min_sizes=[100.0],
clip=True) aspect_ratios=[1.],
assert len(box.shape) == 4 flip=True,
assert box.shape == var.shape clip=True)
assert box.shape[3] == 4 assert len(box.shape) == 4
assert box.shape == var.shape
assert box.shape[3] == 4
class TestDensityPriorBox(unittest.TestCase): class TestDensityPriorBox(unittest.TestCase):
def test_density_prior_box(self): def test_density_prior_box(self):
data_shape = [3, 224, 224] program = Program()
images = fluid.layers.data( with program_guard(program):
name='pixel', shape=data_shape, dtype='float32') data_shape = [3, 224, 224]
conv1 = fluid.layers.conv2d(images, 3, 3, 2) images = fluid.layers.data(
box, var = layers.density_prior_box( name='pixel', shape=data_shape, dtype='float32')
input=conv1, conv1 = fluid.layers.conv2d(images, 3, 3, 2)
image=images, box, var = layers.density_prior_box(
densities=[3, 4], input=conv1,
fixed_sizes=[50., 60.], image=images,
fixed_ratios=[1.0], densities=[3, 4],
clip=True) fixed_sizes=[50., 60.],
assert len(box.shape) == 4 fixed_ratios=[1.0],
assert box.shape == var.shape clip=True)
assert box.shape[3] == 4 assert len(box.shape) == 4
assert box.shape == var.shape
assert box.shape[-1] == 4
class TestAnchorGenerator(unittest.TestCase): class TestAnchorGenerator(unittest.TestCase):
......
...@@ -36,7 +36,8 @@ class TestDensityPriorBoxOp(OpTest): ...@@ -36,7 +36,8 @@ class TestDensityPriorBoxOp(OpTest):
'offset': self.offset, 'offset': self.offset,
'densities': self.densities, 'densities': self.densities,
'fixed_sizes': self.fixed_sizes, 'fixed_sizes': self.fixed_sizes,
'fixed_ratios': self.fixed_ratios 'fixed_ratios': self.fixed_ratios,
'flatten_to_2d': self.flatten_to_2d
} }
self.outputs = {'Boxes': self.out_boxes, 'Variances': self.out_var} self.outputs = {'Boxes': self.out_boxes, 'Variances': self.out_var}
...@@ -48,16 +49,17 @@ class TestDensityPriorBoxOp(OpTest): ...@@ -48,16 +49,17 @@ class TestDensityPriorBoxOp(OpTest):
self.set_data() self.set_data()
def set_density(self): def set_density(self):
self.densities = [] self.densities = [4, 2, 1]
self.fixed_sizes = [] self.fixed_sizes = [32.0, 64.0, 128.0]
self.fixed_ratios = [] self.fixed_ratios = [1.0]
self.layer_w = 17
self.layer_h = 17
self.image_w = 533
self.image_h = 533
self.flatten_to_2d = False
def init_test_params(self): def init_test_params(self):
self.layer_w = 32 self.set_density()
self.layer_h = 32
self.image_w = 40
self.image_h = 40
self.step_w = float(self.image_w) / float(self.layer_w) self.step_w = float(self.image_w) / float(self.layer_w)
self.step_h = float(self.image_h) / float(self.layer_h) self.step_h = float(self.image_h) / float(self.layer_h)
...@@ -69,8 +71,6 @@ class TestDensityPriorBoxOp(OpTest): ...@@ -69,8 +71,6 @@ class TestDensityPriorBoxOp(OpTest):
self.variances = [0.1, 0.1, 0.2, 0.2] self.variances = [0.1, 0.1, 0.2, 0.2]
self.variances = np.array(self.variances, dtype=np.float).flatten() self.variances = np.array(self.variances, dtype=np.float).flatten()
self.set_density()
self.clip = True self.clip = True
self.num_priors = 0 self.num_priors = 0
if len(self.fixed_sizes) > 0 and len(self.densities) > 0: if len(self.fixed_sizes) > 0 and len(self.densities) > 0:
...@@ -129,6 +129,9 @@ class TestDensityPriorBoxOp(OpTest): ...@@ -129,6 +129,9 @@ class TestDensityPriorBoxOp(OpTest):
(self.layer_h, self.layer_w, self.num_priors, 1)) (self.layer_h, self.layer_w, self.num_priors, 1))
self.out_boxes = out_boxes.astype('float32') self.out_boxes = out_boxes.astype('float32')
self.out_var = out_var.astype('float32') self.out_var = out_var.astype('float32')
if self.flatten_to_2d:
self.out_boxes = self.out_boxes.reshape((-1, 4))
self.out_var = self.out_var.reshape((-1, 4))
class TestDensityPriorBox(TestDensityPriorBoxOp): class TestDensityPriorBox(TestDensityPriorBoxOp):
...@@ -136,6 +139,11 @@ class TestDensityPriorBox(TestDensityPriorBoxOp): ...@@ -136,6 +139,11 @@ class TestDensityPriorBox(TestDensityPriorBoxOp):
self.densities = [3, 4] self.densities = [3, 4]
self.fixed_sizes = [1.0, 2.0] self.fixed_sizes = [1.0, 2.0]
self.fixed_ratios = [1.0] self.fixed_ratios = [1.0]
self.layer_w = 32
self.layer_h = 32
self.image_w = 40
self.image_h = 40
self.flatten_to_2d = True
if __name__ == '__main__': if __name__ == '__main__':
......
...@@ -202,6 +202,17 @@ class TestBook(unittest.TestCase): ...@@ -202,6 +202,17 @@ class TestBook(unittest.TestCase):
self.assertIsNotNone(layers.sequence_unpad(x=x, length=length)) self.assertIsNotNone(layers.sequence_unpad(x=x, length=length))
print(str(program)) print(str(program))
def test_pool2d(self):
program = Program()
with program_guard(program):
x = layers.data(name='x', shape=[3, 224, 224], dtype='float32')
self.assertIsNotNone(
layers.pool2d(
x,
pool_size=[5, 3],
pool_stride=[1, 2],
pool_padding=(2, 1)))
def test_lstm_unit(self): def test_lstm_unit(self):
program = Program() program = Program()
with program_guard(program): with program_guard(program):
......
...@@ -16,7 +16,7 @@ ENV PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ...@@ -16,7 +16,7 @@ ENV PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
RUN yum install -y sqlite-devel zlib-devel openssl-devel pcre-devel vim tk-devel tkinter libtool xz graphviz RUN yum install -y sqlite-devel zlib-devel openssl-devel pcre-devel vim tk-devel tkinter libtool xz graphviz
COPY build_scripts /build_scripts COPY build_scripts /build_scripts
RUN bash build_scripts/build.sh && \ RUN bash build_scripts/build.sh && \
bash build_scripts/install_nccl2.sh && rm -r build_scripts bash build_scripts/install_nccl2.sh && rm -rf build_scripts
ENV SSL_CERT_FILE=/opt/_internal/certs.pem ENV SSL_CERT_FILE=/opt/_internal/certs.pem
......
...@@ -50,6 +50,15 @@ function do_cpython_build { ...@@ -50,6 +50,15 @@ function do_cpython_build {
mkdir -p ${prefix}/lib mkdir -p ${prefix}/lib
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6 # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
if [ $(lex_pyver $py_ver) -eq $(lex_pyver 3.6) ]; then
wget https://www.sqlite.org/2018/sqlite-autoconf-3250300.tar.gz
tar -zxf sqlite-autoconf-3250300.tar.gz
cd sqlite-autoconf-3250300
./configure --prefix=/usr/local
make -j8 && make install
cd ../ && rm sqlite-autoconf-3250300.tar.gz
fi
# NOTE --enable-shared for generating libpython shared library needed for # NOTE --enable-shared for generating libpython shared library needed for
# linking of some of the nupic.core test executables. # linking of some of the nupic.core test executables.
if [ $(lex_pyver $py_ver) -ge $(lex_pyver 3.7) ]; then if [ $(lex_pyver $py_ver) -ge $(lex_pyver 3.7) ]; then
...@@ -59,9 +68,9 @@ function do_cpython_build { ...@@ -59,9 +68,9 @@ function do_cpython_build {
make -j8 > /dev/null make -j8 > /dev/null
make altinstall > /dev/null make altinstall > /dev/null
else else
CFLAGS="-Wformat" ./configure --prefix=${prefix} --enable-shared $unicode_flags > /dev/null LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH} CFLAGS="-Wformat" ./configure --prefix=${prefix} --enable-shared $unicode_flags > /dev/null
make -j8 > /dev/null LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH} make -j8 > /dev/null
make install > /dev/null LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH} make install > /dev/null
fi fi
popd popd
echo "ZZZ looking for libpython" echo "ZZZ looking for libpython"
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册