Body connector in CUDA, GPU speed up

5019d421 · gineshidalgo99 · 04362b9d · 5019d421 · 5019d421 · 5019d421
12 changed file
--- a/.github/issue_template.md
+++ b/.github/issue_template.md
@@ -26,6 +26,10 @@ Note: add `--logging_level 0 --disable_multi_thread` to get higher debug informa



+### Errors (if any)
+
+
+
 ### Type of Issue
 You might select multiple topics, delete the rest:
 - Compilation/installation error
@@ -38,38 +42,40 @@ You might select multiple topics, delete the rest:


 ### Your System Configuration
-1. **OpenPose version**: Latest GitHub code? Or specific commit (e.g., d52878f)? Or specific version from `Release` section (e.g., 1.2.0)?
+1. **Whole console output** (if errors appeared), paste the error to [PasteBin](https://pastebin.com/) and then paste the link here: LINK
+
+2. **OpenPose version**: Latest GitHub code? Or specific commit (e.g., d52878f)? Or specific version from `Release` section (e.g., 1.2.0)?

-2. **General configuration**:
+3. **General configuration**:
    - **Installation mode**: CMake, sh script, manual Makefile installation, ... (Ubuntu); CMake, ... (Windows); ...?
    - **Operating system** (`lsb_release -a` in Ubuntu):
    - **Release or Debug mode**? (by default: release):
    - Compiler (`gcc --version` in Ubuntu or VS version in Windows): 5.4.0, ... (Ubuntu); VS2015 Enterprise Update 3, VS2017 community, ... (Windows); ...?

-3. **Non-default settings**:
+4. **Non-default settings**:
    - **3-D Reconstruction module added**? (by default: no):
    - Any other custom CMake configuration with respect to the default version? (by default: no):

-4. **3rd-party software**:
+5. **3rd-party software**:
    - **Caffe version**: Default from OpenPose, custom version, ...?
    - **CMake version** (`cmake --version` in Ubuntu):
    - **OpenCV version**: pre-compiled `apt-get install libopencv-dev` (only Ubuntu); OpenPose default (only Windows); compiled from source? If so, 2.4.9, 2.4.12, 3.1, 3.2?; ...?

-5. If **GPU mode** issue:
+6. If **GPU mode** issue:
    - **CUDA version** (`cat /usr/local/cuda/version.txt` in most cases):
    - **cuDNN version**:
    - **GPU model** (`nvidia-smi` in Ubuntu):

-6. If **CPU-only mode** issue:
+7. If **CPU-only mode** issue:
    - **CPU brand & model**:
    - Total **RAM memory** available:

-7. If **Python** API:
+8. If **Python** API:
    - **Python version**: 2.7, 3.7, ...?
    - **Numpy version** (`python -c "import numpy; print numpy.version.version"` in Ubuntu):

-8. If **Windows** system:
+9. If **Windows** system:
    - Portable demo or compiled library?

-9. If **speed performance** issue:
+10. If **speed performance** issue:
    - Report OpenPose timing speed based on [this link](https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/installation.md#profiling-speed).
--- a/doc/installation.md
+++ b/doc/installation.md
@@ -106,30 +106,31 @@ Any problem installing OpenPose? Check [doc/faq.md](./faq.md) and/or post a GitH


 ### Prerequisites
-1. Download and install CMake GUI:
+1. Ubuntu - **Anaconda should not be installed** on your system. Anaconda includes a Protobuf version that is incompatible with Caffe. Either you uninstall anaconda and install protobuf via apt-get, or you compile your own Caffe and link it to OpenPose.
+2. Download and install CMake GUI:
    - Ubuntu: run the command `sudo apt-get install cmake-qt-gui`. Note: If you prefer to use CMake through the command line, see [Cmake Command Line Build](#cmake-command-line-build-ubuntu-only).
    - Windows: download and install the latest CMake win64-x64 msi installer from the [CMake website](https://cmake.org/download/), called `cmake-X.X.X-win64-x64.msi`.
    - Mac: `brew cask install cmake`.
-2. Windows - **Microsoft Visual Studio (VS) 2015 Enterprise Update 3**:
+3. Windows - **Microsoft Visual Studio (VS) 2015 Enterprise Update 3**:
    - If **Visual Studio 2017 Community** is desired, we do not officially support it, but it might be compiled by firstly [enabling CUDA 8.0 in VS2017](https://stackoverflow.com/questions/43745099/using-cuda-with-visual-studio-2017?answertab=active#tab-top) or use **VS2017 with CUDA 9** by checking the `.vcxproj` file and changing the necessary paths from CUDA 8 to 9.
    - VS 2015 Enterprise Update 1 will give some compiler errors and VS 2015 Community has not been tested.
-3. Nvidia GPU version prerequisites:
+4. Nvidia GPU version prerequisites:
    1. [**CUDA 8**](https://developer.nvidia.com/cuda-80-ga2-download-archive):
        - Ubuntu: Run `sudo ubuntu/install_cuda.sh` or alternatively download and install it from their website.
-        - Windows: Install CUDA 8.0 after Visual Studio 2015 is installed to assure that the CUDA installation will generate all necessary files for VS. If CUDA was already installed, re-install CUDA after installing VS!
-        - **IMPORTANT**: As of a recent Windows update, you have to download the Nvidia [drivers](http://www.nvidia.com/Download/index.aspx) drivers first, and then install CUDA without the Graphics Driver flag or else your system might hang.
+        - Windows: Install CUDA 8.0 after Visual Studio 2015 is installed to assure that the CUDA installation will generate all necessary files for VS. If CUDA was already installed, re-install - **IMPORTANT 1/2**: Nvidia V, any Nvidia with Volta architecture, and newer Nvidia model GPUs require at least CUDA 9.
+        - **IMPORTANT 2/2**: As of a recent Windows update, you might want to download the Nvidia [drivers](http://www.nvidia.com/Download/index.aspx) first, and then install CUDA without the Graphics Driver flag or else your system might hang.
    2. [**cuDNN 5.1**](https://developer.nvidia.com/cudnn):
        - Ubuntu: Run `sudo ubuntu/install_cudnn.sh` or alternatively download and install it from their website.
        - Windows (and Ubuntu if manual installation): In order to manually install it, just unzip it and copy (merge) the contents on the CUDA folder, usually `/usr/local/cuda/` in Ubuntu and `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0` in Windows.
-4. AMD GPU version prerequisites:
+5. AMD GPU version prerequisites:
    1. Download official AMD drivers for Windows from [**AMD - Windows**](https://support.amd.com/en-us/download).
    2. Download 3rd party ROCM driver for Ubuntu from [**AMD - OpenCL**](https://rocm.github.io/ROCmInstall.html).
    3. Ubuntu only: Install `sudo apt-get install libviennacl-dev`. This comes packaged inside OpenPose for Windows.
    4. AMD Drivers have not been tested on OSX. Please email us if you wish to test it. This has only been tested on Vega series cards.
-5. Ubuntu - Other prerequisites:
+6. Ubuntu - Other prerequisites:
    - Caffe prerequisites: By default, OpenPose uses Caffe under the hood. If you have not used Caffe previously, install its dependencies by running `sudo bash ./ubuntu/install_cmake.sh`.
    - OpenCV must be already installed on your machine. It can be installed with `apt-get install libopencv-dev`. You can also use your own compiled OpenCV version.
-6. Windows - **Caffe, OpenCV, and Caffe prerequisites**:
+7. Windows - **Caffe, OpenCV, and Caffe prerequisites**:
    - CMake automatically downloads all the Windows DLLs. Alternatively, you might prefer to download them manually:
        - Models:
            - [COCO model](http://posefs1.perception.cs.cmu.edu/OpenPose/models/pose/coco/pose_iter_440000.caffemodel): download in `models/pose/coco/`.
@@ -141,9 +142,9 @@ Any problem installing OpenPose? Check [doc/faq.md](./faq.md) and/or post a GitH
            - [Caffe](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/caffe_2018_01_18.zip): Unzip as `3rdparty/windows/caffe/`.
            - [Caffe dependencies](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/caffe3rdparty_2017_07_14.zip): Unzip as `3rdparty/windows/caffe3rdparty/`.
            - [OpenCV 3.1](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/opencv_310.zip): Unzip as `3rdparty/windows/opencv/`.
-7. Mac - **Caffe, OpenCV, and Caffe prerequisites**:
+8. Mac - **Caffe, OpenCV, and Caffe prerequisites**:
    - Install deps by running `bash 3rdparty/osx/install_deps.sh` on your terminal.
-8. **Eigen prerequisite**:
+9. **Eigen prerequisite**:
    - Note: This step is optional, only required for some specific extra functionality, such as extrinsic camera calibration.
    - If you enable the `WITH_EIGEN` flag when running CMake. You can either:
        1. Do not do anything if you set the `WITH_EIGEN` flag to `BUILD`, CMake will automatically download Eigen. Alternatively, you might prefer to download it manually:

--- a/doc/release_notes.md
+++ b/doc/release_notes.md
@@ -262,7 +262,9 @@ OpenPose Library - Release Notes
 ## Current version - future OpenPose 1.4.1
 1. Main improvements:
    1. Added initial single-person tracker for further speed up or visual smoothing (`--tracking` flag).
+    2. Greedy body part connector implemented in CUDA: +~33% speed up in Nvidia (CUDA) version with default flags and +~10% in maximum accuracy configuration.
 2. Functions or parameters renamed:
+    1. By default, python example `2_pose_from_heatmaps.py` was using 2 scales starting at -1x736, changed to 1 scale at -1x368.
 3. Main bugs fixed:



--- a/examples/tutorial_python/2_pose_from_heatmaps.py
+++ b/examples/tutorial_python/2_pose_from_heatmaps.py
@@ -23,8 +23,12 @@ except:
    raise Exception('Error: OpenPose library could not be found. Did you enable `BUILD_PYTHON` in CMake and have this Python script in the right folder?')

 # Params for change
-defRes = 736
-scales = [1,0.5]
+# Single-scale
+defRes = 368
+scales = [1]
+# # Multi-scale
+# defRes = 736
+# scales = [1, 0.75, 0.5, 0.25]
 class Param:
    caffemodel = dir_path + "/../../../models/pose/body_25/pose_iter_584000.caffemodel"
    prototxt = dir_path + "/../../../models/pose/body_25/pose_deploy.prototxt"
@@ -36,7 +40,7 @@ params["output_resolution"] = "-1x-1"
 params["net_resolution"] = "-1x"+str(defRes)
 params["model_pose"] = "BODY_25"
 params["alpha_pose"] = 0.6
-params["scale_gap"] = 0.5
+params["scale_gap"] = 0.25
 params["scale_number"] = len(scales)
 params["render_threshold"] = 0.05
 params["num_gpu_start"] = 0

--- a/include/openpose/pose/bodyPartConnectorBase.hpp
+++ b/include/openpose/pose/bodyPartConnectorBase.hpp
@@ -17,6 +17,9 @@ namespace op
                                    const T* const peaksPtr, const PoseModel poseModel, const Point<int>& heatMapSize,
                                    const int maxPeaks, const T interMinAboveThreshold, const T interThreshold,
                                    const int minSubsetCnt, const T minSubsetScore, const T scaleFactor = 1.f,
+                                    Array<T> finalOutputCpu = Array<T>{}, T* finalOutputGpuPtr = nullptr,
+                                    const unsigned int* const bodyPartPairsGpuPtr = nullptr,
+                                    const unsigned int* const mapIdxGpuPtr = nullptr,
                                    const T* const peaksGpuPtr = nullptr);

    // Private functions used by the 2 above functions

--- a/include/openpose/pose/bodyPartConnectorCaffe.hpp
+++ b/include/openpose/pose/bodyPartConnectorCaffe.hpp
@@ -22,6 +22,8 @@ namespace op
    public:
        explicit BodyPartConnectorCaffe();

+        ~BodyPartConnectorCaffe();
+
        virtual void Reshape(const std::vector<caffe::Blob<T>*>& bottom);

        virtual inline const char* type() const { return "BodyPartConnector"; }
@@ -60,6 +62,11 @@ namespace op
        std::array<int, 4> mHeatMapsSize;
        std::array<int, 4> mPeaksSize;
        std::array<int, 4> mTopSize;
+        // GPU auxiliary
+        unsigned int* pBodyPartPairsGpuPtr;
+        unsigned int* pMapIdxGpuPtr;
+        Array<T> mFinalOutputCpu;
+        T* pFinalOutputGpuPtr;

        DELETE_COPY(BodyPartConnectorCaffe);
    };

--- a/src/openpose/net/nmsBase.cu
+++ b/src/openpose/net/nmsBase.cu
@@ -117,6 +117,44 @@ namespace op
        }
    }

+    // template <typename T>
+    // __global__ void sortKernel(T* targetPtr, const int channels, const int offsetTarget)
+    // {
+    //     const auto globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    //     if (globalIdx < channels)
+    //     {
+    //         const auto totalOffset = globalIdx * offsetTarget;
+    //         const int nonZeroElementsPlus1 = targetPtr[totalOffset]+1;
+    //         for (auto i = 1 ; i < nonZeroElementsPlus1 ; i++)
+    //         {
+    //             // Find new maximum
+    //             const auto iIndex = totalOffset+3*i;
+    //             int maxIndex = i;
+    //             T maxIndexValue = targetPtr[iIndex+2];
+    //             for (auto j = i+1 ; j < nonZeroElementsPlus1 ; j++)
+    //             {
+    //                 if (maxIndexValue < targetPtr[totalOffset+3*j+2])
+    //                 {
+    //                     maxIndex = j;
+    //                     maxIndexValue = targetPtr[totalOffset+3*j+2];
+    //                 }
+    //             }
+    //             // Swap
+    //             const auto jIndex = totalOffset+3*maxIndex;
+    //             const T temp [3] = {targetPtr[iIndex],
+    //                                 targetPtr[iIndex+1],
+    //                                 targetPtr[iIndex+2]};
+    //             targetPtr[iIndex] = targetPtr[jIndex];
+    //             targetPtr[iIndex+1] = targetPtr[jIndex+1];
+    //             targetPtr[iIndex+2] = targetPtr[jIndex+2];
+    //             targetPtr[jIndex] = temp[0];
+    //             targetPtr[jIndex+1] = temp[1];
+    //             targetPtr[jIndex+2] = temp[2];
+    //         }
+    //     }
+    // }
+
    template <typename T>
    void nmsGpu(T* targetPtr, int* kernelPtr, const T* const sourcePtr, const T threshold,
                const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize, const Point<T>& offset)
@@ -137,14 +175,16 @@ namespace op
                                   getNumberCudaBlocks(height, threadsPerBlock2D.y)};
            const dim3 threadsPerBlock1D{THREADS_PER_BLOCK};
            const dim3 numBlocks1D{getNumberCudaBlocks(imageOffset, threadsPerBlock1D.x)};
-            // log("num_b: " + std::to_string(bottom->shape(0)));       // = 1
-            // log("channel_b: " + std::to_string(bottom->shape(1)));   // = 57 = 18 body parts + bkg + 19x2 PAFs
-            // log("height_b: " + std::to_string(bottom->shape(2)));    // = 368 = height
-            // log("width_b: " + std::to_string(bottom->shape(3)));     // = 656 = width
-            // log("num_t: " + std::to_string(top->shape(0)));       // = 1
-            // log("channel_t: " + std::to_string(top->shape(1)));   // = 18 = numberParts
-            // log("height_t: " + std::to_string(top->shape(2)));    // = 97 = maxPeople + 1
-            // log("width_t: " + std::to_string(top->shape(3)));     // = 3 = [x, y, score]
+            // const dim3 threadsPerBlockSort{128};
+            // const dim3 numBlocksSort{getNumberCudaBlocks(channels, threadsPerBlockSort.x)};
+            // log("num_b: " + std::to_string(sourceSize[0]));       // = 1
+            // log("channel_b: " + std::to_string(sourceSize[1]));   // = 57 = 18 body parts + bkg + 19x2 PAFs
+            // log("height_b: " + std::to_string(sourceSize[2]));    // = 368 = height
+            // log("width_b: " + std::to_string(sourceSize[3]));     // = 656 = width
+            // log("num_t: " + std::to_string(targetSize[0]));       // = 1
+            // log("channel_t: " + std::to_string(targetSize[1]));   // = 18 = numberParts
+            // log("height_t: " + std::to_string(targetSize[2]));    // = 128 = maxPeople + 1
+            // log("width_t: " + std::to_string(targetSize[3]));     // = 3 = [x, y, score]
            // log("");

            for (auto n = 0; n < num; n++)
@@ -157,15 +197,19 @@ namespace op
                    const auto* const sourcePtrOffsetted = sourcePtr + offsetChannel * imageOffset;
                    auto* targetPtrOffsetted = targetPtr + offsetChannel * offsetTarget;

-                    // This returns kernelPtrOffsetted, a binary array with 0s & 1s. 1s in the local maximum positions (size = size(sourcePtrOffsetted))
-                    nmsRegisterKernel<<<numBlocks2D, threadsPerBlock2D>>>(kernelPtrOffsetted, sourcePtrOffsetted, width, height, threshold); //[0,0,0,0,1,0,0,0,0,1,0,0,0,0]
+                    // This returns kernelPtrOffsetted, a binary array with 0s & 1s. 1s in the local maximum
+                    // positions (size = size(sourcePtrOffsetted))
+                    // Example result: [0,0,0,0,1,0,0,0,0,1,0,0,0,0]
+                    nmsRegisterKernel<<<numBlocks2D, threadsPerBlock2D>>>(
+                        kernelPtrOffsetted, sourcePtrOffsetted, width, height, threshold);
                    // // Debug
                    // if (c==3)
                    // {
                    //     char filename[50];
                    //     sprintf(filename, "work%02d.txt", c);
                    //     std::ofstream fout(filename);
-                    //     int* kernelPtrOffsetted_local = mKernelBlob.mutable_cpu_data() + n * parts_num * imageOffset + c * imageOffset;
+                    //     int* kernelPtrOffsetted_local = mKernelBlob.mutable_cpu_data()
+                    //                                   + n * parts_num * imageOffset + c * imageOffset;
                    //     for (int y = 0; y < height; y++){
                    //         for (int x = 0; x < width; x++)
                    //             fout << kernelPtrOffsetted_local[y*width + x] << "\t";
@@ -175,14 +219,21 @@ namespace op
                    // }
                    auto kernelThrustPtr = thrust::device_pointer_cast(kernelPtrOffsetted);

-                    // This modifies kernelPtrOffsetted, now it indicates the local maximum indexes. Format: 0,0,0,1,1,1,1,2,2,2,... First maximum at index 2, second at 6, etc...
-                    thrust::exclusive_scan(kernelThrustPtr, kernelThrustPtr + imageOffset, kernelThrustPtr); //[0,0,0,0,0,1,1,1,1,1,2,2,2,2]
+                    // This modifies kernelPtrOffsetted, now it indicates the local maximum indexes
+                    // Format: 0,0,0,1,1,1,1,2,2,2,... First maximum at index 2, second at 6, etc...
+                    // Example result: [0,0,0,0,0,1,1,1,1,1,2,2,2,2]
+                    thrust::exclusive_scan(kernelThrustPtr, kernelThrustPtr + imageOffset, kernelThrustPtr);

                    // This returns targetPtrOffsetted, with the NMS applied over it
                    writeResultKernel<<<numBlocks1D, threadsPerBlock1D>>>(targetPtrOffsetted, imageOffset,
                                                                          kernelPtrOffsetted, sourcePtrOffsetted,
                                                                          width, height, maxPeaks, offset.x, offset.y);
+
                }
+                // // Sort based on score
+                // // Commented because it doesn't change accuracy
+                // // TODO: If finally used, implement for CPU/CL versions
+                // sortKernel<<<numBlocksSort, threadsPerBlockSort>>>(targetPtr, channels, offsetTarget);
            }
            cudaCheck(__LINE__, __FUNCTION__, __FILE__);
        }

--- a/src/openpose/pose/bodyPartConnectorBase.cpp
+++ b/src/openpose/pose/bodyPartConnectorBase.cpp
--- a/src/openpose/pose/bodyPartConnectorBase.cu
+++ b/src/openpose/pose/bodyPartConnectorBase.cu
@@ -6,20 +6,19 @@

 namespace op
 {
+    const dim3 THREADS_PER_BLOCK{4, 16, 16};
+
    template<typename T>
    inline __device__ int intRoundGPU(const T a)
    {
-        return int(a+0.5f);
+        return int(a+T(0.5));
    }

    template <typename T>
    inline __device__  T process(const T* bodyPartA, const T* bodyPartB, const T* mapX, const T* mapY,
-                                 const int heatmapWidth, const int heatmapHeight, const T interThreshold = T(0.05),
-                                 const T interMinAboveThreshold = T(0.95), const T renderThreshold = T(0.05))
+                                 const int heatmapWidth, const int heatmapHeight, const T interThreshold,
+                                 const T interMinAboveThreshold)
    {
-        T finalOutput = -1;
-        if (bodyPartA[2] < renderThreshold || bodyPartB[2] < renderThreshold) return finalOutput;
-
        const auto vectorAToBX = bodyPartB[0] - bodyPartA[0];
        const auto vectorAToBY = bodyPartB[1] - bodyPartA[1];
        const auto vectorAToBMax = max(abs(vectorAToBX), abs(vectorAToBY));
@@ -57,36 +56,47 @@ namespace op

            // parts score + connection score
            if (count/(float)numberPointsInLine > interMinAboveThreshold)
-                finalOutput = sum/count;
+                return sum/count;
        }
-
-        return finalOutput;
+        return -1;
    }

    template <typename T>
    __global__ void pafScoreKernel(T* finalOutputPtr, const T* const heatMapPtr, const T* const peaksPtr,
                                   const unsigned int* const bodyPartPairsPtr, const unsigned int* const mapIdxPtr,
-                                   const unsigned int poseMaxPeople, const int numberBodyPartPairs,
-                                   const int heatmapWidth, const int heatmapHeight)
+                                   const unsigned int maxPeaks, const int numberBodyPartPairs,
+                                   const int heatmapWidth, const int heatmapHeight, const T interThreshold,
+                                   const T interMinAboveThreshold)
    {
-        const auto i = (blockIdx.x * blockDim.x) + threadIdx.x;
-        const auto j = (blockIdx.y * blockDim.y) + threadIdx.y;
-        const auto k = (blockIdx.z * blockDim.z) + threadIdx.z;
+        const auto pairIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
+        const auto peakA = (blockIdx.y * blockDim.y) + threadIdx.y;
+        const auto peakB = (blockIdx.z * blockDim.z) + threadIdx.z;

-        if (i < numberBodyPartPairs)
+        if (pairIndex < numberBodyPartPairs && peakA < maxPeaks && peakB < maxPeaks)
        {
-            const int partA = bodyPartPairsPtr[i*2];
-            const int partB = bodyPartPairsPtr[i*2 + 1];
-            const int mapIdxX = mapIdxPtr[i*2];
-            const int mapIdxY = mapIdxPtr[i*2 + 1];
-
-            const T* const bodyPartA = peaksPtr + (partA*poseMaxPeople*3 + j*3);
-            const T* const bodyPartB = peaksPtr + (partB*poseMaxPeople*3 + k*3);
-            const T* const mapX = heatMapPtr + mapIdxX*heatmapWidth*heatmapHeight;
-            const T* const mapY = heatMapPtr + mapIdxY*heatmapWidth*heatmapHeight;
-
-            const T finalOutput = process(bodyPartA, bodyPartB, mapX, mapY, heatmapWidth, heatmapHeight);
-            finalOutputPtr[(i*poseMaxPeople+j)*poseMaxPeople + k] = finalOutput;
+            const auto baseIndex = 2*pairIndex;
+            const auto partA = bodyPartPairsPtr[baseIndex];
+            const auto partB = bodyPartPairsPtr[baseIndex + 1];
+
+            const T numberPeaksA = peaksPtr[3*partA*(maxPeaks+1)];
+            const T numberPeaksB = peaksPtr[3*partB*(maxPeaks+1)];
+
+            const auto outputIndex = (pairIndex*maxPeaks+peakA)*maxPeaks + peakB;
+            if (peakA < numberPeaksA && peakB < numberPeaksB)
+            {
+                const auto mapIdxX = mapIdxPtr[baseIndex];
+                const auto mapIdxY = mapIdxPtr[baseIndex + 1];
+
+                const T* const bodyPartA = peaksPtr + (3*(partA*(maxPeaks+1) + peakA+1));
+                const T* const bodyPartB = peaksPtr + (3*(partB*(maxPeaks+1) + peakB+1));
+                const T* const mapX = heatMapPtr + mapIdxX*heatmapWidth*heatmapHeight;
+                const T* const mapY = heatMapPtr + mapIdxY*heatmapWidth*heatmapHeight;
+                finalOutputPtr[outputIndex] = process(
+                    bodyPartA, bodyPartB, mapX, mapY, heatmapWidth, heatmapHeight, interThreshold,
+                    interMinAboveThreshold);
+            }
+            else
+                finalOutputPtr[outputIndex] = -1;
        }
    }

@@ -95,50 +105,37 @@ namespace op
                             const T* const peaksPtr, const PoseModel poseModel, const Point<int>& heatMapSize,
                             const int maxPeaks, const T interMinAboveThreshold, const T interThreshold,
                             const int minSubsetCnt, const T minSubsetScore, const T scaleFactor,
+                             Array<T> finalOutputCpu, T* finalOutputGpuPtr,
+                             const unsigned int* const bodyPartPairsGpuPtr, const unsigned int* const mapIdxGpuPtr,
                             const T* const peaksGpuPtr)
    {
        try
        {
            // Parts Connection
            const auto& bodyPartPairs = getPosePartPairs(poseModel);
-            const auto& mapIdxOffset = getPoseMapIndex(poseModel);
            const auto numberBodyParts = getPoseNumberBodyParts(poseModel);
            const auto numberBodyPartPairs = bodyPartPairs.size() / 2;
            const auto subsetCounterIndex = numberBodyParts;
-            // Update mapIdx
-            auto mapIdx = mapIdxOffset;
-            for (auto& i : mapIdx)
-                i += (numberBodyParts+1);
+            const auto totalComputations = finalOutputCpu.getVolume();

            if (numberBodyParts == 0)
                error("Invalid value of numberBodyParts, it must be positive, not " + std::to_string(numberBodyParts),
                      __LINE__, __FUNCTION__, __FILE__);
+            if (bodyPartPairsGpuPtr == nullptr || mapIdxGpuPtr == nullptr)
+                error("The pointers bodyPartPairsGpuPtr and mapIdxGpuPtr cannot be nullptr.",
+                      __LINE__, __FUNCTION__, __FILE__);
+

-            // Upload required data to GPU
-            unsigned int* bodyPartPairsGpuPtr;
-            cudaMalloc((void **)&bodyPartPairsGpuPtr, bodyPartPairs.size() * sizeof(unsigned int));
-            cudaMemcpy(bodyPartPairsGpuPtr, &bodyPartPairs[0], bodyPartPairs.size() * sizeof(unsigned int),
-                       cudaMemcpyHostToDevice);
-            unsigned int* mapIdxGpuPtr;
-            cudaMalloc((void **)&mapIdxGpuPtr, mapIdx.size() * sizeof(unsigned int));
-            cudaMemcpy(mapIdxGpuPtr, &mapIdx[0], mapIdx.size() * sizeof(unsigned int), cudaMemcpyHostToDevice);
-            T* finalOutputGpuPtr;
-            Array<T> finalOutputCpu;
-            finalOutputCpu.reset({(int)numberBodyPartPairs, (int)POSE_MAX_PEOPLE, (int)POSE_MAX_PEOPLE},-1);
-            int totalComputations = numberBodyPartPairs * POSE_MAX_PEOPLE * POSE_MAX_PEOPLE;
-            cudaMalloc((void **)&finalOutputGpuPtr, totalComputations * sizeof(float));
-
-            // Run Kernel
-            const dim3 threadsPerBlock{4, 8, 8}; //4 is good for BODY_25, 8 for COCO?
-            if ((POSE_MAX_PEOPLE+1) % threadsPerBlock.y || (POSE_MAX_PEOPLE+1) % threadsPerBlock.z)
-                error("Invalid value of POSE_MAX_PEOPLE, it must be multiple of 16, rather than "
-                      + std::to_string(POSE_MAX_PEOPLE), __LINE__, __FUNCTION__, __FILE__);
-            int pairBlocks = intRound((numberBodyPartPairs/threadsPerBlock.x) + 0.5);
-            const dim3 numBlocks{(unsigned int)pairBlocks, (POSE_MAX_PEOPLE+1) / threadsPerBlock.y,
-                                 (POSE_MAX_PEOPLE+1) / threadsPerBlock.z};
-            pafScoreKernel<<<numBlocks, threadsPerBlock>>>(
+            // Run Kernel - finalOutputGpu
+            const dim3 numBlocks{
+                getNumberCudaBlocks(numberBodyPartPairs, THREADS_PER_BLOCK.x),
+                getNumberCudaBlocks(maxPeaks, THREADS_PER_BLOCK.y),
+                getNumberCudaBlocks(maxPeaks, THREADS_PER_BLOCK.z)};
+            pafScoreKernel<<<numBlocks, THREADS_PER_BLOCK>>>(
                finalOutputGpuPtr, heatMapGpuPtr, peaksGpuPtr, bodyPartPairsGpuPtr, mapIdxGpuPtr,
-                POSE_MAX_PEOPLE, numberBodyPartPairs, heatMapSize.x, heatMapSize.y);
+                maxPeaks, numberBodyPartPairs, heatMapSize.x, heatMapSize.y, interThreshold,
+                interMinAboveThreshold);
+            // finalOutputCpu <-- finalOutputGpu
            cudaMemcpy(finalOutputCpu.getPtr(), finalOutputGpuPtr, totalComputations * sizeof(float),
                       cudaMemcpyDeviceToHost);

@@ -153,10 +150,10 @@ namespace op
            // Delete people below the following thresholds:
                // a) minSubsetCnt: removed if less than minSubsetCnt body parts
                // b) minSubsetScore: removed if global score smaller than this
-                // c) POSE_MAX_PEOPLE: keep first POSE_MAX_PEOPLE people above thresholds
+                // c) maxPeaks (POSE_MAX_PEOPLE): keep first maxPeaks people above thresholds
            int numberPeople;
            std::vector<int> validSubsetIndexes;
-            validSubsetIndexes.reserve(fastMin((size_t)POSE_MAX_PEOPLE, subsets.size()));
+            validSubsetIndexes.reserve(fastMin((size_t)maxPeaks, subsets.size()));
            removeSubsetsBelowThresholds(validSubsetIndexes, numberPeople, subsets, subsetCounterIndex,
                                         numberBodyParts, minSubsetCnt, minSubsetScore, maxPeaks);

@@ -164,10 +161,7 @@ namespace op
            subsetsToPoseKeypointsAndScores(poseKeypoints, poseScores, scaleFactor, subsets, validSubsetIndexes,
                                            peaksPtr, numberPeople, numberBodyParts, numberBodyPartPairs);

-            // Differences w.r.t. CPU version for now
-            cudaFree(bodyPartPairsGpuPtr);
-            cudaFree(mapIdxGpuPtr);
-            cudaFree(finalOutputGpuPtr);
+            // Sanity check
            cudaCheck(__LINE__, __FUNCTION__, __FILE__);
        }
        catch (const std::exception& e)
@@ -181,11 +175,17 @@ namespace op
                                      const PoseModel poseModel, const Point<int>& heatMapSize, const int maxPeaks,
                                      const float interMinAboveThreshold, const float interThreshold,
                                      const int minSubsetCnt, const float minSubsetScore, const float scaleFactor,
+                                      Array<float> finalOutputCpu, float* finalOutputGpuPtr,
+                                      const unsigned int* const bodyPartPairsGpuPtr,
+                                      const unsigned int* const mapIdxGpuPtr,
                                      const float* const peaksGpuPtr);
    template void connectBodyPartsGpu(Array<double>& poseKeypoints, Array<double>& poseScores,
                                      const double* const heatMapGpuPtr, const double* const peaksPtr,
                                      const PoseModel poseModel, const Point<int>& heatMapSize, const int maxPeaks,
                                      const double interMinAboveThreshold, const double interThreshold,
                                      const int minSubsetCnt, const double minSubsetScore, const double scaleFactor,
+                                      Array<double> finalOutputCpu, double* finalOutputGpuPtr,
+                                      const unsigned int* const bodyPartPairsGpuPtr,
+                                      const unsigned int* const mapIdxGpuPtr,
                                      const double* const peaksGpuPtr);
 }
--- a/src/openpose/pose/bodyPartConnectorCaffe.cpp
+++ b/src/openpose/pose/bodyPartConnectorCaffe.cpp
 #ifdef USE_CAFFE
    #include <caffe/blob.hpp>
 #endif
+#ifdef USE_CUDA
+    #include <openpose/gpu/cuda.hpp>
+#endif
 #include <openpose/pose/bodyPartConnectorBase.hpp>
+#include <openpose/pose/poseParameters.hpp>
 #include <openpose/pose/bodyPartConnectorCaffe.hpp>

 namespace op
 {
    template <typename T>
-    BodyPartConnectorCaffe<T>::BodyPartConnectorCaffe()
+    BodyPartConnectorCaffe<T>::BodyPartConnectorCaffe() :
+        pBodyPartPairsGpuPtr{nullptr},
+        pMapIdxGpuPtr{nullptr},
+        pFinalOutputGpuPtr{nullptr}
    {
        try
        {
@@ -22,6 +29,23 @@ namespace op
        }
    }

+    template <typename T>
+    BodyPartConnectorCaffe<T>::~BodyPartConnectorCaffe()
+    {
+        try
+        {
+            #if defined USE_CAFFE && defined USE_CUDA
+                cudaFree(pBodyPartPairsGpuPtr);
+                cudaFree(pMapIdxGpuPtr);
+                cudaFree(pFinalOutputGpuPtr);
+            #endif
+        }
+        catch (const std::exception& e)
+        {
+            error(e.what(), __LINE__, __FUNCTION__, __FILE__);
+        }
+    }
+
    template <typename T>
    void BodyPartConnectorCaffe<T>::Reshape(const std::vector<caffe::Blob<T>*>& bottom)
    {
@@ -162,15 +186,58 @@ namespace op
        try
        {
            #if defined USE_CAFFE && defined USE_CUDA
+                // Global data
                const auto heatMapsBlob = bottom.at(0);
-                const auto* const peaksPtr = bottom.at(1)->cpu_data();
                const auto* const heatMapsGpuPtr = heatMapsBlob->gpu_data();
-                const auto* const peaksGpuPtr = bottom.at(1)->gpu_data();
+                const auto* const peaksPtr = bottom.at(1)->cpu_data();
                const auto maxPeaks = mTopSize[1];
+                const auto* const peaksGpuPtr = bottom.at(1)->gpu_data();
+
+                // Initialize fixed pointers (1-time task) - It must be done in the same thread than Forward_gpu
+                if (pBodyPartPairsGpuPtr == nullptr || pMapIdxGpuPtr == nullptr)
+                {
+                    // Free previous memory
+                    cudaFree(pBodyPartPairsGpuPtr);
+                    cudaFree(pMapIdxGpuPtr);
+                    // Data
+                    const auto& bodyPartPairs = getPosePartPairs(mPoseModel);
+                    const auto numberBodyParts = getPoseNumberBodyParts(mPoseModel);
+                    const auto& mapIdxOffset = getPoseMapIndex(mPoseModel);
+                    // Update mapIdx
+                    auto mapIdx = mapIdxOffset;
+                    for (auto& i : mapIdx)
+                        i += (numberBodyParts+1);
+                    // Re-allocate memory
+                    cudaMalloc((void **)&pBodyPartPairsGpuPtr, bodyPartPairs.size() * sizeof(unsigned int));
+                    cudaMemcpy(pBodyPartPairsGpuPtr, &bodyPartPairs[0], bodyPartPairs.size() * sizeof(unsigned int),
+                               cudaMemcpyHostToDevice);
+                    cudaMalloc((void **)&pMapIdxGpuPtr, mapIdx.size() * sizeof(unsigned int));
+                    cudaMemcpy(pMapIdxGpuPtr, &mapIdx[0], mapIdx.size() * sizeof(unsigned int),
+                               cudaMemcpyHostToDevice);
+                    // Sanity check
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                }
+                // Initialize auxiliary pointers (1-time task)
+                if (mFinalOutputCpu.empty()) // if (pFinalOutputGpuPtr == nullptr)
+                {
+                    // Data
+                    const auto& bodyPartPairs = getPosePartPairs(mPoseModel);
+                    const auto numberBodyPartPairs = bodyPartPairs.size() / 2;
+                    // Allocate memory
+                    mFinalOutputCpu.reset({(int)numberBodyPartPairs, maxPeaks, maxPeaks});
+                    const auto totalComputations = mFinalOutputCpu.getVolume();
+                    if (pFinalOutputGpuPtr == nullptr)
+                        cudaMalloc((void **)&pFinalOutputGpuPtr, totalComputations * sizeof(float));
+                    // Sanity check
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                }
+
+                // Run body part connector
                connectBodyPartsGpu(poseKeypoints, poseScores, heatMapsGpuPtr, peaksPtr, mPoseModel,
                                    Point<int>{heatMapsBlob->shape(3), heatMapsBlob->shape(2)},
                                    maxPeaks, mInterMinAboveThreshold, mInterThreshold,
-                                    mMinSubsetCnt, mMinSubsetScore, mScaleNetToOutput, peaksGpuPtr);
+                                    mMinSubsetCnt, mMinSubsetScore, mScaleNetToOutput, mFinalOutputCpu,
+                                    pFinalOutputGpuPtr, pBodyPartPairsGpuPtr, pMapIdxGpuPtr, peaksGpuPtr);
            #else
                UNUSED(bottom);
                UNUSED(poseKeypoints);

--- a/src/openpose/pose/poseExtractorCaffe.cpp
+++ b/src/openpose/pose/poseExtractorCaffe.cpp
@@ -294,15 +294,23 @@ namespace op
                upImpl->spBodyPartConnectorCaffe->setMinSubsetCnt((int)get(PoseProperty::ConnectMinSubsetCnt));
                upImpl->spBodyPartConnectorCaffe->setMinSubsetScore((float)get(PoseProperty::ConnectMinSubsetScore));

-                // #ifdef USE_CUDA
-                //     upImpl->spBodyPartConnectorCaffe->Forward_gpu({upImpl->spHeatMapsBlob.get(),
-                //                                                    upImpl->spPeaksBlob.get()},
-                //                                                   mPoseKeypoints, mPoseScores);
-                // #else
+                #ifdef USE_CUDA
+                    // BODY_25D only implemented for CPU version
+                    if (mPoseModel == PoseModel::BODY_25D)
+                        upImpl->spBodyPartConnectorCaffe->Forward_cpu({upImpl->spHeatMapsBlob.get(),
+                                                                       upImpl->spPeaksBlob.get()},
+                                                                      mPoseKeypoints, mPoseScores);
+                    else
+                        upImpl->spBodyPartConnectorCaffe->Forward_gpu({upImpl->spHeatMapsBlob.get(),
+                                                                       upImpl->spPeaksBlob.get()},
+                                                                      mPoseKeypoints, mPoseScores);
+                #else
                    upImpl->spBodyPartConnectorCaffe->Forward_cpu({upImpl->spHeatMapsBlob.get(),
                                                                   upImpl->spPeaksBlob.get()},
                                                                  mPoseKeypoints, mPoseScores);
-                // #endif
+                #endif
+
+                // 5. CUDA sanity check
                #ifdef USE_CUDA
                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
                #endif

--- a/src/openpose/pose/renderPose.cu
+++ b/src/openpose/pose/renderPose.cu
@@ -533,13 +533,17 @@ namespace op
                if (googlyEyes && (poseModel == PoseModel::MPI_15 || poseModel == PoseModel::MPI_15_4))
                    error("Bool googlyEyes not compatible with MPI models.",
                          __LINE__, __FUNCTION__, __FILE__);
+                if (numberPeople > POSE_MAX_PEOPLE)
+                    error("Rendering assumes that numberPeople <= POSE_MAX_PEOPLE = " + std::to_string(POSE_MAX_PEOPLE)
+                          + ".", __LINE__, __FUNCTION__, __FILE__);

                dim3 threadsPerBlock;
                dim3 numBlocks;
                getNumberCudaThreadsAndBlocks(threadsPerBlock, numBlocks, frameSize);

                // Body pose
-                if (poseModel == PoseModel::BODY_25 || poseModel == PoseModel::BODY_25_19 || poseModel == PoseModel::BODY_25D)
+                if (poseModel == PoseModel::BODY_25 || poseModel == PoseModel::BODY_25_19
+                    || poseModel == PoseModel::BODY_25D)
                    renderPoseBody25<<<threadsPerBlock, numBlocks>>>(
                        framePtr, frameSize.x, frameSize.y, posePtr, numberPeople, renderThreshold, googlyEyes,
                        blendOriginalFrame, alphaBlending