提交 5019d421 编写于 作者: G gineshidalgo99

Body connector in CUDA, GPU speed up

上级 04362b9d
......@@ -26,6 +26,10 @@ Note: add `--logging_level 0 --disable_multi_thread` to get higher debug informa
### Errors (if any)
### Type of Issue
You might select multiple topics, delete the rest:
- Compilation/installation error
......@@ -38,38 +42,40 @@ You might select multiple topics, delete the rest:
### Your System Configuration
1. **OpenPose version**: Latest GitHub code? Or specific commit (e.g., d52878f)? Or specific version from `Release` section (e.g., 1.2.0)?
1. **Whole console output** (if errors appeared), paste the error to [PasteBin](https://pastebin.com/) and then paste the link here: LINK
2. **OpenPose version**: Latest GitHub code? Or specific commit (e.g., d52878f)? Or specific version from `Release` section (e.g., 1.2.0)?
2. **General configuration**:
3. **General configuration**:
- **Installation mode**: CMake, sh script, manual Makefile installation, ... (Ubuntu); CMake, ... (Windows); ...?
- **Operating system** (`lsb_release -a` in Ubuntu):
- **Release or Debug mode**? (by default: release):
- Compiler (`gcc --version` in Ubuntu or VS version in Windows): 5.4.0, ... (Ubuntu); VS2015 Enterprise Update 3, VS2017 community, ... (Windows); ...?
3. **Non-default settings**:
4. **Non-default settings**:
- **3-D Reconstruction module added**? (by default: no):
- Any other custom CMake configuration with respect to the default version? (by default: no):
4. **3rd-party software**:
5. **3rd-party software**:
- **Caffe version**: Default from OpenPose, custom version, ...?
- **CMake version** (`cmake --version` in Ubuntu):
- **OpenCV version**: pre-compiled `apt-get install libopencv-dev` (only Ubuntu); OpenPose default (only Windows); compiled from source? If so, 2.4.9, 2.4.12, 3.1, 3.2?; ...?
5. If **GPU mode** issue:
6. If **GPU mode** issue:
- **CUDA version** (`cat /usr/local/cuda/version.txt` in most cases):
- **cuDNN version**:
- **GPU model** (`nvidia-smi` in Ubuntu):
6. If **CPU-only mode** issue:
7. If **CPU-only mode** issue:
- **CPU brand & model**:
- Total **RAM memory** available:
7. If **Python** API:
8. If **Python** API:
- **Python version**: 2.7, 3.7, ...?
- **Numpy version** (`python -c "import numpy; print numpy.version.version"` in Ubuntu):
8. If **Windows** system:
9. If **Windows** system:
- Portable demo or compiled library?
9. If **speed performance** issue:
10. If **speed performance** issue:
- Report OpenPose timing speed based on [this link](https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/installation.md#profiling-speed).
......@@ -106,30 +106,31 @@ Any problem installing OpenPose? Check [doc/faq.md](./faq.md) and/or post a GitH
### Prerequisites
1. Download and install CMake GUI:
1. Ubuntu - **Anaconda should not be installed** on your system. Anaconda includes a Protobuf version that is incompatible with Caffe. Either you uninstall anaconda and install protobuf via apt-get, or you compile your own Caffe and link it to OpenPose.
2. Download and install CMake GUI:
- Ubuntu: run the command `sudo apt-get install cmake-qt-gui`. Note: If you prefer to use CMake through the command line, see [Cmake Command Line Build](#cmake-command-line-build-ubuntu-only).
- Windows: download and install the latest CMake win64-x64 msi installer from the [CMake website](https://cmake.org/download/), called `cmake-X.X.X-win64-x64.msi`.
- Mac: `brew cask install cmake`.
2. Windows - **Microsoft Visual Studio (VS) 2015 Enterprise Update 3**:
3. Windows - **Microsoft Visual Studio (VS) 2015 Enterprise Update 3**:
- If **Visual Studio 2017 Community** is desired, we do not officially support it, but it might be compiled by firstly [enabling CUDA 8.0 in VS2017](https://stackoverflow.com/questions/43745099/using-cuda-with-visual-studio-2017?answertab=active#tab-top) or use **VS2017 with CUDA 9** by checking the `.vcxproj` file and changing the necessary paths from CUDA 8 to 9.
- VS 2015 Enterprise Update 1 will give some compiler errors and VS 2015 Community has not been tested.
3. Nvidia GPU version prerequisites:
4. Nvidia GPU version prerequisites:
1. [**CUDA 8**](https://developer.nvidia.com/cuda-80-ga2-download-archive):
- Ubuntu: Run `sudo ubuntu/install_cuda.sh` or alternatively download and install it from their website.
- Windows: Install CUDA 8.0 after Visual Studio 2015 is installed to assure that the CUDA installation will generate all necessary files for VS. If CUDA was already installed, re-install CUDA after installing VS!
- **IMPORTANT**: As of a recent Windows update, you have to download the Nvidia [drivers](http://www.nvidia.com/Download/index.aspx) drivers first, and then install CUDA without the Graphics Driver flag or else your system might hang.
- Windows: Install CUDA 8.0 after Visual Studio 2015 is installed to assure that the CUDA installation will generate all necessary files for VS. If CUDA was already installed, re-install - **IMPORTANT 1/2**: Nvidia V, any Nvidia with Volta architecture, and newer Nvidia model GPUs require at least CUDA 9.
- **IMPORTANT 2/2**: As of a recent Windows update, you might want to download the Nvidia [drivers](http://www.nvidia.com/Download/index.aspx) first, and then install CUDA without the Graphics Driver flag or else your system might hang.
2. [**cuDNN 5.1**](https://developer.nvidia.com/cudnn):
- Ubuntu: Run `sudo ubuntu/install_cudnn.sh` or alternatively download and install it from their website.
- Windows (and Ubuntu if manual installation): In order to manually install it, just unzip it and copy (merge) the contents on the CUDA folder, usually `/usr/local/cuda/` in Ubuntu and `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0` in Windows.
4. AMD GPU version prerequisites:
5. AMD GPU version prerequisites:
1. Download official AMD drivers for Windows from [**AMD - Windows**](https://support.amd.com/en-us/download).
2. Download 3rd party ROCM driver for Ubuntu from [**AMD - OpenCL**](https://rocm.github.io/ROCmInstall.html).
3. Ubuntu only: Install `sudo apt-get install libviennacl-dev`. This comes packaged inside OpenPose for Windows.
4. AMD Drivers have not been tested on OSX. Please email us if you wish to test it. This has only been tested on Vega series cards.
5. Ubuntu - Other prerequisites:
6. Ubuntu - Other prerequisites:
- Caffe prerequisites: By default, OpenPose uses Caffe under the hood. If you have not used Caffe previously, install its dependencies by running `sudo bash ./ubuntu/install_cmake.sh`.
- OpenCV must be already installed on your machine. It can be installed with `apt-get install libopencv-dev`. You can also use your own compiled OpenCV version.
6. Windows - **Caffe, OpenCV, and Caffe prerequisites**:
7. Windows - **Caffe, OpenCV, and Caffe prerequisites**:
- CMake automatically downloads all the Windows DLLs. Alternatively, you might prefer to download them manually:
- Models:
- [COCO model](http://posefs1.perception.cs.cmu.edu/OpenPose/models/pose/coco/pose_iter_440000.caffemodel): download in `models/pose/coco/`.
......@@ -141,9 +142,9 @@ Any problem installing OpenPose? Check [doc/faq.md](./faq.md) and/or post a GitH
- [Caffe](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/caffe_2018_01_18.zip): Unzip as `3rdparty/windows/caffe/`.
- [Caffe dependencies](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/caffe3rdparty_2017_07_14.zip): Unzip as `3rdparty/windows/caffe3rdparty/`.
- [OpenCV 3.1](http://posefs1.perception.cs.cmu.edu/OpenPose/3rdparty/windows/opencv_310.zip): Unzip as `3rdparty/windows/opencv/`.
7. Mac - **Caffe, OpenCV, and Caffe prerequisites**:
8. Mac - **Caffe, OpenCV, and Caffe prerequisites**:
- Install deps by running `bash 3rdparty/osx/install_deps.sh` on your terminal.
8. **Eigen prerequisite**:
9. **Eigen prerequisite**:
- Note: This step is optional, only required for some specific extra functionality, such as extrinsic camera calibration.
- If you enable the `WITH_EIGEN` flag when running CMake. You can either:
1. Do not do anything if you set the `WITH_EIGEN` flag to `BUILD`, CMake will automatically download Eigen. Alternatively, you might prefer to download it manually:
......
......@@ -262,7 +262,9 @@ OpenPose Library - Release Notes
## Current version - future OpenPose 1.4.1
1. Main improvements:
1. Added initial single-person tracker for further speed up or visual smoothing (`--tracking` flag).
2. Greedy body part connector implemented in CUDA: +~33% speed up in Nvidia (CUDA) version with default flags and +~10% in maximum accuracy configuration.
2. Functions or parameters renamed:
1. By default, python example `2_pose_from_heatmaps.py` was using 2 scales starting at -1x736, changed to 1 scale at -1x368.
3. Main bugs fixed:
......
......@@ -23,8 +23,12 @@ except:
raise Exception('Error: OpenPose library could not be found. Did you enable `BUILD_PYTHON` in CMake and have this Python script in the right folder?')
# Params for change
defRes = 736
scales = [1,0.5]
# Single-scale
defRes = 368
scales = [1]
# # Multi-scale
# defRes = 736
# scales = [1, 0.75, 0.5, 0.25]
class Param:
caffemodel = dir_path + "/../../../models/pose/body_25/pose_iter_584000.caffemodel"
prototxt = dir_path + "/../../../models/pose/body_25/pose_deploy.prototxt"
......@@ -36,7 +40,7 @@ params["output_resolution"] = "-1x-1"
params["net_resolution"] = "-1x"+str(defRes)
params["model_pose"] = "BODY_25"
params["alpha_pose"] = 0.6
params["scale_gap"] = 0.5
params["scale_gap"] = 0.25
params["scale_number"] = len(scales)
params["render_threshold"] = 0.05
params["num_gpu_start"] = 0
......
......@@ -17,6 +17,9 @@ namespace op
const T* const peaksPtr, const PoseModel poseModel, const Point<int>& heatMapSize,
const int maxPeaks, const T interMinAboveThreshold, const T interThreshold,
const int minSubsetCnt, const T minSubsetScore, const T scaleFactor = 1.f,
Array<T> finalOutputCpu = Array<T>{}, T* finalOutputGpuPtr = nullptr,
const unsigned int* const bodyPartPairsGpuPtr = nullptr,
const unsigned int* const mapIdxGpuPtr = nullptr,
const T* const peaksGpuPtr = nullptr);
// Private functions used by the 2 above functions
......
......@@ -22,6 +22,8 @@ namespace op
public:
explicit BodyPartConnectorCaffe();
~BodyPartConnectorCaffe();
virtual void Reshape(const std::vector<caffe::Blob<T>*>& bottom);
virtual inline const char* type() const { return "BodyPartConnector"; }
......@@ -60,6 +62,11 @@ namespace op
std::array<int, 4> mHeatMapsSize;
std::array<int, 4> mPeaksSize;
std::array<int, 4> mTopSize;
// GPU auxiliary
unsigned int* pBodyPartPairsGpuPtr;
unsigned int* pMapIdxGpuPtr;
Array<T> mFinalOutputCpu;
T* pFinalOutputGpuPtr;
DELETE_COPY(BodyPartConnectorCaffe);
};
......
......@@ -117,6 +117,44 @@ namespace op
}
}
// template <typename T>
// __global__ void sortKernel(T* targetPtr, const int channels, const int offsetTarget)
// {
// const auto globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
// if (globalIdx < channels)
// {
// const auto totalOffset = globalIdx * offsetTarget;
// const int nonZeroElementsPlus1 = targetPtr[totalOffset]+1;
// for (auto i = 1 ; i < nonZeroElementsPlus1 ; i++)
// {
// // Find new maximum
// const auto iIndex = totalOffset+3*i;
// int maxIndex = i;
// T maxIndexValue = targetPtr[iIndex+2];
// for (auto j = i+1 ; j < nonZeroElementsPlus1 ; j++)
// {
// if (maxIndexValue < targetPtr[totalOffset+3*j+2])
// {
// maxIndex = j;
// maxIndexValue = targetPtr[totalOffset+3*j+2];
// }
// }
// // Swap
// const auto jIndex = totalOffset+3*maxIndex;
// const T temp [3] = {targetPtr[iIndex],
// targetPtr[iIndex+1],
// targetPtr[iIndex+2]};
// targetPtr[iIndex] = targetPtr[jIndex];
// targetPtr[iIndex+1] = targetPtr[jIndex+1];
// targetPtr[iIndex+2] = targetPtr[jIndex+2];
// targetPtr[jIndex] = temp[0];
// targetPtr[jIndex+1] = temp[1];
// targetPtr[jIndex+2] = temp[2];
// }
// }
// }
template <typename T>
void nmsGpu(T* targetPtr, int* kernelPtr, const T* const sourcePtr, const T threshold,
const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize, const Point<T>& offset)
......@@ -137,14 +175,16 @@ namespace op
getNumberCudaBlocks(height, threadsPerBlock2D.y)};
const dim3 threadsPerBlock1D{THREADS_PER_BLOCK};
const dim3 numBlocks1D{getNumberCudaBlocks(imageOffset, threadsPerBlock1D.x)};
// log("num_b: " + std::to_string(bottom->shape(0))); // = 1
// log("channel_b: " + std::to_string(bottom->shape(1))); // = 57 = 18 body parts + bkg + 19x2 PAFs
// log("height_b: " + std::to_string(bottom->shape(2))); // = 368 = height
// log("width_b: " + std::to_string(bottom->shape(3))); // = 656 = width
// log("num_t: " + std::to_string(top->shape(0))); // = 1
// log("channel_t: " + std::to_string(top->shape(1))); // = 18 = numberParts
// log("height_t: " + std::to_string(top->shape(2))); // = 97 = maxPeople + 1
// log("width_t: " + std::to_string(top->shape(3))); // = 3 = [x, y, score]
// const dim3 threadsPerBlockSort{128};
// const dim3 numBlocksSort{getNumberCudaBlocks(channels, threadsPerBlockSort.x)};
// log("num_b: " + std::to_string(sourceSize[0])); // = 1
// log("channel_b: " + std::to_string(sourceSize[1])); // = 57 = 18 body parts + bkg + 19x2 PAFs
// log("height_b: " + std::to_string(sourceSize[2])); // = 368 = height
// log("width_b: " + std::to_string(sourceSize[3])); // = 656 = width
// log("num_t: " + std::to_string(targetSize[0])); // = 1
// log("channel_t: " + std::to_string(targetSize[1])); // = 18 = numberParts
// log("height_t: " + std::to_string(targetSize[2])); // = 128 = maxPeople + 1
// log("width_t: " + std::to_string(targetSize[3])); // = 3 = [x, y, score]
// log("");
for (auto n = 0; n < num; n++)
......@@ -157,15 +197,19 @@ namespace op
const auto* const sourcePtrOffsetted = sourcePtr + offsetChannel * imageOffset;
auto* targetPtrOffsetted = targetPtr + offsetChannel * offsetTarget;
// This returns kernelPtrOffsetted, a binary array with 0s & 1s. 1s in the local maximum positions (size = size(sourcePtrOffsetted))
nmsRegisterKernel<<<numBlocks2D, threadsPerBlock2D>>>(kernelPtrOffsetted, sourcePtrOffsetted, width, height, threshold); //[0,0,0,0,1,0,0,0,0,1,0,0,0,0]
// This returns kernelPtrOffsetted, a binary array with 0s & 1s. 1s in the local maximum
// positions (size = size(sourcePtrOffsetted))
// Example result: [0,0,0,0,1,0,0,0,0,1,0,0,0,0]
nmsRegisterKernel<<<numBlocks2D, threadsPerBlock2D>>>(
kernelPtrOffsetted, sourcePtrOffsetted, width, height, threshold);
// // Debug
// if (c==3)
// {
// char filename[50];
// sprintf(filename, "work%02d.txt", c);
// std::ofstream fout(filename);
// int* kernelPtrOffsetted_local = mKernelBlob.mutable_cpu_data() + n * parts_num * imageOffset + c * imageOffset;
// int* kernelPtrOffsetted_local = mKernelBlob.mutable_cpu_data()
// + n * parts_num * imageOffset + c * imageOffset;
// for (int y = 0; y < height; y++){
// for (int x = 0; x < width; x++)
// fout << kernelPtrOffsetted_local[y*width + x] << "\t";
......@@ -175,14 +219,21 @@ namespace op
// }
auto kernelThrustPtr = thrust::device_pointer_cast(kernelPtrOffsetted);
// This modifies kernelPtrOffsetted, now it indicates the local maximum indexes. Format: 0,0,0,1,1,1,1,2,2,2,... First maximum at index 2, second at 6, etc...
thrust::exclusive_scan(kernelThrustPtr, kernelThrustPtr + imageOffset, kernelThrustPtr); //[0,0,0,0,0,1,1,1,1,1,2,2,2,2]
// This modifies kernelPtrOffsetted, now it indicates the local maximum indexes
// Format: 0,0,0,1,1,1,1,2,2,2,... First maximum at index 2, second at 6, etc...
// Example result: [0,0,0,0,0,1,1,1,1,1,2,2,2,2]
thrust::exclusive_scan(kernelThrustPtr, kernelThrustPtr + imageOffset, kernelThrustPtr);
// This returns targetPtrOffsetted, with the NMS applied over it
writeResultKernel<<<numBlocks1D, threadsPerBlock1D>>>(targetPtrOffsetted, imageOffset,
kernelPtrOffsetted, sourcePtrOffsetted,
width, height, maxPeaks, offset.x, offset.y);
}
// // Sort based on score
// // Commented because it doesn't change accuracy
// // TODO: If finally used, implement for CPU/CL versions
// sortKernel<<<numBlocksSort, threadsPerBlockSort>>>(targetPtr, channels, offsetTarget);
}
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
}
......
......@@ -6,20 +6,19 @@
namespace op
{
const dim3 THREADS_PER_BLOCK{4, 16, 16};
template<typename T>
inline __device__ int intRoundGPU(const T a)
{
return int(a+0.5f);
return int(a+T(0.5));
}
template <typename T>
inline __device__ T process(const T* bodyPartA, const T* bodyPartB, const T* mapX, const T* mapY,
const int heatmapWidth, const int heatmapHeight, const T interThreshold = T(0.05),
const T interMinAboveThreshold = T(0.95), const T renderThreshold = T(0.05))
const int heatmapWidth, const int heatmapHeight, const T interThreshold,
const T interMinAboveThreshold)
{
T finalOutput = -1;
if (bodyPartA[2] < renderThreshold || bodyPartB[2] < renderThreshold) return finalOutput;
const auto vectorAToBX = bodyPartB[0] - bodyPartA[0];
const auto vectorAToBY = bodyPartB[1] - bodyPartA[1];
const auto vectorAToBMax = max(abs(vectorAToBX), abs(vectorAToBY));
......@@ -57,36 +56,47 @@ namespace op
// parts score + connection score
if (count/(float)numberPointsInLine > interMinAboveThreshold)
finalOutput = sum/count;
return sum/count;
}
return finalOutput;
return -1;
}
template <typename T>
__global__ void pafScoreKernel(T* finalOutputPtr, const T* const heatMapPtr, const T* const peaksPtr,
const unsigned int* const bodyPartPairsPtr, const unsigned int* const mapIdxPtr,
const unsigned int poseMaxPeople, const int numberBodyPartPairs,
const int heatmapWidth, const int heatmapHeight)
const unsigned int maxPeaks, const int numberBodyPartPairs,
const int heatmapWidth, const int heatmapHeight, const T interThreshold,
const T interMinAboveThreshold)
{
const auto i = (blockIdx.x * blockDim.x) + threadIdx.x;
const auto j = (blockIdx.y * blockDim.y) + threadIdx.y;
const auto k = (blockIdx.z * blockDim.z) + threadIdx.z;
const auto pairIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
const auto peakA = (blockIdx.y * blockDim.y) + threadIdx.y;
const auto peakB = (blockIdx.z * blockDim.z) + threadIdx.z;
if (i < numberBodyPartPairs)
if (pairIndex < numberBodyPartPairs && peakA < maxPeaks && peakB < maxPeaks)
{
const int partA = bodyPartPairsPtr[i*2];
const int partB = bodyPartPairsPtr[i*2 + 1];
const int mapIdxX = mapIdxPtr[i*2];
const int mapIdxY = mapIdxPtr[i*2 + 1];
const T* const bodyPartA = peaksPtr + (partA*poseMaxPeople*3 + j*3);
const T* const bodyPartB = peaksPtr + (partB*poseMaxPeople*3 + k*3);
const T* const mapX = heatMapPtr + mapIdxX*heatmapWidth*heatmapHeight;
const T* const mapY = heatMapPtr + mapIdxY*heatmapWidth*heatmapHeight;
const T finalOutput = process(bodyPartA, bodyPartB, mapX, mapY, heatmapWidth, heatmapHeight);
finalOutputPtr[(i*poseMaxPeople+j)*poseMaxPeople + k] = finalOutput;
const auto baseIndex = 2*pairIndex;
const auto partA = bodyPartPairsPtr[baseIndex];
const auto partB = bodyPartPairsPtr[baseIndex + 1];
const T numberPeaksA = peaksPtr[3*partA*(maxPeaks+1)];
const T numberPeaksB = peaksPtr[3*partB*(maxPeaks+1)];
const auto outputIndex = (pairIndex*maxPeaks+peakA)*maxPeaks + peakB;
if (peakA < numberPeaksA && peakB < numberPeaksB)
{
const auto mapIdxX = mapIdxPtr[baseIndex];
const auto mapIdxY = mapIdxPtr[baseIndex + 1];
const T* const bodyPartA = peaksPtr + (3*(partA*(maxPeaks+1) + peakA+1));
const T* const bodyPartB = peaksPtr + (3*(partB*(maxPeaks+1) + peakB+1));
const T* const mapX = heatMapPtr + mapIdxX*heatmapWidth*heatmapHeight;
const T* const mapY = heatMapPtr + mapIdxY*heatmapWidth*heatmapHeight;
finalOutputPtr[outputIndex] = process(
bodyPartA, bodyPartB, mapX, mapY, heatmapWidth, heatmapHeight, interThreshold,
interMinAboveThreshold);
}
else
finalOutputPtr[outputIndex] = -1;
}
}
......@@ -95,50 +105,37 @@ namespace op
const T* const peaksPtr, const PoseModel poseModel, const Point<int>& heatMapSize,
const int maxPeaks, const T interMinAboveThreshold, const T interThreshold,
const int minSubsetCnt, const T minSubsetScore, const T scaleFactor,
Array<T> finalOutputCpu, T* finalOutputGpuPtr,
const unsigned int* const bodyPartPairsGpuPtr, const unsigned int* const mapIdxGpuPtr,
const T* const peaksGpuPtr)
{
try
{
// Parts Connection
const auto& bodyPartPairs = getPosePartPairs(poseModel);
const auto& mapIdxOffset = getPoseMapIndex(poseModel);
const auto numberBodyParts = getPoseNumberBodyParts(poseModel);
const auto numberBodyPartPairs = bodyPartPairs.size() / 2;
const auto subsetCounterIndex = numberBodyParts;
// Update mapIdx
auto mapIdx = mapIdxOffset;
for (auto& i : mapIdx)
i += (numberBodyParts+1);
const auto totalComputations = finalOutputCpu.getVolume();
if (numberBodyParts == 0)
error("Invalid value of numberBodyParts, it must be positive, not " + std::to_string(numberBodyParts),
__LINE__, __FUNCTION__, __FILE__);
if (bodyPartPairsGpuPtr == nullptr || mapIdxGpuPtr == nullptr)
error("The pointers bodyPartPairsGpuPtr and mapIdxGpuPtr cannot be nullptr.",
__LINE__, __FUNCTION__, __FILE__);
// Upload required data to GPU
unsigned int* bodyPartPairsGpuPtr;
cudaMalloc((void **)&bodyPartPairsGpuPtr, bodyPartPairs.size() * sizeof(unsigned int));
cudaMemcpy(bodyPartPairsGpuPtr, &bodyPartPairs[0], bodyPartPairs.size() * sizeof(unsigned int),
cudaMemcpyHostToDevice);
unsigned int* mapIdxGpuPtr;
cudaMalloc((void **)&mapIdxGpuPtr, mapIdx.size() * sizeof(unsigned int));
cudaMemcpy(mapIdxGpuPtr, &mapIdx[0], mapIdx.size() * sizeof(unsigned int), cudaMemcpyHostToDevice);
T* finalOutputGpuPtr;
Array<T> finalOutputCpu;
finalOutputCpu.reset({(int)numberBodyPartPairs, (int)POSE_MAX_PEOPLE, (int)POSE_MAX_PEOPLE},-1);
int totalComputations = numberBodyPartPairs * POSE_MAX_PEOPLE * POSE_MAX_PEOPLE;
cudaMalloc((void **)&finalOutputGpuPtr, totalComputations * sizeof(float));
// Run Kernel
const dim3 threadsPerBlock{4, 8, 8}; //4 is good for BODY_25, 8 for COCO?
if ((POSE_MAX_PEOPLE+1) % threadsPerBlock.y || (POSE_MAX_PEOPLE+1) % threadsPerBlock.z)
error("Invalid value of POSE_MAX_PEOPLE, it must be multiple of 16, rather than "
+ std::to_string(POSE_MAX_PEOPLE), __LINE__, __FUNCTION__, __FILE__);
int pairBlocks = intRound((numberBodyPartPairs/threadsPerBlock.x) + 0.5);
const dim3 numBlocks{(unsigned int)pairBlocks, (POSE_MAX_PEOPLE+1) / threadsPerBlock.y,
(POSE_MAX_PEOPLE+1) / threadsPerBlock.z};
pafScoreKernel<<<numBlocks, threadsPerBlock>>>(
// Run Kernel - finalOutputGpu
const dim3 numBlocks{
getNumberCudaBlocks(numberBodyPartPairs, THREADS_PER_BLOCK.x),
getNumberCudaBlocks(maxPeaks, THREADS_PER_BLOCK.y),
getNumberCudaBlocks(maxPeaks, THREADS_PER_BLOCK.z)};
pafScoreKernel<<<numBlocks, THREADS_PER_BLOCK>>>(
finalOutputGpuPtr, heatMapGpuPtr, peaksGpuPtr, bodyPartPairsGpuPtr, mapIdxGpuPtr,
POSE_MAX_PEOPLE, numberBodyPartPairs, heatMapSize.x, heatMapSize.y);
maxPeaks, numberBodyPartPairs, heatMapSize.x, heatMapSize.y, interThreshold,
interMinAboveThreshold);
// finalOutputCpu <-- finalOutputGpu
cudaMemcpy(finalOutputCpu.getPtr(), finalOutputGpuPtr, totalComputations * sizeof(float),
cudaMemcpyDeviceToHost);
......@@ -153,10 +150,10 @@ namespace op
// Delete people below the following thresholds:
// a) minSubsetCnt: removed if less than minSubsetCnt body parts
// b) minSubsetScore: removed if global score smaller than this
// c) POSE_MAX_PEOPLE: keep first POSE_MAX_PEOPLE people above thresholds
// c) maxPeaks (POSE_MAX_PEOPLE): keep first maxPeaks people above thresholds
int numberPeople;
std::vector<int> validSubsetIndexes;
validSubsetIndexes.reserve(fastMin((size_t)POSE_MAX_PEOPLE, subsets.size()));
validSubsetIndexes.reserve(fastMin((size_t)maxPeaks, subsets.size()));
removeSubsetsBelowThresholds(validSubsetIndexes, numberPeople, subsets, subsetCounterIndex,
numberBodyParts, minSubsetCnt, minSubsetScore, maxPeaks);
......@@ -164,10 +161,7 @@ namespace op
subsetsToPoseKeypointsAndScores(poseKeypoints, poseScores, scaleFactor, subsets, validSubsetIndexes,
peaksPtr, numberPeople, numberBodyParts, numberBodyPartPairs);
// Differences w.r.t. CPU version for now
cudaFree(bodyPartPairsGpuPtr);
cudaFree(mapIdxGpuPtr);
cudaFree(finalOutputGpuPtr);
// Sanity check
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
}
catch (const std::exception& e)
......@@ -181,11 +175,17 @@ namespace op
const PoseModel poseModel, const Point<int>& heatMapSize, const int maxPeaks,
const float interMinAboveThreshold, const float interThreshold,
const int minSubsetCnt, const float minSubsetScore, const float scaleFactor,
Array<float> finalOutputCpu, float* finalOutputGpuPtr,
const unsigned int* const bodyPartPairsGpuPtr,
const unsigned int* const mapIdxGpuPtr,
const float* const peaksGpuPtr);
template void connectBodyPartsGpu(Array<double>& poseKeypoints, Array<double>& poseScores,
const double* const heatMapGpuPtr, const double* const peaksPtr,
const PoseModel poseModel, const Point<int>& heatMapSize, const int maxPeaks,
const double interMinAboveThreshold, const double interThreshold,
const int minSubsetCnt, const double minSubsetScore, const double scaleFactor,
Array<double> finalOutputCpu, double* finalOutputGpuPtr,
const unsigned int* const bodyPartPairsGpuPtr,
const unsigned int* const mapIdxGpuPtr,
const double* const peaksGpuPtr);
}
#ifdef USE_CAFFE
#include <caffe/blob.hpp>
#endif
#ifdef USE_CUDA
#include <openpose/gpu/cuda.hpp>
#endif
#include <openpose/pose/bodyPartConnectorBase.hpp>
#include <openpose/pose/poseParameters.hpp>
#include <openpose/pose/bodyPartConnectorCaffe.hpp>
namespace op
{
template <typename T>
BodyPartConnectorCaffe<T>::BodyPartConnectorCaffe()
BodyPartConnectorCaffe<T>::BodyPartConnectorCaffe() :
pBodyPartPairsGpuPtr{nullptr},
pMapIdxGpuPtr{nullptr},
pFinalOutputGpuPtr{nullptr}
{
try
{
......@@ -22,6 +29,23 @@ namespace op
}
}
template <typename T>
BodyPartConnectorCaffe<T>::~BodyPartConnectorCaffe()
{
try
{
#if defined USE_CAFFE && defined USE_CUDA
cudaFree(pBodyPartPairsGpuPtr);
cudaFree(pMapIdxGpuPtr);
cudaFree(pFinalOutputGpuPtr);
#endif
}
catch (const std::exception& e)
{
error(e.what(), __LINE__, __FUNCTION__, __FILE__);
}
}
template <typename T>
void BodyPartConnectorCaffe<T>::Reshape(const std::vector<caffe::Blob<T>*>& bottom)
{
......@@ -162,15 +186,58 @@ namespace op
try
{
#if defined USE_CAFFE && defined USE_CUDA
// Global data
const auto heatMapsBlob = bottom.at(0);
const auto* const peaksPtr = bottom.at(1)->cpu_data();
const auto* const heatMapsGpuPtr = heatMapsBlob->gpu_data();
const auto* const peaksGpuPtr = bottom.at(1)->gpu_data();
const auto* const peaksPtr = bottom.at(1)->cpu_data();
const auto maxPeaks = mTopSize[1];
const auto* const peaksGpuPtr = bottom.at(1)->gpu_data();
// Initialize fixed pointers (1-time task) - It must be done in the same thread than Forward_gpu
if (pBodyPartPairsGpuPtr == nullptr || pMapIdxGpuPtr == nullptr)
{
// Free previous memory
cudaFree(pBodyPartPairsGpuPtr);
cudaFree(pMapIdxGpuPtr);
// Data
const auto& bodyPartPairs = getPosePartPairs(mPoseModel);
const auto numberBodyParts = getPoseNumberBodyParts(mPoseModel);
const auto& mapIdxOffset = getPoseMapIndex(mPoseModel);
// Update mapIdx
auto mapIdx = mapIdxOffset;
for (auto& i : mapIdx)
i += (numberBodyParts+1);
// Re-allocate memory
cudaMalloc((void **)&pBodyPartPairsGpuPtr, bodyPartPairs.size() * sizeof(unsigned int));
cudaMemcpy(pBodyPartPairsGpuPtr, &bodyPartPairs[0], bodyPartPairs.size() * sizeof(unsigned int),
cudaMemcpyHostToDevice);
cudaMalloc((void **)&pMapIdxGpuPtr, mapIdx.size() * sizeof(unsigned int));
cudaMemcpy(pMapIdxGpuPtr, &mapIdx[0], mapIdx.size() * sizeof(unsigned int),
cudaMemcpyHostToDevice);
// Sanity check
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
}
// Initialize auxiliary pointers (1-time task)
if (mFinalOutputCpu.empty()) // if (pFinalOutputGpuPtr == nullptr)
{
// Data
const auto& bodyPartPairs = getPosePartPairs(mPoseModel);
const auto numberBodyPartPairs = bodyPartPairs.size() / 2;
// Allocate memory
mFinalOutputCpu.reset({(int)numberBodyPartPairs, maxPeaks, maxPeaks});
const auto totalComputations = mFinalOutputCpu.getVolume();
if (pFinalOutputGpuPtr == nullptr)
cudaMalloc((void **)&pFinalOutputGpuPtr, totalComputations * sizeof(float));
// Sanity check
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
}
// Run body part connector
connectBodyPartsGpu(poseKeypoints, poseScores, heatMapsGpuPtr, peaksPtr, mPoseModel,
Point<int>{heatMapsBlob->shape(3), heatMapsBlob->shape(2)},
maxPeaks, mInterMinAboveThreshold, mInterThreshold,
mMinSubsetCnt, mMinSubsetScore, mScaleNetToOutput, peaksGpuPtr);
mMinSubsetCnt, mMinSubsetScore, mScaleNetToOutput, mFinalOutputCpu,
pFinalOutputGpuPtr, pBodyPartPairsGpuPtr, pMapIdxGpuPtr, peaksGpuPtr);
#else
UNUSED(bottom);
UNUSED(poseKeypoints);
......
......@@ -294,15 +294,23 @@ namespace op
upImpl->spBodyPartConnectorCaffe->setMinSubsetCnt((int)get(PoseProperty::ConnectMinSubsetCnt));
upImpl->spBodyPartConnectorCaffe->setMinSubsetScore((float)get(PoseProperty::ConnectMinSubsetScore));
// #ifdef USE_CUDA
// upImpl->spBodyPartConnectorCaffe->Forward_gpu({upImpl->spHeatMapsBlob.get(),
// upImpl->spPeaksBlob.get()},
// mPoseKeypoints, mPoseScores);
// #else
#ifdef USE_CUDA
// BODY_25D only implemented for CPU version
if (mPoseModel == PoseModel::BODY_25D)
upImpl->spBodyPartConnectorCaffe->Forward_cpu({upImpl->spHeatMapsBlob.get(),
upImpl->spPeaksBlob.get()},
mPoseKeypoints, mPoseScores);
else
upImpl->spBodyPartConnectorCaffe->Forward_gpu({upImpl->spHeatMapsBlob.get(),
upImpl->spPeaksBlob.get()},
mPoseKeypoints, mPoseScores);
#else
upImpl->spBodyPartConnectorCaffe->Forward_cpu({upImpl->spHeatMapsBlob.get(),
upImpl->spPeaksBlob.get()},
mPoseKeypoints, mPoseScores);
// #endif
#endif
// 5. CUDA sanity check
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
......
......@@ -533,13 +533,17 @@ namespace op
if (googlyEyes && (poseModel == PoseModel::MPI_15 || poseModel == PoseModel::MPI_15_4))
error("Bool googlyEyes not compatible with MPI models.",
__LINE__, __FUNCTION__, __FILE__);
if (numberPeople > POSE_MAX_PEOPLE)
error("Rendering assumes that numberPeople <= POSE_MAX_PEOPLE = " + std::to_string(POSE_MAX_PEOPLE)
+ ".", __LINE__, __FUNCTION__, __FILE__);
dim3 threadsPerBlock;
dim3 numBlocks;
getNumberCudaThreadsAndBlocks(threadsPerBlock, numBlocks, frameSize);
// Body pose
if (poseModel == PoseModel::BODY_25 || poseModel == PoseModel::BODY_25_19 || poseModel == PoseModel::BODY_25D)
if (poseModel == PoseModel::BODY_25 || poseModel == PoseModel::BODY_25_19
|| poseModel == PoseModel::BODY_25D)
renderPoseBody25<<<threadsPerBlock, numBlocks>>>(
framePtr, frameSize.x, frameSize.y, posePtr, numberPeople, renderThreshold, googlyEyes,
blendOriginalFrame, alphaBlending
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册