Cuda¶

Dynamic Link Libs¶

hl_dso_loader.h¶

Functions

void GetCublasDsoHandle(void **dso_handle)¶

load the DSO of CUBLAS

Parameters

**dso_handle -
dso handler

void GetCudnnDsoHandle(void **dso_handle)¶

load the DSO of CUDNN

Parameters

**dso_handle -
dso handler

void GetCudartDsoHandle(void **dso_handle)¶

load the DSO of CUDA Run Time

Parameters

**dso_handle -
dso handler

void GetCurandDsoHandle(void **dso_handle)¶

load the DSO of CURAND

Parameters

**dso_handle -
dso handler

GPU Resources¶

hl_cuda.ph¶

hl_cuda.h¶

Typedefs

typedef struct _hl_event_st *hl_event_t¶: HPPL event.

Functions

int hl_get_cuda_lib_version()¶: return cuda runtime api version.

void hl_start()¶: HPPL strat(Initialize all GPU).

void hl_specify_devices_start(int *device, int number)¶

HPPL start(Initialize the specific GPU).

Parameters

device -
device id(0, 1......). if device is NULL, will start all GPU.
number -
number of devices.

bool hl_device_can_access_peer(int device, int peerDevice)¶

Queries if a device may directly access a peer device’s memory.

Return

Returns true if device is capable of directly accessing memory from peerDevice and false otherwise.

Parameters

device -
Device from which allocations on peerDevice are to be directly accessed.
peerDevice -
Device on which the allocations to be directly accessed by device reside.

void hl_device_enable_peer_access(int peerDevice)¶

Enables direct access to memory allocations on a peer device.

Parameters

peerDevice -
Peer device to enable direct access to from the current device

void hl_init(int device)¶

Init a work thread.

Parameters

device -
device id.

void hl_fini()¶: Finish a work thread.

void hl_set_sync_flag(bool flag)¶

Set synchronous/asynchronous flag.

Note

This setting is only valid for the current worker thread.

Parameters

flag -
true(default), set synchronous flag. false, set asynchronous flag.

bool hl_get_sync_flag()¶

Get synchronous/asynchronous flag.

Return: Synchronous call true. Asynchronous call false.

int hl_get_device_count()¶: Returns the number of compute-capable devices.

void hl_set_device(int device)¶

Set device to be used.

Parameters

device -
device id.

int hl_get_device()¶

Returns which device is currently being used.

Return: device device id.

void *hl_malloc_device(size_t size)¶

Allocate device memory.

Return

dest_d pointer to device memory.

Parameters

size -
size in bytes to copy.

void hl_free_mem_device(void *dest_d)¶

Free device memory.

Parameters

dest_d -
pointer to device memory.

void *hl_malloc_host(size_t size)¶

Allocate host page-lock memory.

Return

dest_h pointer to host memory.

Parameters

size -
size in bytes to copy.

void hl_free_mem_host(void *dest_h)¶

Free host page-lock memory.

Parameters

dest_h -
pointer to host memory.

void hl_memcpy(void *dst, void *src, size_t size)¶

Copy data.

Parameters

dst -
dst memory address(host or device).
src -
src memory address(host or device).
size -
size in bytes to copy.

void hl_memset_device(void *dest_d, int value, size_t size)¶

Set device memory to a value.

Parameters

dest_d -
pointer to device memory.
value -
value to set for each byte of specified memory.
size -
size in bytes to set.

void hl_memcpy_host2device(void *dest_d, void *src_h, size_t size)¶

Copy host memory to device memory.

Parameters

dest_d -
dst memory address.
src_h -
src memory address.
size -
size in bytes to copy.

void hl_memcpy_device2host(void *dest_h, void *src_d, size_t size)¶

Copy device memory to host memory.

Parameters

dest_h -
dst memory address.
src_d -
src memory address.
size -
size in bytes to copy.

void hl_memcpy_device2device(void *dest_d, void *src_d, size_t size)¶

Copy device memory to device memory.

Parameters

dest_d -
dst memory address.
src_d -
src memory address.
size -
size in bytes to copy.

void hl_rand(real *dest_d, size_t num)¶

Generate uniformly distributed floats (0, 1.0].

Parameters

dest_d -
pointer to device memory to store results.
num -
number of floats to generate.

void hl_srand(unsigned int seed)¶

Set the seed value of the random number generator.

Parameters

seed -
seed value.

void hl_memcpy_async(void *dst, void *src, size_t size, hl_stream_t stream)¶

Copy data.

Parameters

dst -
dst memory address(host or device).
src -
src memory address(host or device).
size -
size in bytes to copy.
stream -
stream id.

void hl_stream_synchronize(hl_stream_t stream)¶

Waits for stream tasks to complete.

Parameters

stream -
stream id.

void hl_create_event(hl_event_t *event)¶

Creates an event object.

Parameters

event -
New event.

void hl_destroy_event(hl_event_t event)¶

Destroys an event object.

Parameters

event -
Event to destroy.

float hl_event_elapsed_time(hl_event_t start, hl_event_t end)¶

Computes the elapsed time between events.

Return

time Time between start and end in ms.

Parameters

start -
Starting event.
end -
Ending event.

void hl_stream_record_event(hl_stream_t stream, hl_event_t event)¶

Records an event.

Parameters

stream -
Stream in which to insert event.
event -
Event waiting to be recorded as completed.

void hl_stream_wait_event(hl_stream_t stream, hl_event_t event)¶

Make a compute stream wait on an event.

Parameters

stream -
Stream in which to insert event.
event -
Event to wait on.

void hl_event_synchronize(hl_event_t event)¶

Wait for an event to complete.

Parameters

event -
event to wait for.

void hl_set_device_flags_block()¶

Sets block flags to be used for device executions.

Note: This interface needs to be called before hl_start.

const char *hl_get_device_error_string()¶: Returns the last error string from a cuda runtime call.

const char *hl_get_device_error_string(size_t err)¶

Returns the last error string from a cuda runtime call.

See

hl_get_device_last_error()

Parameters

err -
error number.

int hl_get_device_last_error()¶

Returns the last error number.

Return: error number.
See: hl_get_device_error_string()

void hl_cuda_event_query(hl_event_t event, bool &isNotReady)¶

hppl query event.

Parameters

event -
cuda event to query.
isNotReady -
this work under device has not yet been completed, vice versa.

void hl_device_synchronize()¶: hppl device synchronization.

CUDA Wrapper¶

hl_cuda_cublas.h¶

Functions

void hl_matrix_transpose(real *A_d, real *C_d, int dimM, int dimN, int lda, int ldc)¶

Matrix transpose: C_d = T(A_d)

Parameters

A_d -
input matrix (M x N).
C_d -
output matrix (N x M).
dimM -
matrix height.
dimN -
matrix width.
lda -
the first dimension of A_d.
ldc -
the first dimension of C_d.

void hl_matrix_transpose(real *A_d, real *C_d, int dimM, int dimN)¶

void hl_matrix_mul(real *A_d, hl_trans_op_t transa, real *B_d, hl_trans_op_t transb, real *C_d, int dimM, int dimN, int dimK, real alpha, real beta, int lda, int ldb, int ldc)¶

C_d = alpha*(op(A_d) * op(B_d)) + beta*C_d.

Parameters

A_d -
input.
transa -
operation op(A) that is non-or transpose.
B_d -
input.
transb -
operation op(B) that is non-or transpose.
C_d -
output.
dimM -
matrix height of op(A) & C
dimN -
matrix width of op(B) & C
dimK -
width of op(A) & height of op(B)
alpha -
scalar used for multiplication.
beta -
scalar used for multiplication.
lda -
the first dimension of A_d.
ldb -
the first dimension of B_d.
ldc -
the first dimension of C_d.

void hl_matrix_mul(real *A_d, hl_trans_op_t transa, real *B_d, hl_trans_op_t transb, real *C_d, int dimM, int dimN, int dimK, real alpha, real beta)¶

C_d = alpha*(op(A_d) * op(B_d)) + beta*C_d.

Parameters

A_d -
input.
transa -
operation op(A) that is non-or transpose.
B_d -
input.
transb -
operation op(B) that is non-or transpose.
C_d -
output.
dimM -
matrix height of op(A) & C
dimN -
matrix width of op(B) & C
dimK -
width of op(A) & height of op(B)
alpha -
scalar used for multiplication.
beta -
scalar used for multiplication.

void hl_matrix_mul_vector(real *A_d, hl_trans_op_t trans, real *B_d, real *C_d, int dimM, int dimN, real alpha, real beta, int lda, int incb, int incc)¶

This function performs the matrix-vector multiplication. C_d = alpha*op(A_d)*B_d + beta*C_d.

Parameters

A_d -
matrix.
trans -
operation op(A) that is non-or transpose.
B_d -
vector with dimN(dimM) elements if trans==HPPL_OP_N(HPPL_OP_T).
C_d -
vector with dimM(dimN) elements if trans==HPPL_OP_N(HPPL_OP_T).
dimM -
number of rows of matrix A_d.
dimN -
number of columns of matrix A_d.
alpha -
scalar used for multiplication.
beta -
scalar used for multiplication.
lda -
the first dimension of A_d.
incb -
increase B_d size for compaction.
incc -
increase C_d size for compaction.

void hl_matrix_mul_vector(real *A_d, hl_trans_op_t trans, real *B_d, real *C_d, int dimM, int dimN, real alpha, real beta)¶

This function performs the matrix-vector multiplication. C_d = alpha*op(A_d)*B_d + beta*C_d.

Parameters

A_d -
matrix.
trans -
operation op(A) that is non-or transpose.
B_d -
vector with dimN(dimM) elements if trans==HPPL_OP_N(HPPL_OP_T).
C_d -
vector with dimM(dimN) elements if trans==HPPL_OP_N(HPPL_OP_T).
dimM -
number of rows of matrix A_d.
dimN -
number of columns of matrix A_d.
alpha -
scalar used for multiplication.
beta -
scalar used for multiplication.

hl_cuda_cudnn.h¶

Typedefs

typedef struct _hl_tensor_descriptor *hl_tensor_descriptor¶: hppl image descriptor.

typedef struct _hl_pooling_descriptor *hl_pooling_descriptor¶: hppl pooling descriptor.

typedef struct _hl_filter_descriptor *hl_filter_descriptor¶: hppl filter descriptor.

typedef struct _hl_convolution_descriptor *hl_convolution_descriptor¶: hppl filter descriptor.

Enums

enum hl_pooling_mode_t¶

Values:

HL_POOLING_MAX = 0¶

HL_POOLING_AVERAGE = 1¶

HL_POOLING_AVERAGE_EXCLUDE_PADDING = 2¶

HL_POOLING_END¶

Functions

int hl_get_cudnn_lib_version()¶: return cudnn lib version

void hl_create_tensor_descriptor(hl_tensor_descriptor *image_desc)¶

create image descriptor.

Parameters

image_desc -
image descriptor.

void hl_tensor_reshape(hl_tensor_descriptor image_desc, int batch_size, int feature_maps, int height, int width)¶

reshape image descriptor.

Parameters

image_desc -
image descriptor.
batch_size -
input batch size.
feature_maps -
image feature maps.
height -
image height.
width -
image width.

void hl_tensor_reshape(hl_tensor_descriptor image_desc, int batch_size, int feature_maps, int height, int width, int nStride, int cStride, int hStride, int wStride)¶

reshape image descriptor.

Parameters

image_desc -
image descriptor.
batch_size -
input batch size.
feature_maps -
image feature maps.
height -
image height.
width -
image width.
nStride -
stride between two consecutive images.
cStride -
stride between two consecutive feature maps.
hStride -
stride between two consecutive rows.
wStride -
stride between two consecutive columns.

void hl_destroy_tensor_descriptor(hl_tensor_descriptor image_desc)¶

destroy image descriptor.

Parameters

image_desc -
hppl image descriptor.

void hl_create_pooling_descriptor(hl_pooling_descriptor *pooling_desc, hl_pooling_mode_t mode, int height, int width, int height_padding, int width_padding, int stride_height, int stride_width)¶

create pooling descriptor.

Parameters

pooling_desc -
pooling descriptor.
mode -
pooling mode.
height -
height of the pooling window.
width -
width of the pooling window.
height_padding -
padding height.
width_padding -
padding width.
stride_height -
pooling vertical stride.
stride_width -
pooling horizontal stride.

void hl_destroy_pooling_descriptor(hl_pooling_descriptor pooling_desc)¶

destroy pooling descriptor.

Parameters

pooling_desc -
hppl pooling descriptor.

void hl_pooling_forward(hl_tensor_descriptor input, real *input_image, hl_tensor_descriptor output, real *output_image, hl_pooling_descriptor pooling)¶

pooling forward(calculate output image).

Parameters

input -
input image descriptor.
input_image -
input image data.
output -
output image descriptor.
output_image -
output image data.
pooling -
pooling descriptor.

void hl_pooling_backward(hl_tensor_descriptor input, real *input_image, real *input_image_grad, hl_tensor_descriptor output, real *output_image, real *output_image_grad, hl_pooling_descriptor pooling)¶

pooling backward(calculate input image gradient).

Parameters

input -
input image descriptor.
input_image -
input image data.
input_image_grad -
input image gradient data.
output -
output image descriptor.
output_image -
output image data.
output_image_grad -
output image gradient data.
pooling -
pooling descriptor.

void hl_create_filter_descriptor(hl_filter_descriptor *filter, int input_feature_maps, int output_feature_maps, int height, int width)¶

create filter descriptor.

Parameters

filter -
filter descriptor.
input_feature_maps -
input image feature maps.
output_feature_maps -
output image feature maps.
height -
filter height.
width -
filter width.

void hl_conv_workspace(hl_tensor_descriptor input, hl_tensor_descriptor output, hl_filter_descriptor filter, hl_convolution_descriptor conv, int *convFwdAlgo, size_t *fwdLimitBytes, int *convBwdDataAlgo, size_t *bwdDataLimitBytes, int *convBwdFilterAlgo, size_t *bwdFilterLimitBytes)¶

convolution workspace configuration

Parameters

input -
image descriptor
output -
image descriptor
filter -
filter descriptor
conv -
convolution descriptor
convFwdAlgo -
forward algorithm
fwdLimitBytes -
forward workspace size
convBwdDataAlgo -
backward data algorithm
bwdDataLimitBytes -
backward data workspace size
convBwdFilterAlgo -
backward filter algorithm
bwdFilterLimitBytes -
backward filter workspace size

void hl_destroy_filter_descriptor(hl_filter_descriptor filter)¶

destroy filter descriptor.

Parameters

filter -
hppl filter descriptor.

void hl_create_convolution_descriptor(hl_convolution_descriptor *conv, hl_tensor_descriptor image, hl_filter_descriptor filter, int padding_height, int padding_width, int stride_height, int stride_width)¶

create convolution descriptor.

Parameters

conv -
conv descriptor.
image -
input image descriptor.
filter -
filter descriptor.
padding_height -
padding height.
padding_width -
padding width.
stride_height -
stride height.
stride_width -
stride width.

void hl_reset_convolution_descriptor(hl_convolution_descriptor conv, hl_tensor_descriptor image, hl_filter_descriptor filter, int padding_height, int padding_width, int stride_height, int stride_width)¶

reset convolution descriptor.

Parameters

conv -
conv descriptor.
image -
input image descriptor.
filter -
filter descriptor.
padding_height -
padding height.
padding_width -
padding width.
stride_height -
stride height.
stride_width -
stride width.

void hl_destroy_convolution_descriptor(hl_convolution_descriptor conv)¶

destroy convolution descriptor.

Parameters

conv -
hppl convolution descriptor.

void hl_convolution_forward(hl_tensor_descriptor input, real *input_data, hl_tensor_descriptor output, real *output_data, hl_filter_descriptor filter, real *filter_data, hl_convolution_descriptor conv, void *gpuWorkSpace, size_t sizeInBytes, int convFwdAlgo)¶

convolution forward(calculate output image).

Parameters

input -
input image descriptor.
input_data -
input image data.
output -
output image descriptor.
output_data -
output image data.
filter -
filter descriptor.
filter_data -
filter data.
conv -
convolution descriptor.
gpuWorkSpace -
limited gpu workspace.
sizeInBytes -
gpu workspace size (bytes).
convFwdAlgo -
forward algorithm.

void hl_convolution_forward_add_bias(hl_tensor_descriptor bias, real *bias_data, hl_tensor_descriptor output, real *output_data)¶

convolution forward add bias(calculate output add bias).

Parameters

bias -
bias descriptor.
bias_data -
bias data.
output -
output image descriptor.
output_data -
output image data.

void hl_convolution_backward_filter(hl_tensor_descriptor input, real *input_data, hl_tensor_descriptor output, real *output_grad_data, hl_filter_descriptor filter, real *filter_grad_data, hl_convolution_descriptor conv, void *gpuWorkSpace, size_t sizeInBytes, int convBwdFilterAlgo)¶

convolution backward filter(calculate filter grad data).

Parameters

input -
input image descriptor.
input_data -
input image data.
output -
output image descriptor.
output_grad_data -
output image grad data.
filter -
filter descriptor.
filter_grad_data -
filter grad data.
conv -
convolution descriptor.
gpuWorkSpace -
limited gpu workspace.
sizeInBytes -
gpu workspace size (bytes).
convBwdFilterAlgo -
backward filter algorithm.

void hl_convolution_backward_data(hl_tensor_descriptor input, real *input_data_grad, hl_tensor_descriptor output, real *output_grad_data, hl_filter_descriptor filter, real *filter_data, hl_convolution_descriptor conv, void *gpuWorkSpace, size_t sizeInBytes, int convBwdDataAlgo)¶

convolution backward data(calculate input image grad data).

Parameters

input -
input image descriptor.
input_data_grad -
input image grad data.
output -
output image descriptor.
output_grad_data -
output image grad data.
filter -
filter descriptor.
filter_data -
filter data.
conv -
convolution descriptor.
gpuWorkSpace -
limited gpu workspace.
sizeInBytes -
gpu workspace size (bytes).
convBwdDataAlgo -
backward data algorithm.

void hl_convolution_backward_bias(hl_tensor_descriptor bias, real *bias_grad_data, hl_tensor_descriptor output, real *output_grad_data)¶

convolution backward bias(calculate bias grad data).

Parameters

bias -
bias descriptor.
bias_grad_data -
bias grad data.
output -
output image descriptor.
output_grad_data -
output image grad data.

void hl_softmax_forward(real *input, real *output, int height, int width)¶

softmax forward.

Parameters

input -
input value.
output -
output value.
height -
matrix height.
width -
matrix width.

void hl_softmax_backward(real *output_value, real *output_grad, int height, int width)¶

softmax backward.

Parameters

output_value -
output value data.
output_grad -
output grad data.
height -
matrix height.
width -
matrix width.

void hl_batch_norm_forward_training(hl_tensor_descriptor inputDesc, real *input, hl_tensor_descriptor outputDesc, real *output, hl_tensor_descriptor bnParamDesc, real *scale, real *bias, double factor, real *runningMean, real *runningInvVar, double epsilon, real *savedMean, real *savedVar)¶

cudnn batch norm forward.

Parameters

inputDesc -
input tensor descriptor desc.
input -
input data.
outputDesc -
output tensor descriptor desc.
output -
output data.
bnParamDesc -
tensor descriptor desc. bnScale, bnBias, running mean/var, save_mean/var.
scale -
batch normalization scale parameter (in original paper scale is referred to as gamma).
bias -
batch normalization bias parameter (in original paper scale is referred to as beta).
factor -
Factor used in the moving average computation. runningMean = newMean * factor
- runningMean * (1 - factor)
runningMean -
running mean.
runningInvVar -
running variance.
epsilon -
Epsilon value used in the batch normalization formula.
savedMean -
optional cache to save intermediate results.
savedVar -
optional cache to save intermediate results.

void hl_batch_norm_forward_inference(hl_tensor_descriptor inputDesc, real *input, hl_tensor_descriptor outputDesc, real *output, hl_tensor_descriptor bnParamDesc, real *scale, real *bias, real *estimatedMean, real *estimatedVar, double epsilon)¶

cudnn batch norm forward.

Parameters

inputDesc -
input tensor descriptor desc.
input -
input data.
outputDesc -
output tensor descriptor desc.
output -
output data.
bnParamDesc -
tensor descriptor desc. bnScale, bnBias, running mean/var, save_mean/var.
scale -
batch normalization scale parameter (in original paper scale is referred to as gamma).
bias -
batch normalization bias parameter (in original paper scale is referred to as beta).
estimatedMean -
estimatedVar -
It is suggested that resultRunningMean, resultRunningVariance from the cudnnBatchNormalizationForwardTraining call accumulated during the training phase are passed as inputs here.
epsilon -
Epsilon value used in the batch normalization formula.

void hl_batch_norm_backward(hl_tensor_descriptor inputDesc, real *input, hl_tensor_descriptor outGradDesc, real *outGrad, hl_tensor_descriptor inGradDesc, real *inGrad, hl_tensor_descriptor dBnParamDesc, real *scale, real *scaleGrad, real *biasGrad, double epsilon, real *savedMean, real *savedInvVar)¶

cudnn batch norm forward.

Parameters

inputDesc -
input tensor descriptor desc.
input -
input data.
outGradDesc -
output tensor descriptor desc.
outGrad -
output data.
inGradDesc -
input tensor descriptor desc.
inGrad -
input data.
dBnParamDesc -
tensor descriptor desc. bnScale, bnBias, running mean/var, save_mean/var.
scale -
batch normalization scale parameter (in original paper scale is referred to as gamma).
scaleGrad -
batch normalization scale parameter (in original paper scale is referred to as gamma) gradient.
biasGrad -
batch normalization bias parameter (in original paper scale is referred to as beta) gradient.
epsilon -
Epsilon value used in the batch normalization formula.
savedMean -
optional cache to save intermediate results.
savedInvVar -
optional cache to save intermediate results.

Cuda¶

Dynamic Link Libs¶

hl_dso_loader.h¶

GPU Resources¶

hl_cuda.ph¶

hl_cuda.h¶

CUDA Wrapper¶

hl_cuda_cublas.h¶

hl_cuda_cudnn.h¶

hl_cuda_cudnn.h¶

Table Of Contents

Previous topic

Next topic

This Page