a faster transpose implementation
Created by: dolphin8
the original implementation
====================[ profile ]======================
conv2d 711525 45.1095
depthwise_conv2d 571597 36.2383
batch_norm 152207 9.6497
relu 73814 4.6797
conv_add 30833 1.9548
softmax 18402 1.1667
transpose 10722 0.6798
reshape 3752 0.2379
prior_box 1972 0.1250
concat 1080 0.0685
multiclass_nms 584 0.0370
box_coder 557 0.0353
feed 246 0.0156
fetch 37 0.0023
total 1577328 100.0000
====================[---------]======================
the new implementation
====================[ profile ]======================
conv2d 710851 45.2465
depthwise_conv2d 574878 36.5916
batch_norm 151054 9.6148
relu 73551 4.6816
conv_add 31083 1.9785
softmax 17724 1.1282
transpose 3885 0.2473
reshape 3649 0.2323
prior_box 1934 0.1231
concat 1052 0.0670
multiclass_nms 583 0.0371
box_coder 561 0.0357
feed 223 0.0142
fetch 36 0.0023
total 1571064 100.0000
====================[---------]======================
we can get the index of each dimension by simply treating it as a number, while the base of each digit is variant. below is a description of 2x2x3 dimension. as we can see, during the loop, only one digit changes, so the original index can be computed by simply add and minus.
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 1 2
1 0 0
1 0 1
1 0 2
1 1 0
1 1 1
1 1 2