a faster transpose implementation (#418) · Issue · PaddlePaddle / Paddle-Lite

You need to sign in or sign up before continuing.

a faster transpose implementation

Created by: dolphin8

the original implementation

====================[ profile ]======================
conv2d          	711525    	45.1095
depthwise_conv2d	571597    	36.2383
batch_norm      	152207    	9.6497
relu            	73814     	4.6797
conv_add        	30833     	1.9548
softmax         	18402     	1.1667
transpose       	10722     	0.6798
reshape         	3752      	0.2379
prior_box       	1972      	0.1250
concat          	1080      	0.0685
multiclass_nms  	584       	0.0370
box_coder       	557       	0.0353
feed            	246       	0.0156
fetch           	37        	0.0023
total           	1577328   	100.0000
====================[---------]======================

the new implementation

====================[ profile ]======================
conv2d          	710851    	45.2465
depthwise_conv2d	574878    	36.5916
batch_norm      	151054    	9.6148
relu            	73551     	4.6816
conv_add        	31083     	1.9785
softmax         	17724     	1.1282
transpose       	3885      	0.2473
reshape         	3649      	0.2323
prior_box       	1934      	0.1231
concat          	1052      	0.0670
multiclass_nms  	583       	0.0371
box_coder       	561       	0.0357
feed            	223       	0.0142
fetch           	36        	0.0023
total           	1571064   	100.0000
====================[---------]======================

we can get the index of each dimension by simply treating it as a number, while the base of each digit is variant. below is a description of 2x2x3 dimension. as we can see, during the loop, only one digit changes, so the original index can be computed by simply add and minus.