Created by: jianhang-liu
Hash OP has two main computations: 1 Call XXH64() for hash 2 "hash % mod_by" where mod_by is actually the size of lookup table
We tried below optimizations:
- Replaced "hash % mod_by" with "hash - (hash / mod_by) * mod_by" and speed up "hash / mod_by" with an external library "libdivide"
- Inline compile XXH64() as well as one additional compile flag XXH_FORCE_MEMORY_ACCESS
Actually, above item 1 contribute much more than item 2.
With above changes, we got 20%~35% performance gain in OP level. However, the performance gain in model level is <5% only. It's due to "strange" performance drop in another OP (i.e. fused_embding_seq_pool). Need further check why that performance drop happen. @tensor-tang Any idea?
See below for benchmark:
- On 2699v3 Before ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10021.3ms, fps: 0.0997878 ====== sample number: 48685, average latency of each sample: 0.205839ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 25657 0.001896 21.0386 0.00329375 0.279999 thread0::hash 5842200 24603.8 0.002024 21.2326 0.00421139 0.268505 thread0::fc 973700 14539.6 0.013026 21.344 0.0149324 0.158673 thread0::sequence_enumerate 5842200 12935.1 0.001709 20.8716 0.00221408 0.141163 thread0::sum 973700 6370.22 0.004313 20.5508 0.00654229 0.0695192 thread0::softsign 973700 2327.97 0.001672 0.024116 0.00239085 0.0254055 thread0::cos_sim 486850 2241.46 0.003985 8.81306 0.00460401 0.0244614 thread0::feed 1947400 1748.23 0.000584 0.030094 0.000897723 0.0190787 thread0::fetch 486850 1209.13 0.00206 0.023377 0.00248357 0.0131954
After ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 9750.24ms, fps: 0.102562 ====== sample number: 48685, average latency of each sample: 0.200272ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 27892.1 0.001933 21.3465 0.00358068 0.314239 thread0::hash 5842200 18945 0.001756 22.4899 0.00324279 0.213439 thread0::fc 973700 14914.7 0.013256 22.2217 0.0153175 0.168032 thread0::sequence_enumerate 5842200 12822.3 0.001682 20.3096 0.00219477 0.144459 thread0::sum 973700 6496.36 0.004225 20.9529 0.00667183 0.0731895 thread0::cos_sim 486850 2360.41 0.004238 21.2927 0.00484834 0.026593 thread0::softsign 973700 2291.19 0.001608 19.7146 0.00235308 0.0258132 thread0::feed 1947400 1796.12 0.000586 21.3117 0.000922318 0.0202355 thread0::fetch 486850 1242.6 0.002142 0.023137 0.00255233 0.0139994
- On SKX6140 Before ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10666.7ms, fps: 0.0937501 ====== ====== sample number: 48685, average latency of each sample: 0.219095ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::hash 5842200 24818.2 0.002372 1.32739 0.0042481 0.282479 thread0::fused_embedding_seq_pool 7789600 24470.3 0.002441 1.67723 0.00314141 0.27852 thread0::sequence_enumerate 5842200 15027.8 0.002197 0.908093 0.00257229 0.171045 thread0::fc 973700 9299.96 0.007387 1.60737 0.00955115 0.105851 thread0::sum 973700 5202.84 0.004106 0.425563 0.00534337 0.0592183 thread0::softsign 973700 2895.76 0.002396 0.410873 0.00297397 0.0329593 thread0::feed 1947400 2652.1 0.001111 1.30264 0.00136187 0.030186 thread0::cos_sim 486850 2392.37 0.004267 1.35109 0.00491398 0.0272298 thread0::fetch 486850 1099.23 0.001871 0.404445 0.00225783 0.0125113
** After ** ====== batch_size: 1, repeat: 10, threads: 1, thread id: 0, latency: 10493.8ms, fps: 0.095294 ====== sample number: 48685, average latency of each sample: 0.215546ms ====== Event Calls Total Min. Max. Ave. Ratio. thread0::fused_embedding_seq_pool 7789600 27182.2 0.002518 2.44027 0.00348955 0.314938 thread0::hash 5842200 20382.6 0.002411 2.05364 0.00348885 0.236156 thread0::sequence_enumerate 5842200 15313.4 0.002242 2.27128 0.00262117 0.177424 thread0::fc 973700 9241.42 0.00752 2.52018 0.00949103 0.107073 thread0::sum 973700 5307.15 0.004207 2.44368 0.0054505 0.0614896 thread0::softsign 973700 2947.92 0.002332 0.077244 0.00302754 0.0341551 thread0::feed 1947400 2537.44 0.00107 0.027547 0.00130299 0.0293993 thread0::cos_sim 486850 2297.62 0.003981 0.444106 0.00471936 0.0266206 thread0::fetch 486850 1100.08 0.001824 0.02604 0.00225959 0.0127457