resolve #15618 !16114
Created by: guomingz
resolve #15618 (closed) Backgroud: the PR #15398 raised the box_coder op performance regression, we optimized the code via the more efficency leveraging opemmp.
- Test Env:SKX 8180 with fake data on 28 threads(bs=1/32/256).
- The below table shows the box_coder op performance has been improved massively compare to the current HEAD(4f3c8a41)
Type | Batch size | Event | Calls | Total | Min. | Max. | Ave. | Ratio |
---|---|---|---|---|---|---|---|---|
with optimization | 1 | thread0::box_coder | 500 | 15.3449 | 0.028966 | 0.037312 | 0.0306897 | 0.00434202 |
without optimization | 1 | thread0::box_coder | 500 | 87.3473 | 0.17186 | 0.182247 | 0.174695 | 0.0242312 |
with optimization | 32 | thread0::box_coder | 500 | 136.237 | 0.264745 | 0.564132 | 0.272473 | 0.00489434 |
without optimization | 32 | thread0::box_coder | 500 | 2652.95 | 5.11728 | 5.61542 | 5.30591 | 0.0870159 |
with optimization | 256 | thread0::box_coder | 500 | 967.843 | 1.91136 | 2.96198 | 1.93569 | 0.00478883 |
without optimization | 256 | thread0::box_coder | 500 | 19420 | 38.727 | 39.0971 | 38.84 | 0.0877227 |
test=develop