Optimize Softmax Kernel (#3112)
* Simplify kernel * Coalesce memory attempt 1. Logits divergence. * Logits fix? * sync after every global mem access * template on iterations. Down to 8.3% cuda time for 8k tokens * Up to 64 iterations * Add alibi/mask check * fp32 * Revert builder.py * naming. precommit * Revert "naming. precommit" This reverts commit 150eb7d9. * naming. spacing * Spacing. simplify checks * remove bsyncs * missed bsyncs * precommit
Showing
想要评论请 注册 或 登录