未验证 提交 e73de8ce 编写于 作者: M Molly Smith 提交者: GitHub

Optimize Softmax Kernel (#3112)

* Simplify kernel

* Coalesce memory attempt 1. Logits divergence.

* Logits fix?

* sync after every global mem access

* template on iterations. Down to 8.3% cuda time for 8k tokens

* Up to 64 iterations

* Add alibi/mask check

* fp32

* Revert builder.py

* naming. precommit

* Revert "naming. precommit"

This reverts commit 150eb7d9.

* naming. spacing

* Spacing. simplify checks

* remove bsyncs

* missed bsyncs

* precommit
上级 f2c9a827
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册