• M
    Optimize Softmax Kernel (#3112) · e73de8ce
    Molly Smith 提交于
    * Simplify kernel
    
    * Coalesce memory attempt 1. Logits divergence.
    
    * Logits fix?
    
    * sync after every global mem access
    
    * template on iterations. Down to 8.3% cuda time for 8k tokens
    
    * Up to 64 iterations
    
    * Add alibi/mask check
    
    * fp32
    
    * Revert builder.py
    
    * naming. precommit
    
    * Revert "naming. precommit"
    
    This reverts commit 150eb7d9.
    
    * naming. spacing
    
    * Spacing. simplify checks
    
    * remove bsyncs
    
    * missed bsyncs
    
    * precommit
    e73de8ce
softmax.cu 23.2 KB