-en:In this tutorial, we want to highlight a new `torch.nn.functional` function
that can be helpful for implementing transformer architectures. The function is
named `torch.nn.functional.scaled_dot_product_attention`. For detailed description
of the function, see the [PyTorch documentation](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention).
This function has already been incorporated into `torch.nn.MultiheadAttention`
-en:At a high level, this PyTorch function calculates the scaled dot product attention
(SDPA) between query, key, and value according to the definition found in the
paper [Attention is all you need](https://arxiv.org/abs/1706.03762). While this
function can be written in PyTorch using existing functions, a fused implementation
can provide large performance benefits over a naive implementation.
id:totrans-8
prefs:[]
type:TYPE_NORMAL
zh:在高层次上,这个PyTorch函数根据论文[Attention is all you need](https://arxiv.org/abs/1706.03762)中的定义,计算查询、键和值之间的缩放点积注意力(SDPA)。虽然这个函数可以使用现有函数在PyTorch中编写,但融合实现可以比朴素实现提供更大的性能优势。