pass num : 0, batch_id: 10, dy_graph avg loss: [9.033163]
pass num : 0, batch_id: 20, dy_graph avg loss: [8.869838]
pass num : 0, batch_id: 30, dy_graph avg loss: [8.635877]
pass num : 0, batch_id: 40, dy_graph avg loss: [8.460026]
pass num : 0, batch_id: 50, dy_graph avg loss: [8.293438]
pass num : 0, batch_id: 60, dy_graph avg loss: [8.138791]
pass num : 0, batch_id: 70, dy_graph avg loss: [7.9594088]
pass num : 0, batch_id: 80, dy_graph avg loss: [7.7303553]
pass num : 0, batch_id: 90, dy_graph avg loss: [7.6716228]
pass num : 0, batch_id: 100, dy_graph avg loss: [7.611051]
pass num : 0, batch_id: 110, dy_graph avg loss: [7.4179897]
pass num : 0, batch_id: 120, dy_graph avg loss: [7.318419]
## 进阶使用
### 模型原理介绍
Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构。其同样使用了 Seq2Seq 任务中典型的编码器-解码器(Encoder-Decoder)的框架结构,但相较于此前广泛使用的循环神经网络(Recurrent Neural Network, RNN),其完全使用注意力(Attention)机制来实现序列到序列的建模,整体网络结构如图1所示。