将一句话表示为矩阵 -> represent a sentence as a $n\times k$ matrix; 使用不同大小的卷积层 -> apply convolution of different kernel sizes; 时间维最大池化 -> max-pooling across temporal channel; 全连接层 -> fully-connected layer.
</p>
Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
...
...
@@ -88,9 +89,9 @@ h_t & = o_t\odot tanh(c_t)\\\\
In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate separately; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:
LSTM enhance the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
$$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM.
...
...
@@ -102,7 +103,7 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi
<palign="center">
<imgsrc="image/stacked_lstm.jpg"width=450><br/>
Figure 4. Stacked Bidirectional LSTM for NLP modeling
Translation of Chinese in the figure: represent a sentence as a $n\times k$ matrix; apply convolution of different kernel sizes; max-pooling across temporal channel; fully-connected layer.
</p>
Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate separately; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:
Figure 3. LSTM at time step $t$ [7]. Translation of Chinese in the figure: input gate, memory cell, forget gate and output gate.
</p>
LSTM enhance the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
$$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM.
...
...
@@ -143,7 +144,7 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi
<palign="center">
<imgsrc="image/stacked_lstm.jpg"width=450><br/>
Figure 4. Stacked Bidirectional LSTM for NLP modeling
Figure 4. Stacked Bidirectional LSTM for NLP modeling. Translation of Chinese in the Figure: word embedding mapping; fully connected layer; reverse-directional LSTM; max-pooling.
</p>
## Data Preparation
...
...
@@ -528,8 +529,6 @@ In this chapter, we use sentiment analysis as an example to introduce applying d