提交 3a30f715 编写于 作者: W wangkuiyi 提交者: GitHub

Merge pull request #111 from wangkuiyi/fixes

Use English figures in all README.en.md files
......@@ -13,8 +13,8 @@ where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once th
## Results Demonstration
We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line.
<p align="center">
<img src = "image/predictions.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value (波士顿房价预测->Prediction of Boston house prices; 预测价格->Predicted prices; 单位->Units; 实际价格->Actual prices)
<img src = "image/predictions_en.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value
</p>
## Model Overview
......@@ -85,8 +85,8 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi
- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar.
<p align="center">
<img src = "image/ranges.png" width=550><br/>
Figure 2. The value ranges of the features (特征尺度->Feature value range)
<img src = "image/ranges_en.png" width=550><br/>
Figure 2. The value ranges of the features
</p>
#### Prepare Training and Test Sets
......
文件模式从 100755 更改为 100644
......@@ -36,15 +36,8 @@ Figure 2. Fine-grained image classification
A good model should be able to recognize objects of different categories correctly, and meanwhile can correctly classify images taken from different points of view, under different illuminations, with object distortion or partial occlusion (we call these image disturbance). Figure 3 show some images with various disturbance. A good model should be able to classify these images correctly like humans.
<p align="center">
<img src="image/variations.png" width="550" ><br/>
<img src="image/variations_en.png" width="550" ><br/>
Figure 3. Disturbed images [22]
不同视角 ==> various perspective
不同大小 ==> various sizes
形变 ==> shape deformation
遮挡 ==> occlusion
不同光照 ==> various illumination
背景干扰 ==> cluttered background
同类异形 ==> homogeneous
</p>
## Model Overview
......@@ -72,13 +65,8 @@ Figure 4. Top-5 error rates on ILSVRC image classification
Traditional CNNs consist of convolutional and fully-connected layers, and employ softmax multi-category classifier and cross-entropy as loss function. Figure 5 shows a typical CNN. We first introduce the common parts of a CNN.
<p align="center">
<img src="image/lenet.png"><br/>
<img src="image/lenet_en.png"><br/>
Figure 5. A CNN example [20]
输入层 ==> input layer
卷积层 ==> convolutional layer
特征图 ==> feature maps
降采样(池化)层 ==> pooling layer
全连接层 ==> fully-connected layer
</p>
- convolutional layer: It uses convolution operation to extract low-level and high-level features, and to discover local correlation and spatial invariance.
......@@ -113,13 +101,8 @@ NIN model has two main characteristics: 1) it replaces the single-layer convolut
Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design, the output of which is a concat of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase of outputs. After going through several of such blocks, the number of outputs and parameters will become larger and larger, leading to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers to reduce dimension or the number of channels, meanwhile improves non-linearity of the network.
<p align="center">
<img src="image/inception.png" width="800" ><br/>
<img src="image/inception_en.png" width="800" ><br/>
Figure 7. Inception block
输入层 ==> input layer
卷积层 ==> convolutional layer
最大池化层 ==> max-pooling layer
Inception简单模块 ==> Inception module, naive version
Inception含降维模块 ==> Inception module with dimensionality reduction
</p>
GoogleNet consists of multiple stacking Inception blocks followed by an avg-pooling layer as in NIN in place of by traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagating. The loss function of the whole network is the weighted sum of these three classifiers.
......@@ -476,12 +459,8 @@ Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_eval
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
<p align="center">
<img src="image/plot.png" width="400" ><br/>
<img src="image/plot_en.png" width="400" ><br/>
Figure 12. The error rate of VGG model on CIFAR10
训练轮数 ==> epoch
误差 ==> error
训练误差 ==> training error
测试误差 ==> test error
</p>
## Model Application
......
# Semantic Role Labeling
本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
Source code of this chpater is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles).
## Background
......
......@@ -14,8 +14,8 @@ To address the problems mentioned above, statistical machine translation techniq
The recent development of deep learning provides new solutions to those challenges. There are mainly two categories for deep learning based machine translation techniques: 1) techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 2) techniques mapping from source language to target language directly using neural network, or end-to-end neural machine translation (NMT).
<p align="center">
<img src="image/nmt.png" width=400><br/>
Figure 1. Neural Network based Machine Translation (Words in Figure: 源语言: Source Language; 目标语言: Target Language; 统计机器翻译: Statistical Machine Translation; 神经网络: Neural Network)
<img src="image/nmt_en.png" width=400><br/>
Figure 1. Neural Network based Machine Translation.
</p>
......@@ -50,8 +50,8 @@ GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extens
- update gate: it combines input gate and forget gate, and is used to control the impact of historical information on the hidden output. The historical information will be passes over when the update gate is close to 1.
<p align="center">
<img src="image/gru.png" width=700><br/>
Figure 2. GRU (更新门: Update Gate; 重置门: Reset Gate; 节点状态: Node State; 输出: Output)
<img src="image/gru_en.png" width=700><br/>
Figure 2. A GRU Gate.
</p>
Generally speaking, sequences with short distance dependency will have active reset gate while sequences with long distance dependency will have active update date.
......@@ -64,8 +64,8 @@ We have already introduce one instance of bi-directional RNN in the Chapter of [
Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output, thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there is no connections between forward hidden and backward hidden layers.
<p align="center">
<img src="image/bi_rnn.png" width=450><br/>
Figure 3. Temporally unrolled bi-directional RNN (输出层: Output Layer; 后向隐层: Backward Hidden Layer; 前向隐层: Forward Hidden Layer; 输入层: Input Layer)
<img src="image/bi_rnn_en.png" width=450><br/>
Figure 3. Temporally unrolled bi-directional RNN.
</p>
### Encoder-Decoder Framework
......@@ -73,9 +73,8 @@ Figure 3. Temporally unrolled bi-directional RNN (输出层: Output Layer; 后
Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, where both sequences can have arbitrary lengths. The source sequence is encoded into a vector via encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both encoder and decoder are typically implemented via RNN.
<p align="center">
<img src="image/encoder_decoder.png" width=700><br/>
Figure 4. Encoder-Decoder Framework (源语言词序列: Word Sequence for the Source Language; 源语编码状态: Word Embedding Sequence for the Source Language; 独热编码: One-hot Encoding; 词向量: Word Embedding; 隐层状态: Hidden State; 词概率: Word Probability; 词样本: Word Sample; 编码器: Encoder; 解码器: Decoder.)
**Note: there is an error in the original figure. The locations for 源语言词序列 and 源语编码状态 should be switched.**
<img src="image/encoder_decoder_en.png" width=700><br/>
Figure 4. Encoder-Decoder Framework.
</p>
#### Encoder
......@@ -91,8 +90,8 @@ There are three steps for encoding a sentence:
Bi-directional RNN can also be used in step 3 for more complicated sentence encoding. This can be implemeted using bi-directional GRU. Forward GRU performs encoding of the source sequence acooding to the its original order, i.e., $(x_1,x_2,...,x_T)$, generating a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. Similarily, backward GRU encodes the source sequence in the reserse order, i.e., $(x_T,x_{T-1},...,x_1), generating $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$.
<p align="center">
<img src="image/encoder_attention.png" width=500><br/>
Figure 5. Encoder using bi-directional GRU (源语编码状态: Word Embedding Sequence for the Source Language; 词向量: Word Embedding; 独热编码: One-hot Encoding; 编码器: Encoder)
<img src="image/encoder_attention_en.png" width=500><br/>
Figure 5. Encoder using bi-directional GRU.
</p>
#### Decoder
......@@ -137,8 +136,8 @@ e_{ij}&=align(z_i,h_j)\\\\
where $align$ is an alignment model, measuring the fitness between the $i$-th word in the target language sentence and the $j$-th word in the source sentence. More concretely, the fitness is computed with the $i$-th hidden state $z_i$ of the decoder RNN and the $j$-th context vector $h_j$ of the source sentence. Hard alignment is used in the conventional alignment model, meaning each word in the target language explicitly corresponds to one or more words from the target language sentence. In attention model, soft alignment is used, where any word in source sentence is related to any word in the target language sentence, where the strength of the relation is a real number computed via the model, thus can be incorporated into the NMT framework and can be trained via back-propagation.
<p align="center">
<img src="image/decoder_attention.png" width=500><br/>
Figure 6. Decoder with Attention Mechanism ( 源语编码状态: Word Embedding Sequence for the Source Language; 权重: Attention Weight; 隐层状态: Hidden State; 词概率: Word Probability; 词样本: Word Sample; 解码器: Decoder.)
<img src="image/decoder_attention_en.png" width=500><br/>
Figure 6. Decoder with Attention Mechanism.
</p>
### Beam Search Algorithm
......
......@@ -45,14 +45,8 @@ $$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/softmax_regression.png" width=400><br/>
<img src="image/softmax_regression_en.png" width=400><br/>
Fig. 2. Softmax regression network architecture<br/>
输入层 -> input layer<br/>
权重W -> weights W<br/>
激活前 -> before activation<br/>
激活函数 -> activation function<br/>
输出层 -> output layer<br/>
偏置b -> bias b<br/>
</p>
### Multilayer Perceptron
......@@ -66,12 +60,9 @@ The Softmax regression model described above uses the simplest two-layer neural
Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/mlp.png" width=500><br/>
<img src="image/mlp_en.png" width=500><br/>
Fig. 3. Multilayer Perceptron network architecture<br/>
输入层X -> input layer X<br/>
隐藏层$H_1$(含激活函数) -> hidden layer $H_1$ (including activation function)<br/>
隐藏层$H_2$(含激活函数) -> hidden layer $H_2$ (including activation function)<br/>
输出层Y -> output layer Y<br/>
</p>
### Convolutional Neural Network
......@@ -79,10 +70,8 @@ Fig. 3. Multilayer Perceptron network architecture<br/>
#### Convolutional Layer
<p align="center">
<img src="image/conv_layer.png" width=500><br/>
<img src="image/conv_layer_en.png" width=500><br/>
Fig. 4. Convolutional layer<br/>
输入数据 -> input data<br/>
卷积输出 -> convolution output<br/>
</p>
The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions, to this result we add bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
......@@ -92,9 +81,8 @@ Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown f
#### Pooling Layer
<p align="center">
<img src="image/max_pooling.png" width="400px"><br/>
<img src="image/max_pooling_en.png" width="400px"><br/>
Fig. 5 Pooling layer<br/>
输入数据 -> input data<br/>
</p>
A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
......@@ -102,13 +90,8 @@ A Pooling layer performs downsampling. The main functionality of this layer is t
#### LeNet-5 Network
<p align="center">
<img src="image/cnn.png"><br/>
<img src="image/cnn_en.png"><br/>
Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/>
特征图 -> feature map<br/>
卷积层 -> convolutional layer<br/>
降采样层 -> downsampling layer<br/>
全连接层 -> fully connected layer<br/>
输出层(全连接+Softmax激活) -> output layer (fully connected + softmax activation)<br/>
</p>
[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:
......@@ -355,12 +338,8 @@ python evaluate.py softmax_train.log
### Training Results for Softmax Regression
<p align="center">
<img src="image/softmax_train_log.png" width="400px"><br/>
<img src="image/softmax_train_log_en.png" width="400px"><br/>
Fig. 7 Softmax regression error curve<br/>
训练集 -> training set<br/>
测试集 -> test set<br/>
平均代价 -> average cost<br/>
训练轮数 -> epoch<br/>
</p>
Evaluation results of the models:
......@@ -375,12 +354,8 @@ From the evaluation results, the best pass for softmax regression model is pass-
### Results of Multilayer Perceptron
<p align="center">
<img src="image/mlp_train_log.png" width="400px"><br/>
<img src="image/mlp_train_log_en.png" width="400px"><br/>
Fig. 8. Multilayer Perceptron error curve<br/>
训练集 -> training set<br/>
测试集 -> test set<br/>
平均代价 -> average cost<br/>
训练轮数 -> epoch<br/>
</p>
Evaluation results of the models:
......@@ -395,12 +370,8 @@ From the evaluation results, the final training accuracy is 94.95%. It is signif
### Training results for Convolutional Neural Network
<p align="center">
<img src="image/cnn_train_log.png" width="400px"><br/>
<img src="image/cnn_train_log_en.png" width="400px"><br/>
Fig. 9. Convolutional Neural Network error curve<br/>
训练集 -> training set<br/>
测试集 -> test set<br/>
平均代价 -> average cost<br/>
训练轮数 -> epoch<br/>
</p>
Results of model evaluation:
......
......@@ -32,10 +32,10 @@ CNN mainly contains convolution and pooling operation, with various extensions.
<p align="center">
<img src="image/text_cnn.png" width = "80%" align="center"/><br/>
<img src="image/text_cnn_en.png" width = "80%" align="center"/><br/>
Figure 1. CNN for text modeling.
将一句话表示为矩阵 -> represent a sentence as a $n\times k$ matrix; 使用不同大小的卷积层 -> apply convolution of different kernel sizes; 时间维最大池化 -> max-pooling across temporal channel; 全连接层 -> fully-connected layer.
</p>
Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
......@@ -87,10 +87,12 @@ h_t & = o_t\odot tanh(c_t)\\\\
\end{align}
In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate separately; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:
<p align="center">
<img src="image/lstm.png" width = "65%" align="center"/><br/>
Figure 3. LSTM at time step $t$ [7]. 输入门 -> input gate, 记忆单元 -> memory cell, 遗忘门 -> forget gate, 输出门 -> output gate.
<img src="image/lstm_en.png" width = "65%" align="center"/><br/>
Figure 3. LSTM at time step $t$ [7].
</p>
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
$$ h_t=Recrurent(x_t,h_{t-1})$$
......@@ -102,8 +104,8 @@ For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t
As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
<p align="center">
<img src="image/stacked_lstm.jpg" width=450><br/>
Figure 4. Stacked Bidirectional LSTM for NLP modeling. 词向量映射 -> word embedding mapping; 全连接层 -> fully-connected layer; 反向LSTM -> reverse-directional LSTM; 池化层 -> pooling layer.
<img src="image/stacked_lstm_en.png" width=450><br/>
Figure 4. Stacked Bidirectional LSTM for NLP modeling.
</p>
## Data Preparation
......@@ -336,6 +338,7 @@ outputs(classification_cost(input=output, label=data_layer('label', 1)))
```
Our model defined in `trainer_config.py` uses the `stacked_lstm_net` structure as default. If you want to use `convolution_net`, you can comment related lines.
```python
stacked_lstm_net(
dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict)
......
......@@ -96,18 +96,10 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
<p align="center">
<img src="image/nnlm.png" width=500><br/>
<img src="image/nnlm_en.png" width=500><br/>
Figure 2. N-gram neural network model
</p>
(Translation of words in figure 2: 图2文字翻译
- 输入:Input;
- 全连接:Fully-Connected Layer
- 词向量:Word Embedding
- 词向量连接:Word Embedding Concatenation
- 分类:Classification
- 词ID: Word ID)
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
......@@ -136,16 +128,10 @@ Figure 2 shows the N-gram neural network model. From the bottom up, the model ha
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
<p align="center">
<img src="image/cbow.png" width=250><br/>
<img src="image/cbow_en.png" width=250><br/>
Figure 3. CBOW model
</p>
(Translation of words in figure 3: 图3文字翻译
- 输入词:Input Word
- 词向量:Word Embedding
- 输出词:Output Word)
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
......@@ -157,16 +143,10 @@ where $x_t$ is the word embedding of the t-th word, classification score vector
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
<p align="center">
<img src="image/skipgram.png" width=250><br/>
<img src="image/skipgram_en.png" width=250><br/>
Figure 4. Skip-gram model
</p>
(Translation of words in figure 4: 图4文字翻译
- 输入词:Input Word
- 词向量:Word Embedding
- 输出词:Output Word)
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation
......@@ -177,13 +157,6 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t
Figure 5. N-gram neural network model in model configuration
</p>
(Translation of words in figure 5: 图5文字翻译
- 词向量映射: Word Embedding Mapping
- 词向量连接: Word Embedding Concatenation
- 全连接层: Fully-Connected Layer
- 隐层: Hidden Layer)
## Model Training
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册