ENH: Sudnya's edits to chapter 2 and the corresponding html generated

b24237ab · sudnya · 424003fc · b24237ab · b24237ab
隐藏空白更改
内联并排

Showing with 97 addition and 90 deletion

recognize_digits/README.en.md recognize_digits/README.en.md +46 -45

recognize_digits/index.en.html recognize_digits/index.en.html +51 -45

未找到文件。
--- a/recognize_digits/README.en.md
+++ b/recognize_digits/README.en.md
 # Recognize Digits
-Source code of this tutorial is under [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/recognize_digits). For the first-time use, please refer to PaddlePaddle [installation instructions](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+The source code for this tutorial is under [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/recognize_digits). First-time readers, please refer to PaddlePaddle [installation instructions](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
 ## Introduction
-When we learn programming, the first program is usually printing “Hello World.” In Machine Learning, or Deep Learning, this is handwritten digit recognition with [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a typical image classification problem. The problem is relatively easy, and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and corresponding labels (Fig. 1). An image is a 28x28 matrix, and a label is to one of the 10 digits from 0 to 9. Each image is normalized in size and centered.
+When we learn a new programming language, the first task is usually to write a program that prints "Hello World." In Machine Learning or Deep Learning, the equivalent task is to train a model to perform handwritten digit recognition with [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a typical image classification problem. The problem is relatively easy, and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a 28x28 matrix, and the label is one of the digits from 0 to 9. Each image is normalized in size and centered.
 <p align="center">
 <img src="image/mnist_example_image.png" width="400"><br/>
 Fig. 1. Examples of MNIST images
 </p>
-MNIST dataset is made from [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and Special Database 1 (SD-1). Since SD-3 is labeled by staffs in U.S. Census Bureau, while SD-1 is labeled by high school students in U.S., SD-3 is cleaner and easier to recognize than SD-1 is. Yann LeCun et al. used half of samples from each of SD-1 and SD-3 to make MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that annotators of training set and test set are not completely overlapped.
+The MNIST dataset is created from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn't a complete overlap of annotators of training set and test set.
-Yann LeCun, one of the founders of Deep Learning, had huge contribution on handwritten character recognition in early dates, and proposed CNN (Convolutional Neural Network), which drastically improved recognition capability for handwritten characters. CNN is now a critical key for Deep Learning. From Yann LeCun’s first proposal of LeNet, to those winning models in ImageNet, such as VGGNet, GoogLeNet, ResNet, etc. (Please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification) tutorial), CNN achieved a series of impressive results in Image Classification tasks.
+Yann LeCun, one of the founders of Deep Learning, contributed highly towards handwritten character recognition in early days and proposed CNN (Convolutional Neural Network), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From Yann LeCun's first proposal of LeNet to those winning models in ImageNet, such as VGGNet, GoogLeNet, ResNet, etc. (Please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification) tutorial), CNN achieved a series of impressive results in Image Classification tasks.
-Many algorithms are tested on MNIST. In 1998, LeCun experimented single layer linear classifier, MLP (Multilayer Perceptron) and Multilayer CNN LeNet. These algorithms constantly reduced test error from 12% to 0.7% \[[1](#References)\]. Since then, researchers worked on many algorithms such as k-NN (K-Nearest Neighbors) \[[2](#References)\], Support Vector Machine (SVM) \[[3](#References)\], Neural Networks \[[4-7](#References)\] and Boosting \[[8](#References)\], and applied various preprocessing methods, such as distortion removal, noise removal and blurring, to increase recognition accuracy.
+Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, MLP (Multilayer Perceptron) and Multilayer CNN LeNet. These algorithms constantly reduced test error from 12% to 0.7% \[[1](#References)\]. Since then, researchers have worked on many algorithms such as k-NN (K-Nearest Neighbors) \[[2](#References)\], Support Vector Machine (SVM) \[[3](#References)\], Neural Networks \[[4-7](#References)\] and Boosting \[[8](#References)\]. Various preprocessing methods like distortion removal, noise removal, blurring etc. have also been applied to increase recognition accuracy.
+In this tutorial, we tackle the task of handwritten character recognition. We start with a simple softmax regression model and guide our readers step-by-step to improve this model's performance on the task of recognition.
-In this tutorial, we start from simple softmax regression model, and guide readers to introduction of handwritten character recognition, and step-by-step improvement of models.
 ## Model Overview
 Before introducing classification algorithms and training procedure, we provide some definitions:
- $X$ is input: Input is $28\times28$ MNIST image. It is flattened to $784$ dimensional vector. $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
+- $X$ is the input: Input is a $28\times28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is output: Output of classifier is 10 class digits from 0 to 9. $Y=\left ( y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents a probability that the image belongs to $i$.
+- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left ( y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
- $L$ is ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$ It is also 10 dimensional, but only one dimension is 1 and others are all 0.
+- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one dimension is 1 and all others are all 0.
 ### Softmax Regression
-The simplest softmax regression model is to feed input into fully connected layers, and directly use softmax for multiclass classification \[[9](#References)\].
+In a simple softmax regression model, the input is fed to fully connected layers and a softmax function is applied to get probabilities of multiple output classes\[[9](#References)\].
-Input $X$ is multiplied with weights $W$, added by bias $b$, and activated.
+Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.
 $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
 where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
-For a $N$ class classification problem with $N$ output nodes, a $N$ dimensional vector is normalized to $N$ real values in [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
+For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
-In classification problem, we usually use cross entropy loss function:
+In such a classification problem, we usually use the cross entropy loss function:
 $$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
-Fig. 2 is softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
+Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
 <p align="center">
 <img src="image/softmax_regression.png" width=400><br/>
@@ -56,11 +57,11 @@ Fig. 2. Softmax regression network architecture<br/>
 ### Multilayer Perceptron
-Softmax regression model uses the simplest two layer neural network, i.e. it only contains input layer and output layer, so that it's regression ability is limited. To achieve better recognition effect, we consider adding several hidden layers \[[10](#References)\] between the input layer and the output layer.
+The Softmax regression model described above uses the simplest two-layer neural network, i.e. it only contains an input layer and an output layer. So its regression ability is limited. To achieve better recognition results, we consider adding several hidden layers \[[10](#References)\] between the input layer and the output layer.
-1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is activation function. Some common ones are sigmoid, tanh and ReLU.
+1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
 2.  After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
-3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the last classification result vector.
+3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
 Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
@@ -84,9 +85,9 @@ Fig. 4. Convolutional layer<br/>
 卷积输出 -> convolution output<br/>
 </p>
-Convolutional layer is the core of Convolutional Neural Network. The parameters in this layer are composed of a set of filters, or kernels. In forward step, each kernel moves horizontally and vertically, and compute dot product of the kernel and the input on corresponding positions, then add bias and apply activation function. The result is two dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
+The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions, to this result we add bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
-Fig. 4 is a dynamic graph of convolutional layer, where depths are not shown for simplicity. Input is $W_1=5,H_1=5,D_1=3$. In fact, this is a common representation for colored images. The width and the height of a colored image correspond to $W_1$ and $H_1$, respectively, and the 3 color channels for RGB correspond to $D_1$. The parameters of the convolutional layer are $K=2,F=3,S=2,P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter   W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is stride. Kernels moves leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.
+Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown for simplicity. Input is $W_1=5, H_1=5, D_1=3$. In fact, this is a common representation for colored images. $W_1$ and  $H_1$ of a colored image correspond to the width and height respectively. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2, F=3, S=2, P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter   W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is the stride. Kernels move leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.
 #### Pooling Layer
@@ -96,7 +97,7 @@ Fig. 5 Pooling layer<br/>
 输入数据 -> input data<br/>
 </p>
-Pooling layer performs downsampling. The main functionality is to reduce computation by reducing network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer includes max pooling, average pooling, etc. Max pooling uses rectangles to segment input layer into several parts, and compute maximum value in each part as output (Fig. 5.)
+A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
 #### LeNet-5 Network 
@@ -110,30 +111,30 @@ Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/>
 输出层(全连接+Softmax激活) -> output layer (fully connected + softmax activation)<br/>
 </p>
-[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: 2 dimensional image input is fed into two sets of convolutional layer and pooling layer, then it is fed into fully connected layer and softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully-connected perceptrons:
+[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:
- 3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in previous layer. This region is called receptive field.
+- 3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field.
- Local connection: CNN utilizes local space correlation by connecting local neurons. This design guarantees learned filter has strong response to local input features. Stacking many such layers leads non-linear filter becomes more and more global. This allows the network to first obtain good representation for a small parts of input, then combine them to represent larger region.
+- Local connection: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
- Sharing weights: In CNN, computation is iterated with shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of output respond to the same feature. This allows detecting a feature regardless of its position in the input, and enables a property of translation equivariance.
+- Sharing weights: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of the output respond to the same feature. This allows detecting a feature regardless of its position in the input and enables translation equivariance.
-For more details of Convolutional Neural Network, please refer to [Stanford open course]( http://cs231n.github.io/convolutional-networks/ ) and [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) tutorial.
+For more details on Convolutional Neural Networks, please refer to [this Stanford open course]( http://cs231n.github.io/convolutional-networks/ ) and [this Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) tutorial.
 ### List of Common Activation Functions  
- Sigmoid activation function： $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
+- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
- Tanh activation function： $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
+- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
-  In fact, tanh function is just a rescaled version of sigmoid function. It is obtained by magnifying the value of sigmoid function and moving it downwards by 1.
+  In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.
- ReLU activation function： $ f(x) = max(0, x) $
+- ReLU activation function: $ f(x) = max(0, x) $
-For more information, please refer to [Activation functions in Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
+For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
 ## Data Preparation
-### Data and Download
+### Data Download
-Execute the following command to download [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip, then put paths of training set and test set to train.list and test.list respectively for PaddlePaddle to read.
+Execute the following command to download the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip. Add paths to the training set and the test set to train.list and test.list respectively for PaddlePaddle to read.
 ```bash
 ./data/get_mnist_data.sh
@@ -154,9 +155,9 @@ Users can randomly generate 10 images with the following script (Refer to Fig. 1
 ./load_data.py
 ```
-### Provide Data for PaddlePaddle
+### Provide Data to PaddlePaddle
-We use python interface to provide data to system. `mnist_provider.py` shows a complete example for MNIST data.
+We use python interface to provide data to system. `mnist_provider.py` shows a complete example for training on MNIST data.
 ```python
 # Define a py data provider
@@ -192,7 +193,7 @@ def process(settings, filename):  # settings is not used currently.
 ### Data Definition
-In model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
+In the model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
 ```python
 if not is_predict:
@@ -209,7 +210,7 @@ In model configuration, use `define_py_data_sources2` to define reading of data
 Set training related parameters.
 - batch_size: use 128 samples in each training step.
- learning_rate: rating of iteration, related to the rate of convergence.
+- learning_rate: determines step taken in each iteration, it determines how fast the model converges.
 - learning_method: use optimizer `MomentumOptimizer` for training. The parameter 0.9 indicates momentum keeps 0.9 of previous speed.
 - regularization: A method to prevent overfitting. Here L2 regularization is used.
@@ -225,7 +226,7 @@ settings(
 #### Overview
-First get data by `data_layer`, and get classification result by classifier. Here we provided three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output results.
+First get reference labels from `data_layer`, and get classification results (predictions) from classifier. Here we provide three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output the results (predictions).
 ``` python
 data_size = 1 * 28 * 28
@@ -256,7 +257,7 @@ def softmax_regression(img):
 #### MultiLayer Perceptron
-The following code implements a Multilayer Perceptron with two fully connected hidden layers and ReLU activation function. Output layer has Softmax activation function.
+The following code implements a Multilayer Perceptron with two fully connected hidden layers and a ReLU activation function. The output layer has a Softmax activation function.
 ```python
 def multilayer_perceptron(img):
@@ -271,7 +272,7 @@ def multilayer_perceptron(img):
 #### Convolutional Neural Network LeNet-5
-The following is the LeNet-5 network architecture. 2D input image is first fed into two sets of convolutional layer and pooling layer, and it is fed into fully connected layer, and another fully connected layer with softmax activation.
+The following is the LeNet-5 network architecture. A 2D input image is first fed into two sets of convolutional layers and pooling layers, this result is then fed to a fully connected layer, and another fully connected layer with a softmax activation.
 ```python
 def convolutional_neural_network(img):
@@ -369,7 +370,7 @@ Best pass is 00013, testing Avgcost is 0.484447
 The classification accuracy is 90.01%
 ```
-From the evaluation results, the best pass for softmax regression model is pass-00013, where classification accuracy is 90.01%, and the last pass-00099 has accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. An explanation is that during training, the model may already arrive at local optimum, and it just swings around nearby in the following passes, or it gets lower local optimum.
+From the evaluation results, the best pass for softmax regression model is pass-00013, where the classification accuracy is 90.01%, and the last pass-00099 has an accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. This is because during training, the model may already arrive at a local optimum, and it just swings around nearby in the following passes, or it gets a lower local optimum.
 ### Results of Multilayer Perceptron
@@ -389,7 +390,7 @@ Best pass is 00085, testing Avgcost is 0.164746
 The classification accuracy is 94.95%
 ```
-From the evaluation results, the final training accuracy is 94.95%. It has significant improvement comparing with softmax regression model. The reason is that softmax regression is simple, and it cannot fit complex data, but Multilayer Perceptron with hidden layers has stronger fitting capacity.
+From the evaluation results, the final training accuracy is 94.95%. It is significantly better than the softmax regression model. This is because the softmax regression is simple, and it cannot fit complex data. The Multilayer Perceptron with hidden layers has better capacity to fit complex data than the softmax regression.
 ### Training results for Convolutional Neural Network
@@ -409,7 +410,7 @@ Best pass is 00076, testing Avgcost is 0.0244684
 The classification accuracy is 99.20%
 ```
-From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. This means, for image problem, Convolutional Neural Network has better recognition effect than fully connected network. This should be related to the local connection and parameter sharing of convolutional layers. Also, in Fig. 9, Convolutional Neural Network achieves good effect in early steps, which indicates that it is fast to converge.
+From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. So for image classification, a Convolutional Neural Network has better recognition results than a fully connected network. This is related to the local connection and parameter sharing of convolutional layers. In Fig. 9, the Convolutional Neural Network achieves good results in early steps, which indicates that it converges faster.
 ## Application Model
@@ -424,7 +425,7 @@ python predict.py -c mnist_model.py -d data/raw_data/ -m softmax_mnist_model/pas
 - -d sets data for prediction
 - -m sets model parameters, here the best trained model is used for prediction
-Follow to instruction to input image ID for prediction. The classifier can output probabilities for each digit, predicted results with the highest probability, and ground truth label.
+Follow the instructions to input image ID for prediction. The classifier can output probabilities for each digit, predictions with the highest probability, and ground truth label.
 ```
 Input image_id [0~9999]: 3
@@ -436,10 +437,10 @@ Predict Number: 0
 Actual Number: 0
 ```
-From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability, and the ground truth is actually consistent.
+From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability. This predicted result is consistent with the ground truth label.
 ## Conclusion
-Softmax regression, Multilayer Perceptron and Convolutional Neural Network in this tutorial are the most basic Deep Learning models. More sophisticated models in the following tutorials are derived from them. Therefore, these models are very helpful for the future learning. At the same time, we observed that when evolving from the simplest softmax regression to slightly complex Convolutional Neural Network, recognition accuracy on MNIST data set has large improvement, due to Convolutional layers' local connections and parameter sharing. When learning new models in the future, we hope readers to understand the key ideas for a new model to improve over an old one. Moreover, this tutorial introduced basic flow of PaddlePaddle model design, starting from dataprovider, model layer construction, to final training and prediction. By becoming familiar with this flow, readers can use specific data, define specific network models, and complete training and prediction for their tasks.
+This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
 ## References

--- a/recognize_digits/index.en.html
+++ b/recognize_digits/index.en.html
@@ -30,55 +30,60 @@
    padding: 45px;
 }
 </style>
 <body>
 <div id="context" class="container markdown-body">
 </div>
 <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
 <div id="markdown" style='display:none'>
 # Recognize Digits
-Source code of this tutorial is under [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/recognize_digits). For the first-time use, please refer to PaddlePaddle [installation instructions](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+The source code for this tutorial is under [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/recognize_digits). First-time readers, please refer to PaddlePaddle [installation instructions](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
 ## Introduction
-When we learn programming, the first program is usually printing “Hello World.” In Machine Learning, or Deep Learning, this is handwritten digit recognition with [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a typical image classification problem. The problem is relatively easy, and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and corresponding labels (Fig. 1). An image is a 28x28 matrix, and a label is to one of the 10 digits from 0 to 9. Each image is normalized in size and centered.
+When we learn a new programming language, the first task is usually to write a program that prints "Hello World." In Machine Learning or Deep Learning, the equivalent task is to train a model to perform handwritten digit recognition with [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a typical image classification problem. The problem is relatively easy, and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a 28x28 matrix, and the label is one of the digits from 0 to 9. Each image is normalized in size and centered.
 <p align="center">
 <img src="image/mnist_example_image.png" width="400"><br/>
 Fig. 1. Examples of MNIST images
 </p>
-MNIST dataset is made from [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and Special Database 1 (SD-1). Since SD-3 is labeled by staffs in U.S. Census Bureau, while SD-1 is labeled by high school students in U.S., SD-3 is cleaner and easier to recognize than SD-1 is. Yann LeCun et al. used half of samples from each of SD-1 and SD-3 to make MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that annotators of training set and test set are not completely overlapped.
+The MNIST dataset is created from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn't a complete overlap of annotators of training set and test set.
+Yann LeCun, one of the founders of Deep Learning, contributed highly towards handwritten character recognition in early days and proposed CNN (Convolutional Neural Network), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From Yann LeCun's first proposal of LeNet to those winning models in ImageNet, such as VGGNet, GoogLeNet, ResNet, etc. (Please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification) tutorial), CNN achieved a series of impressive results in Image Classification tasks.
-Yann LeCun, one of the founders of Deep Learning, had huge contribution on handwritten character recognition in early dates, and proposed CNN (Convolutional Neural Network), which drastically improved recognition capability for handwritten characters. CNN is now a critical key for Deep Learning. From Yann LeCun’s first proposal of LeNet, to those winning models in ImageNet, such as VGGNet, GoogLeNet, ResNet, etc. (Please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification) tutorial), CNN achieved a series of impressive results in Image Classification tasks.
+Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, MLP (Multilayer Perceptron) and Multilayer CNN LeNet. These algorithms constantly reduced test error from 12% to 0.7% \[[1](#References)\]. Since then, researchers have worked on many algorithms such as k-NN (K-Nearest Neighbors) \[[2](#References)\], Support Vector Machine (SVM) \[[3](#References)\], Neural Networks \[[4-7](#References)\] and Boosting \[[8](#References)\]. Various preprocessing methods like distortion removal, noise removal, blurring etc. have also been applied to increase recognition accuracy.
-Many algorithms are tested on MNIST. In 1998, LeCun experimented single layer linear classifier, MLP (Multilayer Perceptron) and Multilayer CNN LeNet. These algorithms constantly reduced test error from 12% to 0.7% \[[1](#References)\]. Since then, researchers worked on many algorithms such as k-NN (K-Nearest Neighbors) \[[2](#References)\], Support Vector Machine (SVM) \[[3](#References)\], Neural Networks \[[4-7](#References)\] and Boosting \[[8](#References)\], and applied various preprocessing methods, such as distortion removal, noise removal and blurring, to increase recognition accuracy.
+In this tutorial, we tackle the task of handwritten character recognition. We start with a simple softmax regression model and guide our readers step-by-step to improve this model's performance on the task of recognition.
-In this tutorial, we start from simple softmax regression model, and guide readers to introduction of handwritten character recognition, and step-by-step improvement of models.
 ## Model Overview
 Before introducing classification algorithms and training procedure, we provide some definitions:
- $X$ is input: Input is $28\times28$ MNIST image. It is flattened to $784$ dimensional vector. $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
+- $X$ is the input: Input is a $28\times28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is output: Output of classifier is 10 class digits from 0 to 9. $Y=\left ( y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents a probability that the image belongs to $i$.
+- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left ( y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
- $L$ is ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$ It is also 10 dimensional, but only one dimension is 1 and others are all 0.
+- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one dimension is 1 and all others are all 0.
 ### Softmax Regression
-The simplest softmax regression model is to feed input into fully connected layers, and directly use softmax for multiclass classification \[[9](#References)\].
+In a simple softmax regression model, the input is fed to fully connected layers and a softmax function is applied to get probabilities of multiple output classes\[[9](#References)\].
-Input $X$ is multiplied with weights $W$, added by bias $b$, and activated.
+Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.
 $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
 where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
-For a $N$ class classification problem with $N$ output nodes, a $N$ dimensional vector is normalized to $N$ real values in [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
+For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
-In classification problem, we usually use cross entropy loss function:
+In such a classification problem, we usually use the cross entropy loss function:
 $$  crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
-Fig. 2 is softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
+Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
 <p align="center">
 <img src="image/softmax_regression.png" width=400><br/>
@@ -93,11 +98,11 @@ Fig. 2. Softmax regression network architecture<br/>
 ### Multilayer Perceptron
-Softmax regression model uses the simplest two layer neural network, i.e. it only contains input layer and output layer, so that it's regression ability is limited. To achieve better recognition effect, we consider adding several hidden layers \[[10](#References)\] between the input layer and the output layer.
+The Softmax regression model described above uses the simplest two-layer neural network, i.e. it only contains an input layer and an output layer. So its regression ability is limited. To achieve better recognition results, we consider adding several hidden layers \[[10](#References)\] between the input layer and the output layer.
-1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is activation function. Some common ones are sigmoid, tanh and ReLU.
+1.  After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
 2.  After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
-3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the last classification result vector.
+3.  Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
 Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
@@ -121,9 +126,9 @@ Fig. 4. Convolutional layer<br/>
 卷积输出 -> convolution output<br/>
 </p>
-Convolutional layer is the core of Convolutional Neural Network. The parameters in this layer are composed of a set of filters, or kernels. In forward step, each kernel moves horizontally and vertically, and compute dot product of the kernel and the input on corresponding positions, then add bias and apply activation function. The result is two dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
+The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions, to this result we add bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
-Fig. 4 is a dynamic graph of convolutional layer, where depths are not shown for simplicity. Input is $W_1=5,H_1=5,D_1=3$. In fact, this is a common representation for colored images. The width and the height of a colored image correspond to $W_1$ and $H_1$, respectively, and the 3 color channels for RGB correspond to $D_1$. The parameters of the convolutional layer are $K=2,F=3,S=2,P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter   W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is stride. Kernels moves leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.
+Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown for simplicity. Input is $W_1=5, H_1=5, D_1=3$. In fact, this is a common representation for colored images. $W_1$ and  $H_1$ of a colored image correspond to the width and height respectively. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2, F=3, S=2, P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter   W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is the stride. Kernels move leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.
 #### Pooling Layer
@@ -133,7 +138,7 @@ Fig. 5 Pooling layer<br/>
 输入数据 -> input data<br/>
 </p>
-Pooling layer performs downsampling. The main functionality is to reduce computation by reducing network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer includes max pooling, average pooling, etc. Max pooling uses rectangles to segment input layer into several parts, and compute maximum value in each part as output (Fig. 5.)
+A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
 #### LeNet-5 Network 
@@ -147,30 +152,30 @@ Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/>
 输出层(全连接+Softmax激活) -> output layer (fully connected + softmax activation)<br/>
 </p>
-[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: 2 dimensional image input is fed into two sets of convolutional layer and pooling layer, then it is fed into fully connected layer and softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully-connected perceptrons:
+[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:
- 3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in previous layer. This region is called receptive field.
+- 3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field.
- Local connection: CNN utilizes local space correlation by connecting local neurons. This design guarantees learned filter has strong response to local input features. Stacking many such layers leads non-linear filter becomes more and more global. This allows the network to first obtain good representation for a small parts of input, then combine them to represent larger region.
+- Local connection: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
- Sharing weights: In CNN, computation is iterated with shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of output respond to the same feature. This allows detecting a feature regardless of its position in the input, and enables a property of translation equivariance.
+- Sharing weights: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of the output respond to the same feature. This allows detecting a feature regardless of its position in the input and enables translation equivariance.
-For more details of Convolutional Neural Network, please refer to [Stanford open course]( http://cs231n.github.io/convolutional-networks/ ) and [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) tutorial.
+For more details on Convolutional Neural Networks, please refer to [this Stanford open course]( http://cs231n.github.io/convolutional-networks/ ) and [this Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) tutorial.
 ### List of Common Activation Functions  
- Sigmoid activation function： $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
+- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
- Tanh activation function： $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
+- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
-  In fact, tanh function is just a rescaled version of sigmoid function. It is obtained by magnifying the value of sigmoid function and moving it downwards by 1.
+  In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.
- ReLU activation function： $ f(x) = max(0, x) $
+- ReLU activation function: $ f(x) = max(0, x) $
-For more information, please refer to [Activation functions in Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
+For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
 ## Data Preparation
-### Data and Download
+### Data Download
-Execute the following command to download [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip, then put paths of training set and test set to train.list and test.list respectively for PaddlePaddle to read.
+Execute the following command to download the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip. Add paths to the training set and the test set to train.list and test.list respectively for PaddlePaddle to read.
 ```bash
 ./data/get_mnist_data.sh
@@ -191,9 +196,9 @@ Users can randomly generate 10 images with the following script (Refer to Fig. 1
 ./load_data.py
 ```
-### Provide Data for PaddlePaddle
+### Provide Data to PaddlePaddle
-We use python interface to provide data to system. `mnist_provider.py` shows a complete example for MNIST data.
+We use python interface to provide data to system. `mnist_provider.py` shows a complete example for training on MNIST data.
 ```python
 # Define a py data provider
@@ -229,7 +234,7 @@ def process(settings, filename):  # settings is not used currently.
 ### Data Definition
-In model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
+In the model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
 ```python
 if not is_predict:
@@ -246,7 +251,7 @@ In model configuration, use `define_py_data_sources2` to define reading of data
 Set training related parameters.
 - batch_size: use 128 samples in each training step.
- learning_rate: rating of iteration, related to the rate of convergence.
+- learning_rate: determines step taken in each iteration, it determines how fast the model converges.
 - learning_method: use optimizer `MomentumOptimizer` for training. The parameter 0.9 indicates momentum keeps 0.9 of previous speed.
 - regularization: A method to prevent overfitting. Here L2 regularization is used.
@@ -262,7 +267,7 @@ settings(
 #### Overview
-First get data by `data_layer`, and get classification result by classifier. Here we provided three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output results.
+First get reference labels from `data_layer`, and get classification results (predictions) from classifier. Here we provide three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output the results (predictions).
 ``` python
 data_size = 1 * 28 * 28
@@ -293,7 +298,7 @@ def softmax_regression(img):
 #### MultiLayer Perceptron
-The following code implements a Multilayer Perceptron with two fully connected hidden layers and ReLU activation function. Output layer has Softmax activation function.
+The following code implements a Multilayer Perceptron with two fully connected hidden layers and a ReLU activation function. The output layer has a Softmax activation function.
 ```python
 def multilayer_perceptron(img):
@@ -308,7 +313,7 @@ def multilayer_perceptron(img):
 #### Convolutional Neural Network LeNet-5
-The following is the LeNet-5 network architecture. 2D input image is first fed into two sets of convolutional layer and pooling layer, and it is fed into fully connected layer, and another fully connected layer with softmax activation.
+The following is the LeNet-5 network architecture. A 2D input image is first fed into two sets of convolutional layers and pooling layers, this result is then fed to a fully connected layer, and another fully connected layer with a softmax activation.
 ```python
 def convolutional_neural_network(img):
@@ -406,7 +411,7 @@ Best pass is 00013, testing Avgcost is 0.484447
 The classification accuracy is 90.01%
 ```
-From the evaluation results, the best pass for softmax regression model is pass-00013, where classification accuracy is 90.01%, and the last pass-00099 has accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. An explanation is that during training, the model may already arrive at local optimum, and it just swings around nearby in the following passes, or it gets lower local optimum.
+From the evaluation results, the best pass for softmax regression model is pass-00013, where the classification accuracy is 90.01%, and the last pass-00099 has an accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. This is because during training, the model may already arrive at a local optimum, and it just swings around nearby in the following passes, or it gets a lower local optimum.
 ### Results of Multilayer Perceptron
@@ -426,7 +431,7 @@ Best pass is 00085, testing Avgcost is 0.164746
 The classification accuracy is 94.95%
 ```
-From the evaluation results, the final training accuracy is 94.95%. It has significant improvement comparing with softmax regression model. The reason is that softmax regression is simple, and it cannot fit complex data, but Multilayer Perceptron with hidden layers has stronger fitting capacity.
+From the evaluation results, the final training accuracy is 94.95%. It is significantly better than the softmax regression model. This is because the softmax regression is simple, and it cannot fit complex data. The Multilayer Perceptron with hidden layers has better capacity to fit complex data than the softmax regression.
 ### Training results for Convolutional Neural Network
@@ -446,7 +451,7 @@ Best pass is 00076, testing Avgcost is 0.0244684
 The classification accuracy is 99.20%
 ```
-From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. This means, for image problem, Convolutional Neural Network has better recognition effect than fully connected network. This should be related to the local connection and parameter sharing of convolutional layers. Also, in Fig. 9, Convolutional Neural Network achieves good effect in early steps, which indicates that it is fast to converge.
+From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. So for image classification, a Convolutional Neural Network has better recognition results than a fully connected network. This is related to the local connection and parameter sharing of convolutional layers. In Fig. 9, the Convolutional Neural Network achieves good results in early steps, which indicates that it converges faster.
 ## Application Model
@@ -461,7 +466,7 @@ python predict.py -c mnist_model.py -d data/raw_data/ -m softmax_mnist_model/pas
 - -d sets data for prediction
 - -m sets model parameters, here the best trained model is used for prediction
-Follow to instruction to input image ID for prediction. The classifier can output probabilities for each digit, predicted results with the highest probability, and ground truth label.
+Follow the instructions to input image ID for prediction. The classifier can output probabilities for each digit, predictions with the highest probability, and ground truth label.
 ```
 Input image_id [0~9999]: 3
@@ -473,10 +478,10 @@ Predict Number: 0
 Actual Number: 0
 ```
-From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability, and the ground truth is actually consistent.
+From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability. This predicted result is consistent with the ground truth label.
 ## Conclusion
-Softmax regression, Multilayer Perceptron and Convolutional Neural Network in this tutorial are the most basic Deep Learning models. More sophisticated models in the following tutorials are derived from them. Therefore, these models are very helpful for the future learning. At the same time, we observed that when evolving from the simplest softmax regression to slightly complex Convolutional Neural Network, recognition accuracy on MNIST data set has large improvement, due to Convolutional layers' local connections and parameter sharing. When learning new models in the future, we hope readers to understand the key ideas for a new model to improve over an old one. Moreover, this tutorial introduced basic flow of PaddlePaddle model design, starting from dataprovider, model layer construction, to final training and prediction. By becoming familiar with this flow, readers can use specific data, define specific network models, and complete training and prediction for their tasks.
+This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
 ## References
@@ -495,6 +500,7 @@ Softmax regression, Multilayer Perceptron and Convolutional Neural Network in th
 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>.
 </div>
 <!-- You can change the lines below now. -->
 <script type="text/javascript">
 marked.setOptions({
  renderer: new marked.Renderer(),