diff --git a/.gitignore b/.gitignore index fab52d7877497804092cf72d52bd64dc2f4f1747..e7f8501f2c04d0ddb9a27202b3e91d33c47d9de8 100644 --- a/.gitignore +++ b/.gitignore @@ -6,4 +6,3 @@ pandoc.template py_env* *.ipynb build -Dockerfile diff --git a/.tools/build_docker.sh b/.tools/build_docker.sh index 60474b1c6c56d7f84c8e5ee19132d93dbf19df9c..242f8de16639f6c333d99f2ff4c7e24f86788c98 100755 --- a/.tools/build_docker.sh +++ b/.tools/build_docker.sh @@ -64,7 +64,7 @@ RUN pip install -U nltk \ RUN ${update_mirror_cmd} apt-get update && \ apt-get install -y locales patch && \ - apt-get -y install gcc curl git && \ + apt-get -y install gcc curl git vim && \ apt-get -y clean && \ localedef -f UTF-8 -i en_US en_US.UTF-8 && \ pip install --upgrade pip && \ diff --git a/01.fit_a_line/README.cn.md b/01.fit_a_line/README.cn.md index 30e78a42ddb5c36072485c5dbecd63433dc91065..5aa1b1d5bc524a6d570130471d17733c894f0a0e 100644 --- a/01.fit_a_line/README.cn.md +++ b/01.fit_a_line/README.cn.md @@ -126,8 +126,18 @@ y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) -cost = paddle.layer.mse_cost(input=y_predict, label=y) +cost = paddle.layer.square_error_cost(input=y_predict, label=y) ``` + +### 保存网络拓扑 + +```python +# Save the inference topology to protobuf. +inference_topology = paddle.topology.Topology(layers=y_predict) +with open("inference_topology.pkl", 'wb') as f: + inference_topology.serialize_for_inference(f) +``` + ### 创建参数 ```python diff --git a/01.fit_a_line/README.md b/01.fit_a_line/README.md index ce4b5334402c52548791fab636593b613cb74096..363f9d06bd37d14d9865332e540396ce9640600d 100644 --- a/01.fit_a_line/README.md +++ b/01.fit_a_line/README.md @@ -4,9 +4,9 @@ Let us begin the tutorial with a classical problem called Linear Regression \[[1 The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). ## Problem Setup -Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity. +Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity. -Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby. +Each home is associated with $d$ attributes. The attributes describe characteristics such as the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby. In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely, @@ -15,7 +15,7 @@ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. ## Results Demonstration -We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line. +We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the closer the point is to the dotted line, better the model's prediction.


Figure 1. Predicted Value V.S. Actual Value @@ -45,7 +45,7 @@ After setting up our model, there are several major steps to go through to train 1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s. 2. Feedforward. Evaluate the network output and compute the corresponding loss. 3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. -4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. +4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of epochs is reached. ## Dataset @@ -60,8 +60,8 @@ import paddle.v2.dataset.uci_housing as uci_housing We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can -1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and -2. [preprocesses](#preprocessing) the dataset. +1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and +2. [preprocess](#preprocessing) the dataset. ### An Introduction of the Dataset @@ -93,7 +93,7 @@ We define a feature vector of length 13 for each home, where each entry correspo Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. #### Feature Normalization -We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale te values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we substract the mean value from the feature value and divide the result by the width of the original range. +We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale the values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we subtract the mean value from the feature value and divide the result by the width of the original range. There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling): - A value range that is too large or too small might cause floating number overflow or underflow during computation. @@ -106,7 +106,7 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi

#### Prepare Training and Test Sets -We split the dataset in two, one for adjusting the model parameters, namely, for model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. +We split the dataset in two, one for adjusting the model parameters, namely, for training the model, and the other for testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process. @@ -124,7 +124,7 @@ paddle.init(use_gpu=False, trainer_count=1) ### Model Configuration -Logistic regression is essentially a fully-connected layer with linear activation: +Linear regression is essentially a fully-connected layer with linear activation: ```python x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) @@ -132,8 +132,19 @@ y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) -cost = paddle.layer.mse_cost(input=y_predict, label=y) +cost = paddle.layer.square_error_cost(input=y_predict, label=y) ``` + +### Save Topology + +```python +# Save the inference topology to protobuf. +inference_topology = paddle.topology.Topology(layers=y_predict) +with open("inference_topology.pkl", 'wb') as f: + inference_topology.serialize_for_inference(f) +``` + + ### Create Parameters ```python @@ -154,7 +165,7 @@ trainer = paddle.trainer.SGD(cost=cost, PaddlePaddle provides the [reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader) -for loadinng training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers. +for loading the training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers. ```python feeding={'x': 0, 'y': 1} @@ -179,7 +190,7 @@ def event_handler(event): ``` ```python -# event_handler to print training and testing info +# event_handler to plot training and testing info from paddle.v2.plot import Ploter train_title = "Train cost" diff --git a/01.fit_a_line/index.cn.html b/01.fit_a_line/index.cn.html index 933e5a4d8bfb53cb50a9675ea907d939203742da..c69a8dc1006cad4e2b051cc25fd2fed8a4e25706 100644 --- a/01.fit_a_line/index.cn.html +++ b/01.fit_a_line/index.cn.html @@ -168,8 +168,18 @@ y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) -cost = paddle.layer.mse_cost(input=y_predict, label=y) +cost = paddle.layer.square_error_cost(input=y_predict, label=y) ``` + +### 保存网络拓扑 + +```python +# Save the inference topology to protobuf. +inference_topology = paddle.topology.Topology(layers=y_predict) +with open("inference_topology.pkl", 'wb') as f: + inference_topology.serialize_for_inference(f) +``` + ### 创建参数 ```python diff --git a/01.fit_a_line/index.html b/01.fit_a_line/index.html index 22afb004ad3701c4f7d6b00bb8d638ff029c7edc..28f72cace59bbfa80bebf965527ed44e3853f47d 100644 --- a/01.fit_a_line/index.html +++ b/01.fit_a_line/index.html @@ -46,9 +46,9 @@ Let us begin the tutorial with a classical problem called Linear Regression \[[1 The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). ## Problem Setup -Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity. +Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity. -Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby. +Each home is associated with $d$ attributes. The attributes describe characteristics such as the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby. In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely, @@ -57,7 +57,7 @@ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. ## Results Demonstration -We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line. +We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the closer the point is to the dotted line, better the model's prediction.


Figure 1. Predicted Value V.S. Actual Value @@ -87,7 +87,7 @@ After setting up our model, there are several major steps to go through to train 1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s. 2. Feedforward. Evaluate the network output and compute the corresponding loss. 3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. -4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. +4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of epochs is reached. ## Dataset @@ -102,8 +102,8 @@ import paddle.v2.dataset.uci_housing as uci_housing We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can -1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and -2. [preprocesses](#preprocessing) the dataset. +1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and +2. [preprocess](#preprocessing) the dataset. ### An Introduction of the Dataset @@ -135,7 +135,7 @@ We define a feature vector of length 13 for each home, where each entry correspo Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. #### Feature Normalization -We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale te values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we substract the mean value from the feature value and divide the result by the width of the original range. +We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale the values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we subtract the mean value from the feature value and divide the result by the width of the original range. There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling): - A value range that is too large or too small might cause floating number overflow or underflow during computation. @@ -148,7 +148,7 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi

#### Prepare Training and Test Sets -We split the dataset in two, one for adjusting the model parameters, namely, for model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. +We split the dataset in two, one for adjusting the model parameters, namely, for training the model, and the other for testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process. @@ -166,7 +166,7 @@ paddle.init(use_gpu=False, trainer_count=1) ### Model Configuration -Logistic regression is essentially a fully-connected layer with linear activation: +Linear regression is essentially a fully-connected layer with linear activation: ```python x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) @@ -174,8 +174,19 @@ y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear()) y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) -cost = paddle.layer.mse_cost(input=y_predict, label=y) +cost = paddle.layer.square_error_cost(input=y_predict, label=y) ``` + +### Save Topology + +```python +# Save the inference topology to protobuf. +inference_topology = paddle.topology.Topology(layers=y_predict) +with open("inference_topology.pkl", 'wb') as f: + inference_topology.serialize_for_inference(f) +``` + + ### Create Parameters ```python @@ -196,7 +207,7 @@ trainer = paddle.trainer.SGD(cost=cost, PaddlePaddle provides the [reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader) -for loadinng training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers. +for loading the training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers. ```python feeding={'x': 0, 'y': 1} @@ -221,7 +232,7 @@ def event_handler(event): ``` ```python -# event_handler to print training and testing info +# event_handler to plot training and testing info from paddle.v2.plot import Ploter train_title = "Train cost" diff --git a/01.fit_a_line/train.py b/01.fit_a_line/train.py index 255180d3c4322e8dd201e96917e288e3ee209d61..79a320fcb1d7fdef53a2254dfdb5d0317227cb2b 100644 --- a/01.fit_a_line/train.py +++ b/01.fit_a_line/train.py @@ -1,10 +1,13 @@ +import os import paddle.v2 as paddle import paddle.v2.dataset.uci_housing as uci_housing +with_gpu = os.getenv('WITH_GPU', '0') != '0' + def main(): # init - paddle.init(use_gpu=False, trainer_count=1) + paddle.init(use_gpu=with_gpu, trainer_count=1) # network config x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) @@ -12,6 +15,11 @@ def main(): y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) cost = paddle.layer.square_error_cost(input=y_predict, label=y) + # Save the inference topology to protobuf. + inference_topology = paddle.topology.Topology(layers=y_predict) + with open("inference_topology.pkl", 'wb') as f: + inference_topology.serialize_for_inference(f) + # create parameters parameters = paddle.parameters.create(cost) @@ -21,10 +29,6 @@ def main(): trainer = paddle.trainer.SGD( cost=cost, parameters=parameters, update_equation=optimizer) - # save model proto as file - with open("model.proto", "w") as f: - f.write(str(trainer.__topology_in_proto__)) - feeding = {'x': 0, 'y': 1} # event_handler to print training and testing info diff --git a/02.recognize_digits/README.md b/02.recognize_digits/README.md index 198897fb55c731ad8b4210fa4dc6e36978c6f197..b7836415c507e86eb0b627e894a449092fcb5d85 100644 --- a/02.recognize_digits/README.md +++ b/02.recognize_digits/README.md @@ -1,22 +1,20 @@ # Recognize Digits -The source code for this tutorial is live at [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For instructions on getting started with Paddle, please refer to [installation instructions](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). +The source code for this tutorial is here: [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For instructions on getting started with Paddle, please refer to [installation instructions](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). ## Introduction -When one learns to program, the first task is usually to write a program that prints "Hello World!". In Machine Learning or Deep Learning, the equivalent task is to train a model to recognize hand-written digits on the dataset [MNIST](http://yann.lecun.com/exdb/mnist/). Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered. +When one learns to program, the first task is usually to write a program that prints "Hello World!". In Machine Learning or Deep Learning, an equivalent task is to train a model to recognize hand-written digits using the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.


Fig. 1. Examples of MNIST images

-The MNIST dataset is created from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn't a complete overlap of annotators of training set and test set. +The MNIST dataset is from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set of 60,000 samples and test set of 10,000 samples. 250 annotators labeled the training set, thus guaranteed that there wasn't a complete overlap of annotators of training set and test set. -Yann LeCun, one of the founders of Deep Learning, have previously made tremendous contributions to handwritten character recognition and proposed the **Convolutional Neural Network** (CNN), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From the LeNet proposal by Yann LeCun, to those winning models in ImageNet competitions, such as VGGNet, GoogLeNet, and ResNet (See [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial), CNNs have achieved a series of impressive results in Image Classification tasks. +The MNIST dataset has been used for evaluating many image recognition algorithms such as a single layer linear classifier, Multilayer Perceptron (MLP) and Multilayer CNN LeNet\[[1](#references)\], K-Nearest Neighbors (k-NN) \[[2](#references)\], Support Vector Machine (SVM) \[[3](#references)\], Neural Networks \[[4-7](#references)\], Boosting \[[8](#references)\] and preprocessing methods like distortion removal, noise removal, and blurring. Among these algorithms, the *Convolutional Neural Network* (CNN) has achieved a series of impressive results in Image Classification tasks, including VGGNet, GoogLeNet, and ResNet (See [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial). -Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, Multilayer Perceptron (MLP) and Multilayer CNN LeNet. These algorithms quickly reduced test error from 12% to 0.7% \[[1](#references)\]. Since then, researchers have worked on many algorithms such as **K-Nearest Neighbors** (k-NN) \[[2](#references)\], **Support Vector Machine** (SVM) \[[3](#references)\], **Neural Networks** \[[4-7](#references)\] and **Boosting** \[[8](#references)\]. Various preprocessing methods like distortion removal, noise removal, and blurring, have also been applied to increase recognition accuracy. - -In this tutorial, we tackle the task of handwritten character recognition. We start with a simple **softmax** regression model and guide our readers step-by-step to improve this model's performance on the task of recognition. +In this tutorial, we start with a simple **softmax** regression model and go on with MLP and CNN. Readers will see how these methods improve the recognition accuracy step-by-step. ## Model Overview @@ -36,7 +34,7 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ -For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in\math{R}$, representing the probability that the sample belongs to a certain class. Here $y_i$ denotes the predicted probability that an image is of digit $i$. +For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in {R}$, representing the probability that the sample belongs to a certain class. Here $y_i$ denotes the predicted probability that an image is of digit $i$. In such a classification problem, we usually use the cross entropy loss function: @@ -76,7 +74,7 @@ Fig. 4. Convolutional layer
The **convolutional layer** is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters, also called kernels. We could visualize the convolution step in the following fashion: Each kernel slides horizontally and vertically till it covers the whole image. At every window, we compute the dot product of the kernel and the input. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features. -Fig. 4 illustrates the dynamic programming of a convolutional layer, where depths are flattened for simplicity. The input is $W_1=5$, $H_1=5$, $D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ correspond to the width and height in a colored image. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2$, $F=3$, $S=2$, $P=1$. $K$ denotes the number of kernels; specifically, $Filter$ $W_0$ and $Filter$ $W_1$ are the kernels. $F$ is kernel size while $W0$ and $W1$ are both $F\timesF = 3\times3$ matrices in all depths. $S$ is the stride, which is the width of the sliding window; here, kernels move leftwards or downwards by 2 units each time. $P$ is the width of the padding, which denotes an extension of the input; here, the gray area shows zero padding with size 1. +Fig. 4 illustrates the dynamic programming of a convolutional layer, where depths are flattened for simplicity. The input is $W_1=5$, $H_1=5$, $D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ correspond to the width and height in a colored image. $D_1$ corresponds to the three color channels for RGB. The parameters of the convolutional layer are $K=2$, $F=3$, $S=2$, $P=1$. $K$ denotes the number of kernels; specifically, $Filter$ $W_0$ and $Filter$ $W_1$ are the kernels. $F$ is kernel size while $W0$ and $W1$ are both $F\timesF = 3\times3$ matrices in all depths. $S$ is the stride, which is the width of the sliding window; here, kernels move leftwards or downwards by two units each time. $P$ is the width of the padding, which denotes an extension of the input; here, the gray area shows zero padding with size 1. #### Pooling Layer @@ -96,11 +94,11 @@ Fig. 6. LeNet-5 Convolutional Neural Network architecture
[**LeNet-5**](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers. This output is then fed to a fully connected layer and a softmax classifier. Compared to multilayer, fully connected perceptrons, the LeNet-5 can recognize images better. This is due to the following three properties of the convolution: -- The 3D nature of the neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field. +- The 3D nature of the neurons: a convolutional layer is organized by width, height, and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field. - Local connectivity: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region. -- Weight sharing: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means that all the neurons in the same depth of the output respond to the same feature. This allows the network to detect a feature regardless of its position in the input. In other words, it is shift invariant. +- Weight sharing: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means that all the neurons in the same depth of the output response to the same feature. This allows the network to detect a feature regardless of its position in the input. -For more details on Convolutional Neural Networks, please refer to the tutorial on [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) and the [relevant lecture](http://cs231n.github.io/convolutional-networks/) from a Stanford open course. +For more details on Convolutional Neural Networks, please refer to the tutorial on [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) and the [relevant lecture](http://cs231n.github.io/convolutional-networks/) from a Stanford course. ### List of Common Activation Functions - Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $ @@ -221,11 +219,11 @@ trainer = paddle.trainer.SGD(cost=cost, update_equation=optimizer) ``` -Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two methods are *reader creators*. Once called, a reader creator returns a *reader*. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data. +Then we specify the training data `paddle.dataset.mnist.train()` and testing data `paddle.dataset.mnist.test()`. These two methods are *reader creators*. Once called, a reader creator returns a *reader*. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data. -`shuffle` is a reader decorator. It takes in a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in `buffer_size` instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data. +`shuffle` is a reader decorator. It takes a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in `buffer_size` instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data. -`batch` is a special decorator, which takes in reader and outputs a *batch reader*, which doesn't yield an instance, but a minibatch at a time. +`batch` is a special decorator, which takes a reader and outputs a *batch reader*, which doesn't yield an instance, but a minibatch at a time. `event_handler_plot` is used to plot a figure like below: @@ -263,6 +261,7 @@ def event_handler_plot(event): ```python lists = [] +# event handler to print the progress def event_handler(event): if isinstance(event, paddle.event.EndIteration): if event.batch_id % 100 == 0: @@ -282,6 +281,7 @@ def event_handler(event): ``` ```python +# Train the model now trainer.train( reader=paddle.batch( paddle.reader.shuffle( @@ -315,7 +315,7 @@ Usually, with MNIST data, the softmax regression model achieves an accuracy arou ## Application -After training is done, user can use the trained model to classify images. The following code shows how to inference MNIST images through `paddle.infer` interface. +After training, users can use the trained model to classify images. The following code shows how to inference MNIST images through `paddle.infer` interface. ```python from PIL import Image @@ -343,15 +343,15 @@ print "Label of image/infer_3.png is: %d" % lab[0][0] This tutorial describes a few common deep learning models using **Softmax regression**, **Multilayer Perceptron Network**, and **Convolutional Neural Network**. Understanding these models is crucial for future learning; the subsequent tutorials derive more sophisticated networks by building on top of them. -When our model evolves from a simple softmax regression to a slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieves a large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. +When our model evolves from a simple softmax regression to a slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST dataset achieves a large improvement. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. -Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a *dataprovider*, a model layer construction, and finally training and prediction. Motivated readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice. +Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a *data provider*, a model layer construction, and finally training and prediction. Motivated readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice. ## References 1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324. -2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&dswid=-434) (2014). +2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2:753279&dswid=-434) (2014). 3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190. 4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003. 5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007. diff --git a/02.recognize_digits/client/client.py b/02.recognize_digits/client/client.py new file mode 100644 index 0000000000000000000000000000000000000000..45b338d6de402100e0ea20bd4361c70dc1bdb80a --- /dev/null +++ b/02.recognize_digits/client/client.py @@ -0,0 +1,21 @@ +import requests +from PIL import Image +import numpy as np +import os + + +def load_image(file): + im = Image.open(file).convert('L') + im = im.resize((28, 28), Image.ANTIALIAS) + im = np.array(im).astype(np.float32).flatten() + im = im / 255.0 + return im + + +cur_dir = os.path.dirname(os.path.realpath(__file__)) +data = load_image(cur_dir + '/../image/infer_3.png') +data = data.tolist() + +r = requests.post("http://0.0.0.0:8000", json={'img': data}) + +print(r.text) diff --git a/02.recognize_digits/index.html b/02.recognize_digits/index.html index 635e7fa30d6d57e0b8d81c086c9935c74093160b..4de8d78216850512c44aef94a172aba5f15d0203 100644 --- a/02.recognize_digits/index.html +++ b/02.recognize_digits/index.html @@ -42,23 +42,21 @@