diff --git a/fit_a_line/README.en.md b/fit_a_line/README.en.md index a804ca9192d4df295bce81d9b95f1c69e9478439..3abb50a37e31930958828a7febc1103bfcfab7cb 100644 --- a/fit_a_line/README.en.md +++ b/fit_a_line/README.en.md @@ -43,15 +43,25 @@ After defining our model, we have several major steps for the training: 3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. -## Data Preparation -Follow the command below to prepare data: -```bash -cd data && python prepare_data.py +## Dataset + +### Python Dataset Modules + +Our program starts with importing necessary packages: + +```python +import paddle.v2 as paddle +import paddle.v2.dataset.uci_housing as uci_housing ``` -This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set. -The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below: +We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can + +1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and +2. [preprocesses](#preprocessing) the dataset. + +### An Introduction of the Dataset +The UCI housing dataset has 506 instances. Each instance is about a house in Boston suburban area. Properties include: | Property Name | Explanation | Data Type | | ------| ------ | ------ | @@ -90,89 +100,91 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi

#### Prepare Training and Test Sets -We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change. +We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. -Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them. -```python -python prepare_data.py -r 0.8 #8:2 is the default split ratio -``` When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now. -### Provide Data to PaddlePaddle -After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line. -```python -from paddle.trainer.PyDataProvider2 import * -import numpy as np -#define data type and dimensionality -@provider(input_types=[dense_vector(13), dense_vector(1)]) -def process(settings, input_file): - data = np.load(input_file.strip()) - for row in data: - yield row[:-1].tolist(), row[-1:].tolist() +## Training -``` +`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org). -## Model Configuration +### Initialize PaddlePaddle -### Data Definition -We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test. ```python -from paddle.trainer_config_helpers import * +paddle.init(use_gpu=False, trainer_count=1) +``` -is_predict = get_config_arg('is_predict', bool, False) +### Model Configuration -define_py_data_sources2( - train_list='data/train.list', - test_list='data/test.list', - module='dataprovider', - obj='process') +Logistic regression is indeed a fully-connected layer with linear activation: +```python +x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) +y_predict = paddle.layer.fc(input=x, + size=1, + act=paddle.activation.Linear()) +y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) +cost = paddle.layer.regression_cost(input=y_predict, label=y) ``` +### Create Parameters -### Algorithm Settings -Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters. ```python -settings(batch_size=2) +parameters = paddle.parameters.create(cost) ``` -### Network -Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model. +### Create Trainer + ```python -#input data of 13 dimensional house information -x = data_layer(name='x', size=13) - -y_predict = fc_layer( - input=x, - param_attr=ParamAttr(name='w'), - size=1, - act=LinearActivation(), - bias_attr=ParamAttr(name='b')) - -if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function - y = data_layer(name='y', size=1) - cost = regression_cost(input=y_predict, label=y) - outputs(cost) #output MSE to view the loss change -else: #during test, output the prediction value - outputs(y_predict) +optimizer = paddle.optimizer.Momentum(momentum=0) + +trainer = paddle.trainer.SGD(cost=cost, + parameters=parameters, + update_equation=optimizer) ``` -## Training Model -We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`: -```bash -./train.sh +### Feeding Data + +PaddlePaddle provides the +[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader) +for loadinng training data. A reader might return multiple columns, +and we need a Python dictionary to specify the correspondence from +column number to data layers. + +```python +feeding={'x': 0, 'y': 1} ``` -## Use Model -Now we can use the trained model to do prediction. -```bash -python predict.py +Also, we provide an event handler function which prints the training progress: + +```python +# event_handler to print training and testing info +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + print "Pass %d, Batch %d, Cost %f" % ( + event.pass_id, event.batch_id, event.cost) + + if isinstance(event, paddle.event.EndPass): + result = trainer.test( + reader=paddle.batch( + uci_housing.test(), batch_size=2), + feeding=feeding) + print "Test %d, Cost %f" % (event.pass_id, result.cost) ``` -Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`. -If you want to use another model or test on other data, you can pass in a new model path or data path: -```bash -python predict.py -m output/pass-00020 -t data/housing.test.npy + +### Start Training + +```python +trainer.train( + reader=paddle.batch( + paddle.reader.shuffle( + uci_housing.train(), buf_size=500), + batch_size=2), + feeding=feeding, + event_handler=event_handler, + num_passes=30) ``` ## Summary diff --git a/fit_a_line/README.md b/fit_a_line/README.md index f2e3243a3d1b91df5b8c9bfaa5da74fd142a63b3..266c6e91cb4d5249997203cada7ef920d3744386 100644 --- a/fit_a_line/README.md +++ b/fit_a_line/README.md @@ -39,16 +39,16 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ ### 训练过程 -定义好模型结构之后,我们要通过以下几个步骤进行模型训练 - 1. 初始化参数,其中包括权重$\omega_i$和偏置$b$,对其进行初始化(如0均值,1方差)。 - 2. 网络正向传播计算网络输出和损失函数。 - 3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。 - 4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。 - +定义好模型结构之后,我们要通过以下几个步骤进行模型训练 + 1. 初始化参数,其中包括权重$\omega_i$和偏置$b$,对其进行初始化(如0均值,1方差)。 + 2. 网络正向传播计算网络输出和损失函数。 + 3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。 + 4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。 + ## 数据集 ### 数据集接口的封装 -首先加载需要的包 +首先加载需要的包 ```python import paddle.v2 as paddle @@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing 其中,在uci_housing模块中封装了: -1. 数据下载的过程
- 下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data
-2. [数据预处理](#数据预处理)的过程
+1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。 +2. [数据预处理](#数据预处理)的过程。 ### 数据集介绍 @@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing 我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$ - 在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。 ## 训练 -fit_a_line下trainer.py演示了训练的整体过程 -### 初始化paddlepaddle +`fit_a_line/trainer.py`演示了训练的整体过程。 + +### 初始化PaddlePaddle ```python -# init paddle.init(use_gpu=False, trainer_count=1) ``` -### 模型配置 +### 模型配置 -使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。 +线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`): ```python -#输入数据,13维的房屋信息 x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) y_predict = paddle.layer.fc(input=x, size=1, @@ -131,17 +128,15 @@ y_predict = paddle.layer.fc(input=x, y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) cost = paddle.layer.regression_cost(input=y_predict, label=y) ``` -### 创建参数 +### 创建参数 ```python -# create parameters parameters = paddle.parameters.create(cost) ``` -### 创建trainer +### 创建Trainer ```python -# create optimizer optimizer = paddle.optimizer.Momentum(momentum=0) trainer = paddle.trainer.SGD(cost=cost, @@ -149,14 +144,20 @@ trainer = paddle.trainer.SGD(cost=cost, update_equation=optimizer) ``` -### 读取数据且打印训练的中间信息 -在程序中,我们通过reader接口来获取训练或者测试的数据,通过eventhandler来打印训练的中间信息 -feeding中设置了训练数据和测试数据的下标,reader通过下标区分训练和测试数据。 +### 读取数据且打印训练的中间信息 + +PaddlePaddle提供一个 +[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader) +来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列 +序号映射到网络里的数据层。 ```python -feeding={'x': 0, - 'y': 1} +feeding={'x': 0, 'y': 1} +``` + +此外,我们还可以提供一个 event handler,来打印训练的进度: +```python # event_handler to print training and testing info def event_handler(event): if isinstance(event, paddle.event.EndIteration): @@ -171,10 +172,10 @@ def event_handler(event): feeding=feeding) print "Test %d, Cost %f" % (event.pass_id, result.cost) ``` -### 开始训练 + +### 开始训练 ```python -# training trainer.train( reader=paddle.batch( paddle.reader.shuffle( @@ -185,13 +186,6 @@ trainer.train( num_passes=30) ``` -## bash中执行训练程序 -**注意设置好paddle的安装包路径** - -```bash -python train.py -``` - ## 总结 在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。