提交 4482eef7 编写于 作者: H helinwang 提交者: GitHub

Merge pull request #153 from wangkuiyi/fit_a_line_2

Change fit_a_line/README.en.md according to changes in README.md
...@@ -43,15 +43,25 @@ After defining our model, we have several major steps for the training: ...@@ -43,15 +43,25 @@ After defining our model, we have several major steps for the training:
3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. 4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
## Data Preparation ## Dataset
Follow the command below to prepare data:
```bash ### Python Dataset Modules
cd data && python prepare_data.py
Our program starts with importing necessary packages:
```python
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
``` ```
This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set.
The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below: We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can
1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
### An Introduction of the Dataset
The UCI housing dataset has 506 instances. Each instance is about a house in Boston suburban area. Properties include:
| Property Name | Explanation | Data Type | | Property Name | Explanation | Data Type |
| ------| ------ | ------ | | ------| ------ | ------ |
...@@ -90,89 +100,91 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi ...@@ -90,89 +100,91 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi
</p> </p>
#### Prepare Training and Test Sets #### Prepare Training and Test Sets
We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change. We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$.
Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them.
```python
python prepare_data.py -r 0.8 #8:2 is the default split ratio
```
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now. When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now.
### Provide Data to PaddlePaddle
After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line.
```python ## Training
from paddle.trainer.PyDataProvider2 import *
import numpy as np
#define data type and dimensionality
@provider(input_types=[dense_vector(13), dense_vector(1)])
def process(settings, input_file):
data = np.load(input_file.strip())
for row in data:
yield row[:-1].tolist(), row[-1:].tolist()
``` `fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
## Model Configuration ### Initialize PaddlePaddle
### Data Definition
We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test.
```python ```python
from paddle.trainer_config_helpers import * paddle.init(use_gpu=False, trainer_count=1)
```
is_predict = get_config_arg('is_predict', bool, False) ### Model Configuration
define_py_data_sources2( Logistic regression is indeed a fully-connected layer with linear activation:
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
```python
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x,
size=1,
act=paddle.activation.Linear())
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y)
``` ```
### Create Parameters
### Algorithm Settings
Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters.
```python ```python
settings(batch_size=2) parameters = paddle.parameters.create(cost)
``` ```
### Network ### Create Trainer
Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model.
```python ```python
#input data of 13 dimensional house information optimizer = paddle.optimizer.Momentum(momentum=0)
x = data_layer(name='x', size=13)
y_predict = fc_layer( trainer = paddle.trainer.SGD(cost=cost,
input=x, parameters=parameters,
param_attr=ParamAttr(name='w'), update_equation=optimizer)
size=1,
act=LinearActivation(),
bias_attr=ParamAttr(name='b'))
if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function
y = data_layer(name='y', size=1)
cost = regression_cost(input=y_predict, label=y)
outputs(cost) #output MSE to view the loss change
else: #during test, output the prediction value
outputs(y_predict)
``` ```
## Training Model ### Feeding Data
We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`:
```bash PaddlePaddle provides the
./train.sh [reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
for loadinng training data. A reader might return multiple columns,
and we need a Python dictionary to specify the correspondence from
column number to data layers.
```python
feeding={'x': 0, 'y': 1}
``` ```
## Use Model Also, we provide an event handler function which prints the training progress:
Now we can use the trained model to do prediction.
```bash ```python
python predict.py # event_handler to print training and testing info
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)
``` ```
Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`.
If you want to use another model or test on other data, you can pass in a new model path or data path: ### Start Training
```bash
python predict.py -m output/pass-00020 -t data/housing.test.npy ```python
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
uci_housing.train(), buf_size=500),
batch_size=2),
feeding=feeding,
event_handler=event_handler,
num_passes=30)
``` ```
## Summary ## Summary
......
...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing
其中,在uci_housing模块中封装了: 其中,在uci_housing模块中封装了:
1. 数据下载的过程<br> 1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data<br> 2. [数据预处理](#数据预处理)的过程。
2. [数据预处理](#数据预处理)的过程<br>
### 数据集介绍 ### 数据集介绍
...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$ 我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。 在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
## 训练 ## 训练
fit_a_line下trainer.py演示了训练的整体过程
### 初始化paddlepaddle `fit_a_line/trainer.py`演示了训练的整体过程。
### 初始化PaddlePaddle
```python ```python
# init
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` ```
### 模型配置 ### 模型配置
使用`fc_layer``LinearActivation`来表示线性回归的模型本身。 线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
```python ```python
#输入数据,13维的房屋信息
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x, y_predict = paddle.layer.fc(input=x,
size=1, size=1,
...@@ -134,14 +131,12 @@ cost = paddle.layer.regression_cost(input=y_predict, label=y) ...@@ -134,14 +131,12 @@ cost = paddle.layer.regression_cost(input=y_predict, label=y)
### 创建参数 ### 创建参数
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
### 创建trainer ### 创建Trainer
```python ```python
# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0) optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
...@@ -150,13 +145,19 @@ trainer = paddle.trainer.SGD(cost=cost, ...@@ -150,13 +145,19 @@ trainer = paddle.trainer.SGD(cost=cost,
``` ```
### 读取数据且打印训练的中间信息 ### 读取数据且打印训练的中间信息
在程序中,我们通过reader接口来获取训练或者测试的数据,通过eventhandler来打印训练的中间信息
feeding中设置了训练数据和测试数据的下标,reader通过下标区分训练和测试数据。 PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
```python ```python
feeding={'x': 0, feeding={'x': 0, 'y': 1}
'y': 1} ```
此外,我们还可以提供一个 event handler,来打印训练的进度:
```python
# event_handler to print training and testing info # event_handler to print training and testing info
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
...@@ -171,10 +172,10 @@ def event_handler(event): ...@@ -171,10 +172,10 @@ def event_handler(event):
feeding=feeding) feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost) print "Test %d, Cost %f" % (event.pass_id, result.cost)
``` ```
### 开始训练 ### 开始训练
```python ```python
# training
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
...@@ -185,13 +186,6 @@ trainer.train( ...@@ -185,13 +186,6 @@ trainer.train(
num_passes=30) num_passes=30)
``` ```
## bash中执行训练程序
**注意设置好paddle的安装包路径**
```bash
python train.py
```
## 总结 ## 总结
在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。 在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册