# Linear Regression
Let's start this tutorial from the classic Linear Regression ([[1](#References)]) model.
In this chapter, you will build a model to predict house price with real datasets and learn about several important concepts about machine learning.
The source code of this tutorial is in [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For the new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background
Given a $n$ dataset ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$, of which $ x_{i1}, \ldots, x_{id}$ are the values of the $d$th attribute of $i$ sample, and $y_i$ is the target to be predicted for this sample.
The linear regression model assumes that the target $y_i$ can be described by a linear combination among attributes, i.e.
$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$
For example, in the problem of prediction of house price we are going to explore, $x_{ij}$ is a description of the various attributes of the house $i$ (such as the number of rooms, the number of schools and hospitals around, traffic conditions, etc.). $y_i$ is the price of the house.
At first glance, this assumption is too simple, and the true relationship among variables is unlikely to be linear. However, because the linear regression model has the advantages of simple form and easy to be modeled and analyzed, it has been widely applied in practical problems. Many classic statistical learning and machine learning books \[[2,3,4](#references)\] also focus on linear model in a chapter.
## Result Demo
We used the Boston house price dataset obtained from [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) to train and predict the model. The scatter plot below shows the result of price prediction for parts of house with model. Each point on x-axis represents the median of the real price of the same type of house, and the y-axis represents the result of the linear regression model based on the feature prediction. When the two values are completely equal, they will fall on the dotted line. So the more accurate the model is predicted, the closer the point is to the dotted line.
Figure One. Predict value V.S Ground-truth value
## Model Overview
### Model Definition
In the dataset of Boston house price, there are 14 values associated with the home: the first 13 are used to describe various information of house, that is $x_i$ in the model; the last value is the medium price of the house we want to predict, which is $y_i$ in the model.
Therefore, our model can be expressed as:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
$\hat{Y}$ represents the predicted result of the model and is used to distinguish it from the real value $Y$. The parameters to be learned by the model are: $\omega_1, \ldots, \omega_{13}, b$.
After building the model, we need to give the model an optimization goal so that the learned parameters can make the predicted value $\hat{Y}$ get as close to the true value $Y$. Here we introduce the concept of loss function ([Loss Function](https://en.wikipedia.org/wiki/Loss_function), or Cost Function. Input the target value $y_{i}$ of any data sample and the predicted value $\hat{y_{i}}$ given by a model. Then the loss function outputs a non-negative real number, which is usually used to represent model error.
For linear regression models, the most common loss function is the Mean Squared Error ([MSE](https://en.wikipedia.org/wiki/Mean_squared_error)), which is:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
That is, for a test set in size of $n$, $MSE$ is the mean of the squared error of the $n$ data prediction results.
The method used to optimize the loss function is generally the gradient descent method. The gradient descent method is a first-order optimization algorithm. If $f(x)$ is defined and divisible at point $x_n$, then $f(x)$ is considered to be the fastest in the negative direction of the gradient $-▽f(x_n)$ at point of $x_n$. Adjust $x$ repeatedly to make $f(x)$ close to the local or global minimum value. The adjustment is as follows:
$$x_n+1=x_n-λ▽f(x), n≧0$$
Where λ represents the learning rate. This method of adjustment is called the gradient descent method.
### Training Process
After defining the model structure, we will train the model through the following steps.
1. Initialize parameters, including weights $\omega_i$ and bias $b$, to initialize them (eg. 0 as mean, 1 as variance).
2. Forward propagation of network calculates network output and loss functions.
3. Reverse error propagation according to the loss function ( [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) ), passing forward the network error from the output layer and updating the parameters in the network.
4. Repeat steps 2~3 until the network training error reaches the specified level or the training round reaches the set value.
## Dataset
### Dataset Introduction
The dataset consists of 506 lines, each containing information about a type of houses in a suburb of Boston and the median price of that type of house. The meaning of each dimensional attribute is as follows:
| Property Name | Explanation | Type |
| ------| ------ | ------ |
CRIM | Per capita crime rate in the town | Continuous value |
| ZN | Proportion of residential land with an area of over 25,000 square feet | Continuous value |
| INDUS | Proportion of non-retail commercial land | Continuous value |
CHAS | Whether it is adjacent to Charles River | Discrete value, 1=proximity; 0=not adjacent |
NOX | Nitric Oxide Concentration | Continuous value |
| RM | Average number of rooms per house | Continuous value |
| AGE | Proportion of self-use units built before 1940 | Continuous value |
| DIS | Weighted Distance to 5 Job Centers in Boston | Continuous value |
| RAD | Accessibility Index to Radial Highway | Continuous value |
| TAX | Tax Rate of Full-value Property | Continuous value |
| PTRATIO | Proportion of Student and Teacher | Continuous value |
| B | 1000(BK - 0.63)^2, where BK is black ratio | Continuous value |
LSTAT | Low-income population ratio | Continuous value |
| MEDV | Median price of a similar home | Continuous value |
### Data Pre-processing
#### Continuous value and discrete value
Analyzing the data, first we find that all 13-dimensional attributes exist 12-dimensional continuous value and 1-dimensional discrete values (CHAS). Discrete value is often represented by numbers like 0, 1, and 2, but its meaning is different from continuous value's because the difference of discrete value here has no meaning. For example, if we use 0, 1, and 2 to represent red, green, and blue, we cannot infer that the distance between blue and red is longer than that between green and red. So usually for a discrete property with $d$ possible values, we will convert them to $d$ binary properties with a value of 0 or 1 or map each possible value to a multidimensional vector. However, there is no this problem for CHAS, since CHAS itself is a binary attribute .
#### Normalization of attributes
Another fact that can be easily found is that the range of values of each dimensional attribute is largely different (as shown in Figure 2). For example, the value range of attribute B is [0.32, 396.90], and the value range of attribute NOX is [0.3850, 0.8170]. Here is a common operation - normalization. The goal of normalization is to scale the value of each attribute to a similar range, such as [-0.5, 0.5]. Here we use a very common operation method: subtract the mean and divide by the range of values.
There are at least three reasons for implementing normalization (or [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling)):
- A range of values that are too large or too small can cause floating value overflow or underflow during calculation.
- Different ranges of number result in different attributes being different for the model (at least in the initial period of training), and this implicit assumption is often unreasonable. This can make the optimization process difficult and the training time greatly longer.
- Many machine learning techniques/models (such as L1, L2 regular items, Vector Space Model) are based on the assumption that all attribute values are almost zero and their ranges of value are similar.
Figure 2. Value range of attributes for all dimensions
#### Organizing training set and testing set
We split the dataset into two parts: one is used to adjust the parameters of the model, that is, to train the model, the error of the model on this dataset is called ** training error **; the other is used to test.The error of the model on this dataset is called the ** test error**. The goal of our training model is to predict unknown new data by finding the regulation from the training data, so the test error is an better indicator for the performance of the model. When it comes to the ratio of the segmentation data, we should take into account two factors: more training data will reduce the square error of estimated parameters, resulting in a more reliable model; and more test data will reduce the square error of the test error, resulting in more credible test error. The split ratio set in our example is $8:2$
In a more complex model training process, we often need more than one dataset: the validation set. Because complex models often have some hyperparameters ([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)) that need to be adjusted, we will try a combination of multiple hyperparameters to train multiple models separately and then compare their performance on the validation set to select the relatively best set of hyperparameters, and finally use the model with this set of parameters to evaluate the test error on the test set. Since the model trained in this chapter is relatively simple, we won't talk about this process at present.
## Training
`fit_a_line/train.py` demonstrates the overall process of training.
### Configuring the Data feeder
First we import the libraries:
```python
from __future__ import print_function
import paddle
import paddle.fluid as fluid
import numpy
import math
import sys
```
We introduced the dataset [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) via the uci_housing module
It is encapsulated in the uci_housing module:
1. The process of data download. The download data is saved in ~/.cache/paddle/dataset/uci_housing/housing.data.
2. The process of [data preprocessing](#data preprocessing).
Next we define the data feeder for training. The data feeder reads a batch of data in the size of `BATCH_SIZE` each time. If the user wants the data to be random, it can define data in size of a batch and a cache. In this case, each time the data feeder randomly reads as same data as the batch size from the cache.
```python
BATCH_SIZE = 20
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.uci_housing.train(), buf_size=500),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.uci_housing.test(), buf_size=500),
batch_size=BATCH_SIZE)
```
If you want to read data directly from \*.txt file, you can refer to the method as follows(need to prepare txt file by yourself).
```text
feature_names = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'convert'
]
feature_num = len(feature_names)
data = numpy.fromfile(filename, sep=' ') # Read primary data from file
data = data.reshape(data.shape[0] // feature_num, feature_num)
maximums, minimums, avgs = data.max(axis=0), data.min(axis=0), data.sum(axis=0)/data.shape[0]
for i in six.moves.range(feature_num-1):
data[:, i] = (data[:, i] - avgs[i]) / (maximums[i] - minimums[i]) # six.moves is compatible to python2 and python3
ratio = 0.8 # distribution ratio of train dataset and verification dataset
offset = int(data.shape[0]\*ratio)
train_data = data[:offset]
test_data = data[offset:]
train_reader = paddle.batch(
paddle.reader.shuffle(
train_data, buf_size=500),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.reader.shuffle(
test_data, buf_size=500),
batch_size=BATCH_SIZE)
```
### Configure Program for Training
The aim of the program for training is to define a network structure of a training model. For linear regression, it is a simple fully connected layer from input to output. More complex result, such as Convolutional Neural Network and Recurrent Neural Network, will be introduced in later chapters. It must return `mean error` as the first return value in program for training, for that `mean error` will be used for BackPropagation.
```python
x = fluid.layers.data(name='x', shape=[13], dtype='float32') # define shape and data type of input
y = fluid.layers.data(name='y', shape=[1], dtype='float32') # define shape and data type of output
y_predict = fluid.layers.fc(input=x, size=1, act=None) # fully connected layer connecting input and output
main_program = fluid.default_main_program() # get default/global main function
startup_program = fluid.default_startup_program() # get default/global launch program
cost = fluid.layers.square_error_cost(input=y_predict, label=y) # use label and output predicted data to estimate square error
avg_loss = fluid.layers.mean(cost) # compute mean value for square error and get mean loss
```
For details, please refer to:
[fluid.default_main_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-main-program)
[fluid.default_startup_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-startup-program)
### Optimizer Function Configuration
`SGD optimizer`, `learning_rate` below are learning rate, which is related to rate of convergence for train of network.
```python
#Clone main_program to get test_program
# operations of some operators are different between train and test. For example, batch_norm use parameter for_test to determine whether the program is for training or for testing.
#The api will not delete any operator, please apply it before backward and optimization.
test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
```
### Define Training Place
We can define whether an operation runs on the CPU or on the GPU.
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() # define the execution space of executor
###executor can accept input program and add data input operator and result fetch operator based on feed map and fetch list. Use close() to close executor and call run(...) to run the program.
exe = fluid.Executor(place)
```
For details, please refer to:
[fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor)
### Create Training Process
To train, it needs a train program and some parameters and creates a function to get test error in the process of train necessary parameters contain executor, program, reader, feeder, fetch_list, executor represents executor created before. Program created before represents program executed by executor. If the parameter is undefined, then it is defined default_main_program by default. Reader represents data read. Feeder represents forward input variable and fetch_list represents variable user wants to get or name.
```python
num_epochs = 100
def train_test(executor, program, reader, feeder, fetch_list):
accumulated = 1 * [0]
count = 0
for data_test in reader():
outs = executor.run(program=program,
feed=feeder.feed(data_test),
fetch_list=fetch_list)
accumulated = [x_c[0] + x_c[1][0] for x_c in zip(accumulated, outs)] # accumulate loss value in the process of test
count += 1 # accumulate samples in test dataset
return [x_d / count for x_d in accumulated] # compute mean loss
```
### Train Main Loop
give name of directory to be stored and initialize an executor
```python
%matplotlib inline
params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe.run(startup_program)
train_prompt = "train cost"
test_prompt = "test cost"
from paddle.utils.plot import Ploter
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0
exe_test = fluid.Executor(place)
```
Paddlepaddle provides reader mechanism to read training data. Reader provide multiple columns of data at one time. Therefore, we need a python list to read sequence. We create a loop to train until the result of train is good enough or time of loop is enough.
If the number of iterations for train is equal to the number of iterations for saving parameters, you can save train parameter into `params_dirname`.
Set main loop for training.
```python
for pass_id in range(num_epochs):
for data_train in train_reader():
avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_loss])
if step % 10 == 0: # record and output train loss for every 10 batches.
plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0: # record and output test loss for every 100 batches.
test_metics = train_test(executor=exe_test,
program=test_program,
reader=test_reader,
fetch_list=[avg_loss.name],
feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # If the accuracy is up to the requirement, the train can be stopped.
break
step += 1
if math.isnan(float(avg_loss_value[0])):
sys.exit("got NaN loss, training failed.")
#save train parameters into the path given before
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], exe)
```
## Predict
It needs to create trained parameters to run program for prediction. The trained parameters is in `params_dirname`.
### Prepare Environment for Prediction
Similar to the process of training, predictor needs a program for prediction. We can slightly modify our training program to include the prediction value.
```python
infer_exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
```
### Predict
Save pictures
```python
def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
Via fluid.io.load_inference_model, predictor will read well-trained model from `params_dirname` to predict unknown data.
```python
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, infer_exe) # load pre-predict model
batch_size = 10
infer_reader = paddle.batch(
paddle.dataset.uci_housing.test(), batch_size=batch_size) # prepare test dataset
infer_data = next(infer_reader())
infer_feat = numpy.array(
[data[0] for data in infer_data]).astype("float32") # extract data in test dataset
infer_label = numpy.array(
[data[1] for data in infer_data]).astype("float32") # extract label in test dataset
assert feed_target_names[0] == 'x'
results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # predict
#print predict result and label and visualize the result
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # print predict result
print("\nground truth:")
for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # print label
save_result(results[0], infer_label) # save picture
```
## Summary
In this chapter, we analyzed dataset of Boston House Price to introduce the basic concepts of linear regression model and how to use PaddlePaddle to implement training and testing. A number of models and theories are derived from linear regression model. Therefore, it is not unnecessary to figure out the principle and limitation of linear regression model.
## References
1. https://en.wikipedia.org/wiki/Linear_regression
2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.