提交 a45248a4 编写于 作者: S swtkiwi

fix lr 9.12

上级 02b3916d
......@@ -18,7 +18,7 @@
"id": "6lPmRFntXYIp"
},
"source": [
"# 简要介绍\n",
"## 简要介绍\n",
"经典的线性回归模型主要用来预测一些存在着线性关系的数据集。回归模型可以理解为:存在一个点集,用一条曲线去拟合它分布的过程。如果拟合曲线是一条直线,则称为线性回归。如果是一条二次曲线,则被称为二次回归。线性回归是回归模型中最简单的一种。 \n",
"本示例简要介绍如何用飞桨开源框架,实现波士顿房价预测。其思路是,假设uci-housing数据集中的房子属性和房价之间的关系可以被属性间的线性组合描述。在模型训练阶段,让假设的预测结果和真实值之间的误差越来越小。在模型预测阶段,预测器会读取训练好的模型,对从未遇见过的房子属性进行房价预测。"
]
......@@ -30,7 +30,7 @@
"id": "OEOMtGXCZaRR"
},
"source": [
"# 数据集介绍\n",
"## 数据集介绍\n",
"本示例采用uci-housing数据集,这是经典线性回归的数据集。数据集共7084条数据,可以拆分成506行,每行14列。前13列用来描述房屋的各种信息,最后一列为该类房屋价格中位数。"
]
},
......@@ -47,14 +47,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 训练方式一\n"
"## 训练方式一\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 环境设置"
"### 环境设置"
]
},
{
......@@ -90,7 +90,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据处理"
"### 数据处理"
]
},
{
......@@ -139,13 +139,13 @@
],
"source": [
"# 画图看特征间的关系,主要是变量两两之间的关系(线性或非线性,有无明显较为相关关系)\n",
"features_np = np.array([x[:13] for x in housing_data],np.float32)\n",
"labels_np = np.array([x[-1] for x in housing_data],np.float32)\n",
"data_np = np.c_[features_np,labels_np]\n",
"df = pd.DataFrame(data_np,columns=feature_names)\n",
"features_np = np.array([x[:13] for x in housing_data], np.float32)\n",
"labels_np = np.array([x[-1] for x in housing_data], np.float32)\n",
"data_np = np.c_[features_np, labels_np]\n",
"df = pd.DataFrame(data_np, columns=feature_names)\n",
"matplotlib.use('TkAgg')\n",
"%matplotlib inline\n",
"sns.pairplot(df.dropna(),y_vars=feature_names[-1],x_vars=feature_names[:])\n",
"sns.pairplot(df.dropna(), y_vars=feature_names[-1], x_vars=feature_names[:])\n",
"plt.show()"
]
},
......@@ -169,11 +169,11 @@
],
"source": [
"# 相关性分析\n",
"fig, ax = plt.subplots(figsize=(15,1)) \n",
"fig, ax = plt.subplots(figsize=(15, 1)) \n",
"corr_data = df.corr().iloc[-1]\n",
"corr_data = np.asarray(corr_data).reshape(1,14)\n",
"ax = sns.heatmap(corr_data, cbar=True,annot=True)\n",
"plt.show()\n"
"corr_data = np.asarray(corr_data).reshape(1, 14)\n",
"ax = sns.heatmap(corr_data, cbar=True, annot=True)\n",
"plt.show()"
]
},
{
......@@ -183,13 +183,7 @@
"id": "IUhqen8LWAYM"
},
"source": [
"***数据归一化处理***\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***数据归一化处理***<br>\n",
"下图为大家展示各属性的取值范围分布:"
]
},
......@@ -201,7 +195,7 @@
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a324e6910>"
"<matplotlib.axes._subplots.AxesSubplot at 0x1a3e2b4e50>"
]
},
"execution_count": 6,
......@@ -222,7 +216,7 @@
}
],
"source": [
"sns.boxplot(data=df.iloc[:,0:13])"
"sns.boxplot(data=df.iloc[:, 0:13])"
]
},
{
......@@ -237,13 +231,10 @@
"metadata": {},
"source": [
"\n",
"做归一化(或 Feature scaling)至少有以下3个理由:\n",
"做归一化(或 Feature scaling)至少有以下2个理由:\n",
"\n",
"* 过大或过小的数值范围会导致计算时的浮点上溢或下溢。\n",
"* 不同的数值范围会导致不同属性对模型的重要性不同(至少在训练的初始阶段如此),而这个隐含的假设常常是不合理的。这会对优化的过程造成困难,使训练时间大大的加长.\n",
"\n",
"* 很多的机器学习技巧/模型(例如L1,L2正则项,向量空间模型-Vector Space Model)都基于这样的假设:所有的属性取值都差不多是以0为均值且取值范围相近的。\n",
"\n",
"\n"
]
},
......@@ -255,7 +246,7 @@
"source": [
"features_max = housing_data.max(axis=0)\n",
"features_min = housing_data.min(axis=0)\n",
"features_avg = housing_data.sum(axis=0) / 506"
"features_avg = housing_data.sum(axis=0) / housing_data.shape[0]"
]
},
{
......@@ -267,7 +258,7 @@
"BATCH_SIZE = 20\n",
"def feature_norm(input):\n",
" f_size = input.shape\n",
" output_features = np.zeros(f_size,np.float32)\n",
" output_features = np.zeros(f_size, np.float32)\n",
" for batch_id in range(f_size[0]):\n",
" for index in range(13):\n",
" output_features[batch_id][index] = (input[batch_id][index] - features_avg[index]) / (features_max[index] - features_min[index])\n",
......@@ -281,9 +272,9 @@
"outputs": [],
"source": [
"#只对属性进行归一化\n",
"housing_features = feature_norm(housing_data[:,:13])\n",
"housing_features = feature_norm(housing_data[:, :13])\n",
"# print(feature_trian.shape)\n",
"housing_data = np.c_[housing_features,housing_data[:,-1]].astype(np.float32)\n",
"housing_data = np.c_[housing_features, housing_data[:, -1]].astype(np.float32)\n",
"# print(training_data[0])"
]
},
......@@ -295,7 +286,7 @@
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a326f51d0>"
"<matplotlib.axes._subplots.AxesSubplot at 0x1a3e4cd4d0>"
]
},
"execution_count": 10,
......@@ -319,9 +310,9 @@
"#归一化后的train_data,我们看下各属性的情况\n",
"features_np = np.array([x[:13] for x in housing_data],np.float32)\n",
"labels_np = np.array([x[-1] for x in housing_data],np.float32)\n",
"data_np = np.c_[features_np,labels_np]\n",
"df = pd.DataFrame(data_np,columns=feature_names)\n",
"sns.boxplot(data=df.iloc[:,0:13])"
"data_np = np.c_[features_np, labels_np]\n",
"df = pd.DataFrame(data_np, columns=feature_names)\n",
"sns.boxplot(data=df.iloc[:, 0:13])"
]
},
{
......@@ -334,7 +325,7 @@
"ratio = 0.8\n",
"offset = int(housing_data.shape[0] * ratio)\n",
"train_data = housing_data[:offset]\n",
"test_data = housing_data[offset:]\n"
"test_data = housing_data[offset:]"
]
},
{
......@@ -344,7 +335,7 @@
"id": "JkEt541Cl0s8"
},
"source": [
"# 模型配置\n",
"### 模型配置\n",
"线性回归就是一个从输入到输出的简单的全连接层。\n",
"\n",
"对于波士顿房价数据集,假设属性和房价之间的关系可以被属性间的线性组合描述。"
......@@ -362,10 +353,10 @@
"source": [
"class Regressor(paddle.nn.Layer):\n",
" def __init__(self):\n",
" super(Regressor,self).__init__()\n",
" self.fc = paddle.nn.Linear(13,1,None)\n",
" super(Regressor, self).__init__()\n",
" self.fc = paddle.nn.Linear(13, 1,)\n",
"\n",
" def forward(self,inputs):\n",
" def forward(self, inputs):\n",
" pred = self.fc(inputs)\n",
" return pred"
]
......@@ -383,17 +374,15 @@
"metadata": {},
"outputs": [],
"source": [
"iter = 0\n",
"iters = []\n",
"train_nums = []\n",
"train_costs = []\n",
"\n",
"def draw_train_process(iters,train_costs):\n",
" plt.title(\"training cost\" ,fontsize=24)\n",
"def draw_train_process(iters, train_costs):\n",
" plt.title(\"training cost\", fontsize=24)\n",
" plt.xlabel(\"iter\", fontsize=14)\n",
" plt.ylabel(\"cost\", fontsize=14)\n",
" plt.plot(iters, train_costs,color='red',label='training cost')\n",
" plt.show()\n",
" "
" plt.plot(iters, train_costs, color='red', label='training cost')\n",
" plt.show()"
]
},
{
......@@ -403,7 +392,7 @@
"id": "oxD989B_cBjF"
},
"source": [
"# 模型训练\n",
"### 模型训练\n",
"下面为大家展示模型训练的代码。\n",
"这里用到的是线性回归模型最常用的损失函数--均方误差(MSE),用来衡量模型预测的房价和真实房价的差异。\n",
"对损失函数进行优化所采用的方法是梯度下降法"
......@@ -427,20 +416,21 @@
"output_type": "stream",
"text": [
"start training ... \n",
"Pass:0,Cost:503.44180\n",
"Pass:50,Cost:79.73357\n",
"Pass:100,Cost:132.61421\n",
"Pass:150,Cost:9.58433\n",
"Pass:200,Cost:39.33120\n",
"Pass:250,Cost:17.30551\n",
"Pass:300,Cost:22.21836\n",
"Pass:350,Cost:55.45938\n",
"Pass:400,Cost:14.99360\n",
"Pass:450,Cost:36.95673\n"
"Pass:0,Cost:740.21814\n",
"Pass:50,Cost:36.40338\n",
"Pass:100,Cost:86.01823\n",
"Pass:150,Cost:50.86654\n",
"Pass:200,Cost:31.14208\n",
"Pass:250,Cost:20.54596\n",
"Pass:300,Cost:22.30817\n",
"Pass:350,Cost:24.18756\n",
"Pass:400,Cost:22.22965\n",
"Pass:450,Cost:39.25978\n"
]
}
],
"source": [
"import paddle.nn.functional as F \n",
"y_preds = []\n",
"labels_list = []\n",
"\n",
......@@ -449,39 +439,36 @@
" # 开启模型训练模式\n",
" model.train()\n",
" EPOCH_NUM = 500\n",
" iter = 0\n",
" optimizer = paddle.optimizer.SGD(learning_rate = 0.001, parameters = model.parameters())\n",
" train_num = 0\n",
" optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters())\n",
" for epoch_id in range(EPOCH_NUM):\n",
" train_cost = 0\n",
" # 在每轮迭代开始之前,将训练数据的顺序随机的打乱\n",
" np.random.shuffle(train_data)\n",
" # 将训练数据进行拆分,每个batch包含20条数据\n",
" mini_batches = [train_data[k:k+BATCH_SIZE] for k in range(0, len(train_data), BATCH_SIZE)]\n",
" for batch_id,data in enumerate(mini_batches):\n",
" features_np = np.array(data[:,:13],np.float32)\n",
" labels_np = np.array(data[:,-1:],np.float32)\n",
" for batch_id, data in enumerate(mini_batches):\n",
" features_np = np.array(data[:, :13], np.float32)\n",
" labels_np = np.array(data[:, -1:], np.float32)\n",
" features = paddle.to_tensor(features_np)\n",
" labels = paddle.to_tensor(labels_np)\n",
" #前向计算\n",
" y_pred = model(features)\n",
" cost = paddle.nn.functional.mse_loss(y_pred,label=labels)\n",
" train_cost = [cost.numpy()]\n",
" cost = F.mse_loss(y_pred, label=labels)\n",
" train_cost = cost.numpy()[0]\n",
" #反向传播\n",
" cost.backward()\n",
" #最小化loss,更新参数\n",
" optimizer.step()\n",
" # 清除梯度\n",
" optimizer.clear_grad()\n",
" \n",
" if batch_id%30 == 0 and epoch_id%50 == 0:\n",
" print(\"Pass:%d,Cost:%0.5f\"%(epoch_id,train_cost[0][0]))\n",
" print(\"Pass:%d,Cost:%0.5f\"%(epoch_id, train_cost))\n",
"\n",
" iter = iter + BATCH_SIZE\n",
" iters.append(iter)\n",
" train_costs.append(train_cost[0][0])\n",
" train_num = train_num + BATCH_SIZE\n",
" train_nums.append(train_num)\n",
" train_costs.append(train_cost)\n",
" \n",
" \n",
" \n",
"\n",
"model = Regressor()\n",
"train(model)"
]
......@@ -493,7 +480,7 @@
"outputs": [
{
"data": {
"image/png": "\n",
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
......@@ -507,7 +494,7 @@
"source": [
"matplotlib.use('TkAgg')\n",
"%matplotlib inline\n",
"draw_train_process(iters,train_costs)"
"draw_train_process(train_nums, train_costs)"
]
},
{
......@@ -524,7 +511,7 @@
"id": "YC73FnkakWbY"
},
"source": [
"# 模型预测\n"
"### 模型预测\n"
]
},
{
......@@ -536,17 +523,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
"No.0: infer result is 11.88,ground truth is 8.50\n",
"No.10: infer result is 5.40,ground truth is 7.00\n",
"No.20: infer result is 14.77,ground truth is 11.70\n",
"No.30: infer result is 16.40,ground truth is 11.70\n",
"No.40: infer result is 13.51,ground truth is 10.80\n",
"No.50: infer result is 15.88,ground truth is 14.90\n",
"No.60: infer result is 18.62,ground truth is 21.40\n",
"No.70: infer result is 15.36,ground truth is 13.80\n",
"No.80: infer result is 18.03,ground truth is 20.60\n",
"No.90: infer result is 21.45,ground truth is 24.50\n",
"平均误差为: 12.579632779346012\n"
"No.0: infer result is 12.15,ground truth is 8.50\n",
"No.10: infer result is 5.21,ground truth is 7.00\n",
"No.20: infer result is 14.32,ground truth is 11.70\n",
"No.30: infer result is 16.11,ground truth is 11.70\n",
"No.40: infer result is 13.42,ground truth is 10.80\n",
"No.50: infer result is 15.50,ground truth is 14.90\n",
"No.60: infer result is 18.81,ground truth is 21.40\n",
"No.70: infer result is 15.42,ground truth is 13.80\n",
"No.80: infer result is 18.16,ground truth is 20.60\n",
"No.90: infer result is 21.48,ground truth is 24.50\n",
"Mean loss is: [12.195988]\n"
]
}
],
......@@ -556,18 +543,21 @@
"\n",
"infer_features_np = np.array([data[:13] for data in test_data]).astype(\"float32\")\n",
"infer_labels_np = np.array([data[-1] for data in test_data]).astype(\"float32\")\n",
"\n",
"infer_features = paddle.to_tensor(infer_features_np)\n",
"fetch_list = model(infer_features).numpy()\n",
"infer_labels = paddle.to_tensor(infer_labels_np)\n",
"fetch_list = model(infer_features)\n",
"\n",
"sum_cost = 0\n",
"for i in range(INFER_BATCH_SIZE):\n",
" infer_result = fetch_list[i][0]\n",
" ground_truth = infer_labels_np[i]\n",
" ground_truth = infer_labels[i]\n",
" if i % 10 == 0:\n",
" print(\"No.%d: infer result is %.2f,ground truth is %.2f\" % (i, infer_result,ground_truth))\n",
" cost = np.power(infer_result-ground_truth,2)\n",
" print(\"No.%d: infer result is %.2f,ground truth is %.2f\" % (i, infer_result, ground_truth))\n",
" cost = paddle.pow(infer_result - ground_truth, 2)\n",
" sum_cost += cost\n",
"print(\"平均误差为:\",sum_cost / INFER_BATCH_SIZE)"
"mean_loss = sum_cost / INFER_BATCH_SIZE\n",
"print(\"Mean loss is:\", mean_loss.numpy())"
]
},
{
......@@ -578,7 +568,7 @@
"source": [
"def plot_pred_ground(pred, ground):\n",
" plt.figure() \n",
" plt.title(\"Predication v.s. Ground\", fontsize=24)\n",
" plt.title(\"Predication v.s. Ground truth\", fontsize=24)\n",
" plt.xlabel(\"ground truth price(unit:$1000)\", fontsize=14)\n",
" plt.ylabel(\"predict price\", fontsize=14)\n",
" plt.scatter(ground, pred, alpha=0.5) # scatter:散点图,alpha:\"透明度\"\n",
......@@ -593,7 +583,7 @@
"outputs": [
{
"data": {
"image/png": "\n",
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
......@@ -619,7 +609,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 训练方式二\n",
"## 训练方式二\n",
"我们也可以用我们的高层API来做线性回归训练,高层API相较于底层API更加的简洁方便。"
]
},
......@@ -633,59 +623,39 @@
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"step 10/51 - loss: 459.0659 - 2ms/step\n",
"step 20/51 - loss: 529.2217 - 2ms/step\n",
"step 30/51 - loss: 632.7692 - 2ms/step\n",
"step 40/51 - loss: 611.4449 - 2ms/step\n",
"step 50/51 - loss: 787.7990 - 2ms/step\n",
"step 51/51 - loss: 616.6230 - 2ms/step\n",
"step 20/51 - loss: 520.8663 - 1ms/step\n",
"step 40/51 - loss: 611.7135 - 1ms/step\n",
"step 51/51 - loss: 620.0662 - 1ms/step\n",
"Eval begin...\n",
"step 10/13 - loss: 412.7979 - 845us/step\n",
"step 13/13 - loss: 394.4999 - 962us/step\n",
"step 13/13 - loss: 389.7871 - 1ms/step\n",
"Eval samples: 102\n",
"Epoch 2/5\n",
"step 10/51 - loss: 498.4369 - 2ms/step\n",
"step 20/51 - loss: 872.9701 - 1ms/step\n",
"step 30/51 - loss: 660.2790 - 2ms/step\n",
"step 40/51 - loss: 1086.9590 - 2ms/step\n",
"step 50/51 - loss: 569.2678 - 3ms/step\n",
"step 51/51 - loss: 416.6243 - 3ms/step\n",
"step 20/51 - loss: 867.4678 - 3ms/step\n",
"step 40/51 - loss: 1081.1701 - 2ms/step\n",
"step 51/51 - loss: 420.8705 - 2ms/step\n",
"Eval begin...\n",
"step 10/13 - loss: 413.6576 - 3ms/step\n",
"step 13/13 - loss: 391.9444 - 3ms/step\n",
"step 13/13 - loss: 387.2432 - 1ms/step\n",
"Eval samples: 102\n",
"Epoch 3/5\n",
"step 10/51 - loss: 639.1314 - 2ms/step\n",
"step 20/51 - loss: 839.7043 - 1ms/step\n",
"step 30/51 - loss: 658.3038 - 1ms/step\n",
"step 40/51 - loss: 855.3226 - 1ms/step\n",
"step 50/51 - loss: 863.4664 - 1ms/step\n",
"step 51/51 - loss: 415.3571 - 1ms/step\n",
"step 20/51 - loss: 810.1555 - 2ms/step\n",
"step 40/51 - loss: 840.3570 - 2ms/step\n",
"step 51/51 - loss: 421.0806 - 2ms/step\n",
"Eval begin...\n",
"step 10/13 - loss: 414.4321 - 868us/step\n",
"step 13/13 - loss: 389.4324 - 892us/step\n",
"step 13/13 - loss: 384.7417 - 693us/step\n",
"Eval samples: 102\n",
"Epoch 4/5\n",
"step 10/51 - loss: 660.5611 - 1ms/step\n",
"step 20/51 - loss: 649.4131 - 1ms/step\n",
"step 30/51 - loss: 578.6218 - 1ms/step\n",
"step 40/51 - loss: 697.6048 - 1ms/step\n",
"step 50/51 - loss: 784.4253 - 1ms/step\n",
"step 51/51 - loss: 423.0613 - 1ms/step\n",
"step 20/51 - loss: 647.1215 - 1ms/step\n",
"step 40/51 - loss: 682.9673 - 1ms/step\n",
"step 51/51 - loss: 422.0570 - 1ms/step\n",
"Eval begin...\n",
"step 10/13 - loss: 415.2260 - 598us/step\n",
"step 13/13 - loss: 386.9349 - 702us/step\n",
"step 13/13 - loss: 382.2546 - 591us/step\n",
"Eval samples: 102\n",
"Epoch 5/5\n",
"step 10/51 - loss: 1080.4787 - 2ms/step\n",
"step 20/51 - loss: 726.1576 - 2ms/step\n",
"step 30/51 - loss: 873.2540 - 1ms/step\n",
"step 40/51 - loss: 566.3094 - 1ms/step\n",
"step 50/51 - loss: 578.0419 - 1ms/step\n",
"step 51/51 - loss: 459.7528 - 1ms/step\n",
"step 20/51 - loss: 713.3719 - 1ms/step\n",
"step 40/51 - loss: 567.0962 - 1ms/step\n",
"step 51/51 - loss: 456.8702 - 1ms/step\n",
"Eval begin...\n",
"step 10/13 - loss: 415.9169 - 707us/step\n",
"step 13/13 - loss: 384.4219 - 805us/step\n",
"step 13/13 - loss: 379.7527 - 985us/step\n",
"Eval samples: 102\n"
]
}
......@@ -707,13 +677,14 @@
" self.fc = paddle.nn.Linear(13, 1, None)\n",
"\n",
" def forward(self, input):\n",
" return self.fc(input)\n",
" pred = self.fc(input)\n",
" return pred\n",
"\n",
"#step3:训练模型\n",
"model = paddle.Model(UCIHousing())\n",
"model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),\n",
" paddle.nn.loss.MSELoss())\n",
"model.fit(train_dataset, eval_dataset, epochs=5, batch_size=8)"
"model.fit(train_dataset, eval_dataset, epochs=5, batch_size=8, log_freq=20)"
]
}
],
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册