提交 f86e2418 编写于 作者: W wizardforcel

2021-01-15 22:42:17

上级 9ad21775
......@@ -433,19 +433,24 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
1. 打开 Jupyter 笔记本以实施此练习。
2. Import the pandas library:
将熊猫作为 pd 导入
```py
import pandas as pd
```
3. Use pandas to read the CSV file containing the dataset we downloaded from the UC Irvine Machine Learning Repository site.
接下来,删除名为`date`的列,因为我们不想在以下练习中考虑它:
数据= pd.read_csv(“ energydata_complete.csv”)
数据= data.drop(列= [“日期”])
```py
data = pd.read_csv("energydata_complete.csv")
data = data.drop(columns=["date"])
```
最后,打印 DataFrame 的头部:
data.head()
```py
data.head()
```
输出应如下所示:
......@@ -455,11 +460,11 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
4. Check for categorical features in your dataset:
```py
cols = data.columns
num_cols = data._get_numeric_data()。列
清单(设定(cols)-设定(num_cols))
num_cols = data._get_numeric_data().columns
list(set(cols) - set(num_cols))
```
第一行生成数据集中所有列的列表。 接下来,包含数值的列也存储在变量中。 最后,通过从整个列列表中减去数字列,可以获得非数字列。
......@@ -467,7 +472,9 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
5. Use Python's **isnull()** and **sum()** functions to find out whether there are any missing values in each column of the dataset:
data.isnull()。sum()
```py
data.isnull().sum()
```
此命令计算每列中空值的数量。 对于正在使用的数据集,不应缺少任何值,如在此处所示:
......@@ -477,31 +484,21 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
6. Use three standard deviations as the measure to detect any outliers for all the features in the dataset:
离群值= {}
对于范围内的 i(data.shape [1]):
min_t = data [data.columns [i]]。mean()\
-(3 * data [data.columns [i]]。std())
max_t = data [data.columns [i]]。mean()\
+(3 * data [data.columns [i]]。std())
计数= 0
对于 data [data.columns [i]]中的 j:
如果 j < min_t or j > max_t:
计数+ = 1
百分比=计数/数据.shape [0]
离群值[data.columns [i]] =“% .3f”% 百分比
离群值
```py
outliers = {}
for i in range(data.shape[1]):
    min_t = data[data.columns[i]].mean() \
            - (3 * data[data.columns[i]].std())
    max_t = data[data.columns[i]].mean() \
            + (3 * data[data.columns[i]].std())
    count = 0
    for j in data[data.columns[i]]:
        if j < min_t or j > max_t:
            count += 1
    percentage = count / data.shape[0]
    outliers[data.columns[i]] = "%.3f" % percentage
outliers
```
前面的代码段循环遍历数据集中的列,以评估每个异常值的存在。 它会继续计算最小和最大阈值,以便可以计算超出阈值之间范围的实例数。
......@@ -561,17 +558,19 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
1. Separate the features from the target. We are only doing this to rescale the features data:
X = data.loc [:,1:]
Y = data.iloc [:,0]
```py
X = data.iloc[:, 1:]
Y = data.iloc[:, 0]
```
前面的代码片段获取数据并使用切片将特征与目标分离。
2. Rescale the features data by using the normalization methodology. Display the head (that is, the top five instances) of the resulting DataFrame to verify the result:
X =(X-X.min())/(X.max()-X.min())
X.head()
```py
X = (X - X.min()) / (X.max() - X.min())
X.head()
```
输出应如下所示:
......@@ -610,59 +609,62 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
1. Print the shape of the dataset in order to determine the split ratio to be used:
形状
```py
shape
```
此操作的输出应为`(19735, 27)`。 这意味着可以使用 60:20:20 的分配比例进行训练,验证和测试。
2. Get the value that you will use as the upper bound of the training and validation sets. This will be used to split the dataset using indexing:
train_end = int(len(X)* 0.6)
dev_end = int(len(X)* 0.8)
```py
train_end = int(len(X) * 0.6)
dev_end = int(len(X) * 0.8)
```
前面的代码确定将用于通过切片划分数据集的实例的索引。
3. Shuffle the dataset:
X_shuffle = X.sample(frac = 1,random_state = 0)
Y_shuffle = Y.sample(frac = 1,random_state = 0)
```py
X_shuffle = X.sample(frac=1, random_state=0)
Y_shuffle = Y.sample(frac=1, random_state=0)
```
使用熊猫`sample`函数,可以对特征和目标矩阵中的元素进行混洗。 通过将`frac`设置为 1,我们确保所有实例都经过改组并在函数的输出中返回。 使用`random_state`参数,我们确保两个数据集均被混洗。
4. Use indexing to split the shuffled dataset into the three sets for both the features and the target data:
x_train = X_shuffle.iloc [:train_end ,:]
y_train = Y_shuffle.iloc [:train_end]
x_dev = X_shuffle.iloc [train_end:dev_end ,:]
y_dev = Y_shuffle.iloc [train_end:dev_end]
x_test = X_shuffle.iloc [dev_end:,:]
y_test = Y_shuffle.iloc [dev_end:]
```py
x_train = X_shuffle.iloc[:train_end,:]
y_train = Y_shuffle.iloc[:train_end]
x_dev = X_shuffle.iloc[train_end:dev_end,:]
y_dev = Y_shuffle.iloc[train_end:dev_end]
x_test = X_shuffle.iloc[dev_end:,:]
y_test = Y_shuffle.iloc[dev_end:]
```
5. Print the shapes of all three sets:
打印(x_train.shape,y_train.shape)
打印(x_dev.shape,y_dev.shape)
打印(x_test.shape,y_test.shape)
```py
print(x_train.shape, y_train.shape)
print(x_dev.shape, y_dev.shape)
print(x_test.shape, y_test.shape)
```
以上操作的结果应为:
```py
(11841, 27) (11841,)
(3947, 27) (3947,)
(3947, 27) (3947,)
```
6. Import the **train_test_split()** function from scikit-learn's **model_selection** module:
从 sklearn.model_selection 导入 train_test_split
```py
from sklearn.model_selection import train_test_split
```
注意
......@@ -670,23 +672,17 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
7. Split the shuffled dataset:
x_new,x_test_2,\
y_new,y_test_2 = train_test_split(X_shuffle,Y_shuffle,\
test_size = 0.2,\
random_state = 0)
dev_per = x_test_2.shape [0] /x_new.shape [0]
x_train_2,x_dev_2,\
y_train_2,y_dev_2 = train_test_split(x_new,y_new,\
test_size = dev_per,\
random_state = 0)
```py
x_new, x_test_2, \
y_new, y_test_2 = train_test_split(X_shuffle, Y_shuffle, \
                                   test_size=0.2, \
                                   random_state=0)
dev_per = x_test_2.shape[0]/x_new.shape[0]
x_train_2, x_dev_2, \
y_train_2, y_dev_2 = train_test_split(x_new, y_new, \
                                      test_size=dev_per, \
random_state=0)
```
代码的第一行执行初始拆分。 该函数将以下内容作为参数:
......@@ -704,19 +700,19 @@ EDA 流程很有用,因为它有助于开发人员发现对于定义操作过
8. Print the shape of all three sets:
打印(x_train_2.shape,y_train_2.shape)
打印(x_dev_2.shape,y_dev_2.shape)
打印(x_test_2.shape,y_test_2.shape)
```py
print(x_train_2.shape, y_train_2.shape)
print(x_dev_2.shape, y_dev_2.shape)
print(x_test_2.shape, y_test_2.shape)
```
以上操作的结果应为:
```py
(11841, 27) (11841,)
(3947, 27) (3947,)
(3947, 27) (3947,)
```
我们可以看到,两种方法的结果集具有相同的形状。 使用一种方法还是另一种方法是优先考虑的问题。
......@@ -806,9 +802,10 @@ PyTorch 的构建考虑了该领域许多开发人员的意见,其优点是可
1. Import the PyTorch library, called **torch**, as well as the **nn** module from PyTorch:
进口火炬
将 torch.nn 导入为 nn
```py
import torch
import torch.nn as nn
```
注意
......@@ -816,61 +813,53 @@ PyTorch 的构建考虑了该领域许多开发人员的意见,其优点是可
2. Separate the feature columns from the target for each of the sets we created in the previous exercise. Additionally, convert the final DataFrames into tensors:
x_train = torch.tensor(x_train.values).float()
y_train =火炬张量(y_train.values).float()
x_dev = torch.tensor(x_dev.values).float()
y_dev = torch.tensor(y_dev.values).float()
x_test = torch.tensor(x_test.values).float()
y_test = torch.tensor(y_test.values).float()
```py
x_train = torch.tensor(x_train.values).float()
y_train = torch.tensor(y_train.values).float()
x_dev = torch.tensor(x_dev.values).float()
y_dev = torch.tensor(y_dev.values).float()
x_test = torch.tensor(x_test.values).float()
y_test = torch.tensor(y_test.values).float()
```
3. Define the network architecture using the **sequential()** container. Make sure to create a four-layer network. Use ReLU activation functions for the first three layers and leave the last layer without an activation function, considering the fact that we are dealing with a regression problem.
每层的单位数应为 100、50、25 和 1:
模型= nn.Sequential(nn.Linear(x_train.shape [1],100),\
nn.ReLU(),\
nn.Linear(100,50),\
nn.ReLU(),\
nn.Linear(50,25),\
nn.ReLU(),\
线性(25,1))
```py
model = nn.Sequential(nn.Linear(x_train.shape[1], 100), \
                      nn.ReLU(), \
                      nn.Linear(100, 50), \
                      nn.ReLU(), \
                      nn.Linear(50, 25), \
                      nn.ReLU(), \
                      nn.Linear(25, 1))
```
4. Define the loss function as the MSE:
loss_function = torch.nn.MSELoss()
```py
loss_function = torch.nn.MSELoss()
```
5. Define the optimizer algorithm as the Adam optimizer:
优化程序= torch.optim.Adam(model.parameters(),lr = 0.001)
```py
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
```
6. Use a **for** loop to train the network over the training data for 1,000 iteration steps:
对于我在范围(1000)中:
y_pred = model(x_train).squeeze()
损失= loss_function(y_pred,y_train)
Optimizer.zero_grad()
loss.backward()
Optimizer.step()
如果 i% 100 == 0:
打印(i,loss.item())
```py
for i in range(1000):
    y_pred = model(x_train).squeeze()
    loss = loss_function(y_pred, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if i%100 == 0:
        print(i, loss.item())
```
注意
......@@ -888,15 +877,17 @@ PyTorch 的构建考虑了该领域许多开发人员的意见,其优点是可
7. To test the model, perform a prediction on the first instance of the testing set and compare it with the ground truth (target value):
之前=模型(x_test [0])
print(“地面真相:”,y_test [0] .item(),\
“预测:”,pred.item())
```py
pred = model(x_test[0])
print("Ground truth:", y_test[0].item(), \
      "Prediction:",pred.item())
```
输出应类似于以下内容:
基本事实:60.0 预测:69.5818099975586
```py
Ground truth: 60.0 Prediction: 69.5818099975586
```
如您所见,地面真实值(`60`)非常接近预测值(`69.58`)。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册