2021-01-15 22:42:17

f86e2418 · wizardforcel · 9ad21775 · f86e2418
隐藏空白更改
内联并排

Showing with 132 addition and 141 deletion

new/dl-pt-workshop/2.md new/dl-pt-workshop/2.md +132 -141

未找到文件。
--- a/new/dl-pt-workshop/2.md
+++ b/new/dl-pt-workshop/2.md
@@ -433,19 +433,24 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过
 1.  打开 Jupyter 笔记本以实施此练习。
 2.  Import the pandas library:

-    将熊猫作为 pd 导入
+    ```py
+    import pandas as pd
+    ```

 3.  Use pandas to read the CSV file containing the dataset we downloaded from the UC Irvine Machine Learning Repository site.

    接下来，删除名为`date`的列，因为我们不想在以下练习中考虑它：

-    数据= pd.read_csv（“ energydata_complete.csv”）
-
-    数据= data.drop（列= [“日期”]）
+    ```py
+    data = pd.read_csv("energydata_complete.csv")
+    data = data.drop(columns=["date"])
+    ```

    最后，打印 DataFrame 的头部：

-    data.head（）
+    ```py
+    data.head()
+    ```

    输出应如下所示：

@@ -455,11 +460,11 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 4.  Check for categorical features in your dataset:

+    ```py
    cols = data.columns
-
-    num_cols = data._get_numeric_data（）。列
-
-    清单（设定（cols）-设定（num_cols））
+    num_cols = data._get_numeric_data().columns
+    list(set(cols) - set(num_cols))
+    ```

    第一行生成数据集中所有列的列表。 接下来，包含数值的列也存储在变量中。 最后，通过从整个列列表中减去数字列，可以获得非数字列。

@@ -467,7 +472,9 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 5.  Use Python's **isnull()** and **sum()** functions to find out whether there are any missing values in each column of the dataset:

-    data.isnull（）。sum（）
+    ```py
+    data.isnull().sum()
+    ```

    此命令计算每列中空值的数量。 对于正在使用的数据集，不应缺少任何值，如在此处所示：

@@ -477,31 +484,21 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 6.  Use three standard deviations as the measure to detect any outliers for all the features in the dataset:

-    离群值= {}
-
-    对于范围内的 i（data.shape [1]）：
-
-    min_t = data [data.columns [i]]。mean（）\
-
-    -（3 * data [data.columns [i]]。std（））
-
-    max_t = data [data.columns [i]]。mean（）\
-
-    +（3 * data [data.columns [i]]。std（））
-
-    计数= 0
-
-    对于 data [data.columns [i]]中的 j：
-
-    如果 j < min_t or j > max_t：
-
-    计数+ = 1
-
-    百分比=计数/数据.shape [0]
-
-    离群值[data.columns [i]] =“% .3f”% 百分比
-
-    离群值
+    ```py
+    outliers = {}
+    for i in range(data.shape[1]):
+        min_t = data[data.columns[i]].mean() \
+                - (3 * data[data.columns[i]].std())
+        max_t = data[data.columns[i]].mean() \
+                + (3 * data[data.columns[i]].std())
+        count = 0
+        for j in data[data.columns[i]]:
+            if j < min_t or j > max_t:
+                count += 1
+        percentage = count / data.shape[0]
+        outliers[data.columns[i]] = "%.3f" % percentage
+    outliers
+    ```

    前面的代码段循环遍历数据集中的列，以评估每个异常值的存在。 它会继续计算最小和最大阈值，以便可以计算超出阈值之间范围的实例数。

@@ -561,17 +558,19 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 1.  Separate the features from the target. We are only doing this to rescale the features data:

-    X = data.loc [：，1：]
-
-    Y = data.iloc [：，0]
+    ```py
+    X = data.iloc[:, 1:]
+    Y = data.iloc[:, 0]
+    ```

    前面的代码片段获取数据并使用切片将特征与目标分离。

 2.  Rescale the features data by using the normalization methodology. Display the head (that is, the top five instances) of the resulting DataFrame to verify the result:

-    X =（X-X.min（））/（X.max（）-X.min（））
-
-    X.head（）
+    ```py
+    X = (X - X.min()) / (X.max() - X.min())
+    X.head()
+    ```

    输出应如下所示：

@@ -610,59 +609,62 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 1.  Print the shape of the dataset in order to determine the split ratio to be used:

-    形状
+    ```py
+    shape
+    ```

    此操作的输出应为`(19735, 27)`。 这意味着可以使用 60:20:20 的分配比例进行训练，验证和测试。

 2.  Get the value that you will use as the upper bound of the training and validation sets. This will be used to split the dataset using indexing:

-    train_end = int（len（X）* 0.6）
-
-    dev_end = int（len（X）* 0.8）
+    ```py
+    train_end = int(len(X) * 0.6)
+    dev_end = int(len(X) * 0.8)
+    ```

    前面的代码确定将用于通过切片划分数据集的实例的索引。

 3.  Shuffle the dataset:

-    X_shuffle = X.sample（frac = 1，random_state = 0）
-
-    Y_shuffle = Y.sample（frac = 1，random_state = 0）
+    ```py
+    X_shuffle = X.sample(frac=1, random_state=0)
+    Y_shuffle = Y.sample(frac=1, random_state=0)
+    ```

    使用熊猫`sample`函数，可以对特征和目标矩阵中的元素进行混洗。 通过将`frac`设置为 1，我们确保所有实例都经过改组并在函数的输出中返回。 使用`random_state`参数，我们确保两个数据集均被混洗。

 4.  Use indexing to split the shuffled dataset into the three sets for both the features and the target data:

-    x_train = X_shuffle.iloc [：train_end ,:]
-
-    y_train = Y_shuffle.iloc [：train_end]
-
-    x_dev = X_shuffle.iloc [train_end：dev_end ,:]
-
-    y_dev = Y_shuffle.iloc [train_end：dev_end]
-
-    x_test = X_shuffle.iloc [dev_end：，：]
-
-    y_test = Y_shuffle.iloc [dev_end：]
+    ```py
+    x_train = X_shuffle.iloc[:train_end,:]
+    y_train = Y_shuffle.iloc[:train_end]
+    x_dev = X_shuffle.iloc[train_end:dev_end,:]
+    y_dev = Y_shuffle.iloc[train_end:dev_end]
+    x_test = X_shuffle.iloc[dev_end:,:]
+    y_test = Y_shuffle.iloc[dev_end:]
+    ```

 5.  Print the shapes of all three sets:

-    打印（x_train.shape，y_train.shape）
-
-    打印（x_dev.shape，y_dev.shape）
-
-    打印（x_test.shape，y_test.shape）
+    ```py
+    print(x_train.shape, y_train.shape)
+    print(x_dev.shape, y_dev.shape)
+    print(x_test.shape, y_test.shape)
+    ```

    以上操作的结果应为：

+    ```py
    (11841, 27) (11841,)
-
    (3947, 27) (3947,)
-
    (3947, 27) (3947,)
+    ```

 6.  Import the **train_test_split()** function from scikit-learn's **model_selection** module:

-    从 sklearn.model_selection 导入 train_test_split
+    ```py
+    from sklearn.model_selection import train_test_split
+    ```

    注意

@@ -670,23 +672,17 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 7.  Split the shuffled dataset:

-    x_new，x_test_2，\
-
-    y_new，y_test_2 = train_test_split（X_shuffle，Y_shuffle，\
-
-    test_size = 0.2，\
-
-    random_state = 0）
-
-    dev_per = x_test_2.shape [0] /x_new.shape [0]
-
-    x_train_2，x_dev_2，\
-
-    y_train_2，y_dev_2 = train_test_split（x_new，y_new，\
-
-    test_size = dev_per，\
-
-    random_state = 0）
+    ```py
+    x_new, x_test_2, \
+    y_new, y_test_2 = train_test_split(X_shuffle, Y_shuffle, \
+                                       test_size=0.2, \
+                                       random_state=0)
+    dev_per = x_test_2.shape[0]/x_new.shape[0]
+    x_train_2, x_dev_2, \
+    y_train_2, y_dev_2 = train_test_split(x_new, y_new, \
+                                          test_size=dev_per, \
+                                        random_state=0)
+    ```

    代码的第一行执行初始拆分。 该函数将以下内容作为参数：

@@ -704,19 +700,19 @@ EDA 流程很有用，因为它有助于开发人员发现对于定义操作过

 8.  Print the shape of all three sets:

-    打印（x_train_2.shape，y_train_2.shape）
-
-    打印（x_dev_2.shape，y_dev_2.shape）
-
-    打印（x_test_2.shape，y_test_2.shape）
+    ```py
+    print(x_train_2.shape, y_train_2.shape)
+    print(x_dev_2.shape, y_dev_2.shape)
+    print(x_test_2.shape, y_test_2.shape)
+    ```

    以上操作的结果应为：

+    ```py
    (11841, 27) (11841,)
-
    (3947, 27) (3947,)
-
    (3947, 27) (3947,)
+    ```

    我们可以看到，两种方法的结果集具有相同的形状。 使用一种方法还是另一种方法是优先考虑的问题。

@@ -806,9 +802,10 @@ PyTorch 的构建考虑了该领域许多开发人员的意见，其优点是可

 1.  Import the PyTorch library, called **torch**, as well as the **nn** module from PyTorch:

-    进口火炬
-
-    将 torch.nn 导入为 nn
+    ```py
+    import torch
+    import torch.nn as nn
+    ```

    注意

@@ -816,61 +813,53 @@ PyTorch 的构建考虑了该领域许多开发人员的意见，其优点是可

 2.  Separate the feature columns from the target for each of the sets we created in the previous exercise. Additionally, convert the final DataFrames into tensors:

-    x_train = torch.tensor（x_train.values）.float（）
-
-    y_train =火炬张量（y_train.values）.float（）
-
-    x_dev = torch.tensor（x_dev.values）.float（）
-
-    y_dev = torch.tensor（y_dev.values）.float（）
-
-    x_test = torch.tensor（x_test.values）.float（）
-
-    y_test = torch.tensor（y_test.values）.float（）
+    ```py
+    x_train = torch.tensor(x_train.values).float()
+    y_train = torch.tensor(y_train.values).float()
+    x_dev = torch.tensor(x_dev.values).float()
+    y_dev = torch.tensor(y_dev.values).float()
+    x_test = torch.tensor(x_test.values).float()
+    y_test = torch.tensor(y_test.values).float()
+    ```

 3.  Define the network architecture using the **sequential()** container. Make sure to create a four-layer network. Use ReLU activation functions for the first three layers and leave the last layer without an activation function, considering the fact that we are dealing with a regression problem.

    每层的单位数应为 100、50、25 和 1：

-    模型= nn.Sequential（nn.Linear（x_train.shape [1]，100），\
-
-    nn.ReLU（），\
-
-    nn.Linear（100，50），\
-
-    nn.ReLU（），\
-
-    nn.Linear（50，25），\
-
-    nn.ReLU（），\
-
-    线性（25，1））
+    ```py
+    model = nn.Sequential(nn.Linear(x_train.shape[1], 100), \
+                          nn.ReLU(), \
+                          nn.Linear(100, 50), \
+                          nn.ReLU(), \
+                          nn.Linear(50, 25), \
+                          nn.ReLU(), \
+                          nn.Linear(25, 1))
+    ```

 4.  Define the loss function as the MSE:

-    loss_function = torch.nn.MSELoss（）
+    ```py
+    loss_function = torch.nn.MSELoss()
+    ```

 5.  Define the optimizer algorithm as the Adam optimizer:

-    优化程序= torch.optim.Adam（model.parameters（），lr = 0.001）
+    ```py
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+    ```

 6.  Use a **for** loop to train the network over the training data for 1,000 iteration steps:

-    对于我在范围（1000）中：
-
-    y_pred = model（x_train）.squeeze（）
-
-    损失= loss_function（y_pred，y_train）
-
-    Optimizer.zero_grad（）
-
-    loss.backward（）
-
-    Optimizer.step（）
-
-    如果 i% 100 == 0：
-
-    打印（i，loss.item（））
+    ```py
+    for i in range(1000):
+        y_pred = model(x_train).squeeze()
+        loss = loss_function(y_pred, y_train)
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        if i%100 == 0:
+            print(i, loss.item())
+    ```

    注意

@@ -888,15 +877,17 @@ PyTorch 的构建考虑了该领域许多开发人员的意见，其优点是可

 7.  To test the model, perform a prediction on the first instance of the testing set and compare it with the ground truth (target value):

-    之前=模型（x_test [0]）
-
-    print（“地面真相：”，y_test [0] .item（），\
-
-    “预测：”，pred.item（））
+    ```py
+    pred = model(x_test[0])
+    print("Ground truth:", y_test[0].item(), \
+          "Prediction:",pred.item())
+    ```

    输出应类似于以下内容：

-    基本事实：60.0 预测：69.5818099975586
+    ```py
+    Ground truth: 60.0 Prediction: 69.5818099975586
+    ```

    如您所见，地面真实值（`60`）非常接近预测值（`69.58`）。