3.2

691ba8b7 · wizardforcel · 834a0d42 · 691ba8b7
隐藏空白更改
内联并排

Showing with 56 addition and 220 deletion

docs/3.2_dataframes.md docs/3.2_dataframes.md +56 -220

未找到文件。
--- a/docs/3.2_dataframes.md
+++ b/docs/3.2_dataframes.md

-# Sniffing data frames
-
-We're going to use a real [kaggle competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries) data set to explore Pandas dataframes. Grab the [rent.csv.zip](https://mlbook.explained.ai/data/rent.csv.zip) file and unzip it.
+# 数据帧

+我们将使用真实的[ kaggle 比赛](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)数据集来探索 Pandas 数据帧。获取[`rent.csv.zip`](https://mlbook.explained.ai/data/rent.csv.zip)文件并解压缩。

 ```python
 import pandas as pd
@@ -10,8 +9,6 @@ df = pd.read_csv("data/rent.csv", parse_dates=['created'])
 df.head(2)
 ```

-
-
 |  | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | listing_id | longitude | manager_id | photos | price | street_address |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | 7211212 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue |
@@ -41,7 +38,7 @@ df.head(2).T
 | price | 3000 | 5465 |
 | street_address | 792 Metropolitan Avenue | 808 Columbus Avenue |

-## Sniff the data
+## 观察数据


 ```python
@@ -71,13 +68,10 @@ memory usage: 5.6+ MB
 '''
 ```

-
-
 ```python
 df.describe()
 ```

-
 |  | bathrooms | bedrooms | latitude | listing_id | longitude | price |
 | --- | --- | --- | --- | --- | --- | --- |
 | count | 49352.00000 | 49352.000000 | 49352.000000 | 4.935200e+04 | 49352.000000 | 4.935200e+04 |
@@ -108,12 +102,7 @@ Name: price, dtype: int64
 ```


-
-
-
-
-
-### Get column
+### 获取列


 ```python
@@ -129,13 +118,6 @@ Name: price, dtype: int64
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df['price'].head(5)

@@ -149,16 +131,9 @@ Name: price, dtype: int64
 '''
 ```

+## 列的计算

-
-
-
-
-
-## Column computations
-
-Can grab values and take average:
-
+可以获取值并取平均值：

 ```python
 prices = df['price']
@@ -169,20 +144,12 @@ print(f"Average rent is ${avg_rent:.0f}")
 ```


-
 ```python
 df.latitude.min(), df.latitude.max()

 # (0.0, 44.8835)
 ```

-
-
-
-
-
-
-
 ```python
 bybaths = df.groupby(['bathrooms']).mean()
 bybaths
@@ -242,7 +209,7 @@ bybaths[['bathrooms','price']] # print just num baths, avg price
 | 13 | 7.0 | 60000.000000 |
 | 14 | 10.0 | 3600.000000 |

-### Columns vs subsets
+### 列 VS 子集


 ```python
@@ -260,12 +227,6 @@ Name: bedrooms, dtype: int64
 ```


-
-
-
-
-
-
 ```python
 df[['bedrooms','bathrooms']].head(3)
 ```
@@ -276,7 +237,7 @@ df[['bedrooms','bathrooms']].head(3)
 | 1 | 2 | 1.0 |
 | 2 | 1 | 1.0 |

-### Get rows
+### 获取行


 ```python
@@ -303,13 +264,6 @@ Name: 3, dtype: object
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df.iloc[0:2] # first two rows
 ```
@@ -323,14 +277,12 @@ df.iloc[0:2] # first two rows
 df.iloc[0:2][['created','features']] # first two rows, show 2 columns
 ```

-
-
 |  | created | features |
 | --- | --- | --- |
 | 0 | 2016-06-24 07:54:24 | [] |
 | 1 | 2016-06-12 12:19:27 | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... |

-## Indexing, Get rows by index key
+## 索引，通过索引的键获取行


 ```python
@@ -368,13 +320,6 @@ Name: 7150865, dtype: object
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df = df.reset_index()
 df.head(3)
@@ -393,8 +338,6 @@ df_beds = df.set_index('bedrooms')
 df_beds.loc[3].head(3)
 ```

-
-
 |  | listing_id | bathrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | bedrooms |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
@@ -402,17 +345,13 @@ df_beds.loc[3].head(3)
 | 3 | 6858062 | 1.0 | 205f95d4a78f1f3befda48b89edc9669 | 2016-04-12 02:39:45 | BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T... | Madison Avenue | ['Doorman', 'Elevator', 'Dishwasher', 'Hardwoo... | low | 40.7454 | -73.9845 | 3793e58c60343a3fd6846ca2d2ef3c7f | ['https://photos.renthop.com/2/6858062_5cfb9d9... | 4395 | 121 Madison Avenue |
 | 3 | 6890563 | 1.0 | be6b7c3fdf3f63a2756306f4af7788a6 | 2016-04-18 04:46:30 | These pictures are from a similarlisting. | Thompson St | ['Washer/Dryer'] | low | 40.7231 | -74.0044 | 64249f81378907ae7cf65e8ccb4bd8dc | ['https://photos.renthop.com/2/6890563_1b98fae... | 3733 | 25 Thompson St |

-
-
-## Checking for missing values
+## 检查缺失值


 ```python
 df.isnull().head(5)
 ```

-
-
 |  | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
@@ -445,14 +384,7 @@ dtype: bool
 '''
 ```

-
-
-
-
-
-
-Find all rows in data frame where description is missing:
-
+查找数据帧中缺少描述的所有行：

 ```python
 df.description.isnull().head(5)
@@ -467,13 +399,6 @@ Name: description, dtype: bool
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df[df.description.isnull()].head(5)
 ```
@@ -487,8 +412,7 @@ df[df.description.isnull()].head(5)
 | 174 | 6853469 | 1.0 | 1 | 0 | 2016-04-10 05:13:03 | NaN | 83rd Avenue | ['Doorman', 'Elevator', 'Cats Allowed', 'Dogs ... | low | 40.7111 | -73.8272 | e6472c7237327dd3903b3d6f6a94515a | ['https://photos.renthop.com/2/6853469_75548f7... | 1675 | 123-30 83rd Avenue |
 | 210 | 6846567 | 1.0 | 1 | 6289dd7229f0d3b87254860764be70ab | 2016-04-09 01:39:12 | NaN | West 28th Street | ['Doorman', 'Fitness Center', 'Elevator', 'Cat... | low | 40.7512 | -74.0026 | 62b685cc0d876c3a1a51d63a0d6a8082 | [] | 4030 | 525 West 28th Street |

-Another query to get all apt rows with price above 1M$
-
+另一个查询，获得价格高于 1M 美金的所有`apt`行

 ```python
 (df.price>1000000).head(5)
@@ -503,13 +427,6 @@ Name: price, dtype: bool
 '''
 ```

-
-
-
-
-
-
-
 ```python
 df[df.price>1000000]
 ```
@@ -522,21 +439,17 @@ df[df.price>1000000]
 | 29665 | 7013217 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-14 05:21:28 |  | West 57th Street | ['Doorman', 'Cats Allowed', 'Dogs Allowed'] | low | 40.7676 | -73.9844 | 8f5a9c893f6d602f4953fcc0b8e6e9b4 | [] | 1070000 | 333 West 57th Street |
 | 30689 | 7036279 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-19 02:37:06 | This 1 Bedroom apartment is located on a prime... | West 57th Street | ['Doorman', 'Elevator', 'Pre-War', 'Dogs Allow... | low | 40.7676 | -73.9844 | 18133bc914e6faf6f8cc1bf29d66fc0d | ['https://photos.renthop.com/2/7036279_924b52f... | 1070000 | 333 West 57th Street |

-
-
 ```python
 df[(df.price>1000) & (df.price<10_000)].head(3) # parentheses are required in query!!!!
 ```

-
-
 |  | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue |
 | 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 |  | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue |
 | 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street |

-## Histogram variables
+## 直方图变量


 ```python
@@ -562,13 +475,6 @@ Name: bathrooms, dtype: int64
 '''
 ```

-
-
-
-
-
-
-
 ```python
 import matplotlib.pyplot as plt

@@ -580,10 +486,6 @@ plt.show()
 # <Figure size 640x480 with 1 Axes>
 ```

-
-
-
-
 ```python
 plt.xlabel('Num Bedrooms')
 plt.ylabel('Num Apts')
@@ -591,11 +493,8 @@ plt.hist(df.bedrooms, bins=20)
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_42_0.png)

-
-
 ```python
 plt.xlabel('Price')
 plt.ylabel('Num Apts at that price')
@@ -606,8 +505,6 @@ plt.show()

 ![png](img/3.2_dataframes_43_0.png)

-
-
 ```python
 import numpy as np
 df_log = df.copy()
@@ -625,12 +522,6 @@ Name: price, dtype: float64
 ```


-
-
-
-
-
-
 ```python
 plt.xlabel('Price')
 plt.ylabel('Num Apts at that price')
@@ -642,7 +533,7 @@ plt.show()
 ![png](img/3.2_dataframes_45_0.png)


-## Inter-variable variation
+## 变量间的变化


 ```python
@@ -650,42 +541,37 @@ bybaths.plot.line('bathrooms','price', style='-o')
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_47_0.png)

-
-
 ```python
 # OR, can do directly
 plt.plot(bybaths.bathrooms, bybaths.price, marker='o') # note slightly different arguments
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_48_0.png)

+# 测试你的知识

-# Test your knowledge
+从`df`获取浴室一列

-Get the column of bathrooms from df
+迭代列的列表并将其打印出来

-Iterate through list of columns and print them out
+从`df`获取第 6 行

-Get row 6 from df
+将`df`索引设置为卧室，然后使用索引获得所有带 3 间卧室的公寓

-Set the index of df to bedrooms then get all apartments with 3 bedrooms using the index
+获取每月价格`> 100_000`的所有行

-Get all rows where price > 100_000 per month
+从`df`中删除列`building_id`

-Drop column building_id from df
+获取价格介于 1000 和 2000 之间的所有行。

-Get all rows where price is between 1000 and 2000.
+将`np.log()`函数应用于价格并存储在名为`log_price`的新列中

-Apply the `np.log()` function to the price and store in a new column called `log_price`
+将纬度和经度列放入到自己的数据帧中

-Get columns latitude and longitude into their own dataframe
-
-## Solutions
+## 答案


 ```python
@@ -711,9 +597,9 @@ df['log_price'] = np.log( df.price )
 df[ ['longitude','latitude'] ]
 ```

-# Clean up
+# 清理

-## Prices
+## 价格


 ```python
@@ -728,11 +614,8 @@ plt.hist(df_clean.price, bins=45)
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_64_0.png)

-
-
 ```python
 plt.scatter(df_clean.bedrooms, df_clean.price, alpha=0.1)
 plt.xlabel("Bedrooms", fontsize=12) 
@@ -740,11 +623,10 @@ plt.ylabel("Rent price", fontsize=12)
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_65_0.png)


-## Location
+## 位置


 ```python
@@ -755,13 +637,6 @@ len(df_missing)
 # 11
 ```

-
-
-
-
-
-
-
 ```python
 # only 11 filter out
 df_clean = df_clean[(df_clean.longitude!=0) |
@@ -783,8 +658,7 @@ print(len(df_clean), len(df))
 # 48300 49352
 ```

-
-## Heatmap
+## 热力图


 ```python
@@ -808,15 +682,9 @@ plt.show()
 ![png](img/3.2_dataframes_73_0.png)


+# 训练模型

-```python
-
-```
-
-# Let's train a model
-
-Get numeric fields only:
-
+仅仅获取数值字段

 ```python
 df_train = df_clean[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
@@ -834,7 +702,6 @@ X_train = df_train[['bedrooms','bathrooms','latitude','longitude']]
 y_train = df_train['price']
 ```

-
 ```python
 from sklearn.ensemble import RandomForestRegressor
 rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
@@ -845,8 +712,7 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is
 # OOB R^2 score is 0.867 (range is -infinity to 1.0; 1.0 is perfect)
 ```

-
-## What does model tell us about features?
+## 模型告诉我们特征的什么信息？


 ```python
@@ -857,11 +723,10 @@ I.plot(kind='barh', legend=False)
 plt.show()
 ```

-
 ![png](img/3.2_dataframes_80_0.png)


-# Synthesize features
+# 合成特征


 ```python
@@ -869,15 +734,13 @@ df['one'] = 1
 df.head(3)
 ```

-
 |  | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | log_price | one |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue | 8.006368 | 1 |
 | 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 |  | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue | 8.606119 | 1 |
 | 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street | 7.955074 | 1 |

-Add random column of appropriate length
-
+添加适当长度的随机列

 ```python
 df2 = df.copy()
@@ -934,8 +797,7 @@ df2.head(2).T
 | random | 0.630259 | 0.124522 |
 | i | 0 | 1 |

-The data set has a `features` attribute (of type string) with a list of features about the apartment.
-
+数据集具有`features`属性（类型为字符串），其中包含公寓的特性列表。

 ```python
 df.features.head(5)
@@ -950,22 +812,14 @@ Name: features, dtype: object
 '''
 ```

-
-
-
-
-
-
-Let's create three new boolean columns that indicate whether the apartment has a doorman, parking, or laundry.  Start by making a copy of the data frame because we'll be modifying it (otherwise we'll get error "A value is trying to be set on a copy of a slice from a DataFrame"):
-
+让我们创建三个新的布尔列，指示公寓是否有门卫，停车或洗衣房。 首先制作数据帧的副本，因为我们将对其进行修改（否则我们将收到错误“正在尝试在`DataFrame`的切片副本上设置值”）：

 ```python
 df_aug = df[['bedrooms','bathrooms','latitude','longitude',
             'features','price']].copy()
 ```

-Then we normalize the features column so that missing features values become blanks and we lowercase all of the strings.
-
+然后我们规范化特性列，以便缺少的特性值变为空白，并且我们将所有字符串小写。

 ```python
 # rewrite features column
@@ -973,8 +827,7 @@ df_aug['features'] = df_aug['features'].fillna('') # fill missing w/blanks
 df_aug['features'] = df_aug['features'].str.lower() # normalize to lower case
 ```

-Create the three boolean columns by checking for the presence or absence of a string in the features column. 
-
+通过检查`features`列中是否存在字符串来创建三个布尔列。

 ```python
 df_aug['doorman'] = df_aug['features'].str.contains("doorman")
@@ -990,8 +843,7 @@ df_aug.head(3)
 | 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False |
 | 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True |

-The other way to drop a column other than `del` is with `drop()` function:
-
+删除`del`以外的列的另一种方法是使用`drop()`函数：

 ```python
 df2 = df.drop('description',axis=1) # drop doesn't affect df in place, returns new one
@@ -1015,8 +867,7 @@ df2.head(2).T # kill this column, return new df without that column
 | price | 3000 | 5465 |
 | street_address | 792 Metropolitan Avenue | 808 Columbus Avenue |

-Let's do some numerical feature stuff
-
+让我们对数值特征做一些事情。

 ```python
 df_aug["beds_to_baths"] = df_aug["bedrooms"]/(df_aug["bathrooms"]+1)
@@ -1029,8 +880,7 @@ df_aug.head(3)
 | 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False | 1.0 |
 | 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True | 0.5 |

-Beyond our scope here, but let's retrain model to see if it improves OOB score.
-
+超出我们的范围，但让我们重新训练模型，看看它是否提高了 OOB 得分。

 ```python
 df_clean = df_aug[(df.price>1_000) & (df.price<10_000)]
@@ -1047,13 +897,10 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is
 # OOB R^2 score is 0.870 (range is -infinity to 1.0; 1.0 is perfect)
 ```

-
-
 ```python
 I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':rf.feature_importances_})
 ```

-
 ```python
 I.sort_values('Importance',ascending=False)
 ```
@@ -1069,24 +916,20 @@ I.sort_values('Importance',ascending=False)
 | 6 | laundry | 0.010321 |
 | 5 | parking | 0.003859 |

-That score is slightly better but not by much.
-
-## Convert categorical to numeric data
+这个分数稍微好一些，但不是很多。

-This is not general but works for small set of categories:
+## 将类别转换为数值数据

+这不是通用的，但适用于小型（有序）类别集：

 ```python
 df['interest_level'] = df['interest_level'].map({'low':1,'medium':2,'high':3})
 ```

-
 ```python
 df[['interest_level']].head(5)
 ```

-
-
 |  | interest_level |
 | --- | --- |
 | 0 | 2 |
@@ -1095,7 +938,7 @@ df[['interest_level']].head(5)
 | 3 | 1 |
 | 4 | 1 |

-## Convert types
+## 转换类型


 ```python
@@ -1118,13 +961,7 @@ Name: some_boolean, dtype: int8
 '''
 ```

-
-
-
-
-
-
-## Convert dates
+## 转换日期


 ```python
@@ -1159,7 +996,7 @@ df.head(1).T
 | day | 24 |
 | month | 6 |

-# Feather format
+# Feather 格式


 ```python
@@ -1176,8 +1013,7 @@ Wall time: 113 ms
 '''
 ```

-
-Compare to loading CSV; like 5x slower:
+与加载 CSV 相比，似乎慢 5 倍：


 ```python
@@ -1190,18 +1026,18 @@ Wall time: 670 ms
 ```


-# Test your knowledge
+# 测试你的知识

-Filter out all rows with more than 5 bathrooms and put back into same dataframe
+过滤出超过 5 个浴室的所有行，并重新放入相同的数据帧

-Show a scatter plot of bathrooms vs price
+显示浴室与价格的散点图

-Show a heatmap whose color is a function of number of bedrooms.
+显示热力图，其颜色是卧室数量的函数。

-Insert a column called `avg_price` that is the average of all prices
+插入一个名为`avg_price`的列，它是所有价格的平均值

-Using a dictionary, convert all original `interest_level` values to 10, 20, 30 for the low, medium, and high categories
+使用字典，将所有原始`interest_level`值转换为`10, 20, 30`，用于低，中和高类别

-Convert `manager_id` column to be categorical not string type
+将`manager_id`列转换为类别而不是字符串类型

-Create a new column called `day_of_year` from the `created` field with the day of year, 1-365 for each row
+从`created`字段创建一个名为`day_of_year`的新列，带有每年的日期 1-365