diff --git a/docs/3.2_dataframes.md b/docs/3.2_dataframes.md index 46a64ef5b735f830e6f280409b97cc487b80b423..3f0b796ffadfd67329e2270707b3f90967adf3f8 100644 --- a/docs/3.2_dataframes.md +++ b/docs/3.2_dataframes.md @@ -1,8 +1,7 @@ -# Sniffing data frames - -We're going to use a real [kaggle competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries) data set to explore Pandas dataframes. Grab the [rent.csv.zip](https://mlbook.explained.ai/data/rent.csv.zip) file and unzip it. +# 数据帧 +我们将使用真实的[ kaggle 比赛](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)数据集来探索 Pandas 数据帧。获取[`rent.csv.zip`](https://mlbook.explained.ai/data/rent.csv.zip)文件并解压缩。 ```python import pandas as pd @@ -10,8 +9,6 @@ df = pd.read_csv("data/rent.csv", parse_dates=['created']) df.head(2) ``` - - | | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | listing_id | longitude | manager_id | photos | price | street_address | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | 7211212 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue | @@ -41,7 +38,7 @@ df.head(2).T | price | 3000 | 5465 | | street_address | 792 Metropolitan Avenue | 808 Columbus Avenue | -## Sniff the data +## 观察数据 ```python @@ -71,13 +68,10 @@ memory usage: 5.6+ MB ''' ``` - - ```python df.describe() ``` - | | bathrooms | bedrooms | latitude | listing_id | longitude | price | | --- | --- | --- | --- | --- | --- | --- | | count | 49352.00000 | 49352.000000 | 49352.000000 | 4.935200e+04 | 49352.000000 | 4.935200e+04 | @@ -108,12 +102,7 @@ Name: price, dtype: int64 ``` - - - - - -### Get column +### 获取列 ```python @@ -129,13 +118,6 @@ Name: price, dtype: int64 ''' ``` - - - - - - - ```python df['price'].head(5) @@ -149,16 +131,9 @@ Name: price, dtype: int64 ''' ``` +## 列的计算 - - - - - -## Column computations - -Can grab values and take average: - +可以获取值并取平均值: ```python prices = df['price'] @@ -169,20 +144,12 @@ print(f"Average rent is ${avg_rent:.0f}") ``` - ```python df.latitude.min(), df.latitude.max() # (0.0, 44.8835) ``` - - - - - - - ```python bybaths = df.groupby(['bathrooms']).mean() bybaths @@ -242,7 +209,7 @@ bybaths[['bathrooms','price']] # print just num baths, avg price | 13 | 7.0 | 60000.000000 | | 14 | 10.0 | 3600.000000 | -### Columns vs subsets +### 列 VS 子集 ```python @@ -260,12 +227,6 @@ Name: bedrooms, dtype: int64 ``` - - - - - - ```python df[['bedrooms','bathrooms']].head(3) ``` @@ -276,7 +237,7 @@ df[['bedrooms','bathrooms']].head(3) | 1 | 2 | 1.0 | | 2 | 1 | 1.0 | -### Get rows +### 获取行 ```python @@ -303,13 +264,6 @@ Name: 3, dtype: object ''' ``` - - - - - - - ```python df.iloc[0:2] # first two rows ``` @@ -323,14 +277,12 @@ df.iloc[0:2] # first two rows df.iloc[0:2][['created','features']] # first two rows, show 2 columns ``` - - | | created | features | | --- | --- | --- | | 0 | 2016-06-24 07:54:24 | [] | | 1 | 2016-06-12 12:19:27 | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | -## Indexing, Get rows by index key +## 索引,通过索引的键获取行 ```python @@ -368,13 +320,6 @@ Name: 7150865, dtype: object ''' ``` - - - - - - - ```python df = df.reset_index() df.head(3) @@ -393,8 +338,6 @@ df_beds = df.set_index('bedrooms') df_beds.loc[3].head(3) ``` - - | | listing_id | bathrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | bedrooms | | | | | | | | | | | | | | | @@ -402,17 +345,13 @@ df_beds.loc[3].head(3) | 3 | 6858062 | 1.0 | 205f95d4a78f1f3befda48b89edc9669 | 2016-04-12 02:39:45 | BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T... | Madison Avenue | ['Doorman', 'Elevator', 'Dishwasher', 'Hardwoo... | low | 40.7454 | -73.9845 | 3793e58c60343a3fd6846ca2d2ef3c7f | ['https://photos.renthop.com/2/6858062_5cfb9d9... | 4395 | 121 Madison Avenue | | 3 | 6890563 | 1.0 | be6b7c3fdf3f63a2756306f4af7788a6 | 2016-04-18 04:46:30 | These pictures are from a similarlisting. | Thompson St | ['Washer/Dryer'] | low | 40.7231 | -74.0044 | 64249f81378907ae7cf65e8ccb4bd8dc | ['https://photos.renthop.com/2/6890563_1b98fae... | 3733 | 25 Thompson St | - - -## Checking for missing values +## 检查缺失值 ```python df.isnull().head(5) ``` - - | | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | @@ -445,14 +384,7 @@ dtype: bool ''' ``` - - - - - - -Find all rows in data frame where description is missing: - +查找数据帧中缺少描述的所有行: ```python df.description.isnull().head(5) @@ -467,13 +399,6 @@ Name: description, dtype: bool ''' ``` - - - - - - - ```python df[df.description.isnull()].head(5) ``` @@ -487,8 +412,7 @@ df[df.description.isnull()].head(5) | 174 | 6853469 | 1.0 | 1 | 0 | 2016-04-10 05:13:03 | NaN | 83rd Avenue | ['Doorman', 'Elevator', 'Cats Allowed', 'Dogs ... | low | 40.7111 | -73.8272 | e6472c7237327dd3903b3d6f6a94515a | ['https://photos.renthop.com/2/6853469_75548f7... | 1675 | 123-30 83rd Avenue | | 210 | 6846567 | 1.0 | 1 | 6289dd7229f0d3b87254860764be70ab | 2016-04-09 01:39:12 | NaN | West 28th Street | ['Doorman', 'Fitness Center', 'Elevator', 'Cat... | low | 40.7512 | -74.0026 | 62b685cc0d876c3a1a51d63a0d6a8082 | [] | 4030 | 525 West 28th Street | -Another query to get all apt rows with price above 1M$ - +另一个查询,获得价格高于 1M 美金的所有`apt`行 ```python (df.price>1000000).head(5) @@ -503,13 +427,6 @@ Name: price, dtype: bool ''' ``` - - - - - - - ```python df[df.price>1000000] ``` @@ -522,21 +439,17 @@ df[df.price>1000000] | 29665 | 7013217 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-14 05:21:28 | | West 57th Street | ['Doorman', 'Cats Allowed', 'Dogs Allowed'] | low | 40.7676 | -73.9844 | 8f5a9c893f6d602f4953fcc0b8e6e9b4 | [] | 1070000 | 333 West 57th Street | | 30689 | 7036279 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-19 02:37:06 | This 1 Bedroom apartment is located on a prime... | West 57th Street | ['Doorman', 'Elevator', 'Pre-War', 'Dogs Allow... | low | 40.7676 | -73.9844 | 18133bc914e6faf6f8cc1bf29d66fc0d | ['https://photos.renthop.com/2/7036279_924b52f... | 1070000 | 333 West 57th Street | - - ```python df[(df.price>1000) & (df.price<10_000)].head(3) # parentheses are required in query!!!! ``` - - | | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue | | 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 | | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue | | 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street | -## Histogram variables +## 直方图变量 ```python @@ -562,13 +475,6 @@ Name: bathrooms, dtype: int64 ''' ``` - - - - - - - ```python import matplotlib.pyplot as plt @@ -580,10 +486,6 @@ plt.show() #
``` - - - - ```python plt.xlabel('Num Bedrooms') plt.ylabel('Num Apts') @@ -591,11 +493,8 @@ plt.hist(df.bedrooms, bins=20) plt.show() ``` - ![png](img/3.2_dataframes_42_0.png) - - ```python plt.xlabel('Price') plt.ylabel('Num Apts at that price') @@ -606,8 +505,6 @@ plt.show() ![png](img/3.2_dataframes_43_0.png) - - ```python import numpy as np df_log = df.copy() @@ -625,12 +522,6 @@ Name: price, dtype: float64 ``` - - - - - - ```python plt.xlabel('Price') plt.ylabel('Num Apts at that price') @@ -642,7 +533,7 @@ plt.show() ![png](img/3.2_dataframes_45_0.png) -## Inter-variable variation +## 变量间的变化 ```python @@ -650,42 +541,37 @@ bybaths.plot.line('bathrooms','price', style='-o') plt.show() ``` - ![png](img/3.2_dataframes_47_0.png) - - ```python # OR, can do directly plt.plot(bybaths.bathrooms, bybaths.price, marker='o') # note slightly different arguments plt.show() ``` - ![png](img/3.2_dataframes_48_0.png) +# 测试你的知识 -# Test your knowledge +从`df`获取浴室一列 -Get the column of bathrooms from df +迭代列的列表并将其打印出来 -Iterate through list of columns and print them out +从`df`获取第 6 行 -Get row 6 from df +将`df`索引设置为卧室,然后使用索引获得所有带 3 间卧室的公寓 -Set the index of df to bedrooms then get all apartments with 3 bedrooms using the index +获取每月价格`> 100_000`的所有行 -Get all rows where price > 100_000 per month +从`df`中删除列`building_id` -Drop column building_id from df +获取价格介于 1000 和 2000 之间的所有行。 -Get all rows where price is between 1000 and 2000. +将`np.log()`函数应用于价格并存储在名为`log_price`的新列中 -Apply the `np.log()` function to the price and store in a new column called `log_price` +将纬度和经度列放入到自己的数据帧中 -Get columns latitude and longitude into their own dataframe - -## Solutions +## 答案 ```python @@ -711,9 +597,9 @@ df['log_price'] = np.log( df.price ) df[ ['longitude','latitude'] ] ``` -# Clean up +# 清理 -## Prices +## 价格 ```python @@ -728,11 +614,8 @@ plt.hist(df_clean.price, bins=45) plt.show() ``` - ![png](img/3.2_dataframes_64_0.png) - - ```python plt.scatter(df_clean.bedrooms, df_clean.price, alpha=0.1) plt.xlabel("Bedrooms", fontsize=12) @@ -740,11 +623,10 @@ plt.ylabel("Rent price", fontsize=12) plt.show() ``` - ![png](img/3.2_dataframes_65_0.png) -## Location +## 位置 ```python @@ -755,13 +637,6 @@ len(df_missing) # 11 ``` - - - - - - - ```python # only 11 filter out df_clean = df_clean[(df_clean.longitude!=0) | @@ -783,8 +658,7 @@ print(len(df_clean), len(df)) # 48300 49352 ``` - -## Heatmap +## 热力图 ```python @@ -808,15 +682,9 @@ plt.show() ![png](img/3.2_dataframes_73_0.png) +# 训练模型 -```python - -``` - -# Let's train a model - -Get numeric fields only: - +仅仅获取数值字段 ```python df_train = df_clean[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']] @@ -834,7 +702,6 @@ X_train = df_train[['bedrooms','bathrooms','latitude','longitude']] y_train = df_train['price'] ``` - ```python from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True) @@ -845,8 +712,7 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is # OOB R^2 score is 0.867 (range is -infinity to 1.0; 1.0 is perfect) ``` - -## What does model tell us about features? +## 模型告诉我们特征的什么信息? ```python @@ -857,11 +723,10 @@ I.plot(kind='barh', legend=False) plt.show() ``` - ![png](img/3.2_dataframes_80_0.png) -# Synthesize features +# 合成特征 ```python @@ -869,15 +734,13 @@ df['one'] = 1 df.head(3) ``` - | | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | log_price | one | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue | 8.006368 | 1 | | 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 | | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue | 8.606119 | 1 | | 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street | 7.955074 | 1 | -Add random column of appropriate length - +添加适当长度的随机列 ```python df2 = df.copy() @@ -934,8 +797,7 @@ df2.head(2).T | random | 0.630259 | 0.124522 | | i | 0 | 1 | -The data set has a `features` attribute (of type string) with a list of features about the apartment. - +数据集具有`features`属性(类型为字符串),其中包含公寓的特性列表。 ```python df.features.head(5) @@ -950,22 +812,14 @@ Name: features, dtype: object ''' ``` - - - - - - -Let's create three new boolean columns that indicate whether the apartment has a doorman, parking, or laundry. Start by making a copy of the data frame because we'll be modifying it (otherwise we'll get error "A value is trying to be set on a copy of a slice from a DataFrame"): - +让我们创建三个新的布尔列,指示公寓是否有门卫,停车或洗衣房。 首先制作数据帧的副本,因为我们将对其进行修改(否则我们将收到错误“正在尝试在`DataFrame`的切片副本上设置值”): ```python df_aug = df[['bedrooms','bathrooms','latitude','longitude', 'features','price']].copy() ``` -Then we normalize the features column so that missing features values become blanks and we lowercase all of the strings. - +然后我们规范化特性列,以便缺少的特性值变为空白,并且我们将所有字符串小写。 ```python # rewrite features column @@ -973,8 +827,7 @@ df_aug['features'] = df_aug['features'].fillna('') # fill missing w/blanks df_aug['features'] = df_aug['features'].str.lower() # normalize to lower case ``` -Create the three boolean columns by checking for the presence or absence of a string in the features column. - +通过检查`features`列中是否存在字符串来创建三个布尔列。 ```python df_aug['doorman'] = df_aug['features'].str.contains("doorman") @@ -990,8 +843,7 @@ df_aug.head(3) | 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False | | 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True | -The other way to drop a column other than `del` is with `drop()` function: - +删除`del`以外的列的另一种方法是使用`drop()`函数: ```python df2 = df.drop('description',axis=1) # drop doesn't affect df in place, returns new one @@ -1015,8 +867,7 @@ df2.head(2).T # kill this column, return new df without that column | price | 3000 | 5465 | | street_address | 792 Metropolitan Avenue | 808 Columbus Avenue | -Let's do some numerical feature stuff - +让我们对数值特征做一些事情。 ```python df_aug["beds_to_baths"] = df_aug["bedrooms"]/(df_aug["bathrooms"]+1) @@ -1029,8 +880,7 @@ df_aug.head(3) | 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False | 1.0 | | 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True | 0.5 | -Beyond our scope here, but let's retrain model to see if it improves OOB score. - +超出我们的范围,但让我们重新训练模型,看看它是否提高了 OOB 得分。 ```python df_clean = df_aug[(df.price>1_000) & (df.price<10_000)] @@ -1047,13 +897,10 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is # OOB R^2 score is 0.870 (range is -infinity to 1.0; 1.0 is perfect) ``` - - ```python I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':rf.feature_importances_}) ``` - ```python I.sort_values('Importance',ascending=False) ``` @@ -1069,24 +916,20 @@ I.sort_values('Importance',ascending=False) | 6 | laundry | 0.010321 | | 5 | parking | 0.003859 | -That score is slightly better but not by much. - -## Convert categorical to numeric data +这个分数稍微好一些,但不是很多。 -This is not general but works for small set of categories: +## 将类别转换为数值数据 +这不是通用的,但适用于小型(有序)类别集: ```python df['interest_level'] = df['interest_level'].map({'low':1,'medium':2,'high':3}) ``` - ```python df[['interest_level']].head(5) ``` - - | | interest_level | | --- | --- | | 0 | 2 | @@ -1095,7 +938,7 @@ df[['interest_level']].head(5) | 3 | 1 | | 4 | 1 | -## Convert types +## 转换类型 ```python @@ -1118,13 +961,7 @@ Name: some_boolean, dtype: int8 ''' ``` - - - - - - -## Convert dates +## 转换日期 ```python @@ -1159,7 +996,7 @@ df.head(1).T | day | 24 | | month | 6 | -# Feather format +# Feather 格式 ```python @@ -1176,8 +1013,7 @@ Wall time: 113 ms ''' ``` - -Compare to loading CSV; like 5x slower: +与加载 CSV 相比,似乎慢 5 倍: ```python @@ -1190,18 +1026,18 @@ Wall time: 670 ms ``` -# Test your knowledge +# 测试你的知识 -Filter out all rows with more than 5 bathrooms and put back into same dataframe +过滤出超过 5 个浴室的所有行,并重新放入相同的数据帧 -Show a scatter plot of bathrooms vs price +显示浴室与价格的散点图 -Show a heatmap whose color is a function of number of bedrooms. +显示热力图,其颜色是卧室数量的函数。 -Insert a column called `avg_price` that is the average of all prices +插入一个名为`avg_price`的列,它是所有价格的平均值 -Using a dictionary, convert all original `interest_level` values to 10, 20, 30 for the low, medium, and high categories +使用字典,将所有原始`interest_level`值转换为`10, 20, 30`,用于低,中和高类别 -Convert `manager_id` column to be categorical not string type +将`manager_id`列转换为类别而不是字符串类型 -Create a new column called `day_of_year` from the `created` field with the day of year, 1-365 for each row +从`created`字段创建一个名为`day_of_year`的新列,带有每年的日期 1-365