提交 691ba8b7 编写于 作者: W wizardforcel

3.2

上级 834a0d42
# Sniffing data frames
We're going to use a real [kaggle competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries) data set to explore Pandas dataframes. Grab the [rent.csv.zip](https://mlbook.explained.ai/data/rent.csv.zip) file and unzip it.
# 数据帧
我们将使用真实的[ kaggle 比赛](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)数据集来探索 Pandas 数据帧。获取[`rent.csv.zip`](https://mlbook.explained.ai/data/rent.csv.zip)文件并解压缩。
```python
import pandas as pd
......@@ -10,8 +9,6 @@ df = pd.read_csv("data/rent.csv", parse_dates=['created'])
df.head(2)
```
| | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | listing_id | longitude | manager_id | photos | price | street_address |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | 7211212 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue |
......@@ -41,7 +38,7 @@ df.head(2).T
| price | 3000 | 5465 |
| street_address | 792 Metropolitan Avenue | 808 Columbus Avenue |
## Sniff the data
## 观察数据
```python
......@@ -71,13 +68,10 @@ memory usage: 5.6+ MB
'''
```
```python
df.describe()
```
| | bathrooms | bedrooms | latitude | listing_id | longitude | price |
| --- | --- | --- | --- | --- | --- | --- |
| count | 49352.00000 | 49352.000000 | 49352.000000 | 4.935200e+04 | 49352.000000 | 4.935200e+04 |
......@@ -108,12 +102,7 @@ Name: price, dtype: int64
```
### Get column
### 获取列
```python
......@@ -129,13 +118,6 @@ Name: price, dtype: int64
'''
```
```python
df['price'].head(5)
......@@ -149,16 +131,9 @@ Name: price, dtype: int64
'''
```
## 列的计算
## Column computations
Can grab values and take average:
可以获取值并取平均值:
```python
prices = df['price']
......@@ -169,20 +144,12 @@ print(f"Average rent is ${avg_rent:.0f}")
```
```python
df.latitude.min(), df.latitude.max()
# (0.0, 44.8835)
```
```python
bybaths = df.groupby(['bathrooms']).mean()
bybaths
......@@ -242,7 +209,7 @@ bybaths[['bathrooms','price']] # print just num baths, avg price
| 13 | 7.0 | 60000.000000 |
| 14 | 10.0 | 3600.000000 |
### Columns vs subsets
### 列 VS 子集
```python
......@@ -260,12 +227,6 @@ Name: bedrooms, dtype: int64
```
```python
df[['bedrooms','bathrooms']].head(3)
```
......@@ -276,7 +237,7 @@ df[['bedrooms','bathrooms']].head(3)
| 1 | 2 | 1.0 |
| 2 | 1 | 1.0 |
### Get rows
### 获取行
```python
......@@ -303,13 +264,6 @@ Name: 3, dtype: object
'''
```
```python
df.iloc[0:2] # first two rows
```
......@@ -323,14 +277,12 @@ df.iloc[0:2] # first two rows
df.iloc[0:2][['created','features']] # first two rows, show 2 columns
```
| | created | features |
| --- | --- | --- |
| 0 | 2016-06-24 07:54:24 | [] |
| 1 | 2016-06-12 12:19:27 | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... |
## Indexing, Get rows by index key
## 索引,通过索引的键获取行
```python
......@@ -368,13 +320,6 @@ Name: 7150865, dtype: object
'''
```
```python
df = df.reset_index()
df.head(3)
......@@ -393,8 +338,6 @@ df_beds = df.set_index('bedrooms')
df_beds.loc[3].head(3)
```
| | listing_id | bathrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| bedrooms | | | | | | | | | | | | | | |
......@@ -402,17 +345,13 @@ df_beds.loc[3].head(3)
| 3 | 6858062 | 1.0 | 205f95d4a78f1f3befda48b89edc9669 | 2016-04-12 02:39:45 | BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T... | Madison Avenue | ['Doorman', 'Elevator', 'Dishwasher', 'Hardwoo... | low | 40.7454 | -73.9845 | 3793e58c60343a3fd6846ca2d2ef3c7f | ['https://photos.renthop.com/2/6858062_5cfb9d9... | 4395 | 121 Madison Avenue |
| 3 | 6890563 | 1.0 | be6b7c3fdf3f63a2756306f4af7788a6 | 2016-04-18 04:46:30 | These pictures are from a similarlisting. | Thompson St | ['Washer/Dryer'] | low | 40.7231 | -74.0044 | 64249f81378907ae7cf65e8ccb4bd8dc | ['https://photos.renthop.com/2/6890563_1b98fae... | 3733 | 25 Thompson St |
## Checking for missing values
## 检查缺失值
```python
df.isnull().head(5)
```
| | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
......@@ -445,14 +384,7 @@ dtype: bool
'''
```
Find all rows in data frame where description is missing:
查找数据帧中缺少描述的所有行:
```python
df.description.isnull().head(5)
......@@ -467,13 +399,6 @@ Name: description, dtype: bool
'''
```
```python
df[df.description.isnull()].head(5)
```
......@@ -487,8 +412,7 @@ df[df.description.isnull()].head(5)
| 174 | 6853469 | 1.0 | 1 | 0 | 2016-04-10 05:13:03 | NaN | 83rd Avenue | ['Doorman', 'Elevator', 'Cats Allowed', 'Dogs ... | low | 40.7111 | -73.8272 | e6472c7237327dd3903b3d6f6a94515a | ['https://photos.renthop.com/2/6853469_75548f7... | 1675 | 123-30 83rd Avenue |
| 210 | 6846567 | 1.0 | 1 | 6289dd7229f0d3b87254860764be70ab | 2016-04-09 01:39:12 | NaN | West 28th Street | ['Doorman', 'Fitness Center', 'Elevator', 'Cat... | low | 40.7512 | -74.0026 | 62b685cc0d876c3a1a51d63a0d6a8082 | [] | 4030 | 525 West 28th Street |
Another query to get all apt rows with price above 1M$
另一个查询,获得价格高于 1M 美金的所有`apt`行
```python
(df.price>1000000).head(5)
......@@ -503,13 +427,6 @@ Name: price, dtype: bool
'''
```
```python
df[df.price>1000000]
```
......@@ -522,21 +439,17 @@ df[df.price>1000000]
| 29665 | 7013217 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-14 05:21:28 | | West 57th Street | ['Doorman', 'Cats Allowed', 'Dogs Allowed'] | low | 40.7676 | -73.9844 | 8f5a9c893f6d602f4953fcc0b8e6e9b4 | [] | 1070000 | 333 West 57th Street |
| 30689 | 7036279 | 1.0 | 1 | 37385c8a58176b529964083315c28e32 | 2016-05-19 02:37:06 | This 1 Bedroom apartment is located on a prime... | West 57th Street | ['Doorman', 'Elevator', 'Pre-War', 'Dogs Allow... | low | 40.7676 | -73.9844 | 18133bc914e6faf6f8cc1bf29d66fc0d | ['https://photos.renthop.com/2/7036279_924b52f... | 1070000 | 333 West 57th Street |
```python
df[(df.price>1000) & (df.price<10_000)].head(3) # parentheses are required in query!!!!
```
| | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue |
| 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 | | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue |
| 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street |
## Histogram variables
## 直方图变量
```python
......@@ -562,13 +475,6 @@ Name: bathrooms, dtype: int64
'''
```
```python
import matplotlib.pyplot as plt
......@@ -580,10 +486,6 @@ plt.show()
# <Figure size 640x480 with 1 Axes>
```
```python
plt.xlabel('Num Bedrooms')
plt.ylabel('Num Apts')
......@@ -591,11 +493,8 @@ plt.hist(df.bedrooms, bins=20)
plt.show()
```
![png](img/3.2_dataframes_42_0.png)
```python
plt.xlabel('Price')
plt.ylabel('Num Apts at that price')
......@@ -606,8 +505,6 @@ plt.show()
![png](img/3.2_dataframes_43_0.png)
```python
import numpy as np
df_log = df.copy()
......@@ -625,12 +522,6 @@ Name: price, dtype: float64
```
```python
plt.xlabel('Price')
plt.ylabel('Num Apts at that price')
......@@ -642,7 +533,7 @@ plt.show()
![png](img/3.2_dataframes_45_0.png)
## Inter-variable variation
## 变量间的变化
```python
......@@ -650,42 +541,37 @@ bybaths.plot.line('bathrooms','price', style='-o')
plt.show()
```
![png](img/3.2_dataframes_47_0.png)
```python
# OR, can do directly
plt.plot(bybaths.bathrooms, bybaths.price, marker='o') # note slightly different arguments
plt.show()
```
![png](img/3.2_dataframes_48_0.png)
# 测试你的知识
# Test your knowledge
从`df`获取浴室一列
Get the column of bathrooms from df
迭代列的列表并将其打印出来
Iterate through list of columns and print them out
从`df`获取第 6 行
Get row 6 from df
将`df`索引设置为卧室,然后使用索引获得所有带 3 间卧室的公寓
Set the index of df to bedrooms then get all apartments with 3 bedrooms using the index
获取每月价格`> 100_000`的所有行
Get all rows where price > 100_000 per month
从`df`中删除列`building_id`
Drop column building_id from df
获取价格介于 1000 和 2000 之间的所有行。
Get all rows where price is between 1000 and 2000.
将`np.log()`函数应用于价格并存储在名为`log_price`的新列中
Apply the `np.log()` function to the price and store in a new column called `log_price`
将纬度和经度列放入到自己的数据帧中
Get columns latitude and longitude into their own dataframe
## Solutions
## 答案
```python
......@@ -711,9 +597,9 @@ df['log_price'] = np.log( df.price )
df[ ['longitude','latitude'] ]
```
# Clean up
# 清理
## Prices
## 价格
```python
......@@ -728,11 +614,8 @@ plt.hist(df_clean.price, bins=45)
plt.show()
```
![png](img/3.2_dataframes_64_0.png)
```python
plt.scatter(df_clean.bedrooms, df_clean.price, alpha=0.1)
plt.xlabel("Bedrooms", fontsize=12)
......@@ -740,11 +623,10 @@ plt.ylabel("Rent price", fontsize=12)
plt.show()
```
![png](img/3.2_dataframes_65_0.png)
## Location
## 位置
```python
......@@ -755,13 +637,6 @@ len(df_missing)
# 11
```
```python
# only 11 filter out
df_clean = df_clean[(df_clean.longitude!=0) |
......@@ -783,8 +658,7 @@ print(len(df_clean), len(df))
# 48300 49352
```
## Heatmap
## 热力图
```python
......@@ -808,15 +682,9 @@ plt.show()
![png](img/3.2_dataframes_73_0.png)
# 训练模型
```python
```
# Let's train a model
Get numeric fields only:
仅仅获取数值字段
```python
df_train = df_clean[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
......@@ -834,7 +702,6 @@ X_train = df_train[['bedrooms','bathrooms','latitude','longitude']]
y_train = df_train['price']
```
```python
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
......@@ -845,8 +712,7 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is
# OOB R^2 score is 0.867 (range is -infinity to 1.0; 1.0 is perfect)
```
## What does model tell us about features?
## 模型告诉我们特征的什么信息?
```python
......@@ -857,11 +723,10 @@ I.plot(kind='barh', legend=False)
plt.show()
```
![png](img/3.2_dataframes_80_0.png)
# Synthesize features
# 合成特征
```python
......@@ -869,15 +734,13 @@ df['one'] = 1
df.head(3)
```
| | listing_id | bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | longitude | manager_id | photos | price | street_address | log_price | one |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 7211212 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | -73.9425 | 5ba989232d0489da1b5f2c45f6688adc | ['https://photos.renthop.com/2/7211212_1ed4542... | 3000 | 792 Metropolitan Avenue | 8.006368 | 1 |
| 1 | 7150865 | 1.0 | 2 | c5c8a357cba207596b04d1afd1e4f130 | 2016-06-12 12:19:27 | | Columbus Avenue | ['Doorman', 'Elevator', 'Fitness Center', 'Cat... | low | 40.7947 | -73.9667 | 7533621a882f71e25173b27e3139d83d | ['https://photos.renthop.com/2/7150865_be3306c... | 5465 | 808 Columbus Avenue | 8.606119 | 1 |
| 2 | 6887163 | 1.0 | 1 | c3ba40552e2120b0acfc3cb5730bb2aa | 2016-04-17 03:26:41 | Top Top West Village location, beautiful Pre-w... | W 13 Street | ['Laundry In Building', 'Dishwasher', 'Hardwoo... | high | 40.7388 | -74.0018 | d9039c43983f6e564b1482b273bd7b01 | ['https://photos.renthop.com/2/6887163_de85c42... | 2850 | 241 W 13 Street | 7.955074 | 1 |
Add random column of appropriate length
添加适当长度的随机列
```python
df2 = df.copy()
......@@ -934,8 +797,7 @@ df2.head(2).T
| random | 0.630259 | 0.124522 |
| i | 0 | 1 |
The data set has a `features` attribute (of type string) with a list of features about the apartment.
数据集具有`features`属性(类型为字符串),其中包含公寓的特性列表。
```python
df.features.head(5)
......@@ -950,22 +812,14 @@ Name: features, dtype: object
'''
```
Let's create three new boolean columns that indicate whether the apartment has a doorman, parking, or laundry. Start by making a copy of the data frame because we'll be modifying it (otherwise we'll get error "A value is trying to be set on a copy of a slice from a DataFrame"):
让我们创建三个新的布尔列,指示公寓是否有门卫,停车或洗衣房。 首先制作数据帧的副本,因为我们将对其进行修改(否则我们将收到错误“正在尝试在`DataFrame`的切片副本上设置值”):
```python
df_aug = df[['bedrooms','bathrooms','latitude','longitude',
'features','price']].copy()
```
Then we normalize the features column so that missing features values become blanks and we lowercase all of the strings.
然后我们规范化特性列,以便缺少的特性值变为空白,并且我们将所有字符串小写。
```python
# rewrite features column
......@@ -973,8 +827,7 @@ df_aug['features'] = df_aug['features'].fillna('') # fill missing w/blanks
df_aug['features'] = df_aug['features'].str.lower() # normalize to lower case
```
Create the three boolean columns by checking for the presence or absence of a string in the features column.
通过检查`features`列中是否存在字符串来创建三个布尔列。
```python
df_aug['doorman'] = df_aug['features'].str.contains("doorman")
......@@ -990,8 +843,7 @@ df_aug.head(3)
| 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False |
| 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True |
The other way to drop a column other than `del` is with `drop()` function:
删除`del`以外的列的另一种方法是使用`drop()`函数:
```python
df2 = df.drop('description',axis=1) # drop doesn't affect df in place, returns new one
......@@ -1015,8 +867,7 @@ df2.head(2).T # kill this column, return new df without that column
| price | 3000 | 5465 |
| street_address | 792 Metropolitan Avenue | 808 Columbus Avenue |
Let's do some numerical feature stuff
让我们对数值特征做一些事情。
```python
df_aug["beds_to_baths"] = df_aug["bedrooms"]/(df_aug["bathrooms"]+1)
......@@ -1029,8 +880,7 @@ df_aug.head(3)
| 1 | 2 | 1.0 | 40.7947 | -73.9667 | 5465 | True | False | False | 1.0 |
| 2 | 1 | 1.0 | 40.7388 | -74.0018 | 2850 | False | False | True | 0.5 |
Beyond our scope here, but let's retrain model to see if it improves OOB score.
超出我们的范围,但让我们重新训练模型,看看它是否提高了 OOB 得分。
```python
df_clean = df_aug[(df.price>1_000) & (df.price<10_000)]
......@@ -1047,13 +897,10 @@ print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is
# OOB R^2 score is 0.870 (range is -infinity to 1.0; 1.0 is perfect)
```
```python
I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':rf.feature_importances_})
```
```python
I.sort_values('Importance',ascending=False)
```
......@@ -1069,24 +916,20 @@ I.sort_values('Importance',ascending=False)
| 6 | laundry | 0.010321 |
| 5 | parking | 0.003859 |
That score is slightly better but not by much.
## Convert categorical to numeric data
这个分数稍微好一些,但不是很多。
This is not general but works for small set of categories:
## 将类别转换为数值数据
这不是通用的,但适用于小型(有序)类别集:
```python
df['interest_level'] = df['interest_level'].map({'low':1,'medium':2,'high':3})
```
```python
df[['interest_level']].head(5)
```
| | interest_level |
| --- | --- |
| 0 | 2 |
......@@ -1095,7 +938,7 @@ df[['interest_level']].head(5)
| 3 | 1 |
| 4 | 1 |
## Convert types
## 转换类型
```python
......@@ -1118,13 +961,7 @@ Name: some_boolean, dtype: int8
'''
```
## Convert dates
## 转换日期
```python
......@@ -1159,7 +996,7 @@ df.head(1).T
| day | 24 |
| month | 6 |
# Feather format
# Feather 格式
```python
......@@ -1176,8 +1013,7 @@ Wall time: 113 ms
'''
```
Compare to loading CSV; like 5x slower:
与加载 CSV 相比,似乎慢 5 倍:
```python
......@@ -1190,18 +1026,18 @@ Wall time: 670 ms
```
# Test your knowledge
# 测试你的知识
Filter out all rows with more than 5 bathrooms and put back into same dataframe
过滤出超过 5 个浴室的所有行,并重新放入相同的数据帧
Show a scatter plot of bathrooms vs price
显示浴室与价格的散点图
Show a heatmap whose color is a function of number of bedrooms.
显示热力图,其颜色是卧室数量的函数。
Insert a column called `avg_price` that is the average of all prices
插入一个名为`avg_price`的列,它是所有价格的平均值
Using a dictionary, convert all original `interest_level` values to 10, 20, 30 for the low, medium, and high categories
使用字典,将所有原始`interest_level`值转换为`10, 20, 30`,用于低,中和高类别
Convert `manager_id` column to be categorical not string type
将`manager_id`列转换为类别而不是字符串类型
Create a new column called `day_of_year` from the `created` field with the day of year, 1-365 for each row
从`created`字段创建一个名为`day_of_year`的新列,带有每年的日期 1-365
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册