提交 0d084b14 编写于 作者: W wizardforcel

ch11.

上级 f3477b20
......@@ -156,3 +156,121 @@ percentile(75, scores)
## 自举法
一个数据科学家正在使用随机样本中的数据来估计未知参数。她使用样本来计算用作估计值的统计量。
一旦她计算出了统计量的观察值,她就可以把它作为她的估计值,然后顺其自然。 但她是一名数据科学家。 她知道她的随机样本只是众多可能的随机样本之一,因此她的估计只是众多合理估算之一。
这些估计的变化有多大? 为了回答这个问题,似乎她需要从总体中抽取另一个样本,并根据新样本计算一个新的估计值。 但是她没有资源来回到总体中,再抽取一个样本。
这个数据科学家看起来好像卡住了。
幸运的是,一个叫做自举法的好主意可以帮助她。 由于从总体中生成新样本是不可行的,自举法通过称为重采样的方法生成新的随机样本:新样本从原始样本中随机抽取。
在本节中,我们将看到自举法的工作方式和原因。 在本章的其余部分,我们将使用自举法进行推理。
### 旧金山市的雇员薪资
[SF OpenData](https://data.sfgov.org/) 是一个网站,旧金山市和县在上面公开提供他们的一些数据。 其中一个数据集包含城市雇员的薪资数据。 其中包括市营医院的医疗专业人员,警察,消防员,运输工人,民选官员以及市内所有其他雇员。
2015 日历年的薪资数据见表`sf2015`
```py
sf2015 = Table.read_table('san_francisco_2015.csv')
sf2015
```
| Year Type | Year | Organization Group Code | Organization Group | Department Code | Department | Union Code | Union | Job Family Code | Job Family | Job Code | Job | Employee Identifier | Salaries | Overtime | Other Salaries | Total Salary | Retirement | Health/Dental | Other Benefits | Total Benefits | Total Compensation |
| --- | --- |
| Calendar | 2015 | 2 | Public Works, Transportation & Commerce | WTR | PUC Water Department | 21 | Prof & Tech Engineers - Miscellaneous, Local 21 | 2400 | Lab, Pharmacy & Med Techs | 2481 | Water Qualitytech I/II | 21538 | 82146 | 0 | 0 | 82146 | 16942.2 | 12340.9 | 6337.73 | 35620.8 | 117767 |
| Calendar | 2015 | 2 | Public Works, Transportation & Commerce | DPW | General Services Agency - Public Works | 12 | Carpet, Linoleum and Soft Tile Workers, Local 12 | 7300 | Journeyman Trade | 7393 | Soft Floor Coverer | 5459 | 32165.8 | 973.19 | 848.96 | 33987.9 | 0 | 4587.51 | 2634.42 | 7221.93 | 41209.8 |
| Calendar | 2015 | 4 | Community Health | DPH | Public Health | 790 | SEIU - Miscellaneous, Local 1021 | 1600 | Payroll, Billing & Accounting | 1636 | Health Care Billing Clerk 2 | 41541 | 71311 | 5757.98 | 0 | 77069 | 14697.6 | 12424.5 | 6370.06 | 33492.2 | 110561 |
| Calendar | 2015 | 4 | Community Health | DPH | Public Health | 351 | Municipal Executive Association - Miscellaneous | 0900 | Management | 2620 | Food Service Mgr Administrator | 26718 | 28430.2 | 0 | 763.07 | 29193.3 | 0 | 4223.14 | 5208.51 | 9431.65 | 38625 |
| Calendar | 2015 | 2 | Public Works, Transportation & Commerce | MTA | Municipal Transportation Agency | 790 | SEIU - Miscellaneous, Local 1021 | 8200 | Protection & Apprehension | 8201 | School Crossing Guard | 45810 | 7948.75 | 0 | 0 | 7948.75 | 0 | 2873.17 | 616.24 | 3489.41 | 11438.2 |
| Calendar | 2015 | 1 | Public Protection | POL | Police | 911 | Police Officers' Association | Q000 | Police Services | Q002 | Police Officer | 32906 | 2235 | 0 | 0 | 2235 | 490.36 | 286.72 | 176.57 | 953.65 | 3188.65 |
| Calendar | 2015 | 4 | Community Health | DPH | Public Health | 791 | SEIU - Staff and Per Diem Nurses, Local 1021 | 2300 | Nursing | 2328 | Nurse Practitioner | 7506 | 187247 | 0 | 11704.1 | 198951 | 37683.7 | 12424.5 | 11221.7 | 61329.9 | 260281 |
| Calendar | 2015 | 2 | Public Works, Transportation & Commerce | MTA | Municipal Transportation Agency | 253 | Transport Workers - Transit Operators, Local 250-A | 9100 | Street Transit | 9163 | Transit Operator | 36773 | 66988.5 | 3512.88 | 2770.39 | 73271.8 | 19127.2 | 13203 | 5455.1 | 37785.3 | 111057 |
| Calendar | 2015 | 6 | General Administration & Finance | CAT | City Attorney | 311 | Municipal Attorneys' Association | 8100 | Legal & Court | 8177 | Attorney (Civil/Criminal) | 12963 | 135190 | 0 | 1562.5 | 136752 | 27501.8 | 12424.5 | 10103 | 50029.3 | 186781 |
| Calendar | 2015 | 3 | Human Welfare & Neighborhood Development | DSS | Human Services | 535 | SEIU - Human Services, Local 1021 | 9700 | Community Development | 9703 | Emp & Training Spec 2 | 35179 | 70474.8 | 147.28 | 1647.24 | 72269.3 | 14650.3 | 10696.9 | 5993.11 | 31340.3 | 103610 |
(省略了 42979 行)
共有 42,979 名员工,每个人一行。 有许多列包含市政部门隶属关系的信息,以及员工薪酬方案不同部分的详细信息。 这是对应市长 Ed Lee 的一行。
```py
sf2015.where('Job', are.equal_to('Mayor'))
```
| Year Type | Year | Organization Group Code | Organization Group | Department Code | Department | Union Code | Union | Job Family Code | Job Family | Job Code | Job | Employee Identifier | Salaries | Overtime | Other Salaries | Total Salary | Retirement | Health/Dental | Other Benefits | Total Benefits | Total Compensation |
| --- | --- |
| Calendar | 2015 | 6 | General Administration & Finance | MYR | Mayor | 556 | Elected Officials | 1100 | Administrative & Mgmt (Unrep) | 1190 | Mayor | 22433 | 288964 | 0 | 0 | 288964 | 58117 | 12424.5 | 20293 | 90834.5 | 379798 |
我们要研究最后一栏,总薪酬。 这是员工的工资加上市政府对退休和福利计划的贡献。
日历年的财务方案有时难以理解,因为它们取决于雇用日期,员工是否在城市内部换工作等等。 例如,`Total Compensation`列中的最低值看起来有点奇怪。
```py
sf2015.sort('Total Compensation')
```
| Year Type | Year | Organization Group Code | Organization Group | Department Code | Department | Union Code | Union | Job Family Code | Job Family | Job Code | Job | Employee Identifier | Salaries | Overtime | Other Salaries | Total Salary | Retirement | Health/Dental | Other Benefits | Total Benefits | Total Compensation |
| --- | --- |
| Calendar | 2015 | 1 | Public Protection | FIR | Fire Department | 798 | Firefighters - Miscellaneous, Local 798 | H000 | Fire Services | H002 | Firefighter | 43833 | 0 | 0 | 0 | 0 | 0 | 0 | -423.76 | -423.76 | -423.76 |
| Calendar | 2015 | 4 | Community Health | DPH | Public Health | 790 | SEIU - Miscellaneous, Local 1021 | 9900 | Public Service Aide | 9924 | PS Aide Health Services | 27871 | -292.4 | 0 | 0 | -292.4 | 0 | -95.58 | -22.63 | -118.21 | -410.61 |
| Calendar | 2015 | 1 | Public Protection | JUV | Juvenile Probation | 790 | SEIU - Miscellaneous, Local 1021 | 8300 | Correction & Detention | 8320 | Counselor, Juvenile Hall | 10517 | 0 | 0 | 0 | 0 | 0 | 0 | -159.12 | -159.12 | -159.12 |
| Calendar | 2015 | 6 | General Administration & Finance | CPC | City Planning | 21 | Prof & Tech Engineers - Miscellaneous, Local 21 | 1000 | Information Systems | 1053 | IS Business Analyst-Senior | 18961 | 0 | 0 | 0 | 0 | 0 | 0 | -26.53 | -26.53 | -26.53 |
| Calendar | 2015 | 6 | General Administration & Finance | CPC | City Planning | 21 | Prof & Tech Engineers - Miscellaneous, Local 21 | 5200 | Professional Engineering | 5277 | Planner 1 | 19387 | 0 | 0 | 0 | 0 | 0 | 0 | -9.51 | -9.51 | -9.51 |
| Calendar | 2015 | 2 | Public Works, Transportation & Commerce | PUC | PUC Public Utilities Commission | 21 | Prof & Tech Engineers - Miscellaneous, Local 21 | 1000 | Information Systems | 1044 | IS Engineer-Principal | 28988 | 0 | 0 | 0 | 0 | 0 | 0 | -3.1 | -3.1 | -3.1 |
| Calendar | 2015 | 1 | Public Protection | JUV | Juvenile Probation | 39 | Stationary Engineers, Local 39 | 7300 | Journeyman Trade | 7335 | Senior Stationary Engineer | 19125 | 0 | 0 | 0 | 0 | 0 | 0 | -0.01 | -0.01 | -0.01 |
| Calendar | 2015 | 1 | Public Protection | ECD | Department of Emergency Management | 351 | Municipal Executive Association - Miscellaneous | 0900 | Management | 0922 | Manager I | 30025 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Calendar | 2015 | 7 | General City Responsibilities | UNA | General Fund Unallocated | 790 | SEIU - Miscellaneous, Local 1021 | 3200 | Recreation | 3280 | Assistant Recreation Director | 49784 | 0 | 0 | 0 | 0 | 0 | 0 | 1.27 | 1.27 | 1.27 |
| Calendar | 2015 | 4 | Community Health | DPH | Public Health | 250 | SEIU - Health Workers, Local 1021 | 2600 | Dietary & Food | 2654 | Cook | 26768 | 0 | 0 | 2.21 | 2.21 | 0 | 0 | 0.17 | 0.17 | 2.38 |
(省略了 42979 行)
为了便于比较,我们将专注于那些工作时间相当于至少半年的人。 最低工资约为每小时 10 美元,52 周每周 20 小时,工资约为 1 万美元。
```py
sf2015 = sf2015.where('Salaries', are.above(10000))
sf2015.num_rows
36569
```
### 总体和参数
让这张超过 36500 行的表格成为我们的总体。 这是总薪资的直方图。
```py
sf_bins = np.arange(0, 700000, 25000)
sf2015.select('Total Compensation').hist(bins=sf_bins)
```
虽然大部分值都低于 300,000 美元,但有一些还是比较高的。 例如,首席投资官的总薪资不多是 65 万美元。 这就是为什么横轴延伸到了 700,000 美元。
```py
sf2015.sort('Total Compensation', descending=True).show(2)
```
| Year Type | Year | Organization Group Code | Organization Group | Department Code | Department | Union Code | Union | Job Family Code | Job Family | Job Code | Job | Employee Identifier | Salaries | Overtime | Other Salaries | Total Salary | Retirement | Health/Dental | Other Benefits | Total Benefits | Total Compensation |
| --- | --- |
| Calendar | 2015 | 6 | General Administration & Finance | RET | Retirement System | 351 | Municipal Executive Association - Miscellaneous | 1100 | Administrative & Mgmt (Unrep) | 1119 | Chief Investment Officer | 46881 | 507832 | 0 | 0 | 507832 | 105053 | 12424.5 | 23566.2 | 141044 | 648875 |
| Calendar | 2015 | 6 | General Administration & Finance | ADM | General Services Agency - City Admin | 164 | Physicians and Dentists - Miscellaneous | 2500 | Med Therapy & Auxiliary | 2598 | Asst Med Examiner | 1016 | 279311 | 3829.36 | 114434 | 397574 | 56211.6 | 12424.5 | 14299.1 | 82935.2 | 480509 |
(省略了 36567 行)
现在让参数为总薪资的中位数。
既然我们有能力从总体中得到所有数据,我们可以简单计算参数:
```py
pop_median = percentile(50, sf2015.column('Total Compensation'))
pop_median
110305.78999999999
```
所有员工的薪酬总额的中位数刚刚超过 110,300 美元。
从实际的角度来看,我们没有理由抽取样本来估计这个参数,因为我们只是知道它的值。 但在本节中,我们假装不知道这个值,看看我们如何根据随机样本来估计它。
在后面的章节中,我们将回到现实,在参数未知的情况下工作。 就目前而言,我们是无所不知的。
### 随机样本和估计
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册