ch12.

3ae270b0 · wizardforcel · fb9501a1 · 3ae270b0
隐藏空白更改
内联并排

Showing with 143 addition and 0 deletion

12.md 12.md +143 -0

未找到文件。
--- a/12.md
+++ b/12.md
@@ -935,3 +935,146 @@ results.hist(bins=np.arange(0.65, 0.85, 0.01))

 样本量究竟是如何影响样本均值或比例的可变性呢？ 这是我们将在下一节中讨论的问题。

+## 样本均值的可变性
+
+根据中心极限定理，大型随机样本的均值的概率分布是大致正态的。 钟形曲线以总体平均值为中心。 一些样本均值较高，有些则较低，但距离总体均值的偏差在两边大致对称，正如我们已经看到的那样。 形式上，概率论表明样本均值是总体均值的无偏估计。
+
+在我们的模拟中，我们也注意到较大样本的均值，相对较小样本的平均值更倾向于紧密聚集于总体均值附近。 在本节中，我们将量化样本均值的可变性，并建立可变性和样本量之间的关系。
+
+我们从航班延误表开始。 平均延误时间约为 16.7 分钟，延误分布右倾。
+
+```py
+united = Table.read_table('united_summer2015.csv')
+delay = united.select('Delay')
+pop_mean = np.mean(delay.column('Delay'))
+pop_mean
+16.658155515370705
+```
+
+现在我们来随机抽样，来查看样本均值的概率分布。 像往常一样，我们将使用模拟来得到这种分布的经验近似。
+
+我们将定义一个函数`simulate_sample_mean`来实现，因为我们将在稍后改变样本量。 参数是表的名称，包含变量的列标签，样本量和模拟次数。
+
+```py
+"""Empirical distribution of random sample means"""
+
+def simulate_sample_mean(table, label, sample_size, repetitions):
+
+    means = make_array()
+
+    for i in range(repetitions):
+        new_sample = table.sample(sample_size)
+        new_sample_mean = np.mean(new_sample.column(label))
+        means = np.append(means, new_sample_mean)
+
+    sample_means = Table().with_column('Sample Means', means)
+
+    # Display empirical histogram and print all relevant quantities
+    sample_means.hist(bins=20)
+    plots.xlabel('Sample Means')
+    plots.title('Sample Size ' + str(sample_size))
+    print("Sample size: ", sample_size)
+    print("Population mean:", np.mean(table.column(label)))
+    print("Average of sample means: ", np.mean(means))
+    print("Population SD:", np.std(table.column(label)))
+    print("SD of sample means:", np.std(means))
+```
+
+让我们模拟 100 个延误的随机样本的均值，然后是 400 个，最后是 625 个延误的均值。 我们将对这些过程中的每一个执行 10,000 次重复。 `xlim`和`ylim`在所有图表中设置一致的坐标轴，以便比较。 您可以忽略每个单元格中的这两行代码。
+
+```py
+simulate_sample_mean(delay, 'Delay', 100, 10000)
+plots.xlim(5, 35)
+plots.ylim(0, 0.25);
+Sample size:  100
+Population mean: 16.6581555154
+Average of sample means:  16.662059
+Population SD: 39.4801998516
+SD of sample means: 3.90507237968
+```
+
+```py
+simulate_sample_mean(delay, 'Delay', 400, 10000)
+plots.xlim(5, 35)
+plots.ylim(0, 0.25);
+Sample size:  400
+Population mean: 16.6581555154
+Average of sample means:  16.67117625
+Population SD: 39.4801998516
+SD of sample means: 1.98326299651
+```
+
+```py
+simulate_sample_mean(delay, 'Delay', 625, 10000)
+plots.xlim(5, 35)
+plots.ylim(0, 0.25);
+Sample size:  625
+Population mean: 16.6581555154
+Average of sample means:  16.68523712
+Population SD: 39.4801998516
+SD of sample means: 1.60089096006
+```
+
+您可以在实践中看到中心极限定律 - 样本均值的直方图是大致正态的，即使延误本身的直方图与正态分布相差甚远。
+
+您还可以看到，样本均值的三个直方图中的每一个中心都非常接近总体均值。 在每种情况下，“样本均值的均值”非常接近 16.66 分钟，是总体均值。 每个直方图上方的打印输出都提供了这两个值。 像预期一样，样本均值是对总体均值的无偏估计。
+
+### 所有样本均值的 SD
+
+随着样本量的增加，您还可以看到直方图变窄，因此更高。 我们之前已经看到，但现在我们将更加关注延展性的度量。
+
+所有延误总体的标准差约为 40 分钟。
+
+```py
+pop_sd = np.std(delay.column('Delay'))
+pop_sd
+39.480199851609314
+```
+
+看看上面的样本均值的直方图中的标准差。在这三个里面，延误总体的标准差约为 40 分钟，因为所有的样本都来自同一个总体。
+
+现在来看，样本量为 100 时，所有 10,000 个样本均值的标准差。标准差是总体标准差的十分之一。当样本量为 400 时，所有样本均值的标准差约为总体标准差的二十分之一。当样本量为 625 时，样本均值的标准差为总体标准差的二十五分之一。
+
+
+将样本均值的经验分布的标准差与“总体标准差除以样本量的平方根”的数量进行比较，似乎是一个好主意。
+
+这里是数值。对于第一列中的每个样本量，抽取 10,000 个该大小的随机样本，并计算 10,000 个样本均值。第二列包含那些 10,000 个样本均值的标准差。第三列包含计算结果“总体标准差除以样本量的平方根”。
+
+该单元格需要一段时间来运行，因为这是大型模拟。但是你很快就会看到它值得等待。
+
+
+```py
+repetitions = 10000
+sample_sizes = np.arange(25, 626, 25)
+
+sd_means = make_array()
+
+for n in sample_sizes:
+    means = make_array()
+    for i in np.arange(repetitions):
+        means = np.append(means, np.mean(delay.sample(n).column('Delay')))
+    sd_means = np.append(sd_means, np.std(means))
+
+sd_comparison = Table().with_columns(
+    'Sample Size n', sample_sizes,
+    'SD of 10,000 Sample Means', sd_means,
+    'pop_sd/sqrt(n)', pop_sd/np.sqrt(sample_sizes)
+)
+sd_comparison
+```
+
+
+| Sample Size n | SD of 10,000 Sample Means | pop_sd/sqrt(n) |
+| --- | --- | --- |
+| 25 | 7.95017 | 7.89604 |
+| 50 | 5.53425 | 5.58334 |
+| 75 | 4.54429 | 4.55878 |
+| 100 | 3.96157 | 3.94802 |
+| 125 | 3.51095 | 3.53122 |
+| 150 | 3.23949 | 3.22354 |
+| 175 | 3.00694 | 2.98442 |
+| 200 | 2.74606 | 2.79167 |
+| 225 | 2.63865 | 2.63201 |
+| 250 | 2.51853 | 2.49695 |
+
+（省略了 15 行）