3.8

6f2541ad · wizardforcel · ebc356b2 · 6f2541ad
隐藏空白更改
内联并排

Showing with 125 addition and 1 deletion

3.md 3.md +125 -1

未找到文件。
--- a/3.md
+++ b/3.md
@@ -650,7 +650,7 @@ array([[ 0. ,  0.25],
 >>> w_train = w[~random_sample] 
 ```

-现在我们需要获得男性和女性高度的实证分布，基于训练集：
+现在我们需要获得男性和女性高度的经验分布，基于训练集：

 ```py
 >>> from scipy import stats 
@@ -780,3 +780,127 @@ array([0, 0, 0, 0, 0])

 例如，使用`score_examples`，我们实际上可以为每个标签获得每个样例的可能性。

+## 3.8 将 KMeans 用于离群点检测
+
+这一章中，我们会查看 Kmeans 离群点检测的机制和正义。它对于隔离一些类型的错误很实用，但是使用时应多加小心。
+
+### 准备
+
+这个秘籍中，我们会使用 KMeans，对簇中的点执行离群点检测。要注意，提及离群点和离群点检测时有很多“阵营”。以便面，我们可能通过移除离群点，来移除由数据生成过程生成的点。另一方面，离群点可能来源于测量误差或一些其它外部因素。
+
+这就是争议的重点。这篇秘籍的剩余部分有关于寻找离群点。我们的假设是，我们移除离群点的选择是合理的。
+
+离群点检测的操作是，查找簇的形心，之后通过点到形心的距离来识别潜在的离群点。
+
+### 操作步骤
+
+首先，我们会生成 100 个点的单个数据块，之后我们会识别 5 个离形心最远的点。它们就是潜在的离群点。
+
+```py
+>>> from sklearn.datasets import make_blobs 
+>>> X, labels = make_blobs(100, centers=1) 
+>>> import numpy as np 
+```
+
+非常重要的是，Kmeans 聚类只有一个形心。这个想法类似于用于离群点检测的单类 SVM。
+
+```py
+>>> from sklearn.cluster import KMeans 
+>>> kmeans = KMeans(n_clusters=1) 
+>>> kmeans.fit(X)
+
+```
+
+现在，让我们观察绘图。对于那些远离中心的点，尝试猜测哪个点会识别为五个离群点之一：
+
+```py
+>>> f, ax = plt.subplots(figsize=(7, 5)) 
+>>> ax.set_title("Blob") 
+>>> ax.scatter(X[:, 0], X[:, 1], label='Points') 
+>>> ax.scatter(kmeans.cluster_centers_[:, 0],
+                kmeans.cluster_centers_[:, 1],
+                label='Centroid',
+                color='r') 
+>>> ax.legend()
+```
+
+下面就是输出：
+
+![](img/3-8-1.jpg)
+
+现在，让我们识别五个最接近的点：
+
+```py
+>>> distances = kmeans.transform(X) 
+# argsort returns an array of indexes which will sort the array in ascending order 
+# so we reverse it via [::-1] and take the top five with [:5] 
+>>> sorted_idx = np.argsort(distances.ravel())[::-1][:5]
+
+```
+
+现在，让我们看看哪个点离得最远：
+
+```py
+>>> f, ax = plt.subplots(figsize=(7, 5)) 
+>>> ax.set_title("Single Cluster") 
+>>> ax.scatter(X[:, 0], X[:, 1], label='Points') 
+>>> ax.scatter(kmeans.cluster_centers_[:, 0],
+                kmeans.cluster_centers_[:, 1],
+                label='Centroid', color='r') 
+>>> ax.scatter(X[sorted_idx][:, 0], X[sorted_idx][:, 1],
+                label='Extreme Value', edgecolors='g',
+                facecolors='none', s=100) 
+>>> ax.legend(loc='best') 
+```
+
+下面是输出：
+
+![](img/3-8-2.jpg)
+
+如果我们喜欢的话，移除这些点很容易。
+
+```py
+>>> new_X = np.delete(X, sorted_idx, axis=0)
+```
+
+同样，移除这些点之后，形心明显变化了。
+
+```py
+>>> new_kmeans = KMeans(n_clusters=1) 
+>>> new_kmeans.fit(new_X) 
+```
+
+让我们将旧的和新的形心可视化：
+
+```py
+>>> f, ax = plt.subplots(figsize=(7, 5)) 
+>>> ax.set_title("Extreme Values Removed") 
+>>> ax.scatter(new_X[:, 0], new_X[:, 1], label='Pruned Points') 
+>>> ax.scatter(kmeans.cluster_centers_[:, 0],
+               kmeans.cluster_centers_[:, 1], label='Old Centroid',
+               color='r', s=80, alpha=.5) 
+>>> ax.scatter(new_kmeans.cluster_centers_[:, 0],
+               new_kmeans.cluster_centers_[:, 1], label='New Centroid',
+               color='m', s=80, alpha=.5) 
+>>> ax.legend(loc='best') 
+```
+
+下面是输出：
+
+![](img/3-8-3.jpg)
+
+显然，形心没有移动多少，仅仅移除五个极端点时，我们的预期就是这样。这个过程可以重复，知道我们对数据表示满意。
+
+### 工作原理
+
+我们已经看到，高斯分布和 KMeans 聚类之间有本质联系。让我们基于形心和样本的协方差矩阵创建一个经验高斯分布，并且查看每个点的概率 -- 理论上是我们溢出的五个点。这刚好展示了，我们实际上溢出了拥有最低可能性的值。距离和可能性之间的概念十分重要，并且在你的机器学习训练中会经常出现。
+
+使用下列命令来创建经验高斯分布：
+
+```py
+>>> from scipy import stats 
+>>> emp_dist = stats.multivariate_normal(
+               kmeans.cluster_centers_.ravel()) 
+>>> lowest_prob_idx = np.argsort(emp_dist.pdf(X))[:5] 
+>>> np.all(X[sorted_idx] == X[lowest_prob_idx]) True 
+```