未验证 提交 7be43a0f 编写于 作者: 飞龙 提交者: GitHub

Merge pull request #6 from alohahahaha/alohahahaha-patch-1

Visualizing the distribution of a dataset	
# Visualizing the distribution of a dataset
# 可视化数据集的分布
When dealing with a set of data, often the first thing you’ll want to do is get a sense for how the variables are distributed. This chapter of the tutorial will give a brief introduction to some of the tools in seaborn for examining univariate and bivariate distributions. You may also want to look at the [categorical plots](categorical.html#categorical-tutorial) chapter for examples of functions that make it easy to compare the distribution of a variable across levels of other variables.
在处理一组数据时,您通常想做的第一件事就是了解变量的分布情况。本教程的这一章将简要介绍seaborn中用于检查单变量和双变量分布的一些工具。 您可能还需要查看[categorical.html](categorical.html #categical-tutorial)章节中的函数示例,这些函数可以轻松地比较变量在其他变量级别上的分布。
```py
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
......@@ -16,9 +13,9 @@ sns.set(color_codes=True)
```
## Plotting univariate distributions
## 绘制单变量分布
The most convenient way to take a quick look at a univariate distribution in seaborn is the [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") function. By default, this will draw a [histogram](https://en.wikipedia.org/wiki/Histogram) and fit a [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE).
在seaborn中想要快速查看单变量分布的最方便的方法是使用[`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")函数。默认情况下,该方法将会绘制直方图[histogram](https://en.wikipedia.org/wiki/Histogram)并拟合[内核密度估计] [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE).
```py
x = np.random.normal(size=100)
......@@ -28,11 +25,10 @@ sns.distplot(x);
![http://seaborn.pydata.org/_images/distributions_6_0.png](img/fea324aca2ed4416872749b8352a5412.jpg)
### Histograms
Histograms are likely familiar, and a `hist` function already exists in matplotlib. A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.
### 直方图
To illustrate this, let’s remove the density curve and add a rug plot, which draws a small vertical tick at each observation. You can make the rug plot itself with the [`rugplot()`](../generated/seaborn.rugplot.html#seaborn.rugplot "seaborn.rugplot") function, but it is also available in [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"):
对于直方图我们可能很熟悉,而且matplotlib中已经存在`hist`函数。 直方图首先确定数据区间,然后观察数据落入这些区间中的数量来绘制柱形图以此来表征数据的分布情况。
为了说明这一点,让我们删除密度曲线并添加一个rug plot,它在每个观察值上画一个小的垂直刻度。您可以使用[`rugplot()`](../generated/seaborn.rugplot.html#seaborn.rugplot "seaborn.rugplot") 函数来创建rugplot本身,但是也可以在 [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")中使用:
```py
sns.distplot(x, kde=False, rug=True);
......@@ -41,7 +37,7 @@ sns.distplot(x, kde=False, rug=True);
![http://seaborn.pydata.org/_images/distributions_8_0.png](img/3a0a2053efeea3a9932d764e2d71470d.jpg)
When drawing histograms, the main choice you have is the number of bins to use and where to place them. [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") uses a simple rule to make a good guess for what the right number is by default, but trying more or fewer bins might reveal other features in the data:
在绘制柱状图时,您的主要选择是要使用的“桶”的数量和放置它们的位置。 [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") 使用一个简单的规则来很好地猜测默认情况下正确的数字是多少,但是尝试更多或更少的“桶”可能会揭示数据中的其他特性:
```py
sns.distplot(x, bins=20, kde=False, rug=True);
......@@ -50,10 +46,9 @@ sns.distplot(x, bins=20, kde=False, rug=True);
![http://seaborn.pydata.org/_images/distributions_10_0.png](img/5193c672119d848c7926379d43f7f0cc.jpg)
### Kernel density estimation
The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:
### 核密度估计
可能你对核密度估计不太熟悉,但它可以是绘制分布形状的有力工具。和直方图一样,KDE图沿另一个轴的高度,编码一个轴上的观测密度:
```py
sns.distplot(x, hist=False, rug=True);
......@@ -61,8 +56,7 @@ sns.distplot(x, hist=False, rug=True);
![http://seaborn.pydata.org/_images/distributions_12_0.png](img/a6d422236da60cc9bd01d12080b60453.jpg)
Drawing a KDE is more computationally involved than drawing a histogram. What happens is that each observation is first replaced with a normal (Gaussian) curve centered at that value:
绘制KDE比绘制直方图更需要计算。每个观测值首先被一个以该值为中心的正态(高斯)曲线所取代:
```py
x = np.random.normal(0, 1, size=30)
bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)
......@@ -81,7 +75,7 @@ sns.rugplot(x, color=".2", linewidth=3);
![http://seaborn.pydata.org/_images/distributions_14_0.png](img/31ee2d7a3dfda467565a2053ac19a38f.jpg)
Next, these curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1:
接下来,对这些曲线求和,计算支持网格(support grid)中每个点的密度值。然后对得到的曲线进行归一化,使曲线下的面积等于1:
```py
from scipy.integrate import trapz
......@@ -93,7 +87,7 @@ plt.plot(support, density);
![http://seaborn.pydata.org/_images/distributions_16_0.png](img/d0ff3115fb5935fe56c1bb8123d5ddce.jpg)
We can see that if we use the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function in seaborn, we get the same curve. This function is used by [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"), but it provides a more direct interface with easier access to other options when you just want the density estimate:
我们可以看到,如果在seaborn中使用[`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") 函数, 我们可以得到相同的曲线。这个函数也被[`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")所使用, 但是当您只想要密度估计时,它提供了一个更直接的接口,可以更容易地访问其他选项:
```py
sns.kdeplot(x, shade=True);
......@@ -102,8 +96,7 @@ sns.kdeplot(x, shade=True);
![http://seaborn.pydata.org/_images/distributions_18_0.png](img/247df80468d3edbc28836cb1cc56c81c.jpg)
The bandwidth (`bw`) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values:
KDE的带宽(`bw`)参数控制估计与数据的拟合程度,就像直方图中的bin大小一样。 它对应于我们在上面绘制的内核的宽度。 默认行为尝试使用常用参考规则猜测一个好的值,但尝试更大或更小的值可能会有所帮助:
```py
sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw: 0.2")
......@@ -114,7 +107,7 @@ plt.legend();
![http://seaborn.pydata.org/_images/distributions_20_0.png](img/8a713fe4da039acf9c3a4e70b274b60a.jpg)
As you can see above, the nature of the Gaussian KDE process means that estimation extends past the largest and smallest values in the dataset. It’s possible to control how far past the extreme values the curve is drawn with the `cut` parameter; however, this only influences how the curve is drawn and not how it is fit:
正如您在上面所看到的,高斯KDE过程的本质意味着估计超出了数据集中最大和最小的值。有可能控制超过极值多远的曲线是由'cut'参数绘制的;然而,这只影响曲线的绘制方式,而不影响曲线的拟合方式:
```py
sns.kdeplot(x, shade=True, cut=0)
......@@ -124,10 +117,11 @@ sns.rugplot(x);
![http://seaborn.pydata.org/_images/distributions_22_0.png](img/63e498131614f726dd72a90161b58971.jpg)
### Fitting parametric distributions
### 拟合参数分布
You can also use [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") to fit a parametric distribution to a dataset and visually evaluate how closely it corresponds to the observed data:
您还可以使用 [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")
将参数分布拟合到数据集上,并直观地评估其与观测数据的对应程度:
```py
x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma);
......@@ -136,9 +130,9 @@ sns.distplot(x, kde=False, fit=stats.gamma);
![http://seaborn.pydata.org/_images/distributions_24_0.png](img/cf48dc45f5484db58f3d310e434b11a2.jpg)
## Plotting bivariate distributions
## 绘制二元分布
It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.
它对于可视化两个变量的二元分布也很有用。在seaborn中,最简单的方法就是使用[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")函数,它创建了一个多面板图形,显示了两个变量之间的二元(或联合)关系,以及每个变量在单独轴上的一元(或边际)分布。
```py
mean, cov = [0, 1], [(1, .5), (.5, 1)]
......@@ -147,9 +141,9 @@ df = pd.DataFrame(data, columns=["x", "y"])
```
### Scatterplots
### 散点图
The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the _x_ and _y_ values. This is analgous to a rug plot on two dimensions. You can draw a scatterplot with the matplotlib `plt.scatter` function, and it is also the default kind of plot shown by the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function:
可视化二元分布最常见的方法是散点图,其中每个观察点都以_x_和_y_值表示。 这类似于二维rug plot。 您可以使用matplotlib的`plt.scatter` 函数绘制散点图, 它也是 [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")函数显示的默认类型的图:
```py
sns.jointplot(x="x", y="y", data=df);
......@@ -158,9 +152,9 @@ sns.jointplot(x="x", y="y", data=df);
![http://seaborn.pydata.org/_images/distributions_28_0.png](img/66ba868aeef60b82d90c872e188217ed.jpg)
### Hexbin plots
### 六边形“桶”(Hexbin)图
The bivariate analogue of a histogram is known as a “hexbin” plot, because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It’s available through the matplotlib `plt.hexbin` function and as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"). It looks best with a white background:
类似于单变量的直方图,用于描绘二元变量关系的图称为 “hexbin” 图,因为它显示了落入六边形“桶”内的观察计数。 此图对于相对较大的数据集最有效。它可以通过调用matplotlib中的 `plt.hexbin`函数获得并且在[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")作为一种样式。当使用白色作为背景色时效果最佳。
```py
x, y = np.random.multivariate_normal(mean, cov, 1000).T
......@@ -171,9 +165,9 @@ with sns.axes_style("white"):
![http://seaborn.pydata.org/_images/distributions_30_0.png](img/621cac508b507f43ba50f91290aea5fd.jpg)
### Kernel density estimation
### 核密度估计
It is also possible to use the kernel density estimation procedure described above to visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"):
也可以使用上面描述的核密度估计过程来可视化二元分布。在seaborn中,这种图用等高线图表示, 在[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")中被当作一种样式:
```py
sns.jointplot(x="x", y="y", data=df, kind="kde");
......@@ -182,7 +176,7 @@ sns.jointplot(x="x", y="y", data=df, kind="kde");
![http://seaborn.pydata.org/_images/distributions_32_0.png](img/3fa9b8716f00e81aa6ca6864cb110e2b.jpg)
You can also draw a two-dimensional kernel density plot with the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function. This allows you to draw this kind of plot onto a specific (and possibly already existing) matplotlib axes, whereas the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function manages its own figure:
您还可以使用[`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot")函数绘制二维核密度图。这允许您在一个特定的(可能已经存在的)matplotlib轴上绘制这种图,而 [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") 函数能够管理它自己的图:
```py
f, ax = plt.subplots(figsize=(6, 6))
......@@ -194,7 +188,7 @@ sns.rugplot(df.y, vertical=True, ax=ax);
![http://seaborn.pydata.org/_images/distributions_34_0.png](img/5bbf1afea90de1dcab11584fb0169efe.jpg)
If you wish to show the bivariate density more continuously, you can simply increase the number of contour levels:
如果希望更连续地显示双变量密度,可以简单地增加轮廓层的数量:
```py
f, ax = plt.subplots(figsize=(6, 6))
......@@ -205,7 +199,7 @@ sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True);
![http://seaborn.pydata.org/_images/distributions_36_0.png](img/fd8b7fa16dccb291fe1a2148a45e3eba.jpg)
The [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function uses a [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") to manage the figure. For more flexibility, you may want to draw your figure by using [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") directly. [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") returns the [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") object after plotting, which you can use to add more layers or to tweak other aspects of the visualization:
[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")函数使用[`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid")来管理图形。为了获得更大的灵活性,您可能想直接使用[`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid")来绘制图形。[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")在绘图后返回[`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid")对象,您可以使用它添加更多图层或调整可视化的其他方面:
```py
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
......@@ -217,10 +211,9 @@ g.set_axis_labels("$X$", "$Y$");
![http://seaborn.pydata.org/_images/distributions_38_0.png](img/aeaafccce597b72105feb6cf712b0ca2.jpg)
## Visualizing pairwise relationships in a dataset
To plot multiple pairwise bivariate distributions in a dataset, you can use the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:
## 可视化数据集中的成对关系
要在数据集中绘制多个成对的双变量分布,您可以使用[`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot")函数。 这将创建一个轴矩阵并显示DataFrame中每对列的关系,默认情况下,它还绘制对角轴上每个变量的单变量分布:
```py
iris = sns.load_dataset("iris")
sns.pairplot(iris);
......@@ -229,7 +222,7 @@ sns.pairplot(iris);
![http://seaborn.pydata.org/_images/distributions_40_0.png](img/bea67bf34fcd01d7b6f454ae5f563460.jpg)
Much like the relationship between [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") and [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid"), the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function is built on top of a [`PairGrid`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid") object, which can be used directly for more flexibility:
与[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")和[`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid")之间的关系非常类似, [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot")函数构建在[`PairGrid`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid")对象之上, 可以直接使用它来获得更大的灵活性:
```py
g = sns.PairGrid(iris)
......@@ -238,4 +231,4 @@ g.map_offdiag(sns.kdeplot, n_levels=6);
```
![http://seaborn.pydata.org/_images/distributions_42_0.png](img/c65d91122f8de69b16659df5ab31214e.jpg)
\ No newline at end of file
![http://seaborn.pydata.org/_images/distributions_42_0.png](img/c65d91122f8de69b16659df5ab31214e.jpg)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册