5.md

# 可视化数据集的分布

在处理一组数据时，您通常想做的第一件事就是了解变量的分布情况。本教程的这一章将简要介绍seaborn中用于检查单变量和双变量分布的一些工具。 您可能还需要查看[categorical.html]（categorical.html #categical-tutorial）章节中的函数示例，这些函数可以轻松地比较变量在其他变量级别上的分布。

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

```

```py
sns.set(color_codes=True)

```

## 绘制单变量分布

在seaborn中想要快速查看单变量分布的最方便的方法是使用[`distplot()`]函数(../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")。默认情况下，该方法将会绘制直方图[histogram](https://en.wikipedia.org/wiki/Histogram)并拟合[内核密度估计] [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE).

```py
x = np.random.normal(size=100)
sns.distplot(x);

```

![http://seaborn.pydata.org/_images/distributions_6_0.png](img/fea324aca2ed4416872749b8352a5412.jpg)

### 直方图

Histograms are likely familiar, and a `hist` function already exists in matplotlib. A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.

To illustrate this, let’s remove the density curve and add a rug plot, which draws a small vertical tick at each observation. You can make the rug plot itself with the [`rugplot()`](../generated/seaborn.rugplot.html#seaborn.rugplot "seaborn.rugplot") function, but it is also available in [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"):

```py
sns.distplot(x, kde=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_8_0.png](img/3a0a2053efeea3a9932d764e2d71470d.jpg)

When drawing histograms, the main choice you have is the number of bins to use and where to place them. [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") uses a simple rule to make a good guess for what the right number is by default, but trying more or fewer bins might reveal other features in the data:

```py
sns.distplot(x, bins=20, kde=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_10_0.png](img/5193c672119d848c7926379d43f7f0cc.jpg)

### Kernel density estimation

The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:

```py
sns.distplot(x, hist=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_12_0.png](img/a6d422236da60cc9bd01d12080b60453.jpg)

Drawing a KDE is more computationally involved than drawing a histogram. What happens is that each observation is first replaced with a normal (Gaussian) curve centered at that value:

```py
x = np.random.normal(0, 1, size=30)
bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)
support = np.linspace(-4, 4, 200)

kernels = []
for x_i in x:

    kernel = stats.norm(x_i, bandwidth).pdf(support)
    kernels.append(kernel)
    plt.plot(support, kernel, color="r")

sns.rugplot(x, color=".2", linewidth=3);

```

![http://seaborn.pydata.org/_images/distributions_14_0.png](img/31ee2d7a3dfda467565a2053ac19a38f.jpg)

Next, these curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1:

```py
from scipy.integrate import trapz
density = np.sum(kernels, axis=0)
density /= trapz(density, support)
plt.plot(support, density);

```

![http://seaborn.pydata.org/_images/distributions_16_0.png](img/d0ff3115fb5935fe56c1bb8123d5ddce.jpg)

We can see that if we use the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function in seaborn, we get the same curve. This function is used by [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"), but it provides a more direct interface with easier access to other options when you just want the density estimate:

```py
sns.kdeplot(x, shade=True);

```

![http://seaborn.pydata.org/_images/distributions_18_0.png](img/247df80468d3edbc28836cb1cc56c81c.jpg)

The bandwidth (`bw`) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values:

```py
sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2")
plt.legend();

```

![http://seaborn.pydata.org/_images/distributions_20_0.png](img/8a713fe4da039acf9c3a4e70b274b60a.jpg)

As you can see above, the nature of the Gaussian KDE process means that estimation extends past the largest and smallest values in the dataset. It’s possible to control how far past the extreme values the curve is drawn with the `cut` parameter; however, this only influences how the curve is drawn and not how it is fit:

```py
sns.kdeplot(x, shade=True, cut=0)
sns.rugplot(x);

```

![http://seaborn.pydata.org/_images/distributions_22_0.png](img/63e498131614f726dd72a90161b58971.jpg)

### Fitting parametric distributions

You can also use [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") to fit a parametric distribution to a dataset and visually evaluate how closely it corresponds to the observed data:

```py
x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma);

```

![http://seaborn.pydata.org/_images/distributions_24_0.png](img/cf48dc45f5484db58f3d310e434b11a2.jpg)

## Plotting bivariate distributions

It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

```py
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

```

### Scatterplots

The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the _x_ and _y_ values. This is analgous to a rug plot on two dimensions. You can draw a scatterplot with the matplotlib `plt.scatter` function, and it is also the default kind of plot shown by the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function:

```py
sns.jointplot(x="x", y="y", data=df);

```

![http://seaborn.pydata.org/_images/distributions_28_0.png](img/66ba868aeef60b82d90c872e188217ed.jpg)

### Hexbin plots

The bivariate analogue of a histogram is known as a “hexbin” plot, because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It’s available through the matplotlib `plt.hexbin` function and as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"). It looks best with a white background:

```py
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
    sns.jointplot(x=x, y=y, kind="hex", color="k");

```

![http://seaborn.pydata.org/_images/distributions_30_0.png](img/621cac508b507f43ba50f91290aea5fd.jpg)

### Kernel density estimation

It is also possible to use the kernel density estimation procedure described above to visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"):

```py
sns.jointplot(x="x", y="y", data=df, kind="kde");

```

![http://seaborn.pydata.org/_images/distributions_32_0.png](img/3fa9b8716f00e81aa6ca6864cb110e2b.jpg)

You can also draw a two-dimensional kernel density plot with the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function. This allows you to draw this kind of plot onto a specific (and possibly already existing) matplotlib axes, whereas the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function manages its own figure:

```py
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax)
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax);

```

![http://seaborn.pydata.org/_images/distributions_34_0.png](img/5bbf1afea90de1dcab11584fb0169efe.jpg)

If you wish to show the bivariate density more continuously, you can simply increase the number of contour levels:

```py
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True);

```

![http://seaborn.pydata.org/_images/distributions_36_0.png](img/fd8b7fa16dccb291fe1a2148a45e3eba.jpg)

The [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function uses a [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") to manage the figure. For more flexibility, you may want to draw your figure by using [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") directly. [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") returns the [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") object after plotting, which you can use to add more layers or to tweak other aspects of the visualization:

```py
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$");

```

![http://seaborn.pydata.org/_images/distributions_38_0.png](img/aeaafccce597b72105feb6cf712b0ca2.jpg)

## Visualizing pairwise relationships in a dataset

To plot multiple pairwise bivariate distributions in a dataset, you can use the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:

```py
iris = sns.load_dataset("iris")
sns.pairplot(iris);

```

![http://seaborn.pydata.org/_images/distributions_40_0.png](img/bea67bf34fcd01d7b6f454ae5f563460.jpg)

Much like the relationship between [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") and [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid"), the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function is built on top of a [`PairGrid`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid") object, which can be used directly for more flexibility:

```py
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

```

![http://seaborn.pydata.org/_images/distributions_42_0.png](img/c65d91122f8de69b16659df5ab31214e.jpg)