5.md 11.4 KB
Newer Older
A
alohahahaha 已提交
1
# 可视化数据集的分布
W
init  
wizardforcel 已提交
2

A
alohahahaha 已提交
3
在处理一组数据时,您通常想做的第一件事就是了解变量的分布情况。本教程的这一章将简要介绍seaborn中用于检查单变量和双变量分布的一些工具。 您可能还需要查看[categorical.html](categorical.html #categical-tutorial)章节中的函数示例,这些函数可以轻松地比较变量在其他变量级别上的分布。
W
init  
wizardforcel 已提交
4 5 6 7 8 9 10 11 12 13 14 15

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

```

```py
sns.set(color_codes=True)

```

A
alohahahaha 已提交
16
## 绘制单变量分布
W
init  
wizardforcel 已提交
17

A
alohahahaha 已提交
18
在seaborn中想要快速查看单变量分布的最方便的方法是使用[`distplot()`]函数(../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot")。默认情况下,该方法将会绘制直方图[histogram](https://en.wikipedia.org/wiki/Histogram)并拟合[内核密度估计] [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE).
W
init  
wizardforcel 已提交
19 20 21 22 23 24 25 26 27

```py
x = np.random.normal(size=100)
sns.distplot(x);

```

![http://seaborn.pydata.org/_images/distributions_6_0.png](img/fea324aca2ed4416872749b8352a5412.jpg)

A
alohahahaha 已提交
28
### 直方图
W
init  
wizardforcel 已提交
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237

Histograms are likely familiar, and a `hist` function already exists in matplotlib. A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.

To illustrate this, let’s remove the density curve and add a rug plot, which draws a small vertical tick at each observation. You can make the rug plot itself with the [`rugplot()`](../generated/seaborn.rugplot.html#seaborn.rugplot "seaborn.rugplot") function, but it is also available in [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"):

```py
sns.distplot(x, kde=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_8_0.png](img/3a0a2053efeea3a9932d764e2d71470d.jpg)

When drawing histograms, the main choice you have is the number of bins to use and where to place them. [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") uses a simple rule to make a good guess for what the right number is by default, but trying more or fewer bins might reveal other features in the data:

```py
sns.distplot(x, bins=20, kde=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_10_0.png](img/5193c672119d848c7926379d43f7f0cc.jpg)

### Kernel density estimation

The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:

```py
sns.distplot(x, hist=False, rug=True);

```

![http://seaborn.pydata.org/_images/distributions_12_0.png](img/a6d422236da60cc9bd01d12080b60453.jpg)

Drawing a KDE is more computationally involved than drawing a histogram. What happens is that each observation is first replaced with a normal (Gaussian) curve centered at that value:

```py
x = np.random.normal(0, 1, size=30)
bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)
support = np.linspace(-4, 4, 200)

kernels = []
for x_i in x:

    kernel = stats.norm(x_i, bandwidth).pdf(support)
    kernels.append(kernel)
    plt.plot(support, kernel, color="r")

sns.rugplot(x, color=".2", linewidth=3);

```

![http://seaborn.pydata.org/_images/distributions_14_0.png](img/31ee2d7a3dfda467565a2053ac19a38f.jpg)

Next, these curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1:

```py
from scipy.integrate import trapz
density = np.sum(kernels, axis=0)
density /= trapz(density, support)
plt.plot(support, density);

```

![http://seaborn.pydata.org/_images/distributions_16_0.png](img/d0ff3115fb5935fe56c1bb8123d5ddce.jpg)

We can see that if we use the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function in seaborn, we get the same curve. This function is used by [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot"), but it provides a more direct interface with easier access to other options when you just want the density estimate:

```py
sns.kdeplot(x, shade=True);

```

![http://seaborn.pydata.org/_images/distributions_18_0.png](img/247df80468d3edbc28836cb1cc56c81c.jpg)

The bandwidth (`bw`) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values:

```py
sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2")
plt.legend();

```

![http://seaborn.pydata.org/_images/distributions_20_0.png](img/8a713fe4da039acf9c3a4e70b274b60a.jpg)

As you can see above, the nature of the Gaussian KDE process means that estimation extends past the largest and smallest values in the dataset. It’s possible to control how far past the extreme values the curve is drawn with the `cut` parameter; however, this only influences how the curve is drawn and not how it is fit:

```py
sns.kdeplot(x, shade=True, cut=0)
sns.rugplot(x);

```

![http://seaborn.pydata.org/_images/distributions_22_0.png](img/63e498131614f726dd72a90161b58971.jpg)

### Fitting parametric distributions

You can also use [`distplot()`](../generated/seaborn.distplot.html#seaborn.distplot "seaborn.distplot") to fit a parametric distribution to a dataset and visually evaluate how closely it corresponds to the observed data:

```py
x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma);

```

![http://seaborn.pydata.org/_images/distributions_24_0.png](img/cf48dc45f5484db58f3d310e434b11a2.jpg)

## Plotting bivariate distributions

It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

```py
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

```

### Scatterplots

The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the _x_ and _y_ values. This is analgous to a rug plot on two dimensions. You can draw a scatterplot with the matplotlib `plt.scatter` function, and it is also the default kind of plot shown by the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function:

```py
sns.jointplot(x="x", y="y", data=df);

```

![http://seaborn.pydata.org/_images/distributions_28_0.png](img/66ba868aeef60b82d90c872e188217ed.jpg)

### Hexbin plots

The bivariate analogue of a histogram is known as a “hexbin” plot, because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It’s available through the matplotlib `plt.hexbin` function and as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"). It looks best with a white background:

```py
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
    sns.jointplot(x=x, y=y, kind="hex", color="k");

```

![http://seaborn.pydata.org/_images/distributions_30_0.png](img/621cac508b507f43ba50f91290aea5fd.jpg)

### Kernel density estimation

It is also possible to use the kernel density estimation procedure described above to visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot"):

```py
sns.jointplot(x="x", y="y", data=df, kind="kde");

```

![http://seaborn.pydata.org/_images/distributions_32_0.png](img/3fa9b8716f00e81aa6ca6864cb110e2b.jpg)

You can also draw a two-dimensional kernel density plot with the [`kdeplot()`](../generated/seaborn.kdeplot.html#seaborn.kdeplot "seaborn.kdeplot") function. This allows you to draw this kind of plot onto a specific (and possibly already existing) matplotlib axes, whereas the [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function manages its own figure:

```py
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax)
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax);

```

![http://seaborn.pydata.org/_images/distributions_34_0.png](img/5bbf1afea90de1dcab11584fb0169efe.jpg)

If you wish to show the bivariate density more continuously, you can simply increase the number of contour levels:

```py
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True);

```

![http://seaborn.pydata.org/_images/distributions_36_0.png](img/fd8b7fa16dccb291fe1a2148a45e3eba.jpg)

The [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") function uses a [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") to manage the figure. For more flexibility, you may want to draw your figure by using [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") directly. [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") returns the [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid") object after plotting, which you can use to add more layers or to tweak other aspects of the visualization:

```py
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$");

```

![http://seaborn.pydata.org/_images/distributions_38_0.png](img/aeaafccce597b72105feb6cf712b0ca2.jpg)

## Visualizing pairwise relationships in a dataset

To plot multiple pairwise bivariate distributions in a dataset, you can use the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:

```py
iris = sns.load_dataset("iris")
sns.pairplot(iris);

```

![http://seaborn.pydata.org/_images/distributions_40_0.png](img/bea67bf34fcd01d7b6f454ae5f563460.jpg)

Much like the relationship between [`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") and [`JointGrid`](../generated/seaborn.JointGrid.html#seaborn.JointGrid "seaborn.JointGrid"), the [`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot") function is built on top of a [`PairGrid`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid") object, which can be used directly for more flexibility:

```py
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

```

A
alohahahaha 已提交
238
![http://seaborn.pydata.org/_images/distributions_42_0.png](img/c65d91122f8de69b16659df5ab31214e.jpg)