From f4c826a707fe28d70b533caf3553d04cfc13a365 Mon Sep 17 00:00:00 2001 From: wizardforcel <562826179@qq.com> Date: Mon, 29 Apr 2019 16:14:21 +0800 Subject: [PATCH] 3.3. --- docs/3.3_data.md | 220 +++++++++++++---------------------------------- 1 file changed, 61 insertions(+), 159 deletions(-) diff --git a/docs/3.3_data.md b/docs/3.3_data.md index 4153a4c..1a8db9a 100644 --- a/docs/3.3_data.md +++ b/docs/3.3_data.md @@ -1,22 +1,17 @@ -# Manipulating and Visualizing Data +# 操纵和可视化数据 -We've learned the basics of [Loading files](files.ipynb) and now it's time to reorganize the loaded data into commonly-used data structures from [NumPy](http://www.numpy.org/) and [Pandas](http://pandas.pydata.org/). To motivate the various data structures, we're going to feed them into [matplotlib](https://matplotlib.org/) for visualization. This lecture-lab is then all about steps 2, 3, and 5 from our generic data science program template: +我们已经学习了[加载文件](files.ipynb)的基础知识,现在是时候将加载的数据,从 [NumPy](http://www.numpy.org/) 和 [Pandas](http://pandas.pydata.org/) 重新组织为常用的数据结构了。 为了产生各种数据结构,我们将把它们提供给 [matplotlib](https://matplotlib.org/) 进行可视化。 然后,本实验将全面介绍我们的通用数据科学规划模板中的第 2,3 和 5 步: -1. Acquire data, which means finding a suitable file or collecting data from the web and storing in a file -2. Load data from disk and place into memory **organized into data structures** -3. Normalize, clean, or otherwise **prepare data** -4. Process the data, which can mean training a machine learning model, computing summary statistics, or optimizing a cost function -5. Emit results, which can be anything from simply printing an answer to saving data to the disk to generating a fancy **visualization** +1. 获取数据,这意味着找到合适的文件,或从 Web 收集数据并存储在文件中 +2. 从磁盘加载数据并放入**组织成数据结构**的内存 +3. 规范化,清理或以其他方式**准备数据** +4. 处理数据,这可能意味着训练机器学习模型,计算摘要统计量或优化成本函数 +5. 输出结果,可以是简单地将答案保存到磁盘上,也可以生成奇特的**可视化** -You'll learn more about step 4 in the courses on machine learning, timeseries analysis, and so on. - -TODO: missing values - delete row - insert value - -Let's get started by importing all of the packages we're going to need and setting a few parameters that make this Jupyter notebook look better: +您将在机器学习,时间序列分析等课程中了解第 4 步的更多信息。 +让我们开始导入我们需要的所有软件包,并设置一些参数,使这个 Jupyter 笔记本看起来更好: ```python import pandas @@ -27,12 +22,11 @@ pandas.options.display.max_rows = 7 # Don't display too much data (Pandas) np.set_printoptions(threshold=4) # Don't display too much data (NumPy) ``` -## Your new BFFs +## 你的新 BFF -Analytics programs tend to use lots of one- and two-dimensional arrays. 2D arrays are matrices and tables of data. 1D arrays are vectors, such as points in Euclidean space. A column or row of a table is also a 1D array. Python has lists and lists of lists that would suffice for 1D and 2D arrays, but Pandas and NumPy define similar but more capable data structures. - -Let's start with Pandas *data frames*, which are powerful tables very much like Excel tables. I also think that Pandas' `read_csv()` is the easiest way to load most kinds of data organized into rows and columns. Here's a sample data file with a list of prices over time, one data point per line and no header row: +分析程序倾向于使用大量的一维和二维数组。2D 数组是矩阵和数据表。1D 数组是向量,例如欧几里德空间中的点。表的列或行也是一维数组。 Python 有列表和列表的列表,可以满足 1D 和 2D 数组,但 Pandas 和 NumPy 定义了类似但功能更强的数据结构。 +让我们从 Pandas *数据帧*开始,它们是非常像 Excel 表的强大表格。 我还认为 Pandas 的`read_csv()`是加载组织成行和列的大多数数据的最简单方法。 这是一个示例数据文件,包含一段时间内的价格列表,每行一个数据点且没有标题行: ```python ! wc data/prices.txt @@ -53,11 +47,9 @@ Let's start with Pandas *data frames*, which are powerful tables very much like ''' ``` +(`wc`和`head`是 bash 命令,你可能会发现它们将来很有用。) -(The `wc` and `head` are bash commands that you might find useful in the future.) - -Here's how to load that file using Pandas: - +以下是使用 Pandas 加载该文件的方法: ```python prices = pandas.read_csv('data/prices.txt', header=None) @@ -76,10 +68,7 @@ prices # jupyter notebooks know how to display this nicely 345 rows × 1 columns - - -The numbers in the left column are just the index and are displayed by Pandas for your information; they are not stored in memory as part of the data structure. Let's look at the type and shape of this data structure: - +左栏中的数字只是索引,是 Pandas 会显示给您的信息;它们不作为数据结构的一部分存储在内存中。 我们来看看这个数据结构的类型和形状: ```python print "type is", type(prices) @@ -91,11 +80,9 @@ shape is (345, 1) ''' ``` +该输出表明,数据存储在`DataFrame`对象中,并且有 344 行和 1 列。 -That output indicates that the data is stored in a `DataFrame` object and there are 344 rows and one column. - -While Pandas is great for loading the data, and a few other things we'll see below, I prefer working with NumPy arrays; the actual type is called `ndarray`. Let's convert that list of prices from a data frame to a NumPy array: - +虽然 Pandas 非常适合加载数据,而我们将在下面看到其他一些内容,但我更喜欢使用 NumPy 数组;实际的类型称为`ndarray`。让我们将价格列表从数据帧转换为 NumPy 数组: ```python m = prices.as_matrix() # Convert data frame to numpy array @@ -117,10 +104,11 @@ shape is (345, 1) ``` -The printed array looks like a list of lists but it is a different data type. Just because two data structures print out in the same way, doesn't mean that they are the same kinds of objects. + -We can access the 2D NumPy arrays using array index notation *array*`[`*row*, *column*`]`: +打印的数组看起来像列表的列表,但它是不同的数据类型。 仅仅因为两个数据结构以相同的方式打印出来,并不意味着它们是相同类型的对象。 +我们可以使用数组索引表示法`array[row, column]`访问 2D NumPy 数组: ```python print m[0] # Access the first row @@ -136,9 +124,7 @@ print m[1,0] # Access the first column of the 2nd row ''' ``` - -That is a little weird though. We think of that as a 1D array or just a list, not a 2D array with a single column (shape is 345 x 1). To get NumPy to treat that as 1D, we use the `shape` attribute of the array: - +虽然这有点奇怪。 我们将其视为一维数组或仅仅是一个列表,而不是具有单列的二维数组(形状为 345 x 1)。 为了让 NumPy 将其视为 1D,我们使用数组的`shape`属性: ```python m.shape = (345,) # len(m)==345 @@ -147,14 +133,7 @@ m # array([ 0.605, 0.6 , 0.594, ..., 1.939, 1.898, 1.891]) ``` - - - - - - -Now, we can access the elements using a single index as we would expect: - +现在,我们可以像我们期望的那样使用单个索引访问元素: ```python print m[0] @@ -168,9 +147,7 @@ print m[2] ''' ``` - -A shape with an empty second parameter indicates a 1D array, which is how NumPy converts a regular Python list to an array: - +第二个参数为空的形状表示 1D 数组,这是 NumPy 将常规 Python 列表转换为数组的方式: ```python sizes = [28, 32, 34, 36, 38, 39, 40, 41] # Plain old Python list @@ -184,11 +161,7 @@ array([28, 32, 34, ..., 39, 40, 41]) ''' ``` - - - -While we're at it, here's how to convert a list of lists to a 2D NumPy array: - +虽然我们在这里,但这里是将列表的列表转换为 2D NumPy 数组的方式: ```python stuff = [ @@ -209,21 +182,11 @@ array([[ 18, 8, 307, 3504], ''' ``` +现在多个索引是有意义的。 例如,要访问包含值 3436 的元素,我们使用`m[2,3]`(第 3 行,第 4 列)。 +## 类型问题 - - - - - -Now multiple indices make sense. For example, to access the element containing value 3436, we'd use `m[2,3]` (3rd row, 4th column). - -## Types matter - -*TODO*: show `x+y` could be string, int, float, list, or numpy array. overloaded operators. e.g., x*y could be a string if x is string and y is int - - - +*TODO*:展示`x + y`可以是字符串,`int`,`float`,`list`或 numpy 数组。 重载运算符。例如,如果`x`是字符串且`y`是`int`,则`x * y`可以是字符串 ```python import numpy as np @@ -235,19 +198,20 @@ print f(3.4) X = np.array([1.2,3.0]) # a numpy array is more flexible than list of numbers print f(X) # returns array due to vector math in f()! print [f(x) for x in X] # manually apply f() to X -``` - 0.237946174816 - [ 0.25751416 -0.33333333] - [0.25751416197912252, -0.33333333333333331] - +''' +0.237946174816 +[ 0.25751416 -0.33333333] +[0.25751416197912252, -0.33333333333333331] +''' +``` -## Plotting Time Series Data -Our list of prices is representative of timeseries data, such as stock price, temperature, or population fluctuations. Matplotlib is a great library for visualizing data and in this section we're going to use it to display the prices as a timeseries using `plot()`. That function takes the X and Y coordinates as separate arrays. +## 绘制时间序列数据 -I find Matplotlib kind of mysterious, but I have learned patterns that I use over and over again, such as this timeseries plot. +我们的价格列表表示时间序列数据,例如股票价格,温度或人口波动。Matplotlib 是一个很好的数据可视化库,在本节中我们将使用它的`plot()`,将价格显示为时间序列。 该函数接受单独的数组`X`和`Y`坐标。 +我发现 Matplotlib 有点神秘,但我学会了一遍又一遍地使用的模式,比如这个时间序列的绘图。 ```python m = prices.as_matrix() # Let's convert pandas data frame to numpy array @@ -259,14 +223,11 @@ plt.ylabel("Price (dollars)") plt.show() # Show the actual plot ``` - ![png](img/3.3_data_25_0.png) +**绘制函数** -**Plotting functions** - -Sometimes we have a smooth function such as a cosine that we'd like to plot. To do that, we need to sample the function at regular intervals to collect a list of Y coordinates (like prices from before). Let's start by defining the function that maps X coordinates to Y values and then get a sample of X values at regular intervals between 0.1 and 1.1, stepping by 0.01: - +有时我们有一个平滑的函数,比如我们想绘制的余弦。为此,我们需要定期采样函数,来收集`Y`坐标列表(如之前的价格)。让我们首先定义将`X`坐标映射到`Y`值的函数,然后以 0.1 到 1.1 之间,步长 0.01 的固定间隔获取`X`值的样本: ```python def f(x): @@ -275,8 +236,7 @@ def f(x): X = np.arange(.1, 1.1, 0.01) # from .1 to 1.1 by step 0.01 ``` -There are three ways to sample the function `f()` at the coordinates contained in X, which I've delineated here. All of these 3 methods employ our Map pattern: - +有三种方法可以在`X`中包含的坐标处,采样函数`f()`,我在这里已经描述过了。 所有这三种方法都使用我们的映射模式: ```python # Get f(x) values for all x in three different ways @@ -301,9 +261,7 @@ print Y ''' ``` - -Given X and Y coordinates, we can plot the function: - +给定`X`和`Y`坐标,我们可以绘制函数: ```python plt.figure(figsize=(5, 2)) @@ -314,14 +272,12 @@ plt.ylabel("cos(3 * pi * x) / x") plt.show() ``` - ![png](img/3.3_data_31_0.png) -## Visualizing the relationship between variables - -Let's move beyond one dimensional arrays now to 2D arrays and plot one column versus another. Here is some sample car data (with a header row) with columns for miles per gallon, number of cylinders, engine horsepower, and weight in pounds: +## 可视化变量之间的关系 +让我们现在超越一维数组到二维数组并绘制一列与另一列。以下是一些样本汽车数据(带标题行),带有每加仑英里数,汽缸数,发动机马力和重量(磅)的列: ```python ! head data/cars.csv @@ -340,9 +296,7 @@ MPG,CYL,ENG,WGT ''' ``` - -We can use Pandas again to load the data into a data frame and then convert to a NumPy 2D array (a matrix): - +我们可以再次使用 Pandas 将数据加载到数据帧中,然后转换为 NumPy 2D 数组(矩阵): ```python cars = pandas.read_csv('data/cars.csv') @@ -384,12 +338,9 @@ array([[ 18., 8., 307., 3504.], ''' ``` +假设我们对汽车重量和燃油效率之间的关系感兴趣。 我们可以通过使用散点图绘制重量与效率来直观地检查这种关系。 - -Let's say we're interested in the relationship between the weight of the car and the fuel efficiency. We can examine that relationship visually by plotting weight against efficiency using a scatterplot. - -This brings us to the question of how to extract columns from numpy arrays, where each column represents the data associated with one attribute of all cars. The idea is to fix the column number but use a *wildcard* (the colon character) to indicate we want all rows: - +这就引出了如何从 numpy 数组中提取列的问题,其中每列代表与所有汽车的一个属性关联的数据。 想法是固定列号,但使用*通配符*(冒号字符)表示我们想要所有行: ```python # can do this: @@ -425,9 +376,7 @@ array([ 18., 15., 18., ..., 32., 28., 31.]) ''' ``` - -Once we have the two columns, we can use matplotlib's `scatter()` function: - +一旦我们有了两列,我们就可以使用 matplotlib 的`scatter()`函数: ```python plt.figure(figsize=(5, 2)) @@ -437,17 +386,13 @@ plt.ylabel('Miles per gallon') plt.show() ``` - ![png](img/3.3_data_40_0.png) +很棒!这显示了重量和效率之间的明确关系:汽车越重,效率越低。 -Great! This shows a clear relationship between weight and efficiency: the heavier the car, the lower the efficiency. - -It would also be interesting to know how the number of engine cylinders is related to weight and efficiency. We could go to a three-dimensional graph or even multiple graphs, but it's better to add another attribute to a single graph in this case. We could change the color according to the number of cylinders, but a better visualization would change the size of the plotted point. - -We can pull out the number of cylinders with `m[:,1]` like we did before, but we need to plot each point individually now because we have to specify different sizes. That means we need a loop around `scatter()` to pass individual X and Y coordinates rather than a list. The `s` parameter to `scatter()` is actually proportional to the area of the circle we want (see [pyplot scatter plot marker size](https://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size)). To accentuate the difference between engine size, I scale by .7. Here is the code to illustrate the relationship between three variables: - +了解发动机汽缸的数量与重量和效率有何关系也很有趣。 我们可以转到三维图形甚至多个图形,但在这种情况下最好将另一个属性添加到单个图形中。我们可以根据气缸数改变颜色,但更好的可视化会改变所绘制点的大小。 +我们可以像以前一样用`m[:,1]`获取气缸的数量,但我们现在需要单独绘制每个点,因为我们必须指定不同的尺寸。 这意味着我们需要一个循环在`scatter()`周围,来传递单独的`X`和`Y`坐标而不是列表。 `scatter()`的`s`参数实际上与我们想要的圆的面积成正比(参见 [pyplot 散点图的标记大小](https://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size))。 为了突出引擎尺寸之间的差异,我按比例 .7 缩放。以下是用于说明三个变量之间关系的代码: ```python plt.figure(figsize=(5, 2)) @@ -458,12 +403,9 @@ plt.ylabel('Miles per gallon') plt.show() ``` - ![png](img/3.3_data_42_0.png) - -When exploring data, it's often useful to know the unique set of values. For example, if would like to know the set of number of cylinders, we can use `set(`*mylist*`)`: - +在探索数据时,了解唯一值的集合通常很有用。例如,如果想知道气缸数的集合,我们可以使用`set(mylist)`: ```python m = cars.as_matrix() @@ -471,15 +413,13 @@ cyl = m[:,1] print set(cyl) ``` +有趣。 我不知道有 3 个汽缸的汽车。 -Interesting. I did not know there were cars with 3 cylinders. - -**Exercise**: Convert `cyl` to a set of integers (not floating-point values) using a map pattern. Hint: `int(3.0)` gives 3. - -## Histograms +**练习**:使用映射模式将`cyl`转换为整数集合(不是浮点值)。 提示:`int(3.0)`为 3。 -Instead of just a unique set of attribute values, we can count how many of each unique value appear in a data set. Python's `Counter` object, a kind of `dict`, knows how to count the elements. For example, here is how we'd get a dictionary-like object mapping number of cylinders to number of cars with that many cylinders: +## 直方图 +我们可以计算每个唯一值中有多少出现在数据集中,而不仅仅是属性值的唯一集合。 Python 的`Counter`对象,一种`dict`,知道如何计算元素。 例如,以下是我们如何获得类似字典的对象,将气缸数映射到具有多个气缸的汽车数量: ```python from collections import Counter @@ -490,14 +430,7 @@ Counter(cyl) # Counter({3.0: 4, 4.0: 199, 5.0: 3, 6.0: 83, 8.0: 103}) ``` - - - - - - -That works great for categorical variables, which the number of cylinders kind of is, but `Counter` is not as appropriate for numerical values. Besides, looking at the cylinder counts, it's hard to quickly understand the relative populations. Visualizing the data with a *histogram* is a much better idea and works for numerical values too. Here is the code to make a histogram of the cylinder attribute: - +这对于类别变量非常有用,类别变量的种类数是这样,但是“计数器”不适合数值。 此外,观察气缸计数,很难快速了解相关的总体。 用*直方图*可视化数据是一个更好的想法,也适用于数值。 以下是为气缸属性制作直方图的代码: ```python plt.figure(figsize=(5, 2)) @@ -507,12 +440,9 @@ plt.ylabel('Number of cars') plt.show() ``` - ![png](img/3.3_data_49_0.png) - -The same pattern gives us a histogram for numerical data as well: - +相同的模式也为我们提供了数值数据的直方图: ```python m = cars.as_matrix() @@ -524,12 +454,9 @@ plt.ylabel('Number of cars') plt.show() ``` - ![png](img/3.3_data_51_0.png) - -A histogram is really a chunky estimate of a variable's density function and so it's often useful to normalize the histogram so that the area integrates (sums) to 1. To get a normalized histogram, use argument `normed=True`: - +直方图实际上是变量密度函数的粗略估计,因此将直方图标准化来使区域积分(总和)为 1,通常很有用。要获得标准化的直方图,请使用参数`normed=True`: ```python plt.figure(figsize=(5, 2)) @@ -539,23 +466,19 @@ plt.ylabel('Probability') plt.show() ``` - ![png](img/3.3_data_53_0.png) +请注意,桶的高度之和不等于 1;高度乘以宽度之和等于 1。 -Note that it is not the sum of the heights of the bins that equals 1; it is the height * binwidth summed that equals 1. - -## Slicing and dicing - -So far, all of the data we've loaded has been numerical but it's very common to load categorical or textual variables in the form of strings. Pandas data frames are very useful in this case. Let's load some sample sales data that has three string columns: +## 切片和切块 +到目前为止,我们加载的所有数据都是数字的,但以字符串的形式加载分类或文本变量是很常见的。在这种情况下,Pandas 数据帧非常有用。 让我们加载一些示例销售数据,包含三个字符串列: ```python sales = pandas.read_csv('data/sales-small.csv') sales ``` - | | Date | Quantity | Unit Price | Shipping | Customer Name | Product Category | Product Name | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 10/13/10 | 6 | 38.94 | 35.00 | Muhammed MacIntyre | Office Supplies | Eldon Base for stackable storage shelf, platinum | @@ -568,8 +491,7 @@ sales 31 rows × 7 columns -The nice thing about the data frames is that we can access the columns by name, using either *table*`.`*attribute* or array indexing notation *table*`[` *attribute* `]`: - +数据帧的好处是我们可以使用`table.attribute`或数组索引表示法`table[attribute ]`来按名称访问列。 ```python sales.Date @@ -586,13 +508,6 @@ Name: Date, dtype: object ''' ``` - - - - - - - ```python sales['Date'] @@ -608,13 +523,6 @@ Name: Date, dtype: object ''' ``` - - - - - - - ```python sales['Customer Name'] @@ -630,12 +538,6 @@ Name: Customer Name, dtype: object ''' ``` - - - - - - Accessing rows via `sales[0]` then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation: -- GitLab