3.md dataset 概述

上级 440dce54
......@@ -58,13 +58,13 @@ res3: Long = 15
./bin/pyspark
```
Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python’s dynamic nature, we don’t need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let’s make a new DataFrame from the text of the README file in the Spark source directory:
Spark有一个主要的抽象概念叫做Dataset的分布式集合类。Dataset 可以从Hadoop InputFormats(例如HDFS文件)或通过transforming其他数据集来创建数据集。 由于Python的动态特性,我们不需要在Python中定义强类型的Dataset。 因此,Python中的所有数据集都是Dataset[Row],我们称之为“DataFrame”来与Pandas和R中的数据框概念一致。让我们从Spark源文件中的README文件中创建一个新的DataFrame 目录:
```
>>> textFile = spark.read.text("README.md")
```
You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
您可以通过调用某些action直接从DataFrame获取值,也可以transform DataFrame以获取新的DataFrame。 有关详细信息,请阅读 _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
```
>>> textFile.count() # Number of rows in this DataFrame
......@@ -74,7 +74,7 @@ You can get values from DataFrame directly, by calling some actions, or transfor
Row(value=u'# Apache Spark')
```
Now let’s transform this DataFrame to a new one. We call `filter` to return a new DataFrame with a subset of the lines in the file.
现在让我们transform 这个DataFrame来获得一个新的DataFrame. 我们调用 `filter` 方法来返回文件中的一个子集.
```
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册