3.md dataset 概述

f75bf190 · 身体健康,万事如意 · 440dce54 · f75bf190
隐藏空白更改
内联并排

Showing with 3 addition and 3 deletion

docs/3.md docs/3.md +3 -3

未找到文件。
--- a/docs/3.md
+++ b/docs/3.md
@@ -58,13 +58,13 @@ res3: Long = 15
 ./bin/pyspark 
 ```

-Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python’s dynamic nature, we don’t need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let’s make a new DataFrame from the text of the README file in the Spark source directory:
+Spark有一个主要的抽象概念叫做Dataset的分布式集合类。Dataset 可以从Hadoop InputFormats（例如HDFS文件）或通过transforming其他数据集来创建数据集。 由于Python的动态特性，我们不需要在Python中定义强类型的Dataset。 因此，Python中的所有数据集都是Dataset[Row]，我们称之为“DataFrame”来与Pandas和R中的数据框概念一致。让我们从Spark源文件中的README文件中创建一个新的DataFrame 目录：

 ```
 >>> textFile = spark.read.text("README.md")
 ```

-You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
+您可以通过调用某些action直接从DataFrame获取值，也可以transform DataFrame以获取新的DataFrame。 有关详细信息，请阅读 _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.

 ```
 >>> textFile.count()  # Number of rows in this DataFrame
@@ -74,7 +74,7 @@ You can get values from DataFrame directly, by calling some actions, or transfor
 Row(value=u'# Apache Spark')
 ```

-Now let’s transform this DataFrame to a new one. We call `filter` to return a new DataFrame with a subset of the lines in the file.
+现在让我们transform 这个DataFrame来获得一个新的DataFrame. 我们调用 `filter` 方法来返回文件中的一个子集.

 ```
 >>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))