Delete duplicate blank lines

d666601d · 取昵称好难啊 · 58680e4a · d666601d · d666601d · d666601d
12 changed file
--- a/docs/13.md
+++ b/docs/13.md
@@ -12,8 +12,6 @@

 如果用户的应用程序被打包好了，它可以使用 `bin/spark-submit` 脚本来启动。这个脚本负责设置 Spark 和它的依赖的 classpath，并且可以支持 Spark 所支持的不同的 Cluster Manager 以及 deploy mode（部署模式）:

-
-
 ```
 ./bin/spark-submit \
  --class <main-class> \
@@ -25,11 +23,9 @@
  [application-arguments]
 ```

-
-
 一些常用的 options（选项）有 :

-*   `--class`：您的应用程序的入口点（例如。`org.apache.spark.examples.SparkPi`)
+*   `--class`：您的应用程序的入口点（例如 `org.apache.spark.examples.SparkPi`)
 *   `--master`：集群的 [master URL](#master-urls) (例如 `spark://23.195.26.187:7077`)
 *   `--deploy-mode`：是在 worker 节点(`cluster`) 上还是在本地作为一个外部的客户端(`client`) 部署您的 driver(默认：`client`) **†**
 *   `--conf`：按照 key=value 格式任意的 Spark 配置属性。对于包含空格的 value（值）使用引号包 “key=value” 起来。
@@ -44,8 +40,6 @@

 这里有一些选项可用于特定的 [cluster manager](cluster-overview.html#cluster-manager-types) 中。例如， [Spark standalone cluster](spark-standalone.html) 用 `cluster` 部署模式，您也可以指定 `--supervise` 来确保 driver 在 non-zero exit code 失败时可以自动重启。为了列出所有 `spark-submit`，可用的选项，用 `--help`. 来运行它。这里是一些常见选项的例子 :

-
-
 ```
 # Run application locally on 8 cores
 ./bin/spark-submit \
@@ -103,8 +97,6 @@ export HADOOP_CONF_DIR=XXX
  1000
 ```

-
-
 # Master URLs

 传递给 Spark 的 master URL 可以使用下列格式中的一种 :

--- a/docs/15.md
+++ b/docs/15.md
@@ -142,8 +142,6 @@ SPARK_WORKER_OPTS 支持以下的系统属性：

 standalone 集群模式当前只支持一个简单的跨应用程序的 FIFO 调度。然而，为了允许多个并发的用户，您可以控制每个应用程序能用的最大资源数。默认情况下，它将获取集群中的 _all_ cores （核），这只有在某一时刻只允许一个应用程序运行时才有意义。您可以通过 `spark.cores.max` 在 [SparkConf](configuration.html#spark-properties) 中设置 cores （核）的数量。例如：

-
-
 ```
 val conf = new SparkConf()
  .setMaster(...)
@@ -152,18 +150,12 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 ```

-
-
 此外，您可以在集群的 master 进程中配置 `spark.deploy.defaultCores` 来修改默认为没有将 `spark.cores.max` 设置为小于无穷大的应用程序。通过添加下面的命令到 `conf/spark-env.sh` 执行以上的操作：

-
-
 ```
 export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
 ```

-
-
 这在用户没有配置最大独立核数的共享的集群中是有用的。

 # 监控和日志

--- a/docs/16.md
+++ b/docs/16.md
@@ -109,8 +109,6 @@ driver 需要在 `spark-env.sh` 中进行一些配置才能与 Mesos 进行交

 现在，当针对集群启动一个 Spark 应用程序时，在创建 `SparkContext` 时传递一个 `mesos://` URL 作为 master。例如：

-
-
 ```
 val conf = new SparkConf()
  .setMaster("mesos://HOST:5050")
@@ -119,20 +117,14 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 ```

-
-
 (您还可以在 [conf/spark-defaults.conf](configuration.html#loading-default-configurations) 文件中使用 [`spark-submit`](submitting-applications.html) 并且配置 `spark.executor.uri` )

 运行 shell 的时候，`spark.executor.uri` 参数从 `SPARK_EXECUTOR_URI` 继承，所以它不需要作为系统属性冗余地传入。

-
-
 ```
 ./bin/spark-shell --master mesos://host:5050
 ```

-
-
 ## Cluster mode（集群模式）

 Spark on Mesos 还支持 cluster mode （集群模式），其中 driver 在集群中启动并且 client（客户端）可以在 Mesos Web UI 中找到 driver 的 results。
@@ -147,8 +139,6 @@ Spark on Mesos 还支持 cluster mode （集群模式），其中 driver 在集

 例如：

-
-
 ```
 ./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
@@ -161,8 +151,6 @@ Spark on Mesos 还支持 cluster mode （集群模式），其中 driver 在集
  1000
 ```

-
-
 请注意，传入到 spark-submit 的 jars 或者 python 文件应该是 Mesos slaves 可访问的 URIs，因为 Spark driver 不会自动上传本地 jars。

 # Mesos 运行模式
@@ -197,24 +185,16 @@ Spark 可以以两种模式运行 Mesos： “coarse-grained（粗粒度）”

 要以细粒度模式运行，请在您的 [SparkConf](configuration.html#spark-properties) 中设置 `spark.mesos.coarse` 属性为 false。:

-
-
 ```
 conf.set("spark.mesos.coarse", "false")
 ```

-
-
 您还可以使用 `spark.mesos.constraints` 在 Mesos 资源提供上设置基于属性的约束。默认情况下，所有资源 offers 都将被接受。

-
-
 ```
 conf.set("spark.mesos.constraints", "os:centos7;us-east-1:false")
 ```

-
-
 例如，假设将 `spark.mesos.constraints` 设置为 `os:centos7;us-east-1:false`，然后将检查资源 offers 以查看它们是否满足这两个约束，然后才会被接受以启动新的执行器。

 # Mesos Docker 支持

--- a/docs/20.md
+++ b/docs/20.md
@@ -42,8 +42,6 @@ Spark 属性控制大多数应用程序设置，并为每个应用程序单独

 请注意，我们运行 local[2]，意思是两个线程 - 代表 “最小” 并行性，这可以帮助检测在只存在于分布式环境中运行时的错误.

-
-
 ```
 val conf = new SparkConf()
             .setMaster("local[2]")
@@ -51,8 +49,6 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 ```

-
-
 注意，本地模式下，我们可以使用多个线程，而且在像 Spark Streaming 这样的场景下，我们可能需要多个线程来防止任一类型的类似 starvation issues （线程饿死） 这样的问题。配置时间段的属性应该写明时间单位，如下格式都是可接受的:

 ```
@@ -79,25 +75,17 @@ val sc = new SparkContext(conf)

 在某些场景下，你可能想避免将属性值写死在 SparkConf 中。例如，你可能希望在同一个应用上使用不同的 master 或不同的内存总量。Spark 允许你简单地创建一个空的 conf :

-
-
 ```
 val sc = new SparkContext(new SparkConf())
 ```

-
-
 然后在运行时设置这些属性 :

-
-
 ```
 ./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
 ```

-
-
 Spark shell 和 [`spark-submit`](submitting-applications.html) 工具支持两种动态加载配置的方法。第一种，通过命令行选项，如：上面提到的 `--master`。`spark-submit` 可以使用 `--conf` flag 来接受任何 Spark 属性标志，但对于启动 Spark 应用程序的属性使用 special flags （特殊标志）。运行 `./bin/spark-submit --help` 可以展示这些选项的完整列表.

 `bin/spark-submit` 也支持从 `conf/spark-defaults.conf` 中读取配置选项，其中每行由一个 key （键）和一个由 whitespace （空格）分隔的 value （值）组成，如下:
@@ -400,43 +388,27 @@ When not set, the SSL port will be derived from the non-SSL port for the same se

 运行 `SET -v` 命令将显示 SQL 配置的整个列表.

-
-
 ```
 // spark is an existing SparkSession
 spark.sql("SET -v").show(numRows = 200, truncate = false)
 ```

-
-
-
-
 ```
 // spark is an existing SparkSession
 spark.sql("SET -v").show(200, false);
 ```

-
-
-
-
 ```
 # spark is an existing SparkSession
 spark.sql("SET -v").show(n=200, truncate=False)
 ```

-
-
-
-
 ```
 sparkR.session()
 properties <- sql("SET -v")
 showDF(properties, numRows = 200, truncate = FALSE)
 ```

-
-
 ### Spark Streaming

 | Property Name （属性名称） | Default （默认值） | Meaning （含义） |

--- a/docs/22.md
+++ b/docs/22.md
@@ -29,16 +29,12 @@ Spark 自动包含 Kryo 序列化器，用于 [Twitter chill](https://github.com

 要使用 Kryo 注册自己的自定义类，请使用该 `registerKryoClasses` 方法。

-
-
 ```
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 ```

-
-
 所述 [Kryo 文档](https://github.com/EsotericSoftware/kryo) 描述了更先进的注册选项，如添加自定义序列的代码。

 如果您的对象很大，您可能还需要增加 `spark.kryoserializer.buffer` [配置](configuration.html#compression-and-serialization)。该值需要足够大才能容纳您将序列化的最大对象。

--- a/docs/23.md
+++ b/docs/23.md
@@ -45,7 +45,7 @@ Spark 提供了一种基于负载来动态调节Spark应用资源占用的机制

 在YARN模式下，需要按以下步骤在各个NodeManager上启动： [here](running-on-yarn.html#configuring-the-external-shuffle-service).

-所有其它的配置都是可选的，在spark.dynamicAllocation._和spark.shuffle.service._这两个命明空间下有更加详细的介绍 [configurations page](configuration.html#dynamic-allocation).
+所有其它的配置都是可选的，在 _spark.dynamicAllocation._ 和 _spark.shuffle.service._ 这两个命名空间下有更加详细的介绍 [configurations page](configuration.html#dynamic-allocation).

 ### 资源分配策略

@@ -81,41 +81,29 @@ Spark会分轮次来申请执行器。实际的资源申请，会在任务挂起

 不过从 Spark 0.8 开始，Spark 也能支持各个作业间的公平（Fair）调度。公平调度时，Spark 以轮询的方式给每个作业分配资源，因此所有的作业获得的资源大体上是平均分配。这意味着，即使有大作业在运行，小的作业再提交也能立即获得计算资源而不是等待前面的作业结束，大大减少了延迟时间。这种模式特别适合于多用户配置。要启用公平调度器，只需设置一下 SparkContext 中 spark.scheduler.mode 属性为 FAIR 即可 :

-
-
 ```
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set("spark.scheduler.mode", "FAIR")
 val sc = new SparkContext(conf)
 ```

-
-
 ## 公平调度资源池

 公平调度器还可以支持将作业分组放入资源池（pool），然后给每个资源池配置不同的选项（如：权重）。这样你就可以给一些比较重要的作业创建一个“高优先级”资源池，或者你也可以把每个用户的作业分到一组，这样一来就是各个用户平均分享集群资源，而不是各个作业平分集群资源。Spark 公平调度的实现方式基本都是模仿 [Hadoop Fair Scheduler](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html)。来实现的。

 默认情况下，新提交的作业都会进入到默认资源池中，不过作业对应于哪个资源池，可以在提交作业的线程中用 SparkContext.setLocalProperty 设定 spark.scheduler.pool 属性。示例代码如下 :

-
-
 ```
 // Assuming sc is your SparkContext variable
 sc.setLocalProperty("spark.scheduler.pool", "pool1")
 ```

-
-
 一旦设好了局部属性，所有该线程所提交的作业（即：在该线程中调用action算子，如：RDD.save/count/collect 等）都会使用这个资源池。这个设置是以线程为单位保存的，你很容易实现用同一线程来提交同一用户的所有作业到同一个资源池中。同样，如果需要清除资源池设置，只需在对应线程中调用如下代码 :

-
-
 ```
 sc.setLocalProperty("spark.scheduler.pool", null)
 ```

-
-
 ## 资源池默认行为

 默认地，各个资源池之间平分整个集群的资源（包括 default 资源池），但在资源池内部，默认情况下，作业是 FIFO 顺序执行的。举例来说，如果你为每个用户创建了一个资源池，那么久意味着各个用户之间共享整个集群的资源，但每个用户自己提交的作业是按顺序执行的，而不会出现后提交的作业抢占前面作业的资源。
@@ -130,18 +118,12 @@ sc.setLocalProperty("spark.scheduler.pool", null)

 资源池属性是一个 XML 文件，可以基于 conf/fairscheduler.xml.template 修改，然后在 [SparkConf](configuration.html#spark-properties)。的 spark.scheduler.allocation.file 属性指定文件路径：

-
-
 ```
 conf.set("spark.scheduler.allocation.file", "/path/to/file")
 ```

-
-
 资源池 XML 配置文件格式如下，其中每个池子对应一个 &lt;pool&gt;元素，每个资源池可以有其独立的配置 :&lt;/pool&gt;

-
-
 ```
 <?xml version="1.0"?>
 <allocations>
@@ -158,6 +140,4 @@ conf.set("spark.scheduler.allocation.file", "/path/to/file")
 </allocations>
 ```

-
-
 完整的例子可以参考 conf/fairscheduler.xml.template。注意，没有在配置文件中配置的资源池都会使用默认配置（schedulingMode : FIFO，weight : 1，minShare : 0）。
\ No newline at end of file
--- a/docs/26.md
+++ b/docs/26.md
@@ -10,8 +10,6 @@ Although not mandatory, it is recommended to configure the proxy server of Swift

 The Spark application should include `hadoop-openstack` dependency. For example, for Maven support, add the following to the `pom.xml` file:

-
-
 ```
 <dependencyManagement>
  ...
@@ -24,8 +22,6 @@ The Spark application should include `hadoop-openstack` dependency. For example,
 </dependencyManagement>
 ```

-
-
 # Configuration Parameters

 Create `core-site.xml` and place it inside Spark’s `conf` directory. There are two main categories of parameters that should to be configured: declaration of the Swift driver and the parameters that are required by Keystone.
@@ -51,8 +47,6 @@ Additional parameters required by Keystone (v2.0) and should be provided to the

 For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `test`. Then `core-site.xml` should include:

-
-
 ```
 <configuration>
  <property>
@@ -93,6 +87,4 @@ For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` wit
 </configuration>
 ```

-
-
 Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, `fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-site.xml` is not always a good approach. We suggest to keep those parameters in `core-site.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration`.
\ No newline at end of file
--- a/docs/3.md
+++ b/docs/3.md
@@ -25,19 +25,13 @@ Spark shell 提供了一种来学习该 API 比较简单的方式，以及一个

 Spark 的主要抽象是一个称为 Dataset 的分布式的 item 集合。Datasets 可以从 Hadoop 的 InputFormats（例如 HDFS文件）或者通过其它的 Datasets 转换来创建。让我们从 Spark 源目录中的 README 文件来创建一个新的 Dataset:

-
-
 ```
 scala> val textFile = spark.read.textFile("README.md")
 textFile: org.apache.spark.sql.Dataset[String] = [value: string]
 ```

-
-
 您可以直接从 Dataset 中获取 values（值），通过调用一些 actions（动作），或者 transform（转换）Dataset 以获得一个新的。更多细节，请参阅 _[API doc](api/scala/index.html#org.apache.spark.sql.Dataset)_。

-
-
 ```
 scala> textFile.count() // Number of items in this Dataset
 res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
@@ -46,48 +40,32 @@ scala> textFile.first() // First item in this Dataset
 res1: String = # Apache Spark
 ```

-
-
 现在让我们 transform 这个 Dataset 以获得一个新的。我们调用 `filter` 以返回一个新的 Dataset，它是文件中的 items 的一个子集。

-
-
 ```
 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
 linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
 ```

-
-
 我们可以链式操作 transformation（转换）和 action（动作）:

-
-
 ```
 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
 res3: Long = 15
 ```

-
-
 ```
 ./bin/pyspark 
 ```

 Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python’s dynamic nature, we don’t need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let’s make a new DataFrame from the text of the README file in the Spark source directory:

-
-
 ```
 >>> textFile = spark.read.text("README.md")
 ```

-
-
 You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.

-
-
 ```
 >>> textFile.count()  # Number of rows in this DataFrame
 126
@@ -96,46 +74,30 @@ You can get values from DataFrame directly, by calling some actions, or transfor
 Row(value=u'# Apache Spark')
 ```

-
-
 Now let’s transform this DataFrame to a new one. We call `filter` to return a new DataFrame with a subset of the lines in the file.

-
-
 ```
 >>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
 ```

-
-
 We can chain together transformations and actions:

-
-
 ```
 >>> textFile.filter(textFile.value.contains("Spark")).count()  # How many lines contain "Spark"?
 15
 ```

-
-
 ## Dataset 上的更多操作

 Dataset actions（操作）和 transformations（转换）可以用于更复杂的计算。例如，统计出现次数最多的行 :

-
-
 ```
 scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
 res4: Long = 15
 ```

-
-
 第一个 map 操作创建一个新的 Dataset，将一行数据 map 为一个整型值。在 Dataset 上调用 `reduce` 来找到最大的行计数。参数 `map` 与 `reduce` 是 Scala 函数（closures），并且可以使用 Scala/Java 库的任何语言特性。例如，我们可以很容易地调用函数声明，我们将定义一个 max 函数来使代码更易于理解 :

-
-
 ```
 scala> import java.lang.Math
 import java.lang.Math
@@ -144,69 +106,45 @@ scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b
 res5: Int = 15
 ```

-
-
 一种常见的数据流模式是被 Hadoop 所推广的 MapReduce。Spark 可以很容易实现 MapReduce:

-
-
 ```
 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
 wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
 ```

-
-
 在这里，我们调用了 `flatMap` 以 transform 一个 lines 的 Dataset 为一个 words 的 Dataset，然后结合 `groupByKey` 和 `count` 来计算文件中每个单词的 counts 作为一个 (String, Long) 的 Dataset pairs。要在 shell 中收集 word counts，我们可以调用 `collect`:

-
-
 ```
 scala> wordCounts.collect()
 res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
 ```

-
-
-
-
 ```
 >>> from pyspark.sql.functions import *
 >>> textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()
 [Row(max(numWords)=15)]
 ```

-
-
 This first maps a line to an integer value and aliases it as “numWords”, creating a new DataFrame. `agg` is called on that DataFrame to find the largest word count. The arguments to `select` and `agg` are both _[Column](api/python/index.html#pyspark.sql.Column)_, we can use `df.colName` to get a column from a DataFrame. We can also import pyspark.sql.functions, which provides a lot of convenient functions to build a new Column from an old one.

 One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

-
-
 ```
 >>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).as("word")).groupBy("word").count()
 ```

-
-
 Here, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: “word” and “count”. To collect the word counts in our shell, we can call `collect`:

-
-
 ```
 >>> wordCounts.collect()
 [Row(word=u'online', count=1), Row(word=u'graphs', count=1), ...]
 ```

-
-
 ## 缓存

 Spark 还支持 Pulling（拉取）数据集到一个群集范围的内存缓存中。例如当查询一个小的 “hot” 数据集或运行一个像 PageRANK 这样的迭代算法时，在数据被重复访问时是非常高效的。举一个简单的例子，让我们标记我们的 `linesWithSpark` 数据集到缓存中:

-
-
 ```
 scala> linesWithSpark.cache()
 res7: linesWithSpark.type = [value: string]
@@ -218,12 +156,8 @@ scala> linesWithSpark.count()
 res9: Long = 15
 ```

-
-
 使用 Spark 来探索和缓存一个 100 行的文本文件看起来比较愚蠢。有趣的是，即使在他们跨越几十或者几百个节点时，这些相同的函数也可以用于非常大的数据集。您也可以像 [编程指南](rdd-programming-guide.html#using-the-shell). 中描述的一样通过连接 `bin/spark-shell` 到集群中，使用交互式的方式来做这件事情。

-
-
 ```
 >>> linesWithSpark.cache()

@@ -234,8 +168,6 @@ res9: Long = 15
 15
 ```

-
-
 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to a cluster, as described in the [RDD programming guide](rdd-programming-guide.html#using-the-shell).

 # 独立的应用
@@ -244,8 +176,6 @@ It may seem silly to use Spark to explore and cache a 100-line text file. The in

 我们将在 Scala 中创建一个非常简单的 Spark 应用程序 - 很简单的，事实上，它名为 `SimpleApp.scala`:

-
-
 ```
 /* SimpleApp.scala */
 import org.apache.spark.sql.SparkSession
@@ -263,8 +193,6 @@ object SimpleApp {
 }
 ```

-
-
 注意，这个应用程序我们应该定义一个 `main()` 方法而不是去扩展 `scala.App`。使用 `scala.App` 的子类可能不会正常运行。

 该程序仅仅统计了 Spark README 文件中每一行包含 ‘a’ 的数量和包含 ‘b’ 的数量。注意，您需要将 YOUR_SPARK_HOME 替换为您 Spark 安装的位置。不像先前使用 spark shell 操作的示例，它们初始化了它们自己的 SparkContext，我们初始化了一个 SparkContext 作为应用程序的一部分。
@@ -273,8 +201,6 @@ object SimpleApp {

 我们的应用依赖了 Spark API，所以我们将包含一个名为 `build.sbt` 的 sbt 配置文件，它描述了 Spark 的依赖。该文件也会添加一个 Spark 依赖的 repository:

-
-
 ```
 name := "Simple Project"

@@ -285,12 +211,8 @@ scalaVersion := "2.11.8"
 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
 ```

-
-
 为了让 sbt 正常的运行，我们需要根据经典的目录结构来布局 `SimpleApp.scala` 和 `build.sbt` 文件。在成功后，我们可以创建一个包含应用程序代码的 JAR 包，然后使用 `spark-submit` 脚本来运行我们的程序。

-
-
 ```
 # Your directory layout should look like this
 $ find .
@@ -315,14 +237,10 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
 Lines with a: 46, Lines with b: 23
 ```

-
-
 这个例子使用Maven来编译成一个jar应用程序，其他的构建系统（如Ant、Gradle，译者注）也可以。

 我们会创建一个非常简单的Spark应用，`SimpleApp.java`:

-
-
 ```
 /* SimpleApp.java */
 import org.apache.spark.sql.SparkSession;
@@ -343,14 +261,10 @@ public class SimpleApp {
 }
 ```

-
-
 这个程序计算Spark README文档中包含字母’a’和字母’b’的行数。注意把YOUR_SPARK_HOME修改成你的Spark的安装目录。 跟之前的Spark shell不同，我们需要初始化SparkSession。

 把Spark依赖添加到Maven的`pom.xml`文件里。 注意Spark的artifacts使用Scala版本进行标记。

-
-
 ```
 <project>
  <groupId>edu.berkeley</groupId>
@@ -369,12 +283,8 @@ public class SimpleApp {
 </project>
 ```

-
-
 我们按照Maven经典的目录结构组织这些文件：

-
-
 ```
 $ find .
 ./pom.xml
@@ -384,12 +294,8 @@ $ find .
 ./src/main/java/SimpleApp.java
 ```

-
-
 现在我们用Maven打包这个应用，然后用`./bin/spark-submit`执行它。

-
-
 ```
 # 打包包含应用程序的JAR
 $ mvn package
@@ -405,14 +311,10 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
 Lines with a: 46, Lines with b: 23
 ```

-
-
 Now we will show how to write an application using the Python API (PySpark).

 As an example, we’ll create a simple Spark application, `SimpleApp.py`:

-
-
 ```
 """SimpleApp.py"""
 from pyspark.sql import SparkSession
@@ -429,14 +331,10 @@ print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
 spark.stop()
 ```

-
-
 This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala and Java examples, we use a SparkSession to create Datasets. For applications that use custom classes or third-party libraries, we can also add code dependencies to `spark-submit` through its `--py-files` argument by packaging them into a .zip file (see `spark-submit --help` for details). `SimpleApp` is simple enough that we do not need to specify any code dependencies.

 We can run this application using the `bin/spark-submit` script:

-
-
 ```
 # Use spark-submit to run your application
 $ YOUR_SPARK_HOME/bin/spark-submit \
@@ -446,8 +344,6 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
 Lines with a: 46, Lines with b: 23
 ```

-
-
 # 快速跳转

 恭喜您成功的运行了您的第一个 Spark 应用程序！
@@ -456,8 +352,6 @@ Lines with a: 46, Lines with b: 23
 *   为了在集群上运行应用程序，请前往 [deployment overview](cluster-overview.html).
 *   最后，在 Spark 的 `examples` 目录中包含了一些 ([Scala](https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples)，[Java](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples)，[Python](https://github.com/apache/spark/tree/master/examples/src/main/python)，[R](https://github.com/apache/spark/tree/master/examples/src/main/r)) 示例。您可以按照如下方式来运行它们:

-
-
 ```
 # 针对 Scala 和 Java，使用 run-example:
 ./bin/run-example SparkPi
@@ -468,5 +362,4 @@ Lines with a: 46, Lines with b: 23
 # 针对 R 示例，直接使用 spark-submit:

 ./bin/spark-submit examples/src/main/r/dataframe.R
-```
-
+```
\ No newline at end of file
--- a/docs/4.md
+++ b/docs/4.md
--- a/docs/6.md
+++ b/docs/6.md
--- a/docs/7.md
+++ b/docs/7.md
--- a/docs/9.md
+++ b/docs/9.md