@@ -10,8 +10,6 @@ Although not mandatory, it is recommended to configure the proxy server of Swift
The Spark application should include `hadoop-openstack` dependency. For example, for Maven support, add the following to the `pom.xml` file:
```
<dependencyManagement>
...
...
...
@@ -24,8 +22,6 @@ The Spark application should include `hadoop-openstack` dependency. For example,
</dependencyManagement>
```
# Configuration Parameters
Create `core-site.xml` and place it inside Spark’s `conf` directory. There are two main categories of parameters that should to be configured: declaration of the Swift driver and the parameters that are required by Keystone.
...
...
@@ -51,8 +47,6 @@ Additional parameters required by Keystone (v2.0) and should be provided to the
For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `test`. Then `core-site.xml` should include:
```
<configuration>
<property>
...
...
@@ -93,6 +87,4 @@ For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` wit
</configuration>
```
Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, `fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-site.xml` is not always a good approach. We suggest to keep those parameters in `core-site.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration`.
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
```
```
./bin/pyspark
```
Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python’s dynamic nature, we don’t need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let’s make a new DataFrame from the text of the README file in the Spark source directory:
```
>>> textFile = spark.read.text("README.md")
```
You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
```
>>> textFile.count() # Number of rows in this DataFrame
126
...
...
@@ -96,46 +74,30 @@ You can get values from DataFrame directly, by calling some actions, or transfor
Row(value=u'# Apache Spark')
```
Now let’s transform this DataFrame to a new one. We call `filter` to return a new DataFrame with a subset of the lines in the file.
This first maps a line to an integer value and aliases it as “numWords”, creating a new DataFrame. `agg` is called on that DataFrame to find the largest word count. The arguments to `select` and `agg` are both _[Column](api/python/index.html#pyspark.sql.Column)_, we can use `df.colName` to get a column from a DataFrame. We can also import pyspark.sql.functions, which provides a lot of convenient functions to build a new Column from an old one.
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
Here, we use the `explode` function in `select`, to transfrom a Dataset of lines to a Dataset of words, and then combine `groupBy` and `count` to compute the per-word counts in the file as a DataFrame of 2 columns: “word” and “count”. To collect the word counts in our shell, we can call `collect`:
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to a cluster, as described in the [RDD programming guide](rdd-programming-guide.html#using-the-shell).
# 独立的应用
...
...
@@ -244,8 +176,6 @@ It may seem silly to use Spark to explore and cache a 100-line text file. The in
Now we will show how to write an application using the Python API (PySpark).
As an example, we’ll create a simple Spark application, `SimpleApp.py`:
```
"""SimpleApp.py"""
from pyspark.sql import SparkSession
...
...
@@ -429,14 +331,10 @@ print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
spark.stop()
```
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala and Java examples, we use a SparkSession to create Datasets. For applications that use custom classes or third-party libraries, we can also add code dependencies to `spark-submit` through its `--py-files` argument by packaging them into a .zip file (see `spark-submit --help` for details). `SimpleApp` is simple enough that we do not need to specify any code dependencies.
We can run this application using the `bin/spark-submit` script: