It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to a cluster, as described in the[RDD programming guide](rdd-programming-guide.html#using-the-shell).
Now we will show how to write an application using the Python API (PySpark).
现在我们来展示如何用python API 来写一个应用 (pyspark).
As an example, we’ll create a simple Spark application,`SimpleApp.py`:
我们以一个简单的例子为例,创建一个简单的pyspark 应用`SimpleApp.py`:
```
"""SimpleApp.py"""
...
...
@@ -331,9 +331,9 @@ print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
spark.stop()
```
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala and Java examples, we use a SparkSession to create Datasets. For applications that use custom classes or third-party libraries, we can also add code dependencies to `spark-submit` through its `--py-files` argument by packaging them into a .zip file (see `spark-submit --help` for details). `SimpleApp` is simple enough that we do not need to specify any code dependencies.