提交 c9088a49 编写于 作者: K Kostas Tzoumas 提交者: Aljoscha Krettek

[FLINK-2779] Update documentation to reflect new Stream/Window API

上级 9a21ab11
......@@ -76,22 +76,22 @@ under the License.
<li class="dropdown{% if page.url contains '/apis/' %} active{% endif %}">
<a href="{{ apis }}" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">Programming Guides <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<li><a href="{{ apis }}/programming_guide.html"><strong>Batch: DataSet API</strong></a></li>
<li><a href="{{ apis }}/streaming_guide.html"><strong>Streaming: DataStream API</strong> <span class="badge">Beta</span></a></li>
<li><a href="{{ apis }}/programming_guide.html"><strong>DataSet API</strong></a></li>
<li><a href="{{ apis }}/streaming_guide.html"><strong>DataStream API</strong></a></li>
<li><a href="{{ apis }}/python.html">Python API <span class="badge">Beta</span></a></li>
<li class="divider"></li>
<li><a href="{{ apis }}/scala_shell.html">Interactive Scala Shell</a></li>
<li><a href="{{ apis }}/dataset_transformations.html">Dataset Transformations</a></li>
<li><a href="{{ apis }}/dataset_transformations.html">DataSet Transformations</a></li>
<li><a href="{{ apis }}/best_practices.html">Best Practices</a></li>
<li><a href="{{ apis }}/example_connectors.html">Connectors (Batch)</a></li>
<li><a href="{{ apis }}/kafka.html">Kafka Connector <span class="badge">Beta</span></a></li>
<li><a href="{{ apis }}/example_connectors.html">Connectors (DataSet API)</a></li>
<!--<li><a href="{{ apis }}/kafka.html">Kafka Connector <span class="badge">Beta</span></a></li>-->
<li><a href="{{ apis }}/examples.html">Examples</a></li>
<li><a href="{{ apis }}/local_execution.html">Local Execution</a></li>
<li><a href="{{ apis }}/cluster_execution.html">Cluster Execution</a></li>
<li><a href="{{ apis }}/cli.html">Command Line Interface</a></li>
<li><a href="{{ apis }}/web_client.html">Web Client</a></li>
<li><a href="{{ apis }}/iterations.html">Iterations</a></li>
<li><a href="{{ apis }}/iterations.html">Iterations (DataSet API)</a></li>
<li><a href="{{ apis }}/java8.html">Java 8</a></li>
<li><a href="{{ apis }}/hadoop_compatibility.html">Hadoop Compatibility <span class="badge">Beta</span></a></li>
<li><a href="{{ apis }}/storm_compatibility.html">Storm Compatibility <span class="badge">Beta</span></a></li>
......
---
title: "Flink Programming Guide"
title: "Flink DataSet API Programming Guide"
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -22,14 +22,14 @@ under the License.
<a href="#top"></a>
Analysis programs in Flink are regular programs that implement transformations on data sets
DataSet programs in Flink are regular programs that implement transformations on data sets
(e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain
sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for
example write the data to (distributed) files, or to standard output (for example the command line
terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs.
The execution can happen in a local JVM, or on clusters of many machines.
In order to create your own Flink program, we encourage you to start with the
In order to create your own Flink DataSet program, we encourage you to start with the
[program skeleton](#program-skeleton) and gradually add your own
[transformations](#transformations). The remaining sections act as references for additional
operations and advanced features.
......@@ -221,7 +221,7 @@ Program Skeleton
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
As we already saw in the example, Flink programs look like regular Java
As we already saw in the example, Flink DataSet programs look like regular Java
programs with a `main()` method. Each program consists of the same basic parts:
1. Obtain an `ExecutionEnvironment`,
......@@ -233,7 +233,7 @@ programs with a `main()` method. Each program consists of the same basic parts:
We will now give an overview of each of those steps, please refer to the respective sections for
more details. Note that all core classes of the Java API are found in the package {% gh_link /flink-java/src/main/java/org/apache/flink/api/java "org.apache.flink.api.java" %}.
The `ExecutionEnvironment` is the basis for all Flink programs. You can
The `ExecutionEnvironment` is the basis for all Flink DataSet programs. You can
obtain one using these static methods on class `ExecutionEnvironment`:
{% highlight java %}
......@@ -253,7 +253,7 @@ Typically, you only need to use `getExecutionEnvironment()`, since this
will do the right thing depending on the context: if you are executing
your program inside an IDE or as a regular Java program it will create
a local environment that will execute your program on your local machine. If
you created a JAR file from you program, and invoke it through the [command line](cli.html)
you created a JAR file from your program, and invoke it through the [command line](cli.html)
or the [web interface](web_client.html),
the Flink cluster manager will execute your main method and `getExecutionEnvironment()` will return
an execution environment for executing your program on a cluster.
......@@ -276,7 +276,7 @@ more information on data sources and input formats, please refer to
Once you have a DataSet you can apply transformations to create a new
DataSet which you can then write to a file, transform again, or
combine with other DataSets. You apply transformations by calling
methods on DataSet with your own custom transformation function. For example,
methods on DataSet with your own custom transformation functions. For example,
a map transformation looks like this:
{% highlight java %}
......@@ -447,18 +447,14 @@ accessed from the `getLastJobExecutionResult()` method.
DataSet abstraction
---------------
The batch processing APIs of Flink are centered around the `DataSet` abstraction. A `DataSet` is only
an abstract representation of a set of data that can contain duplicates.
Also note that Flink is not always physically creating (materializing) each DataSet at runtime. This
depends on the used runtime, the configuration and optimizer decisions.
The Flink runtime does not need to always materialize the DataSets because it is using a streaming runtime model.
DataSets are only materialized to avoid distributed deadlocks (at points where the data flow graph branches out and joins again later) or if the execution mode has explicitly been set to a batched execution.
A `DataSet` is an abstract representation of a finite immutable collection of data of the same type that may contain duplicates.
When using Flink on Tez, all DataSets are materialized.
Note that Flink is not always physically creating (materializing) each DataSet at runtime. This
depends on the used runtime, the configuration and optimizer decisions. DataSets may be "streamed through"
operations during execution, as under the hood Flink uses a streaming data processing engine.
Some DataSets are materialized automatically to avoid distributed deadlocks (at points where the data flow graph branches
out and joins again later) or if the execution mode has explicitly been set to blocking execution.
[Back to top](#top)
......@@ -466,7 +462,7 @@ When using Flink on Tez, all DataSets are materialized.
Lazy Evaluation
---------------
All Flink programs are executed lazily: When the program's main method is executed, the data loading
All Flink DataSet programs are executed lazily: When the program's main method is executed, the data loading
and transformations do not happen directly. Rather, each operation is created and added to the
program's plan. The operations are actually executed when the execution is explicitly triggered by
an `execute()` call on the ExecutionEnvironment object. Also, `collect()` and `print()` will trigger
......@@ -1323,7 +1319,7 @@ data.map (new MyMapFunction());
#### Anonymous classes
You can pass a function as an anonmymous class:
You can pass a function as an anonymous class:
{% highlight java %}
data.map(new MapFunction<String, Integer> () {
public Integer map(String value) { return Integer.parseInt(value); }
......@@ -1451,7 +1447,7 @@ for a complete example.
Data Types
----------
Flink places some restrictions on the type of elements that are used in DataSets and as results
Flink places some restrictions on the type of elements that are used in DataSets and in results
of transformations. The reason for this is that the system analyzes the types to determine
efficient execution strategies.
......@@ -1473,7 +1469,7 @@ Tuples are composite types that contain a fixed number of fields with various ty
The Java API provides classes from `Tuple1` up to `Tuple25`. Every field of a tuple
can be an arbitrary Flink type including further tuples, resulting in nested tuples. Fields of a
tuple can be accessed directly using the field's name as `tuple.f4`, or using the generic getter method
`tuple.getField(int position)`. The field indicies start at 0. Note that this stands in contrast
`tuple.getField(int position)`. The field indices start at 0. Note that this stands in contrast
to the Scala tuples, but it is more consistent with Java's general indexing.
{% highlight java %}
......@@ -2239,7 +2235,7 @@ result data. This section give some hints how to ease the development of Flink p
### Local Execution Environment
A `LocalEnvironment` starts a Flink system within the same JVM process it was created in. If you
start the LocalEnvironement from an IDE, you can set breakpoint in your code and easily debug your
start the LocalEnvironement from an IDE, you can set breakpoints in your code and easily debug your
program.
A LocalEnvironment is created and used as follows:
......@@ -2957,7 +2953,7 @@ public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<S
[Back to top](#top)
Program Packaging & Distributed Execution
Program Packaging and Distributed Execution
-----------------------------------------
As described in the [program skeleton](#program-skeleton) section, Flink programs can be executed on
......
此差异已折叠。
......@@ -20,18 +20,17 @@ specific language governing permissions and limitations
under the License.
-->
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
It features powerful programming abstractions in Java and Scala, a high-performance runtime, and
automatic program optimization. It has native support for iterations, incremental iterations, and
programs consisting of large DAGs of operations.
Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is
a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed
computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying
native iteration support, managed memory, and program optimization.
If you quickly want to try out the system, please look at one of the available quickstarts. For
a thorough introduction of the Flink API please refer to the
[Programming Guide](apis/programming_guide.html).
If you want to write your first program, look at one of the available quickstarts, and refer to the
[DataSet API guide](apis/programming_guide.html) or the [DataStream API guide](apis/streaming_guide.html).
## Stack
This is an overview of Flink's stack. Click on any component to go to the respective documentation.
This is an overview of Flink's stack. Click on any component to go to the respective documentation page.
<img src="fig/overview-stack-0.9.png" width="893" height="450" alt="Stack" usemap="#overview-stack">
......@@ -62,13 +61,10 @@ This documentation is for Apache Flink version {{ site.version }}, which is the
You can download the latest pre-built snapshot version from the [downloads]({{ site.download_url }}#latest) page of the [project website]({{ site.website_url }}).
The Scala API uses Scala {{ site.scala_version }}. Please make sure to use a compatible version.
<!--The Scala API uses Scala {{ site.scala_version }}. Please make sure to use a compatible version.
Basically, the Scala API uses Scala 2.10. But you can use the API with Scala 2.11. To use Flink with
The Scala API uses Scala 2.10, but you can use the API with Scala 2.11. To use Flink with
Scala 2.11, please check [build guide](/setup/building.html#build-flink-for-scala-211)
and [programming guide](/apis/programming_guide.html#scala-dependency-versions).
and [programming guide](/apis/programming_guide.html#scala-dependency-versions).-->
## Flink Architecture
<img src="fig/process_model.svg" width="100%" alt="Flink Process Model">
......@@ -32,31 +32,53 @@ up within the same JVM.
When a program is submitted, a client is created that performs the pre-processing and turns the program
into the parallel data flow form that is executed by the JobManager and TaskManagers. The figure below
illustrates the different actors in the system very coarsely.
illustrates the different actors in the system and their interactions.
<div style="text-align: center;">
<img src="fig/ClientJmTm.svg" alt="The Interactions between Client, JobManager and TaskManager" height="400px" style="text-align: center;"/>
<img src="../fig/process_model.svg" width="100%" alt="Flink Process Model">
</div>
## Component Stack
An alternative view on the system is given by the stack below. The different layers of the stack build on
As a software stack, Flink is a layered system. The different layers of the stack build on
top of each other and raise the abstraction level of the program representations they accept:
- The **runtime** layer receives a program in the form of a *JobGraph*. A JobGraph is a generic parallel
data flow with arbitrary tasks that consume and produce data streams.
- The **optimizer** and **common api** layer takes programs in the form of operator DAGs. The operators are
specific (e.g., Map, Join, Filter, Reduce, ...), but are data type agnostic. The concrete types and their
interaction with the runtime is specified by the higher layers.
- Both the **DataStream API** and the **DataSet API** generate JobGraphs through separate compilation
processes. The DataSet API uses an optimizer to determine the optimal plan for the program, while
the DataStream API uses a stream builder.
- The **API layer** implements multiple APIs that create operator DAGs for their programs. Each API needs
to provide utilities (serializers, comparators) that describe the interaction between its data types and
the runtime.
- The JobGraph is executed according to a variety of deployment options available in Flink (e.g., local,
remote, YARN, etc)
<div style="text-align: center;">
<img src="fig/stack.svg" alt="The Flink component stack" width="800px" />
</div>
- Libraries and APIs that are bundled with Flink generate DataSet or DataStream API programs. These are
Table for queries on logical tables, FlinkML for Machine Learning, and Gelly for graph processing.
You can click on the components in the figure to learn more.
<img src="../fig/overview-stack-0.9.png" width="893" height="450" alt="Stack" usemap="#overview-stack">
<map name="overview-stack">
<area shape="rect" coords="188,0,263,200" alt="Graph API: Gelly" href="libs/gelly_guide.html">
<area shape="rect" coords="268,0,343,200" alt="Flink ML" href="libs/ml/">
<area shape="rect" coords="348,0,423,200" alt="Table" href="libs/table.html">
<area shape="rect" coords="188,205,538,260" alt="DataSet API (Java/Scala)" href="apis/programming_guide.html">
<area shape="rect" coords="543,205,893,260" alt="DataStream API (Java/Scala)" href="apis/streaming_guide.html">
<!-- <area shape="rect" coords="188,275,538,330" alt="Optimizer" href="optimizer.html"> -->
<!-- <area shape="rect" coords="543,275,893,330" alt="Stream Builder" href="streambuilder.html"> -->
<area shape="rect" coords="188,335,893,385" alt="Flink Runtime" href="internals/general_arch.html">
<area shape="rect" coords="188,405,328,455" alt="Local" href="apis/local_execution.html">
<area shape="rect" coords="333,405,473,455" alt="Remote" href="apis/cluster_execution.html">
<area shape="rect" coords="478,405,638,455" alt="Embedded" href="apis/local_execution.html">
<area shape="rect" coords="643,405,765,455" alt="YARN" href="setup/yarn_setup.html">
<area shape="rect" coords="770,405,893,455" alt="Tez" href="setup/flink_on_tez.html">
</map>
## Projects and Dependencies
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册