提交 cf913476 编写于 作者: S Stephan Ewen

[FLINK-3083] [docs] Add docs on how to configure streaming fault tolerance.

This closes #1413
上级 2b358cde
......@@ -81,8 +81,9 @@ under the License.
<li><a href="{{ apis }}/python.html">Python API <span class="badge">Beta</span></a></li>
<li class="divider"></li>
<li><a href="{{ apis }}/scala_shell.html">Interactive Scala Shell</a></li>
<li><a href="{{ apis }}/fault_tolerance.html">Fault Tolerance</a></li>
<li><a href="{{ apis }}/state_backends.html">State in Streaming Programs</a></li>
<li><a href="{{ apis }}/scala_shell.html">Interactive Scala Shell</a></li>
<li><a href="{{ apis }}/dataset_transformations.html">DataSet Transformations</a></li>
<li><a href="{{ apis }}/best_practices.html">Best Practices</a></li>
<li><a href="{{ apis }}/example_connectors.html">Connectors (DataSet API)</a></li>
......
---
title: "Fault Tolerance"
is_beta: false
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<a href="#top"></a>
Flink's fault tolerance mechanism recovers programs in the presence of failures and
continues to execute them. Such failures include machine hardware failures, network failures,
transient program failures, etc.
* This will be replaced by the TOC
{:toc}
Streaming Fault Tolerance (DataStream API)
------------------------------------------
Flink has a checkpointing mechanism that recovers streaming jobs after failues. The checkpointing mechanism requires a *persistent* (or *durable*) source that
can be asked for prior records again (Apache Kafka is a good example of such a source).
The checkpointing mechanism stores the progress in the data sources and data sinks, the state of windows, as well as the user-defined state (see [Working with State]({{ site.baseurl }}/apis/streaming_guide.html#working-with-state)) consistently to provide *exactly once* processing semantics. Where the checkpoints are stored (e.g., JobManager memory, file system, database) depends on the configured [state backend]({{ site.baseurl }}/apis/state_backends.html).
The [docs on streaming fault tolerance]({{ site.baseurl }}/internals/stream_checkpointing.html) describe in detail the technique behind Flink's streaming fault tolerance mechanism.
To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the checkpoint interval in milliseconds.
Other parameters for checkpointing include:
- *Number of retries*: The `setNumberOfExecutionRerties()` method defines how many times the job is restarted after a failure.
When checkpointing is activated, but this value is not explicitly set, the job is restarted infinitely often.
- *exactly-once vs. at-least-once*: You can optionally pass a mode to the `enableCheckpointing(n)` method to choose between the two guarantee levels.
Exactly-once is preferrable for most applications. At-least-once may be relevant for certain super-low-latency (consistently few milliseconds) applications.
- *number of concurrent checkpoints*: By default, the system will not trigger another checkpoint while one is still in progress. This ensures that the topology does not spend too much time on checkpoints and not make progress with processing the streams. It is possible to allow for multiple overlapping checkpoints, which is interesting for pipelines that have a certain processing delay (for example because the functions call external services that need some time to respond) but that still want to do very frequent checkpoints (100s of milliseconds) to re-process very little upon failures.
- *checkpoint timeout*: The time after which a checkpoint-in-progress is aborted, if it did not complete until then.
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// advanced options:
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val env = StreamExecutionEnvironment.getExecutionEnvironment()
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// advanced options:
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
{% endhighlight %}
</div>
</div>
### Fault Tolerance Guarantees of Data Sources and Sinks
Flink can guarantee exactly-once state updates to user-defined state only when the source participates in the
snapshotting mechanism. This is currently guaranteed for the Kafka source (and internal number generators), but
not for other sources. The following table lists the state update guarantees of Flink coupled with the bundled sources:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 25%">Source</th>
<th class="text-left" style="width: 25%">Guarantees</th>
<th class="text-left">Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Apache Kafka</td>
<td>exactly once</td>
<td>Use the appropriate Kafka connector for your version</td>
</tr>
<tr>
<td>RabbitMQ</td>
<td>at most once (v 0.10) / exactly once (v 1.0) </td>
<td></td>
</tr>
<tr>
<td>Twitter Streaming API</td>
<td>at most once</td>
<td></td>
</tr>
<tr>
<td>Collections</td>
<td>exactly once</td>
<td></td>
</tr>
<tr>
<td>Files</td>
<td>at least once</td>
<td>At failure the file will be read from the beginning</td>
</tr>
<tr>
<td>Sockets</td>
<td>at most once</td>
<td></td>
</tr>
</tbody>
</table>
To guarantee end-to-end exactly-once record delivery (in addition to exactly-once state semantics), the data sink needs
to take part in the checkpointing mechanism. The following table lists the delivery guarantees (assuming exactly-once
state updates) of Flink coupled with bundled sinks:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 25%">Sink</th>
<th class="text-left" style="width: 25%">Guarantees</th>
<th class="text-left">Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>HDFS rolling sink</td>
<td>exactly once</td>
<td>Implementation depends on Hadoop version</td>
</tr>
<tr>
<td>Elasticsearch</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>Kafka producer</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>File sinks</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>Socket sinks</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>Standard output</td>
<td>at least once</td>
<td></td>
</tr>
</tbody>
</table>
[Back to top](#top)
Batch Processing Fault Tolerance (DataSet API)
----------------------------------------------
Fault tolerance for programs in the *DataSet API* works by retrying failed executions.
The number of time that Flink retries the execution before the job is declared as failed is configurable
via the *execution retries* parameter. A value of *0* effectively means that fault tolerance is deactivated.
To activate the fault tolerance, set the *execution retries* to a value larger than zero. A common choice is a value
of three.
This example shows how to configure the execution retries for a Flink DataSet program.
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setNumberOfExecutionRetries(3);
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val env = ExecutionEnvironment.getExecutionEnvironment()
env.setNumberOfExecutionRetries(3)
{% endhighlight %}
</div>
</div>
You can also define default values for the number of execution retries and the retry delay in the `flink-conf.yaml`:
~~~
execution-retries.default: 3
~~~
Retry Delays
------------
Execution retries can be configured to be delayed. Delaying the retry means that after a failed execution, the re-execution does not start
immediately, but only after a certain delay.
Delaying the retries can be helpful when the program interacts with external systems where for example connections or pending transactions should reach a timeout before re-execution is attempted.
You can set the retry delay for each program as follows (the sample shows the DataStream API - the DataSet API works similarly):
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setExecutionRetryDelay(5000); // 5000 milliseconds delay
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.getConfig.setExecutionRetryDelay(5000) // 5000 milliseconds delay
{% endhighlight %}
</div>
</div>
You can also define the default value for the retry delay in the `flink-conf.yaml`:
~~~
execution-retries.delay: 10 s
~~~
[Back to top](#top)
......@@ -2889,136 +2889,7 @@ Execution Parameters
### Fault Tolerance
Flink has a checkpointing mechanism that recovers streaming jobs after failues. The checkpointing mechanism requires a *persistent* or *durable* source that
can be asked for prior records again (Apache Kafka is a good example of a durable source).
The checkpointing mechanism stores the progress in the source as well as the user-defined state (see [Working with State](#working_with_state))
consistently to provide *exactly once* processing guarantees.
To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the checkpoint interval in milliseconds.
Other parameters for checkpointing include:
- *Number of retries*: The `setNumberOfExecutionRerties()` method defines how many times the job is restarted after a failure.
When checkpointing is activated, but this value is not explicitly set, the job is restarted infinitely often.
- *exactly-once vs. at-least-once*: You can optionally pass a mode to the `enableCheckpointing(n)` method to choose between the two guarantee levels.
Exactly-once is preferrable for most applications. At-least-once may be relevant for certain super-low-latency (consistently few milliseconds) applications.
The [docs on streaming fault tolerance](../internals/stream_checkpointing.html) describe in detail the technique behind Flink's streaming fault tolerance mechanism.
Flink can guarantee exactly-once state updates to user-defined state only when the source participates in the
snapshotting mechanism. This is currently guaranteed for the Kafka source (and internal number generators), but
not for other sources. The following table lists the state update guarantees of Flink coupled with the bundled sources:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 25%">Source</th>
<th class="text-left" style="width: 25%">Guarantees</th>
<th class="text-left">Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Apache Kafka</td>
<td>exactly once</td>
<td>Use the appropriate Kafka connector for your version</td>
</tr>
<tr>
<td>RabbitMQ</td>
<td>at most once</td>
<td></td>
</tr>
<tr>
<td>Twitter Streaming API</td>
<td>at most once</td>
<td></td>
</tr>
<tr>
<td>Collections</td>
<td>at most once</td>
<td></td>
</tr>
<tr>
<td>Files</td>
<td>at least once</td>
<td>At failure the file will be read from the beginning</td>
</tr>
<tr>
<td>Sockets</td>
<td>at most once</td>
<td></td>
</tr>
</tbody>
</table>
<!--
| Source | Strongest guarantees | Notes |
|-----------------------|-----------------------|-------|
| Apache Kafka | exactly once | Use the appropriate Kafka connector |
| RabbitMQ | at most once | |
| Twitter Streaming API | at most once | |
| Collection sources | at most once | |
| File sources | at least once | Restarts from beginning of the file |
| Socket sources | at most once | |
-->
To guarantee end-to-end exactly-once record delivery (in addition to exactly-once updates), the data sink needs
to take part in the snapshotting mechanism. The following table lists the delivery guarantees (assuming exactly-once
state updates) of Flink coupled with bundled sinks:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 25%">Sink</th>
<th class="text-left" style="width: 25%">Guarantees</th>
<th class="text-left">Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>HDFS rolling sink</td>
<td>exactly once</td>
<td>Implementation depends on Hadoop version</td>
</tr>
<tr>
<td>Elasticsearch</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>Kafka producer</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>File sinks</td>
<td>at least once</td>
<td></td>
</tr>
<tr>
<td>Socket sinks</td>
<td>at lest once</td>
<td></td>
</tr>
<tr>
<td>Standard output</td>
<td>at least once</td>
<td></td>
</tr>
</tbody>
</table>
<!--
| Sink | Strongest guarantees | Notes |
|-----------------------|-----------------------|-------|
| HDFS rolling sink | exactly once | Implementation depends on Hadoop version |
| Elasticsearch | at least once | Duplicates need to be handled in Elasticsearch
| File sinks | at least once | |
| Socket sinks | at least once | |
| Standard output | at least once | |
-->
The [Fault Tolerance Documentation]({{ site.baseurl }}/apis/fault_tolerance.html) describes the options and parameters to enable and configure Flink's checkpointing mechanism.
### Parallelism
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册