After configuring the masters and the ZooKeeper quorum, you can use the provided cluster startup scripts as usual. They will start an HA-cluster. Keep in mind that the **ZooKeeper quorum has to be running** when you call the scripts and make sure to **configure a separate ZooKeeper root path** for each HA cluster you are starting.
#### 示例:有两个JobManager的standalone模式集群
#### Example: Standalone Cluster with 2 JobManagers
When running a highly available YARN cluster, **we don’t run multiple JobManager (ApplicationMaster) instances**, but only one, which is restarted by YARN on failures. The exact behaviour depends on on the specific YARN version you are using.
### 配置
### Configuration
#### 最大主应用尝试次数(yarn-site.xml)
#### Maximum Application Master Attempts (yarn-site.xml)
You have to configure the maximum number of attempts for the application masters for **your** YARN setup in `yarn-site.xml`:
你必须在yarn-site.xml为你的主应用上的yarn设置最大尝试次数。
```
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>4</value>
<description>
The maximum number of application master execution attempts.
</description>
</property>
```
You have to configure the maximum number of attempts for the application masters for **your** YARN setup in `yarn-site.xml`:
```
...
...
@@ -233,38 +306,75 @@ You have to configure the maximum number of attempts for the application masters
```
对当前版本的yarn的默认值是2(表示单个JobManager的失败的可容忍的)。
The default for current YARN versions is 2 (meaning a single JobManager failure is tolerated).
In addition to the HA configuration ([see above](#configuration)), you have to configure the maximum attempts in `conf/flink-conf.yaml`:
```
yarn.application-attempts: 10
```
这意味着应用在失败后可以重启可以被重启9次(9次重启+1次初始化)。当YARN操作需要时:抢占,节点硬件故障或重启,或nodemanager重新同步,YARN可以执行重启。这些重启将不会记录在 `yarn.application-attemps` 中,查阅 [Jian Fang’s blog post](http://johnjianfang.blogspot.de/2015/04/the-number-of-maximum-attempts-of-yarn.html)。值得注意的是, `yarn.application.am.max-attempts` 是应用程序重启的上限。因此Flink中设置的应用程序尝试次数不能超过启动YARN的集群设置次数。
This means that the application can be restarted 9 times for failed attempts before YARN fails the application (9 retries + 1 initial attempt). Additional restarts can be performed by YARN if required by YARN operations: Preemption, node hardware failures or reboots, or NodeManager resyncs. These restarts are not counted against `yarn.application-attempts`, see [Jian Fang’s blog post](http://johnjianfang.blogspot.de/2015/04/the-number-of-maximum-attempts-of-yarn.html). It’s important to note that `yarn.resourcemanager.am.max-attempts` is an upper bound for the application restarts. Therefore, the number of application attempts set within Flink cannot exceed the YARN cluster setting with which YARN was started.
#### 容器关闭行为
#### Container Shutdown Behaviour
***YARN 2.3.0 < version < 2.4.0**. 当主应用失败,所有的容器都被重启。
***YARN 2.4.0 < version < 2.6.0**. TaskManager容器在主应用程序故障期间保持活跃,这具有以下优点:启动时间更快并且用户不必等待再次获得容器资源。
***YARN 2.3.0 < version < 2.4.0**. All containers are restarted if the application master fails.
***YARN 2.4.0 < version < 2.6.0**. TaskManager containers are kept alive across application master failures. This has the advantage that the startup time is faster and that the user does not have to wait for obtaining the container resources again.
***YARN 2.6.0 <= version**: Sets the attempt failure validity interval to the Flinks’ Akka timeout value. The attempt failure validity interval says that an application is only killed after the system has seen the maximum number of application attempts during one interval. This avoids that a long lasting job will deplete it’s application attempts.
**Note**: Hadoop YARN 2.4.0 has a major bug (fixed in 2.5.0) preventing container restarts from a restarted Application Master/Job Manager container. See [FLINK-4142](https://issues.apache.org/jira/browse/FLINK-4142) for details. We recommend using at least Hadoop 2.5.0 for high availability setups on YARN.
If ZooKeeper is running in secure mode with Kerberos, you can override the following configurations in `flink-conf.yaml` as necessary:
```
...
...
@@ -299,8 +436,22 @@ zookeeper.sasl.login-context-name: Client # default is "Client". The value need
For more information on Flink configuration for Kerberos security, please see [here](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html). You can also find [here](//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/security-kerberos.html) further details on how Flink internally setups Kerberos-based security.
If you don’t have a running ZooKeeper installation, you can use the helper scripts, which ship with Flink.
There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can configure the hosts to run ZooKeeper on with the `server.X` entries, where X is a unique ID of each server:
The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some required configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation.
Programs written in the [Data Stream API](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/datastream_api.html) often hold state in various forms:
* Windows gather elements or aggregates until they are triggered
...
...
@@ -10,10 +20,22 @@ Programs written in the [Data Stream API](//ci.apache.org/projects/flink/flink-d
See also [state section](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/index.html) in the streaming API guide.
When checkpointing is activated, such state is persisted upon checkpoints to guard against data loss and recover consistently. How the state is represented internally, and how and where it is persisted upon checkpoints depends on the chosen **State Backend**.
## 后端状态可用
## Available State Backends
开箱即用,Flink捆绑了这些后端状态
* _MemoryStateBackend_
* _FsStateBackend_
* _RocksDBStateBackend_
如果没有配置其他任何内容,系统将使用内存后端状态。
Out of the box, Flink bundles these state backends:
* _MemoryStateBackend_
...
...
@@ -22,6 +44,10 @@ Out of the box, Flink bundles these state backends:
If nothing else is configured, the system will use the MemoryStateBackend.
### 内存后端状态
### The MemoryStateBackend
The _MemoryStateBackend_ holds data internally as objects on the Java heap. Key/value state and window operators hold hash tables that store the values, triggers, etc.