diff --git a/docs/_includes/navbar.html b/docs/_includes/navbar.html
index 740ec9f07aeb34bedba959f28437e6dfaa949341..dc7ef30d630da4b99d2dc2f324abd02768837f74 100644
--- a/docs/_includes/navbar.html
+++ b/docs/_includes/navbar.html
@@ -53,6 +53,7 @@ under the License.
YARN
GCloud
Flink on Tez Beta
+ JobManager High Availability
Configuration
diff --git a/docs/page/css/flink.css b/docs/page/css/flink.css
index 9074e23b07684aee05cfa1204a16c5f2a3493cd2..3b09e54db0d8b59ead3c0e754b38db1c078996a8 100644
--- a/docs/page/css/flink.css
+++ b/docs/page/css/flink.css
@@ -113,11 +113,19 @@ h2, h3 {
code {
background: #f5f5f5;
- padding: 0;
- color: #333333;
- font-family: "Menlo", "Lucida Console", monospace;
+ padding: 0;
+ color: #333333;
+ font-family: "Menlo", "Lucida Console", monospace;
}
pre {
font-size: 85%;
}
+
+img.center {
+ display: block;
+ margin-left: auto;
+ margin-right: auto;
+}
+
+
diff --git a/docs/setup/fig/jobmanager_ha_overview.png b/docs/setup/fig/jobmanager_ha_overview.png
new file mode 100644
index 0000000000000000000000000000000000000000..ff82cae5f8e5f7d32953b5f4aa8a375968e4d52c
Binary files /dev/null and b/docs/setup/fig/jobmanager_ha_overview.png differ
diff --git a/docs/setup/jobmanager_high_availability.md b/docs/setup/jobmanager_high_availability.md
new file mode 100644
index 0000000000000000000000000000000000000000..dec0cdcbeef3441d70eff7993425f60b9a572cef
--- /dev/null
+++ b/docs/setup/jobmanager_high_availability.md
@@ -0,0 +1,121 @@
+---
+title: "JobManager High Availability (HA)"
+---
+
+
+The JobManager is the coordinator of each Flink deployment. It is responsible for both *scheduling* and *resource management*.
+
+By default, there is a single JobManager instance per Flink cluster. This creates a *single point of failure* (SPOF): if the JobManager crashes, no new programs can be submitted and running programs fail.
+
+With JobManager High Availability, you can run multiple JobManager instances per Flink cluster and thereby circumvent the *SPOF*.
+
+The general idea of JobManager high availability is that there is a **single leading JobManager** at any time and **multiple standby JobManagers** to take over leadership in case the leader fails. This guarantees that there is **no single point of failure** and programs can make progress as soon as a standby JobManager has taken leadership. There is no explicit distinction between standby and master JobManager instances. Each JobManager can take the role of master or standby.
+
+As an example, consider the following setup with three JobManager instances:
+
+
+
+## Configuration
+
+To enable JobManager High Availability you have to configure a **ZooKeeper quorum** and set up a **masters file** with all JobManagers hosts.
+
+Flink leverages **[ZooKeeper](http://zookeeper.apache.org)** for *distributed coordination* between all running JobManager instances. ZooKeeper is a separate service from Flink, which provides highly reliable distirbuted coordination via leader election and light-weight consistent state storage. Check out [ZooKeeper's Getting Started Guide](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html) for more information about ZooKeeper.
+
+Configuring a ZooKeeper quorum in `conf/flink-conf.yaml` *enables* high availability mode and all Flink components try to connect to a JobManager via coordination through ZooKeeper.
+
+- **ZooKeeper quorum** (required): A *ZooKeeper quorum* is a replicated group of ZooKeeper servers, which provide the distributed coordination service.
+
+ ha.zookeeper.quorum: address1:2181[,...],addressX:2181
+
+ Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.
+
+- The following configuration keys are optional:
+
+ - `ha.zookeeper.dir: /flink [default]`: ZooKeeper directory to use for coordination
+ - TODO Add client configuration keys
+
+## Starting an HA-cluster
+
+In order to start an HA-cluster configure the *masters* file in `conf/masters`:
+
+- **masters file**: The *masters file* contains all hosts, on which JobManagers are started.
+
+
+jobManagerAddress1
+[...]
+jobManagerAddressX
+
+
+After configuring the masters and the ZooKeeper quorum, you can use the provided cluster startup scripts as usual. They will start a HA-cluster. **Keep in mind that the ZooKeeper quorum has to be running when you call the scripts**.
+
+## Running ZooKeeper
+
+If you don't have a running ZooKeeper installation, you can use the helper scripts, which ship with Flink.
+
+There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can configure the hosts to run ZooKeeper on with the `server.X` entries, where X is a unique ID of each server:
+
+
+server.X=addressX:peerPort:leaderPort
+[...]
+server.Y=addressY:peerPort:leaderPort
+
+
+The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some rqeuired configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation.
+
+## Example: Start and stop a local HA-cluster with 2 JobManagers
+
+1. **Configure ZooKeeper quorum** in `conf/flink.yaml`:
+
+ ha.zookeeper.quorum: localhost
+
+2. **Configure masters** in `conf/masters`:
+
+
+localhost
+localhost
+
+3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+
+ server.0=localhost:2888:3888
+
+4. **Start ZooKeeper quorum**:
+
+
+$ bin/start-zookeeper-quorum.sh
+Starting zookeeper daemon on host localhost.
+
+5. **Start an HA-cluster**:
+
+
+$ bin/start-cluster-streaming.sh
+Starting HA cluster (streaming mode) with 2 masters and 1 peers in ZooKeeper quorum.
+Starting jobmanager daemon on host localhost.
+Starting jobmanager daemon on host localhost.
+Starting taskmanager daemon on host localhost.
+
+6. **Stop ZooKeeper quorum and cluster**:
+
+
+$ bin/stop-cluster.sh
+Stopping taskmanager daemon (pid: 7647) on localhost.
+Stopping jobmanager daemon (pid: 7495) on host localhost.
+Stopping jobmanager daemon (pid: 7349) on host localhost.
+$ bin/stop-zookeeper-quorum.sh
+Stopping zookeeper daemon (pid: 7101) on host localhost.