diff --git a/docs/_includes/navbar.html b/docs/_includes/navbar.html index 740ec9f07aeb34bedba959f28437e6dfaa949341..dc7ef30d630da4b99d2dc2f324abd02768837f74 100644 --- a/docs/_includes/navbar.html +++ b/docs/_includes/navbar.html @@ -53,6 +53,7 @@ under the License.
  • YARN
  • GCloud
  • Flink on Tez Beta
  • +
  • JobManager High Availability
  • Configuration
  • diff --git a/docs/page/css/flink.css b/docs/page/css/flink.css index 9074e23b07684aee05cfa1204a16c5f2a3493cd2..3b09e54db0d8b59ead3c0e754b38db1c078996a8 100644 --- a/docs/page/css/flink.css +++ b/docs/page/css/flink.css @@ -113,11 +113,19 @@ h2, h3 { code { background: #f5f5f5; - padding: 0; - color: #333333; - font-family: "Menlo", "Lucida Console", monospace; + padding: 0; + color: #333333; + font-family: "Menlo", "Lucida Console", monospace; } pre { font-size: 85%; } + +img.center { + display: block; + margin-left: auto; + margin-right: auto; +} + + diff --git a/docs/setup/fig/jobmanager_ha_overview.png b/docs/setup/fig/jobmanager_ha_overview.png new file mode 100644 index 0000000000000000000000000000000000000000..ff82cae5f8e5f7d32953b5f4aa8a375968e4d52c Binary files /dev/null and b/docs/setup/fig/jobmanager_ha_overview.png differ diff --git a/docs/setup/jobmanager_high_availability.md b/docs/setup/jobmanager_high_availability.md new file mode 100644 index 0000000000000000000000000000000000000000..dec0cdcbeef3441d70eff7993425f60b9a572cef --- /dev/null +++ b/docs/setup/jobmanager_high_availability.md @@ -0,0 +1,121 @@ +--- +title: "JobManager High Availability (HA)" +--- + + +The JobManager is the coordinator of each Flink deployment. It is responsible for both *scheduling* and *resource management*. + +By default, there is a single JobManager instance per Flink cluster. This creates a *single point of failure* (SPOF): if the JobManager crashes, no new programs can be submitted and running programs fail. + +With JobManager High Availability, you can run multiple JobManager instances per Flink cluster and thereby circumvent the *SPOF*. + +The general idea of JobManager high availability is that there is a **single leading JobManager** at any time and **multiple standby JobManagers** to take over leadership in case the leader fails. This guarantees that there is **no single point of failure** and programs can make progress as soon as a standby JobManager has taken leadership. There is no explicit distinction between standby and master JobManager instances. Each JobManager can take the role of master or standby. + +As an example, consider the following setup with three JobManager instances: + + + +## Configuration + +To enable JobManager High Availability you have to configure a **ZooKeeper quorum** and set up a **masters file** with all JobManagers hosts. + +Flink leverages **[ZooKeeper](http://zookeeper.apache.org)** for *distributed coordination* between all running JobManager instances. ZooKeeper is a separate service from Flink, which provides highly reliable distirbuted coordination via leader election and light-weight consistent state storage. Check out [ZooKeeper's Getting Started Guide](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html) for more information about ZooKeeper. + +Configuring a ZooKeeper quorum in `conf/flink-conf.yaml` *enables* high availability mode and all Flink components try to connect to a JobManager via coordination through ZooKeeper. + +- **ZooKeeper quorum** (required): A *ZooKeeper quorum* is a replicated group of ZooKeeper servers, which provide the distributed coordination service. + +
    ha.zookeeper.quorum: address1:2181[,...],addressX:2181
    + + Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port. + +- The following configuration keys are optional: + + - `ha.zookeeper.dir: /flink [default]`: ZooKeeper directory to use for coordination + - TODO Add client configuration keys + +## Starting an HA-cluster + +In order to start an HA-cluster configure the *masters* file in `conf/masters`: + +- **masters file**: The *masters file* contains all hosts, on which JobManagers are started. + +
    +jobManagerAddress1
    +[...]
    +jobManagerAddressX
    +  
    + +After configuring the masters and the ZooKeeper quorum, you can use the provided cluster startup scripts as usual. They will start a HA-cluster. **Keep in mind that the ZooKeeper quorum has to be running when you call the scripts**. + +## Running ZooKeeper + +If you don't have a running ZooKeeper installation, you can use the helper scripts, which ship with Flink. + +There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can configure the hosts to run ZooKeeper on with the `server.X` entries, where X is a unique ID of each server: + +
    +server.X=addressX:peerPort:leaderPort
    +[...]
    +server.Y=addressY:peerPort:leaderPort
    +
    + +The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some rqeuired configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation. + +## Example: Start and stop a local HA-cluster with 2 JobManagers + +1. **Configure ZooKeeper quorum** in `conf/flink.yaml`: + +
    ha.zookeeper.quorum: localhost
    + +2. **Configure masters** in `conf/masters`: + +
    +localhost
    +localhost
    + +3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine): + +
    server.0=localhost:2888:3888
    + +4. **Start ZooKeeper quorum**: + +
    +$ bin/start-zookeeper-quorum.sh
    +Starting zookeeper daemon on host localhost.
    + +5. **Start an HA-cluster**: + +
    +$ bin/start-cluster-streaming.sh
    +Starting HA cluster (streaming mode) with 2 masters and 1 peers in ZooKeeper quorum.
    +Starting jobmanager daemon on host localhost.
    +Starting jobmanager daemon on host localhost.
    +Starting taskmanager daemon on host localhost.
    + +6. **Stop ZooKeeper quorum and cluster**: + +
    +$ bin/stop-cluster.sh
    +Stopping taskmanager daemon (pid: 7647) on localhost.
    +Stopping jobmanager daemon (pid: 7495) on host localhost.
    +Stopping jobmanager daemon (pid: 7349) on host localhost.
    +$ bin/stop-zookeeper-quorum.sh
    +Stopping zookeeper daemon (pid: 7101) on host localhost.