提交 844c874b 编写于 作者: U Ufuk Celebi

[FLINK-4317, FLIP-3] [docs] Restructure docs

- Add redirect layout
- Remove Maven artifact name warning
- Add info box if stable, but not latest
- Add font-awesome 4.6.3
- Add sidenav layout

This closes #2387.
上级 c09ff035
......@@ -225,7 +225,7 @@ The Apache Flink project bundles the following files under the MIT License:
- dagre v0.7.4 (https://github.com/cpettitt/dagre) - Copyright (c) 2012-2014 Chris Pettitt
- dagre-d3 v0.4.17 (https://github.com/cpettitt/dagre-d3) - Copyright (c) 2013 Chris Pettitt
- EvEmitter v1.0.2 (https://github.com/metafizzy/ev-emitter) - Copyright (C) 2016 David DeSandro
- Font Awesome (code) v4.5.0 (http://fontawesome.io) - Copyright (c) 2014 Dave Gandy
- Font Awesome (code) v4.5.0, v4.6.3 (http://fontawesome.io) - Copyright (c) 2014 Dave Gandy
- graphlib v1.0.7 (https://github.com/cpettitt/graphlib) - Copyright (c) 2012-2014 Chris Pettitt
- imagesloaded v4.1.0 (https://github.com/desandro/imagesloaded) - Copyright (C) 2016 David DeSandro
- JQuery v2.2.0 (http://jquery.com/) - Copyright 2014 jQuery Foundation and other contributors
......@@ -300,7 +300,8 @@ The Apache Flink project bundles the following fonts under the
Open Font License (OFT) - http://scripts.sil.org/OFL
- Font Awesome (http://fortawesome.github.io/Font-Awesome/) - Created by Dave Gandy
-> fonts in "flink-runtime-web/web-dashboard/assets/fonts"
-> fonts in "flink-runtime-web/web-dashboard/web/fonts"
-> fonts in "docs/page/font-awesome/fonts"
-----------------------------------------------------------------------
The ISC License
......
......@@ -109,43 +109,19 @@ These will be replaced by a info or warning label. You can change the text of th
### Documentation
#### Top Navigation
#### Navigation
You can modify the top-level navigation in two places. You can either edit the `_includes/navbar.html` file or add tags to your page frontmatter (recommended).
The navigation on the left side of the docs is automatically generated when building the docs. You can modify the markup in `_include/sidenav.html`.
# Top-level navigation
top-nav-group: apis
top-nav-pos: 2
top-nav-title: <strong>Batch Guide</strong> (DataSet API)
The structure of the navigation is determined by the front matter of all pages. The fields used to determine the structure are:
This adds the page to the group `apis` (via `top-nav-group`) at position `2` (via `top-nav-pos`). Furthermore, it specifies a custom title for the navigation via `top-nav-title`. If this field is missing, the regular page title (via `title`) will be used. If no position is specified, the element will be added to the end of the group. If no group is specified, the page will not show up.
- `nav-id` => ID of this page. Other pages can use this ID as their parent ID.
- `nav-parent_id` => ID of the parent. This page will be listed under the page with id `nav-parent_id`.
Currently, there are groups `quickstart`, `setup`, `deployment`, `apis`, `libs`, and `internals`.
Level 0 is made up of all pages, which have nav-parent_id set to `root`. There is no limitation on how many levels you can nest.
#### Sub Navigation
The `title` of the page is used as the default link text. You can override this via `nav-title`. The relative position per navigational level is determined by `nav-pos`.
A sub navigation is shown if the field `sub-nav-group` is specified. A sub navigation groups all pages with the same `sub-nav-group`. Check out the streaming or batch guide as an example.
If you have a page with sub pages, the link target will be used to expand the sub level navigation. If you want to actually add a link to the page as well, you can add the `nav-show_overview: true` field to the front matter. This will then add an `Overview` sub page to the expanded list.
# Sub-level navigation
sub-nav-group: batch
sub-nav-id: dataset_api
sub-nav-pos: 1
sub-nav-title: DataSet API
The fields work similar to their `top-nav-*` counterparts.
In addition, you can specify a hierarchy via `sub-nav-id` and `sub-nav-parent`:
# Sub-level navigation
sub-nav-group: batch
sub-nav-parent: dataset_api
sub-nav-pos: 1
sub-nav-title: Transformations
This will show the `Transformations` page under the `DataSet API` page. The `sub-nav-parent` field has to have a matching `sub-nav-id`.
#### Breadcrumbs
Pages with sub navigations can use breadcrumbs like `Batch Guide > Libraries > Machine Learning > Optimization`.
The breadcrumbs for the last page are generated from the front matter. For the a sub navigation root to appear (like `Batch Guide` in the example above), you have to specify the `sub-nav-group-title`. This field designates a group page as the root.
The nesting is also used for the breadcrumbs like `Application Development > Libraries > Machine Learning > Optimization`.
......@@ -29,6 +29,7 @@
version: "1.2-SNAPSHOT"
version_hadoop1: "1.2-hadoop1-SNAPSHOT"
version_short: "1.2" # Used for the top navbar w/o snapshot suffix
is_snapshot_version: true
# This suffix is appended to the Scala-dependent Maven artifact names
scala_version_suffix: "_2.10"
......@@ -40,6 +41,16 @@ jira_url: "https://issues.apache.org/jira/browse/FLINK"
github_url: "https://github.com/apache/flink"
download_url: "http://flink.apache.org/downloads.html"
# Flag whether this is the latest stable version or not. If not, a warning
# will be printed pointing to the docs of the latest stable version.
is_latest: true
is_stable: false
latest_stable_url: http://ci.apache.org/projects/flink/flink-docs-release-1.1
previous_docs:
1.1: http://ci.apache.org/projects/flink/flink-docs-release-1.1
1.0: http://ci.apache.org/projects/flink/flink-docs-release-1.0
#------------------------------------------------------------------------------
# BUILD CONFIG
#------------------------------------------------------------------------------
......@@ -47,14 +58,16 @@ download_url: "http://flink.apache.org/downloads.html"
# to change anything here.
#------------------------------------------------------------------------------
# Used in some documents to initialize arrays. Don't delete.
array: []
defaults:
-
scope:
path: ""
values:
layout: plain
top-nav-pos: 99999 # Move to end
sub-nav-pos: 99999 # Move to end
nav-pos: 99999 # Move to end if no pos specified
markdown: KramdownPygments
highlighter: pygments
......
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
{% capture quickstart %}{{site.baseurl}}/quickstart{% endcapture %}
{% capture setup %}{{site.baseurl}}/setup{% endcapture %}
{% capture apis %}{{site.baseurl}}/apis{% endcapture %}
{% capture libs %}{{site.baseurl}}/libs{% endcapture %}
{% capture internals %}{{site.baseurl}}/internals{% endcapture %}
<!-- Top navbar. -->
<nav class="navbar navbar-default navbar-fixed-top">
<div class="container">
<!-- The logo. -->
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<div class="navbar-logo">
<a href="http://flink.apache.org"><img alt="Apache Flink" src="{{ site.baseurl }}/page/img/navbar-brand-logo.jpg"></a>
</div>
</div><!-- /.navbar-header -->
<!-- The navigation links. -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li class="hidden-sm {% if page.url == '/' %}active{% endif %}"><a href="{{ site.baseurl}}/">Docs v{{ site.version_short }}</a></li>
<li class="{% if page.url == '/concepts/concepts.html' %}active{% endif %}"><a href="{{ site.baseurl}}/concepts/concepts.html">Concepts</a></li>
<!-- Setup -->
<li class="dropdown{% if page.url contains '/setup/' %} active{% endif %}">
<a href="{{ setup }}" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">Setup <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
{% assign setup_group = (site.pages | where: "top-nav-group" , "setup" | sort: "top-nav-pos") %}
{% for setup_group_page in setup_group %}
<li class="{% if page.url contains setup_group_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ setup_group_page.url }}">{% if setup_group_page.top-nav-title %}{{ setup_group_page.top-nav-title }}{% else %}{{ setup_group_page.title }}{% endif %}</a></li>
{% endfor %}
<li class="divider"></li>
<li role="presentation" class="dropdown-header"><strong>Quickstart</strong></li>
<!-- Quickstart -->
{% assign quickstart_group = (site.pages | where: "top-nav-group" , "quickstart" | sort: "top-nav-pos") %}
{% for quickstart_page in quickstart_group %}
<li class="{% if page.url contains quickstart_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ quickstart_page.url }}">{% if quickstart_page.top-nav-title %}{{ quickstart_page.top-nav-title }}{% else %}{{ quickstart_page.title }}{% endif %}</a></li>
{% endfor %}
<li class="divider"></li>
<li role="presentation" class="dropdown-header"><strong>Deployment</strong></li>
{% assign deployment_group = (site.pages | where: "top-nav-group" , "deployment" | sort: "top-nav-pos") %}
{% for deployment_group_page in deployment_group %}
<li class="{% if page.url contains deployment_group_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ deployment_group_page.url }}">{% if deployment_group_page.top-nav-title %}{{ deployment_group_page.top-nav-title }}{% else %}{{ deployment_group_page.title }}{% endif %}</a></li>
{% endfor %}
</ul>
</li>
<!-- Programming Guides -->
<li class="dropdown{% unless page.url contains '/libs/' %}{% if page.url contains '/apis/' %} active{% endif %}{% endunless %}">
<a href="{{ apis }}" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">Programming Guides <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
{% assign apis_group = (site.pages | where: "top-nav-group" , "apis" | sort: "top-nav-pos") %}
{% for apis_group_page in apis_group %}
<li class="{% if page.url contains apis_group_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ apis_group_page.url }}">{% if apis_group_page.top-nav-title %}{{ apis_group_page.top-nav-title }}{% else %}{{ apis_group_page.title }}{% endif %}</a></li>
{% endfor %}
</ul>
</li>
<!-- Libraries -->
<li class="dropdown{% if page.url contains '/libs/' %} active{% endif %}">
<a href="{{ libs }}" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">Libraries <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
{% assign libs_group = (site.pages | where: "top-nav-group" , "libs" | sort: "top-nav-pos") %}
{% for libs_page in libs_group %}
<li class="{% if page.url contains libs_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ libs_page.url }}">{% if libs_page.top-nav-title %}{{ libs_page.top-nav-title }}{% else %}{{ libs_page.title }}{% endif %}</a></li>
{% endfor %}
</ul>
</li>
<!-- Internals -->
<li class="dropdown{% if page.url contains '/internals/' %} active{% endif %}">
<a href="{{ internals }}" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">Internals <span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<li role="presentation" class="dropdown-header"><strong>Contribute</strong></li>
<li><a href="http://flink.apache.org/how-to-contribute.html"><small><span class="glyphicon glyphicon-new-window"></span></small> How to Contribute</a></li>
<li><a href="http://flink.apache.org/contribute-code.html#coding-guidelines"><small><span class="glyphicon glyphicon-new-window"></span></small> Coding Guidelines</a></li>
{% assign internals_group = (site.pages | where: "top-nav-group" , "internals" | sort: "top-nav-pos") %}
{% for internals_page in internals_group %}
<li class="{% if page.url contains internals_page.url %}active{% endif %}"><a href="{{ site.baseurl }}{{ internals_page.url }}">{% if internals_page.top-nav-title %}{{ internals_page.top-nav-title }}{% else %}{{ internals_page.title }}{% endif %}</a></li>
{% endfor %}
</ul>
</li>
</ul>
<form class="navbar-form navbar-right hidden-sm hidden-md" role="search" action="{{site.baseurl}}/search-results.html">
<div class="form-group">
<input type="text" class="form-control" size="16px" name="q" placeholder="Search all pages">
</div>
<button type="submit" class="btn btn-default">Search</button>
</form>
</div><!-- /.navbar-collapse -->
</div><!-- /.container -->
</nav>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
{% comment %}
==============================================================================
Extract the active nav IDs.
==============================================================================
{% endcomment %}
{% assign active_nav_ids = site.array %}
{% assign parent_id = page.nav-parent_id %}
{% for i in (1..10) %}
{% if parent_id %}
{% assign active_nav_ids = active_nav_ids | push: parent_id %}
{% assign current = (site.pages | where: "nav-id" , parent_id | sort: "nav-pos") %}
{% if current.size > 0 %}
{% assign parent_id = current[0].nav-parent_id %}
{% else %}
{% break %}
{% endif %}
{% else %}
{% break %}
{% endif %}
{% endfor %}
{% comment %}
==============================================================================
Build the nested list from nav-id and nav-parent_id relations.
==============================================================================
This builds a nested list from all pages. The fields used to determine the
structure are:
- 'nav-id' => ID of this page. Other pages can use this ID as their
parent ID.
- 'nav-parent_id' => ID of the parent. This page will be listed under
the page with id 'nav-parent_id'.
Level 0 is made up of all pages, which have nav-parent_id set to 'root'.
The 'title' of the page is used as the default link text. You can
override this via 'nav-title'. The relative position per navigational
level is determined by 'nav-pos'.
{% endcomment %}
{% assign elementsPosStack = site.array %}
{% assign posStack = site.array %}
{% assign elements = site.array %}
{% assign children = (site.pages | where: "nav-parent_id" , "root" | sort: "nav-pos") %}
{% if children.size > 0 %}
{% assign elements = elements | push: children %}
{% endif %}
{% assign elementsPos = 0 %}
{% assign pos = 0 %}
<div class="sidenav-logo">
<p><a href="{{ site.baseurl }}"><img class="bottom" alt="Apache Flink" src="{{ site.baseurl }}/page/img/navbar-brand-logo.jpg"></a> v{{ site.version }}</p>
</div>
<ul id="sidenav">
{% for i in (1..10000) %}
{% if pos >= elements[elementsPos].size %}
{% if elementsPos == 0 %}
{% break %}
{% else %}
{% assign elementsPos = elementsPosStack | last %}
{% assign pos = posStack | last %}
</li></ul></div>
{% assign elementsPosStack = elementsPosStack | pop %}
{% assign posStack = posStack | pop %}
{% endif %}
{% else %}
{% assign this = elements[elementsPos][pos] %}
{% if this.url == page.url %}
{% assign active = true %}
{% elsif this.nav-id and active_nav_ids contains this.nav-id %}
{% assign active = true %}
{% else %}
{% assign active = false %}
{% endif %}
{% capture title %}{% if this.nav-title %}{{ this.nav-title }}{% else %}{{ this.title }}{% endif %}{% endcapture %}
{% capture target %}"{{ site.baseurl }}{{ this.url }}"{% if active %} class="active"{% endif %}{% endcapture %}
{% capture overview_target %}"{{ site.baseurl }}{{ this.url }}"{% if this.url == page.url %} class="active"{% endif %}{% endcapture %}
{% assign pos = pos | plus: 1 %}
{% if this.nav-id %}
{% assign children = (site.pages | where: "nav-parent_id" , this.nav-id | sort: "nav-pos") %}
{% if children.size > 0 %}
{% capture collapse_target %}"#collapse-{{ i }}" data-toggle="collapse"{% if active %} class="active"{% endif %}{% endcapture %}
{% capture expand %}{% unless active %} <i class="fa fa-caret-down pull-right" aria-hidden="true" style="padding-top: 4px"></i>{% endunless %}{% endcapture %}
<li><a href={{ collapse_target }}>{{ title }}{{ expand }}</a><div class="collapse{% if active %} in{% endif %}" id="collapse-{{ i }}"><ul>
{% if this.nav-show_overview %}<li><a href={{ overview_target }}>Overview</a></li>{% endif %}
{% assign elements = elements | push: children %}
{% assign elementsPosStack = elementsPosStack | push: elementsPos %}
{% assign posStack = posStack | push: pos %}
{% assign elementsPos = elements.size | minus: 1 %}
{% assign pos = 0 %}
{% else %}
<li><a href={{ target }}>{{ title }}</a></li>
{% endif %}
{% else %}
<li><a href={{ target }}>{{ title }}</a></li>
{% endif %}
{% endif %}
{% endfor %}
<li class="divider"></li>
<li><a href="http://flink.apache.org"><i class="fa fa-external-link" aria-hidden="true"></i> Project Page</a></li>
</ul>
<div class="sidenav-search-box">
<form class="navbar-form" role="search" action="{{site.baseurl}}/search-results.html">
<div class="form-group">
<input type="text" class="form-control" size="16px" name="q" placeholder="Search Docs">
</div>
<button type="submit" class="btn btn-default">Go</button>
</form>
</div>
<div class="sidenav-versions">
<div class="dropdown">
<button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown">Pick Docs Version
<span class="caret"></span></button>
<ul class="dropdown-menu">
{% for d in site.previous_docs %}
<li><a href="{{ d[1] }}">v{{ d[0] }}</a></li>
{% endfor %}
</ul>
</div>
</div>
......@@ -32,6 +32,7 @@ under the License.
<link rel="stylesheet" href="{{ site.baseurl }}/page/css/flink.css">
<link rel="stylesheet" href="{{ site.baseurl }}/page/css/syntax.css">
<link rel="stylesheet" href="{{ site.baseurl }}/page/css/codetabs.css">
<link rel="stylesheet" href="{{ site.baseurl }}/page/font-awesome/css/font-awesome.min.css">
{% if page.mathjax %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
......@@ -53,20 +54,36 @@ under the License.
<![endif]-->
</head>
<body>
{% comment %} Includes are found in the _includes directory. {% endcomment %}
{% include navbar.html %}
{% if page.mathjax %}
{% include latex_commands.html %}
{% endif %}
<!-- Main content. -->
<div class="container">
{% if site.is_stable %}
{% unless site.is_latest %}
<div class="row">
<div class="col-sm-12">
<div class="alert alert-info">
<strong>Note</strong>: This documentation is for Flink version <strong>{{ site.version }}</strong>. There is a more recent stable version available. Please consider updating and <a href="{{ site.latest_stable_url }}">check out the documentation for that version</a>.
</div>
</div>
</div>
{% endunless %}
{% endif %}
{% comment %}
This is the base for all content. The content from the layouts found in
the _layouts directory goes here.
{% endcomment %}
{{ content }}
<div class="row">
<div class="col-lg-3">
{% include sidenav.html %}
</div>
<div class="col-lg-9 content">
{% if page.mathjax %}
{% include latex_commands.html %}
{% endif %}
{{ content }}
</div>
</div>
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
......
......@@ -19,103 +19,39 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<div class="row">
{% if page.sub-nav-group %}
{% comment %}
The plain layout with a sub navigation.
- This is activated via the 'sub-nav-group' field in the preemble.
- All pages of this sub nav group will be displayed in the sub navigation:
* Each element without a 'sub-nav-parent' field will be displayed on the 1st level, where the position is defined via 'sub-nav-pos'.
* If the page should be displayed as a child element, it needs to specify a 'sub-nav-parent' field, which matches the 'sub-nav-id' of its parent. The parent only needs to specify this if it expects child nodes.
{% endcomment %}
<!-- Sub Navigation -->
<div class="col-sm-3">
<ul id="sub-nav">
{% comment %} Get all pages belonging to this group sorted by their position {% endcomment %}
{% assign group = (site.pages | where: "sub-nav-group" , page.sub-nav-group | where: "sub-nav-parent" , nil | sort: "sub-nav-pos") %}
{% for group_page in group %}
{% if group_page.sub-nav-id %}
{% assign sub_group = (site.pages | where: "sub-nav-group" , page.sub-nav-group | where: "sub-nav-parent" , group_page.sub-nav-id | sort: "sub-nav-pos") %}
{% else %}
{% assign sub_group = nil %}
{% endif %}
<li><a href="{{ site.baseurl }}{{ group_page.url }}" class="{% if page.url contains group_page.url %}active{% endif %}">{% if group_page.sub-nav-title %}{{ group_page.sub-nav-title }}{% else %}{{ group_page.title }}{% endif %}</a>
{% if sub_group and sub_group.size() > 0 %}
<ul>
{% for sub_group_page in sub_group %}
<li><a href="{{ site.baseurl }}{{ sub_group_page.url }}" class="{% if page.url contains sub_group_page.url or (sub_group_page.sub-nav-id and page.sub-nav-parent and sub_group_page.sub-nav-id == page.sub-nav-parent) %}active{% endif %}">{% if sub_group_page.sub-nav-title %}{{ sub_group_page.sub-nav-title }}{% else %}{{ sub_group_page.title }}{% endif %}</a></li>
{% endfor %}
</ul>
{% endif %}
</li>
{% endfor %}
</ul>
</div>
<!-- Main -->
<div class="col-sm-9">
<!-- Top anchor -->
<a href="#top"></a>
{% assign active_pages = site.array %}
{% assign active = page %}
{% for i in (1..10) %}
{% assign active_pages = active_pages | push: active %}
{% if active.nav-parent_id %}
{% assign next = (site.pages | where: "nav-id" , active.nav-parent_id ) %}
{% if next.size > 0 %}
{% assign active = next[0] %}
{% else %}
{% break %}
{% endif %}
{% else %}
{% break %}
{% endif %}
{% endfor %}
{% assign active_pages = active_pages | reverse %}
<ol class="breadcrumb">
{% for p in active_pages %}
{% capture title %}{% if p.nav-title %}{{ p.nav-title }}{% else %}{{ p.title }}{% endif %}{% endcapture %}
{% if forloop.last == true %}
<li class="active">{{ title }}</li>
{% elsif p.nav-show_overview %}
<li><a href="{{ site.baseurl }}{{ p.url }}">{{ title }}</a></li>
{% else %}
<li>{{ title }}</li>
{% endif %}
{% endfor %}
</ol>
<h1>{{ page.title }}{% if page.is_beta %} <span class="beta">Beta</span>{% endif %}</h1>
<!-- Artifact name change warning. Remove for the 1.0 release. -->
<div class="panel panel-default">
<div class="panel-body"><strong>Important</strong>: Maven artifacts which depend on Scala are now suffixed with the Scala major version, e.g. "2.10" or "2.11". Please consult the <a href="https://cwiki.apache.org/confluence/display/FLINK/Maven+artifact+names+suffixed+with+Scala+version">migration guide on the project Wiki</a>.</div>
</div>
<!-- Breadcrumbs above the main heading -->
<ol class="breadcrumb">
{% for group_page in group %}
{% if group_page.sub-nav-group-title %}
<li><a href="{{ site.baseurl }}{{ group_page.url }}">{{ group_page.sub-nav-group-title }}</a></li>
{% endif %}
{% endfor %}
{% if page.sub-nav-parent %}
{% assign parent = (site.pages | where: "sub-nav-group" , page.sub-nav-group | where: "sub-nav-id" , page.sub-nav-parent | first) %}
{% if parent %}
{% if parent.sub-nav-parent %}
{% assign grandparent = (site.pages | where: "sub-nav-group" , page.sub-nav-group | where: "sub-nav-id" , parent.sub-nav-parent | first) %}
{% if grandparent %}
<li><a href="{{ site.baseurl }}{{ grandparent.url }}">{% if grandparent.sub-nav-title %}{{ grandparent.sub-nav-title }}{% else %}{{ grandparent.title }}{% endif %}</a></li>
{% endif %}
{% endif %}
<li><a href="{{ site.baseurl }}{{ parent.url }}">{% if parent.sub-nav-title %}{{ parent.sub-nav-title }}{% else %}{{ parent.title }}{% endif %}</a></li>
{% endif %}
{% endif %}
<li class="active">{% if page.sub-nav-title %}{{ page.sub-nav-title }}{% else %}{{ page.title }}{% endif %}</li>
</ol>
<div class="text">
<!-- Main heading -->
<h1>{{ page.title }}{% if page.is_beta %} <span class="beta">(Beta)</span>{% endif %}</h1>
<!-- Content -->
{{ content }}
</div>
</div>
{% else %}
{% comment %}
The plain layout without a sub navigation (only text).
{% endcomment %}
<div class="col-md-8 col-md-offset-2 text">
<!-- Artifact name change warning. Remove for the 1.0 release. -->
<div class="panel panel-default">
<div class="panel-body"><strong>Important</strong>: Maven artifacts which depend on Scala are now suffixed with the Scala major version, e.g. "2.10" or "2.11". Please consult the <a href="https://cwiki.apache.org/confluence/display/FLINK/Maven+artifact+names+suffixed+with+Scala+version">migration guide on the project Wiki</a>.</div>
</div>
<h1>{{ page.title }}{% if page.is_beta %} <span class="beta">Beta</span>{% endif %}</h1>
{{ content }}
</div>
{% endif %}
{% comment %}
Removed until Robert complains... ;)
<div class="col-sm-8 col-sm-offset-2">
<!-- Disqus thread and some vertical offset -->
<div style="margin-top: 75px; margin-bottom: 50px" id="disqus_thread"></div>
</div>
{% endcomment %}
</div>
---
title: DataSet API
layout: base
---
<meta http-equiv="refresh" content="1; url={{ site.baseurl }}/apis/batch/index.html" />
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
......@@ -23,4 +20,8 @@ specific language governing permissions and limitations
under the License.
-->
The *DataSet API guide* has been moved. Redirecting to [{{ site.baseurl }}/apis/batch/index.html]({{ site.baseurl }}/apis/batch/index.html) in 1 second.
<meta http-equiv="refresh" content="1; url={{ site.baseurl }}{{ page.redirect }}" />
<h1>Page '{{ page.title }}' Has Moved</h1>
The <strong>{{ page.title }}</strong> has been moved. Redirecting to <a href="{{ site.baseurl }}/{{ page.redirect }}">{{ site.baseurl }}{{ page.redirect }}</a> in 1 second.
All image files in the folder and its subfolders are
licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\ No newline at end of file
---
title: Gelly
---
<meta http-equiv="refresh" content="1; url={{ site.baseurl }}/apis/batch/libs/gelly/index.html" />
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The *Gelly guide* has been moved. Redirecting to [{{ site.baseurl }}/apis/batch/libs/gelly/index.html]({{ site.baseurl }}/apis/batch/libs/gelly/index.html) in 1 second.
---
title: "Table API and SQL"
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<meta http-equiv="refresh" content="1; url={{ site.baseurl }}/apis/table.html" />
The *Table API guide* has been moved. Redirecting to [{{ site.baseurl }}/apis/table.html]({{ site.baseurl }}/apis/table.html) in 1 second.
\ No newline at end of file
All image files in the folder and its subfolders are
licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\ No newline at end of file
---
title: DataStream API
---
<meta http-equiv="refresh" content="1; url={{ site.baseurl }}/apis/streaming/index.html" />
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The *DataStream API guide* has been moved. Redirecting to [{{ site.baseurl }}/apis/streaming/index.html]({{ site.baseurl }}/apis/streaming/index.html) in 1 second.
---
title: "Concepts"
title: Concepts
nav-id: concepts
nav-pos: 1
nav-title: '<i class="fa fa-map-o" aria-hidden="true"></i> Concepts'
nav-parent_id: root
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
......@@ -38,7 +41,7 @@ omit this here for simplicity).
In most cases, there is a one-to-one correspondence between the transformations in the programs and the operators
in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.
<img src="fig/program_dataflow.svg" alt="A DataStream program, and its dataflow." class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream program, and its dataflow." class="offset" width="80%" />
{% top %}
......@@ -51,7 +54,7 @@ in different threads and on different machines or containers.
The number of operator subtasks is the **parallelism** of that particular operator. The parallelism of a stream
is always that of its producing operator. Different operators of the program may have a different parallelism.
<img src="fig/parallel_dataflow.svg" alt="A parallel dataflow" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel dataflow" class="offset" width="80%" />
Streams can transport data between two operators in a *one-to-one* (or *forwarding*) pattern, or in a *redistributing* pattern:
......@@ -77,7 +80,7 @@ The chaining behavior can be configured in the APIs.
The sample dataflow in the figure below is executed with five subtasks, and hence with five parallel threads.
<img src="fig/tasks_chains.svg" alt="Operator chaining into Tasks" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/tasks_chains.svg" alt="Operator chaining into Tasks" class="offset" width="80%" />
{% top %}
......@@ -105,7 +108,7 @@ The **client** is not part of the runtime and program execution, but is used to
After that, the client can disconnect, or stay connected to receive progress reports. The client runs either as part of the
Java/Scala program that triggers the execution, or in the command line process `./bin/flink run ...`.
<img src="fig/processes.svg" alt="The processes involved in executing a Flink dataflow" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/processes.svg" alt="The processes involved in executing a Flink dataflow" class="offset" width="80%" />
{% top %}
......@@ -125,7 +128,7 @@ separate container, for example). Having multiple slots
means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and
heartbeats messages. They may also share data sets and data structures, thus reducing the per-task overhead.
<img src="fig/tasks_slots.svg" alt="A TaskManager with Task Slots and Tasks" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/tasks_slots.svg" alt="A TaskManager with Task Slots and Tasks" class="offset" width="80%" />
By default, Flink allows subtasks to share slots, if they are subtasks of different tasks, but from the same
job. The result is that one slot may hold an entire pipeline of the job. Allowing this *slot sharing*
......@@ -146,7 +149,7 @@ The mechanism for that are the *resource groups*, which define what (sub)tasks m
As a rule-of-thumb, a good default number of task slots would be the number of CPU cores.
With hyper threading, each slot then takes 2 or more hardware thread contexts.
<img src="fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
{% top %}
......@@ -161,7 +164,7 @@ Windows can be *time driven* (example: every 30 seconds) or *data driven* (examp
One typically distinguishes different types of windows, such as *tumbling windows* (no overlap),
*sliding windows* (with overlap), and *session windows* (gap of activity).
<img src="fig/windows.svg" alt="Time- and Count Windows" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/windows.svg" alt="Time- and Count Windows" class="offset" width="80%" />
More window examples can be found in this [blog post](https://flink.apache.org/news/2015/12/04/Introducing-windows.html).
......@@ -174,15 +177,15 @@ of time:
- **Event Time** is the time when an event was created. It is usually described by a timestamp in the events,
for example attached by the producing sensor, or the producing service. Flink accesses event timestamps
via [timestamp assigners]({{ site.baseurl }}/apis/streaming/event_timestamps_watermarks.html).
via [timestamp assigners]({{ site.baseurl }}/dev/event_timestamps_watermarks.html).
- **Ingestion time** is the time when an event enters the Flink dataflow at the source operator.
- **Processing Time** is the local time at each operator that performs a time-based operation.
<img src="fig/event_ingestion_processing_time.svg" alt="Event Time, Ingestion Time, and Processing Time" class="offset" width="80%" />
<img src="{{ site.baseurl }}/fig/event_ingestion_processing_time.svg" alt="Event Time, Ingestion Time, and Processing Time" class="offset" width="80%" />
More details on how to handle time are in the [event time docs]({{ site.baseurl }}/apis/streaming/event_time.html).
More details on how to handle time are in the [event time docs]({{ site.baseurl }}/dev/event_time.html).
{% top %}
......@@ -199,7 +202,7 @@ and is restricted to the values of the current event's key. Aligning the keys of
makes sure that all state updates are local operations, guaranteeing consistency without transaction overhead.
This alignment also allows Flink to redistribute the state and adjust the stream partitioning transparently.
<img src="fig/state_partitioning.svg" alt="State and Partitioning" class="offset" width="50%" />
<img src="{{ site.baseurl }}/fig/state_partitioning.svg" alt="State and Partitioning" class="offset" width="50%" />
{% top %}
......@@ -214,7 +217,7 @@ of events that need to be replayed).
More details on checkpoints and fault tolerance are in the [fault tolerance docs]({{ site.baseurl }}/internals/stream_checkpointing.html).
<img src="fig/checkpoints.svg" alt="checkpoints and snapshots" class="offset" width="60%" />
<img src="{{ site.baseurl }}/fig/checkpoints.svg" alt="checkpoints and snapshots" class="offset" width="60%" />
{% top %}
......@@ -241,6 +244,6 @@ same way as well as they apply to streaming programs, with minor exceptions:
key/value indexes.
- The DataSet API introduces special synchronized (superstep-based) iterations, which are only possible on
bounded streams. For details, check out the [iteration docs]({{ site.baseurl }}/apis/batch/iterations.html).
bounded streams. For details, check out the [iteration docs]({{ site.baseurl }}/dev/batch/iterations.html).
{% top %}
---
title: "Basic API Concepts"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 1
top-nav-title: <strong>Basic API Concepts</strong>
nav-parent_id: apis
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -37,8 +34,8 @@ Depending on the type of data sources, i.e. bounded or unbounded sources you wou
write a batch program or a streaming program where the DataSet API is used for the former
and the DataStream API is used for the latter. This guide will introduce the basic concepts
that are common to both APIs but please see our
[Streaming Guide]({{ site.baseurl }}/apis/streaming/index.html) and
[Batch Guide]({{ site.baseurl }}/apis/batch/index.html) for concrete information about
[Streaming Guide]({{ site.baseurl }}/dev/datastream_api.html) and
[Batch Guide]({{ site.baseurl }}/dev/batch/index.html) for concrete information about
writing programs with each API.
**NOTE:** When showing actual examples of how the APIs can be used we will use
......@@ -167,7 +164,7 @@ In order to link against the latest SNAPSHOT versions of the code, please follow
The *flink-clients* dependency is only necessary to invoke the Flink program locally (for example to
run it standalone for testing and debugging). If you intend to only export the program as a JAR
file and [run it on a cluster]({{ site.baseurl }}/apis/cluster_execution.html), you can skip that dependency.
file and [run it on a cluster]({{ site.baseurl }}/dev/cluster_execution.html), you can skip that dependency.
{% top %}
......@@ -224,7 +221,7 @@ will do the right thing depending on the context: if you are executing
your program inside an IDE or as a regular Java program it will create
a local environment that will execute your program on your local machine. If
you created a JAR file from your program, and invoke it through the
[command line]({{ site.baseurl }}/apis/cli.html), the Flink cluster manager
[command line]({{ site.baseurl }}/setup/cli.html), the Flink cluster manager
will execute your main method and `getExecutionEnvironment()` will return
an execution environment for executing your program on a cluster.
......@@ -343,11 +340,11 @@ machine or submit your program for execution on a cluster.
The `execute()` method is returning a `JobExecutionResult`, this contains execution
times and accumulator results.
Please see the [Streaming Guide]({{ site.baseurl }}/apis/streaming/index.html)
Please see the [Streaming Guide]({{ site.baseurl }}/dev/datastream_api.html)
for information about streaming data sources and sink and for more in-depths information
about the supported transformations on DataStream.
Check out the [Batch Guide]({{ site.baseurl }}/apis/batch/index.html)
Check out the [Batch Guide]({{ site.baseurl }}/dev/batch/index.html)
for information about batch data sources and sink and for more in-depths information
about the supported transformations on DataSet.
......@@ -634,7 +631,7 @@ data.map(new MapFunction<String, Integer> () {
#### Java 8 Lambdas
Flink also supports Java 8 Lambdas in the Java API. Please see the full [Java 8 Guide]({{ site.baseurl }}/apis/java8.html).
Flink also supports Java 8 Lambdas in the Java API. Please see the full [Java 8 Guide]({{ site.baseurl }}/dev/java8.html).
{% highlight java %}
data.filter(s -> s.startsWith("http://"));
......@@ -732,12 +729,12 @@ data.map (new RichMapFunction[String, Int] {
Rich functions provide, in addition to the user-defined function (map,
reduce, etc), four methods: `open`, `close`, `getRuntimeContext`, and
`setRuntimeContext`. These are useful for parameterizing the function
(see [Passing Parameters to Functions]({{ site.baseurl }}/apis/batch/index.html#passing-parameters-to-functions)),
(see [Passing Parameters to Functions]({{ site.baseurl }}/dev/batch/index.html#passing-parameters-to-functions)),
creating and finalizing local state, accessing broadcast variables (see
[Broadcast Variables]({{ site.baseurl }}/apis/batch/index.html#broadcast-variables), and for accessing runtime
[Broadcast Variables]({{ site.baseurl }}/dev/batch/index.html#broadcast-variables), and for accessing runtime
information such as accumulators and counters (see
[Accumulators and Counters](#accumulators--counters), and information
on iterations (see [Iterations]({{ site.baseurl }}/apis/batch/iterations.html)).
on iterations (see [Iterations]({{ site.baseurl }}/dev/batch/iterations.html)).
{% top %}
......@@ -1005,7 +1002,7 @@ Program Packaging and Distributed Execution
As described earlier, Flink programs can be executed on
clusters by using a `remote environment`. Alternatively, programs can be packaged into JAR Files
(Java Archives) for execution. Packaging the program is a prerequisite to executing them through the
[command line interface]({{ site.baseurl }}/apis/cli.html).
[command line interface]({{ site.baseurl }}/setup/cli.html).
#### Packaging Programs
......@@ -1337,7 +1334,7 @@ To visualize the execution plan, do the following:
After these steps, a detailed execution plan will be visualized.
<img alt="A flink job execution graph." src="fig/plan_visualizer.png" width="80%">
<img alt="A flink job execution graph." src="{{ site.baseurl }}/fig/plan_visualizer.png" width="80%">
__Web Interface__
......
---
title: "Programming Guides"
title: "APIs"
nav-id: apis
nav-parent_id: dev
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Connectors"
# Sub-level navigation
sub-nav-group: batch
sub-nav-pos: 4
nav-parent_id: batch
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -34,7 +32,7 @@ Flink has build-in support for the following file systems:
| Filesystem | Scheme | Notes |
| ------------------------------------- |--------------| ------ |
| Hadoop Distributed File System (HDFS) &nbsp; | `hdfs://` | All HDFS versions are supported |
| Amazon S3 | `s3://` | Support through Hadoop file system implementation (see below) |
| Amazon S3 | `s3://` | Support through Hadoop file system implementation (see below) |
| MapR file system | `maprfs://` | The user has to manually place the required jar files in the `lib/` dir |
| Alluxio | `alluxio://` &nbsp; | Support through Hadoop file system implementation (see below) |
......@@ -104,7 +102,7 @@ One implementation of these `InputFormat`s is the `HadoopInputFormat`. This is a
users to use all existing Hadoop input formats with Flink.
This section shows some examples for connecting Flink to other systems.
[Read more about Hadoop compatibility in Flink]({{ site.baseurl }}/apis/batch/hadoop_compatibility.html).
[Read more about Hadoop compatibility in Flink]({{ site.baseurl }}/dev/batch/hadoop_compatibility.html).
## Avro support in Flink
......@@ -197,17 +195,17 @@ public class AzureTableExample {
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// create a AzureTableInputFormat, using a Hadoop input format wrapper
HadoopInputFormat<Text, WritableEntity> hdIf = new HadoopInputFormat<Text, WritableEntity>(new AzureTableInputFormat(), Text.class, WritableEntity.class, new Job());
// set the Account URI, something like: https://apacheflink.table.core.windows.net
hdIf.getConfiguration().set(AzureTableConfiguration.Keys.ACCOUNT_URI.getKey(), "TODO");
hdIf.getConfiguration().set(AzureTableConfiguration.Keys.ACCOUNT_URI.getKey(), "TODO");
// set the secret storage key here
hdIf.getConfiguration().set(AzureTableConfiguration.Keys.STORAGE_KEY.getKey(), "TODO");
// set the table name here
hdIf.getConfiguration().set(AzureTableConfiguration.Keys.TABLE_NAME.getKey(), "TODO");
DataSet<Tuple2<Text, WritableEntity>> input = env.createInput(hdIf);
// a little example how to use the data in a mapper.
DataSet<String> fin = input.map(new MapFunction<Tuple2<Text,WritableEntity>, String>() {
......@@ -238,5 +236,3 @@ The example shows how to access an Azure table and turn data into Flink's `DataS
## Access MongoDB
This [GitHub repository documents how to use MongoDB with Apache Flink (starting from 0.7-incubating)](https://github.com/okkam-it/flink-mongodb-test).
---
title: "DataSet Transformations"
# Sub-level navigation
sub-nav-group: batch
sub-nav-parent: dataset_api
sub-nav-pos: 1
sub-nav-title: Transformations
nav-title: Transformations
nav-parent_id: batch
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Bundled Examples"
# Sub-level navigation
sub-nav-group: batch
sub-nav-pos: 5
sub-nav-title: Examples
nav-title: Examples
nav-parent_id: batch
nav-pos: 5
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Fault Tolerance"
# Sub-level navigation
sub-nav-group: batch
sub-nav-pos: 2
nav-parent_id: batch
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Hadoop Compatibility"
is_beta: true
# Sub-level navigation
sub-nav-group: batch
sub-nav-pos: 7
nav-parent_id: batch
nav-pos: 7
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -36,7 +35,7 @@ You can:
- use a Hadoop `Reducer` as [GroupReduceFunction](dataset_transformations.html#groupreduce-on-grouped-dataset).
This document shows how to use existing Hadoop MapReduce code with Flink. Please refer to the
[Connecting to other systems]({{ site.baseurl }}/apis/connectors.html) guide for reading from Hadoop supported file systems.
[Connecting to other systems]({{ site.baseurl }}/dev/batch/connectors.html) guide for reading from Hadoop supported file systems.
* This will be replaced by the TOC
{:toc}
......
---
title: "Flink DataSet API Programming Guide"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 3
top-nav-title: <strong>Batch Guide</strong> (DataSet API)
# Sub-level navigation
sub-nav-group: batch
sub-nav-group-title: Batch Guide
sub-nav-id: dataset_api
sub-nav-pos: 1
sub-nav-title: DataSet API
nav-id: batch
nav-title: Batch (DataSet API)
nav-parent_id: apis
nav-pos: 3
nav-show_overview: true
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -39,11 +32,11 @@ example write the data to (distributed) files, or to standard output (for exampl
terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs.
The execution can happen in a local JVM, or on clusters of many machines.
Please see [basic concepts]({{ site.baseurl }}/apis/common/index.html) for an introduction
Please see [basic concepts]({{ site.baseurl }}/dev/api_concepts.html) for an introduction
to the basic concepts of the Flink API.
In order to create your own Flink DataSet program, we encourage you to start with the
[anatomy of a Flink Program]({{ site.baseurl }}/apis/common/index.html#anatomy-of-a-flink-program)
[anatomy of a Flink Program]({{ site.baseurl }}/dev/api_concepts.html#anatomy-of-a-flink-program)
and gradually add your own
[transformations](#dataset-transformations). The remaining sections act as references for additional
operations and advanced features.
......@@ -56,7 +49,7 @@ Example Program
The following program is a complete, working example of WordCount. You can copy &amp; paste the code
to run it locally. You only have to include the correct Flink's library into your project
(see Section [Linking with Flink]({{ site.baseurl }}/apis/common/index.html#linking-with-flink)) and specify the imports. Then you are ready
(see Section [Linking with Flink]({{ site.baseurl }}/dev/api_concepts.html#linking-with-flink)) and specify the imports. Then you are ready
to go!
<div class="codetabs" markdown="1">
......@@ -787,7 +780,7 @@ is not supported by the API out-of-the-box. To use this feature, you should use
</div>
</div>
The [parallelism]({{ site.baseurl }}/apis/common/index.html#parallel-execution) of a transformation can be defined by `setParallelism(int)` while
The [parallelism]({{ site.baseurl }}/dev/api_concepts.html#parallel-execution) of a transformation can be defined by `setParallelism(int)` while
`name(String)` assigns a custom name to a transformation which is helpful for debugging. The same is
possible for [Data Sources](#data-sources) and [Data Sinks](#data-sinks).
......@@ -1208,7 +1201,7 @@ myResult.output(
#### Locally Sorted Output
The output of a data sink can be locally sorted on specified fields in specified orders using [tuple field positions]({{ site.baseurl }}/apis/common/index.html#define-keys-for-tuples) or [field expressions]({{ site.baseurl }}/apis/common/index.html#define-keys-using-field-expressions). This works for every output format.
The output of a data sink can be locally sorted on specified fields in specified orders using [tuple field positions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-for-tuples) or [field expressions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-using-field-expressions). This works for every output format.
The following examples show how to use this feature:
......@@ -1291,7 +1284,7 @@ values map { tuple => tuple._1 + " - " + tuple._2 }
#### Locally Sorted Output
The output of a data sink can be locally sorted on specified fields in specified orders using [tuple field positions]({{ site.baseurl }}/apis/common/index.html#define-keys-for-tuples) or [field expressions]({{ site.baseurl }}/apis/common/index.html#define-keys-using-field-expressions). This works for every output format.
The output of a data sink can be locally sorted on specified fields in specified orders using [tuple field positions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-for-tuples) or [field expressions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-using-field-expressions). This works for every output format.
The following examples show how to use this feature:
......@@ -1540,10 +1533,10 @@ Operating on data objects in functions
--------------------------------------
Flink's runtime exchanges data with user functions in form of Java objects. Functions receive input objects from the runtime as method parameters and return output objects as result. Because these objects are accessed by user functions and runtime code, it is very important to understand and follow the rules about how the user code may access, i.e., read and modify, these objects.
User functions receive objects from Flink's runtime either as regular method parameters (like a `MapFunction`) or through an `Iterable` parameter (like a `GroupReduceFunction`). We refer to objects that the runtime passes to a user function as *input objects*. User functions can emit objects to the Flink runtime either as a method return value (like a `MapFunction`) or through a `Collector` (like a `FlatMapFunction`). We refer to objects which have been emitted by the user function to the runtime as *output objects*.
Flink's DataSet API features two modes that differ in how Flink's runtime creates or reuses input objects. This behavior affects the guarantees and constraints for how user functions may interact with input and output objects. The following sections define these rules and give coding guidelines to write safe user function code.
Flink's DataSet API features two modes that differ in how Flink's runtime creates or reuses input objects. This behavior affects the guarantees and constraints for how user functions may interact with input and output objects. The following sections define these rules and give coding guidelines to write safe user function code.
### Object-Reuse Disabled (DEFAULT)
......@@ -1780,7 +1773,7 @@ This information is used by the optimizer to infer whether a data property such
partitioning is preserved by a function.
For functions that operate on groups of input elements such as `GroupReduce`, `GroupCombine`, `CoGroup`, and `MapPartition`, all fields that are defined as forwarded fields must always be jointly forwarded from the same input element. The forwarded fields of each element that is emitted by a group-wise function may originate from a different element of the function's input group.
Field forward information is specified using [field expressions]({{ site.baseurl }}/apis/common/index.html#define-keys-using-field-expressions).
Field forward information is specified using [field expressions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-using-field-expressions).
Fields that are forwarded to the same position in the output can be specified by their position.
The specified position must be valid for the input and output data type and have the same type.
For example the String `"f2"` declares that the third field of a Java input tuple is always equal to the third field in the output tuple.
......@@ -1849,7 +1842,7 @@ Non-forwarded field information for group-wise operators such as `GroupReduce`,
**IMPORTANT**: The specification of non-forwarded fields information is optional. However if used,
**ALL!** non-forwarded fields must be specified, because all other fields are considered to be forwarded in place. It is safe to declare a forwarded field as non-forwarded.
Non-forwarded fields are specified as a list of [field expressions]({{ site.baseurl }}/apis/common/index.html#define-keys-using-field-expressions). The list can be either given as a single String with field expressions separated by semicolons or as multiple Strings.
Non-forwarded fields are specified as a list of [field expressions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-using-field-expressions). The list can be either given as a single String with field expressions separated by semicolons or as multiple Strings.
For example both `"f1; f3"` and `"f1", "f3"` declare that the second and fourth field of a Java tuple
are not preserved in place and all other fields are preserved in place.
Non-forwarded field information can only be specified for functions which have identical input and output types.
......@@ -1900,7 +1893,7 @@ Fields which are only unmodified forwarded to the output without evaluating thei
**IMPORTANT**: The specification of read fields information is optional. However if used,
**ALL!** read fields must be specified. It is safe to declare a non-read field as read.
Read fields are specified as a list of [field expressions]({{ site.baseurl }}/apis/common/index.html#define-keys-using-field-expressions). The list can be either given as a single String with field expressions separated by semicolons or as multiple Strings.
Read fields are specified as a list of [field expressions]({{ site.baseurl }}/dev/api_concepts.html#define-keys-using-field-expressions). The list can be either given as a single String with field expressions separated by semicolons or as multiple Strings.
For example both `"f1; f3"` and `"f1", "f3"` declare that the second and fourth field of a Java tuple are read and evaluated by the function.
Read field information is specified as function class annotations using the following annotations:
......@@ -2028,7 +2021,7 @@ Distributed Cache
Flink offers a distributed cache, similar to Apache Hadoop, to make files locally accessible to parallel instances of user functions. This functionality can be used to share files that contain static external data such as dictionaries or machine-learned regression models.
The cache works as follows. A program registers a file or directory of a [local or remote filesystem such as HDFS or S3]({{ site.baseurl }}/apis/batch/connectors.html#reading-from-file-systems) under a specific name in its `ExecutionEnvironment` as a cached file. When the program is executed, Flink automatically copies the file or directory to the local filesystem of all workers. A user function can look up the file or directory under the specified name and access it from the worker's local filesystem.
The cache works as follows. A program registers a file or directory of a [local or remote filesystem such as HDFS or S3]({{ site.baseurl }}/dev/batch/connectors.html#reading-from-file-systems) under a specific name in its `ExecutionEnvironment` as a cached file. When the program is executed, Flink automatically copies the file or directory to the local filesystem of all workers. A user function can look up the file or directory under the specified name and access it from the worker's local filesystem.
The distributed cache is used as follows:
......@@ -2054,16 +2047,16 @@ DataSet<Integer> result = input.map(new MyMapper());
env.execute();
{% endhighlight %}
Access the cached file or directory in a user function (here a `MapFunction`). The function must extend a [RichFunction]({{ site.baseurl }}/apis/common/index.html#rich-functions) class because it needs access to the `RuntimeContext`.
Access the cached file or directory in a user function (here a `MapFunction`). The function must extend a [RichFunction]({{ site.baseurl }}/dev/api_concepts.html#rich-functions) class because it needs access to the `RuntimeContext`.
{% highlight java %}
// extend a RichFunction to have access to the RuntimeContext
public final class MyMapper extends RichMapFunction<String, Integer> {
@Override
public void open(Configuration config) {
// access cached file via RuntimeContext and DistributedCache
File myFile = getRuntimeContext().getDistributedCache().getFile("hdfsFile");
// read the file (or navigate the directory)
......@@ -2132,7 +2125,7 @@ Passing Parameters to Functions
Parameters can be passed to functions using either the constructor or the `withParameters(Configuration)` method. The parameters are serialized as part of the function object and shipped to all parallel task instances.
Check also the [best practices guide on how to pass command line arguments to functions]({{ site.baseurl }}/apis/best_practices.html#parsing-command-line-arguments-and-passing-them-around-in-your-flink-application).
Check also the [best practices guide on how to pass command line arguments to functions]({{ site.baseurl }}/monitoring/best_practices.html#parsing-command-line-arguments-and-passing-them-around-in-your-flink-application).
#### Via Constructor
......@@ -2175,7 +2168,7 @@ class MyFilter(limit: Int) extends FilterFunction[Int] {
#### Via `withParameters(Configuration)`
This method takes a Configuration object as an argument, which will be passed to the [rich function]({{ site.baseurl }}/apis/common/index.html#rich-functions)'s `open()`
This method takes a Configuration object as an argument, which will be passed to the [rich function]({{ site.baseurl }}/dev/api_concepts.html#rich-functions)'s `open()`
method. The Configuration object is a Map from String keys to different value types.
<div class="codetabs" markdown="1">
......
......@@ -28,11 +28,10 @@ Iterative algorithms occur in many domains of data analysis, such as *machine le
Flink programs implement iterative algorithms by defining a **step function** and embedding it into a special iteration operator. There are two variants of this operator: **Iterate** and **Delta Iterate**. Both operators repeatedly invoke the step function on the current iteration state until a certain termination condition is reached.
Here, we provide background on both operator variants and outline their usage. The [programming guide](index.html) explains how to implement the operators in both Scala and Java. We also support **vertex-centric, scatter-gather, and gather-sum-apply iterations** through Flink's graph processing API, [Gelly]({{site.baseurl}}/libs/gelly_guide.html).
Here, we provide background on both operator variants and outline their usage. The [programming guide](index.html) explains how to implement the operators in both Scala and Java. We also support both **vertex-centric and gather-sum-apply iterations** through Flink's graph processing API, [Gelly]({{site.baseurl}}/dev/libs/gelly/index.html).
The following table provides an overview of both operators:
<table class="table table-striped table-hover table-bordered">
<thead>
<th></th>
......
---
title: "Python Programming Guide"
is_beta: true
# Sub-level navigation
sub-nav-group: batch
sub-nav-id: python_api
sub-nav-pos: 4
sub-nav-title: Python API
nav-title: Python API
nav-parent_id: batch
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -76,7 +73,7 @@ env.execute(local=True)
Program Skeleton
----------------
As we already saw in the example, Flink programs look like regular python programs.
As we already saw in the example, Flink programs look like regular python programs.
Each program consists of the same basic parts:
1. Obtain an `Environment`,
......@@ -484,7 +481,7 @@ File-based:
Collection-based:
- `from_elements(*args)` - Creates a data set from a Seq. All elements
- `generate_sequence(from, to)` - Generates the sequence of numbers in the given interval, in parallel.
- `generate_sequence(from, to)` - Generates the sequence of numbers in the given interval, in parallel.
**Examples**
......@@ -569,7 +566,7 @@ toBroadcast = env.from_elements(1, 2, 3)
data = env.from_elements("a", "b")
# 2. Broadcast the DataSet
data.map(MapperBcv()).with_broadcast_set("bcv", toBroadcast)
data.map(MapperBcv()).with_broadcast_set("bcv", toBroadcast)
{% endhighlight %}
Make sure that the names (`bcv` in the previous example) match when registering and
......
---
title: "Zipping Elements in a DataSet"
# Sub-level navigation
sub-nav-group: batch
sub-nav-parent: dataset_api
sub-nav-pos: 2
sub-nav-title: Zipping Elements
nav-title: Zipping Elements
nav-parent_id: batch
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Cluster Execution"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 8
nav-parent_id: dev
nav-pos: 12
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -34,7 +33,7 @@ are two ways to send a program to a cluster for execution:
The command line interface lets you submit packaged programs (JARs) to a cluster
(or single machine setup).
Please refer to the [Command Line Interface](cli.html) documentation for
Please refer to the [Command Line Interface]({{ site.baseurl }}/setup/cli.html) documentation for
details.
## Remote Environment
......@@ -102,7 +101,7 @@ The latter version is recommended as it respects the classloader management in F
To provide these dependencies not included by Flink we suggest two options with Maven.
1. The maven assembly plugin builds a so-called uber-jar (executable jar) containing all your dependencies.
The assembly configuration is straight-forward, but the resulting jar might become bulky.
The assembly configuration is straight-forward, but the resulting jar might become bulky.
See [maven-assembly-plugin](http://maven.apache.org/plugins/maven-assembly-plugin/usage.html) for further information.
2. The maven unpack plugin unpacks the relevant parts of the dependencies and
then packages it with your code.
......
---
title: "Apache Cassandra Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 1
sub-nav-title: Cassandra
nav-title: Cassandra
nav-parent_id: connectors
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -38,7 +35,7 @@ To use this connector, add the following dependency to your project:
</dependency>
{% endhighlight %}
Note that the streaming connectors are currently not part of the binary distribution. See how to link with them for cluster execution [here]({{ site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that the streaming connectors are currently not part of the binary distribution. See how to link with them for cluster execution [here]({{ site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
#### Installing Apache Cassandra
Follow the instructions from the [Cassandra Getting Started page](http://wiki.apache.org/cassandra/GettingStarted).
......@@ -76,7 +73,7 @@ checkpoint will be replayed completely.
Furthermore, for non-deterministic programs the write-ahead log has to be enabled. For such a program
the replayed checkpoint may be completely different than the previous attempt, which may leave the
database in an inconsitent state since part of the first attempt may already be written.
The write-ahead log guarantees that the replayed checkpoint is identical to the first attempt.
The write-ahead log guarantees that the replayed checkpoint is identical to the first attempt.
Note that that enabling this feature will have an adverse impact on latency.
<p style="border-radius: 5px; padding: 5px" class="bg-danger"><b>Note</b>: The write-ahead log functionality is currently experimental. In many cases it is sufficent to use the connector without enabling it. Please report problems to the development mailing list.</p>
......
---
title: "Elasticsearch Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 2
sub-nav-title: Elasticsearch
nav-title: Elasticsearch
nav-parent_id: connectors
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -40,7 +37,7 @@ following dependency to your project:
Note that the streaming connectors are currently not part of the binary
distribution. See
[here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
[here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.
......
---
title: "Elasticsearch 2.x Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 2
sub-nav-title: Elasticsearch 2.x
nav-title: Elasticsearch 2.x
nav-parent_id: connectors
nav-pos: 5
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -40,7 +37,7 @@ following dependency to your project:
Note that the streaming connectors are currently not part of the binary
distribution. See
[here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
[here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.
......
---
title: "HDFS Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 3
sub-nav-title: Filesystem Sink
nav-title: Rolling File Sink
nav-parent_id: connectors
nav-pos: 6
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -40,7 +37,7 @@ following dependency to your project:
Note that the streaming connectors are currently not part of the binary
distribution. See
[here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
[here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.
......
---
title: "Streaming Connectors"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-id: connectors
sub-nav-pos: 6
sub-nav-title: Connectors
nav-id: connectors
nav-title: Connectors
nav-parent_id: dev
nav-pos: 7
nav-show_overview: true
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Apache Kafka Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 1
sub-nav-title: Kafka
nav-title: Kafka
nav-parent_id: connectors
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -85,7 +82,7 @@ Then, import the connector in your maven project:
</dependency>
{% endhighlight %}
Note that the streaming connectors are currently not part of the binary distribution. See how to link with them for cluster execution [here]({{ site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that the streaming connectors are currently not part of the binary distribution. See how to link with them for cluster execution [here]({{ site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
### Installing Apache Kafka
......@@ -144,7 +141,7 @@ If you experience any issues with the Kafka consumer on the client side, the cli
#### The `DeserializationSchema`
The Flink Kafka Consumer needs to know how to turn the binary data in Kafka into Java/Scala objects. The
The Flink Kafka Consumer needs to know how to turn the binary data in Kafka into Java/Scala objects. The
`DeserializationSchema` allows users to specify such a schema. The `T deserialize(byte[] message)`
method gets called for each Kafka message, passing the value from Kafka.
......@@ -157,13 +154,13 @@ the following deserialize method ` T deserialize(byte[] messageKey, byte[] messa
For convenience, Flink provides the following schemas:
1. `TypeInformationSerializationSchema` (and `TypeInformationKeyValueSerializationSchema`) which creates
1. `TypeInformationSerializationSchema` (and `TypeInformationKeyValueSerializationSchema`) which creates
a schema based on a Flink's `TypeInformation`. This is useful if the data is both written and read by Flink.
This schema is a performant Flink-specific alternative to other generic serialization approaches.
2. `JsonDeserializationSchema` (and `JSONKeyValueDeserializationSchema`) which turns the serialized JSON
into an ObjectNode object, from which fields can be accessed using objectNode.get("field").as(Int/String/...)().
The KeyValue objectNode contains a "key" and "value" field which contain all fields, as well as
2. `JsonDeserializationSchema` (and `JSONKeyValueDeserializationSchema`) which turns the serialized JSON
into an ObjectNode object, from which fields can be accessed using objectNode.get("field").as(Int/String/...)().
The KeyValue objectNode contains a "key" and "value" field which contain all fields, as well as
an optional "metadata" field that exposes the offset/partition/topic for this message.
#### Kafka Consumers and Fault Tolerance
......@@ -200,14 +197,14 @@ If checkpointing is not enabled, the Kafka consumer will periodically commit the
#### Kafka Consumers and Timestamp Extraction/Watermark Emission
In many scenarios, the timestamp of a record is embedded (explicitly or implicitly) in the record itself.
In many scenarios, the timestamp of a record is embedded (explicitly or implicitly) in the record itself.
In addition, the user may want to emit watermarks either periodically, or in an irregular fashion, e.g. based on
special records in the Kafka stream that contain the current event-time watermark. For these cases, the Flink Kafka
special records in the Kafka stream that contain the current event-time watermark. For these cases, the Flink Kafka
Consumer allows the specification of an `AssignerWithPeriodicWatermarks` or an `AssignerWithPunctuatedWatermarks`.
You can specify your custom timestamp extractor/watermark emitter as described
[here]({{ site.baseurl }}/apis/streaming/event_timestamps_watermarks.html), or use one from the
[predefined ones]({{ site.baseurl }}/apis/streaming/event_timestamp_extractors.html). After doing so, you
You can specify your custom timestamp extractor/watermark emitter as described
[here]({{ site.baseurl }}/apis/streaming/event_timestamps_watermarks.html), or use one from the
[predefined ones]({{ site.baseurl }}/apis/streaming/event_timestamp_extractors.html). After doing so, you
can pass it to your consumer in the following way:
<div class="codetabs" markdown="1">
......@@ -219,7 +216,7 @@ properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer08<String> myConsumer =
FlinkKafkaConsumer08<String> myConsumer =
new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
myConsumer.assignTimestampsAndWatermarks(new CustomWatermarkEmitter());
......@@ -244,12 +241,12 @@ stream = env
{% endhighlight %}
</div>
</div>
Internally, an instance of the assigner is executed per Kafka partition.
When such an assigner is specified, for each record read from Kafka, the
`extractTimestamp(T element, long previousElementTimestamp)` is called to assign a timestamp to the record and
the `Watermark getCurrentWatermark()` (for periodic) or the
`Watermark checkAndGetNextWatermark(T lastElement, long extractedTimestamp)` (for punctuated) is called to determine
When such an assigner is specified, for each record read from Kafka, the
`extractTimestamp(T element, long previousElementTimestamp)` is called to assign a timestamp to the record and
the `Watermark getCurrentWatermark()` (for periodic) or the
`Watermark checkAndGetNextWatermark(T lastElement, long extractedTimestamp)` (for punctuated) is called to determine
if a new watermark should be emitted and with which timestamp.
### Kafka Producer
......@@ -285,9 +282,8 @@ The interface of the serialization schema is called `KeyedSerializationSchema`.
**Note**: By default, the number of retries is set to "0". This means that the producer fails immediately on errors,
including leader changes. The value is set to "0" by default to avoid duplicate messages in the target topic.
For most production environments with frequent broker changes, we recommend setting the number of retries to a
For most production environments with frequent broker changes, we recommend setting the number of retries to a
higher value.
There is currently no transactional producer for Kafka, so Flink can not guarantee exactly-once delivery
into a Kafka topic.
---
title: "Amazon AWS Kinesis Streams Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 5
sub-nav-title: Amazon Kinesis Streams
nav-title: Kinesis
nav-parent_id: connectors
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -26,7 +23,7 @@ specific language governing permissions and limitations
under the License.
-->
The Kinesis connector provides access to [Amazon AWS Kinesis Streams](http://aws.amazon.com/kinesis/streams/).
The Kinesis connector provides access to [Amazon AWS Kinesis Streams](http://aws.amazon.com/kinesis/streams/).
To use the connector, add the following Maven dependency to your project:
......@@ -47,14 +44,14 @@ Flink releases because of the licensing issue. Therefore, you need to build the
Download the Flink source or check it out from the git repository. Then, use the following Maven command to build the module:
{% highlight bash %}
mvn clean install -Pinclude-kinesis -DskipTests
# In Maven 3.3 the shading of flink-dist doesn't work properly in one run, so we need to run mvn for flink-dist again.
# In Maven 3.3 the shading of flink-dist doesn't work properly in one run, so we need to run mvn for flink-dist again.
cd flink-dist
mvn clean install -Pinclude-kinesis -DskipTests
{% endhighlight %}
The streaming connectors are not part of the binary distribution. See how to link with them for cluster
execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
The streaming connectors are not part of the binary distribution. See how to link with them for cluster
execution [here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
### Using the Amazon Kinesis Streams Service
Follow the instructions from the [Amazon Kinesis Streams Developer Guide](https://docs.aws.amazon.com/streams/latest/dev/learning-kinesis-module-one-create-stream.html)
......@@ -234,8 +231,8 @@ consumer when calling this API can also be modified by using the other keys pref
### Kinesis Producer
The `FlinkKinesisProducer` is used for putting data from a Flink stream into a Kinesis stream. Note that the producer is not participating in
Flink's checkpointing and doesn't provide exactly-once processing guarantees.
Also, the Kinesis producer does not guarantee that records are written in order to the shards (See [here](https://github.com/awslabs/amazon-kinesis-producer/issues/23) and [here](http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html#API_PutRecord_RequestSyntax) for more details).
Flink's checkpointing and doesn't provide exactly-once processing guarantees.
Also, the Kinesis producer does not guarantee that records are written in order to the shards (See [here](https://github.com/awslabs/amazon-kinesis-producer/issues/23) and [here](http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html#API_PutRecord_RequestSyntax) for more details).
In case of a failure or a resharding, data will be written again to Kinesis, leading to duplicates. This behavior is usually called "at-least-once" semantics.
......@@ -281,13 +278,13 @@ simpleStringStream.addSink(kinesis);
The above is a simple example of using the producer. Configuration for the producer with the mandatory configuration values is supplied with a `java.util.Properties`
instance as described above for the consumer. The example demonstrates producing a single Kinesis stream in the AWS region "us-east-1".
Instead of a `SerializationSchema`, it also supports a `KinesisSerializationSchema`. The `KinesisSerializationSchema` allows to send the data to multiple streams. This is
Instead of a `SerializationSchema`, it also supports a `KinesisSerializationSchema`. The `KinesisSerializationSchema` allows to send the data to multiple streams. This is
done using the `KinesisSerializationSchema.getTargetStream(T element)` method. Returning `null` there will instruct the producer to write the element to the default stream.
Otherwise, the returned stream name is used.
Other optional configuration keys for the producer can be found in `ProducerConfigConstants`.
### Using Non-AWS Kinesis Endpoints for Testing
It is sometimes desirable to have Flink operate as a consumer or producer against a non-AWS Kinesis endpoint such as
......
---
title: "Apache NiFi Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 7
sub-nav-title: Apache NiFi
nav-title: NiFi
nav-parent_id: connectors
nav-pos: 8
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -26,7 +23,7 @@ specific language governing permissions and limitations
under the License.
-->
This connector provides a Source and Sink that can read from and write to
This connector provides a Source and Sink that can read from and write to
[Apache NiFi](https://nifi.apache.org/). To use this connector, add the
following dependency to your project:
......@@ -40,7 +37,7 @@ following dependency to your project:
Note that the streaming connectors are currently not part of the binary
distribution. See
[here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
[here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.
......@@ -57,10 +54,10 @@ The class `NiFiSource(…)` provides 2 constructors for reading data from NiFi.
- `NiFiSource(SiteToSiteConfig config)` - Constructs a `NiFiSource(…)` given the client's SiteToSiteConfig and a
default wait time of 1000 ms.
- `NiFiSource(SiteToSiteConfig config, long waitTimeMs)` - Constructs a `NiFiSource(…)` given the client's
SiteToSiteConfig and the specified wait time (in milliseconds).
Example:
<div class="codetabs" markdown="1">
......@@ -86,15 +83,15 @@ val clientConfig: SiteToSiteClientConfig = new SiteToSiteClient.Builder()
.portName("Data for Flink")
.requestBatchCount(5)
.buildConfig()
val nifiSource = new NiFiSource(clientConfig)
{% endhighlight %}
</div>
</div>
Here data is read from the Apache NiFi Output Port called "Data for Flink" which is part of Apache NiFi
Here data is read from the Apache NiFi Output Port called "Data for Flink" which is part of Apache NiFi
Site-to-site protocol configuration.
#### Apache NiFi Sink
The connector provides a Sink for writing data from Apache Flink to Apache NiFi.
......@@ -102,9 +99,9 @@ The connector provides a Sink for writing data from Apache Flink to Apache NiFi.
The class `NiFiSink(…)` provides a constructor for instantiating a `NiFiSink`.
- `NiFiSink(SiteToSiteClientConfig, NiFiDataPacketBuilder<T>)` constructs a `NiFiSink(…)` given the client's `SiteToSiteConfig` and a `NiFiDataPacketBuilder` that converts data from Flink to `NiFiDataPacket` to be ingested by NiFi.
Example:
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
......@@ -130,7 +127,7 @@ val clientConfig: SiteToSiteClientConfig = new SiteToSiteClient.Builder()
.portName("Data from Flink")
.requestBatchCount(5)
.buildConfig()
val nifiSink: NiFiSink[NiFiDataPacket] = new NiFiSink[NiFiDataPacket](clientConfig, new NiFiDataPacketBuilder<T>() {...})
streamExecEnv.addSink(nifiSink)
......
---
title: "RabbitMQ Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 4
sub-nav-title: RabbitMQ
nav-title: RabbitMQ
nav-parent_id: connectors
nav-pos: 7
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -36,7 +33,7 @@ This connector provides access to data streams from [RabbitMQ](http://www.rabbit
</dependency>
{% endhighlight %}
Note that the streaming connectors are currently not part of the binary distribution. See linking with them for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that the streaming connectors are currently not part of the binary distribution. See linking with them for cluster execution [here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
#### Installing RabbitMQ
Follow the instructions from the [RabbitMQ download page](http://www.rabbitmq.com/download.html). After the installation the server automatically starts, and the application connecting to RabbitMQ can be launched.
......
---
title: "Redis Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 6
sub-nav-title: Redis
nav-title: Redis
nav-parent_id: connectors
nav-pos: 8
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -38,13 +35,13 @@ following dependency to your project:
{% endhighlight %}
Version Compatibility: This module is compatible with Redis 2.8.5.
Note that the streaming connectors are currently not part of the binary distribution. You need to link them for cluster execution [explicitly]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that the streaming connectors are currently not part of the binary distribution. You need to link them for cluster execution [explicitly]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
#### Installing Redis
Follow the instructions from the [Redis download page](http://redis.io/download).
#### Redis Sink
A class providing an interface for sending data to Redis.
A class providing an interface for sending data to Redis.
The sink can use three different methods for communicating with different type of Redis environments:
1. Single Redis Server
2. Redis Cluster
......@@ -153,7 +150,7 @@ This section gives a description of all the available data types and what Redis
</tr>
<tr>
<td>LIST</td><td>
<a href="http://redis.io/commands/rpush">RPUSH</a>,
<a href="http://redis.io/commands/rpush">RPUSH</a>,
<a href="http://redis.io/commands/lpush">LPUSH</a>
</td><td>--NA--</td>
</tr>
......
---
title: "Twitter Connector"
# Sub-level navigation
sub-nav-group: streaming
sub-nav-parent: connectors
sub-nav-pos: 5
sub-nav-title: Twitter
nav-title: Twitter
nav-parent_id: connectors
nav-pos: 9
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -26,8 +23,8 @@ specific language governing permissions and limitations
under the License.
-->
The Twitter Streaming API provides access to the stream of tweets made available by Twitter.
Flink Streaming comes with a built-in `TwitterSource` class for establishing a connection to this stream.
The Twitter Streaming API provides access to the stream of tweets made available by Twitter.
Flink Streaming comes with a built-in `TwitterSource` class for establishing a connection to this stream.
To use this connector, add the following dependency to your project:
{% highlight xml %}
......@@ -38,21 +35,21 @@ To use this connector, add the following dependency to your project:
</dependency>
{% endhighlight %}
Note that the streaming connectors are currently not part of the binary distribution.
See linking with them for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that the streaming connectors are currently not part of the binary distribution.
See linking with them for cluster execution [here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
#### Authentication
In order to connect to the Twitter stream the user has to register their program and acquire the necessary information for the authentication. The process is described below.
#### Acquiring the authentication information
First of all, a Twitter account is needed. Sign up for free at [twitter.com/signup](https://twitter.com/signup)
or sign in at Twitter's [Application Management](https://apps.twitter.com/) and register the application by
First of all, a Twitter account is needed. Sign up for free at [twitter.com/signup](https://twitter.com/signup)
or sign in at Twitter's [Application Management](https://apps.twitter.com/) and register the application by
clicking on the "Create New App" button. Fill out a form about your program and accept the Terms and Conditions.
After selecting the application, the API key and API secret (called `twitter-source.consumerKey` and `twitter-source.consumerSecret` in `TwitterSource` respectively) are located on the "API Keys" tab.
After selecting the application, the API key and API secret (called `twitter-source.consumerKey` and `twitter-source.consumerSecret` in `TwitterSource` respectively) are located on the "API Keys" tab.
The necessary OAuth Access Token data (`twitter-source.token` and `twitter-source.tokenSecret` in `TwitterSource`) can be generated and acquired on the "Keys and Access Tokens" tab.
Remember to keep these pieces of information secret and do not push them to public repositories.
#### Usage
In contrast to other connectors, the `TwitterSource` depends on no additional services. For example the following code should run gracefully:
......@@ -86,4 +83,3 @@ The `TwitterExample` class in the `flink-examples-streaming` package shows a ful
By default, the `TwitterSource` uses the `StatusesSampleEndpoint`. This endpoint returns a random sample of Tweets.
There is a `TwitterSource.EndpointInitializer` interface allowing users to provide a custom endpoint.
---
title: "Flink DataStream API Programming Guide"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 2
top-nav-title: <strong>Streaming Guide</strong> (DataStream API)
# Sub-level navigation
sub-nav-group: streaming
sub-nav-group-title: Streaming Guide
sub-nav-pos: 1
sub-nav-title: DataStream API
nav-title: Streaming (DataStream API)
nav-parent_id: apis
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -38,11 +30,11 @@ example write the data to files, or to standard output (for example the command
terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs.
The execution can happen in a local JVM, or on clusters of many machines.
Please see [basic concepts]({{ site.baseurl }}/apis/common/index.html) for an introduction
Please see [basic concepts]({{ site.baseurl }}/dev/api_concepts.html) for an introduction
to the basic concepts of the Flink API.
In order to create your own Flink DataStream program, we encourage you to start with
[anatomy of a Flink Program]({{ site.baseurl }}/apis/common/index.html#anatomy-of-a-flink-program)
[anatomy of a Flink Program]({{ site.baseurl }}/dev/api_concepts.html#anatomy-of-a-flink-program)
and gradually add your own
[transformations](#datastream-transformations). The remaining sections act as references for additional
operations and advanced features.
......@@ -545,7 +537,7 @@ DataStream<Long> output = iterationBody.filter(new FilterFunction<Long>(){
<td>
<p>
Extracts timestamps from records in order to work with windows
that use event time semantics. See <a href="{{ site.baseurl }}/apis/streaming/event_time.html">Event Time</a>.
that use event time semantics. See <a href="{{ site.baseurl }}/dev/event_time.html">Event Time</a>.
{% highlight java %}
stream.assignTimestamps (new TimeStampExtractor() {...});
{% endhighlight %}
......@@ -1034,7 +1026,7 @@ dataStream.rebalance();
</p>
<div style="text-align: center">
<img src="{{ site.baseurl }}/apis/streaming/fig/rescale.svg" alt="Checkpoint barriers in data streams" />
<img src="{{ site.baseurl }}/fig/rescale.svg" alt="Checkpoint barriers in data streams" />
</div>
......@@ -1142,7 +1134,7 @@ dataStream.rebalance()
</p>
<div style="text-align: center">
<img src="{{ site.baseurl }}/apis/streaming/fig/rescale.svg" alt="Checkpoint barriers in data streams" />
<img src="{{ site.baseurl }}/fig/rescale.svg" alt="Checkpoint barriers in data streams" />
</div>
......@@ -1310,10 +1302,10 @@ Data Sources
<br />
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
......@@ -1325,17 +1317,17 @@ File-based:
- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified file input format.
- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)` - This is the method called internally by the two previous ones. It reads files in the `path` based on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can further exclude files from being processed.
*IMPLEMENTATION:*
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
*IMPORTANT NOTES:*
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
*IMPORTANT NOTES:*
1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can brake the "exactly-once" semantics, as appending data at the end of a file will lead to **all** its contents being re-processed.
2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint.
Socket-based:
- `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter.
......@@ -1360,7 +1352,7 @@ Collection-based:
Custom:
- `addSource` - Attache a new source function. For example, to read from Apache Kafka you can use
`addSource(new FlinkKafkaConsumer08<>(...))`. See [connectors]({{ site.baseurl }}/apis/streaming/connectors/) for more details.
`addSource(new FlinkKafkaConsumer08<>(...))`. See [connectors]({{ site.baseurl }}/dev/connectors/index.html) for more details.
</div>
......@@ -1368,10 +1360,10 @@ Custom:
<br />
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
......@@ -1386,9 +1378,9 @@ File-based:
*IMPLEMENTATION:*
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
*IMPORTANT NOTES:*
*IMPORTANT NOTES:*
1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can brake the "exactly-once" semantics, as appending data at the end of a file will lead to **all** its contents being re-processed.
......@@ -1617,7 +1609,7 @@ Execution Parameters
The `StreamExecutionEnvironment` contains the `ExecutionConfig` which allows to set job specific configuration values for the runtime.
Please refer to [execution configuration]({{ site.baseurl }}/apis/common/index.html#execution-configuration)
Please refer to [execution configuration]({{ site.baseurl }}/dev/api_concepts.html#execution-configuration)
for an explanation of most parameters. These parameters pertain specifically to the DataStream API:
- `enableTimestamps()` / **`disableTimestamps()`**: Attach a timestamp to each event emitted from a source.
......@@ -1630,7 +1622,7 @@ for an explanation of most parameters. These parameters pertain specifically to
### Fault Tolerance
The [Fault Tolerance Documentation](fault_tolerance.html) describes the options and parameters to enable and configure Flink's checkpointing mechanism.
The [Fault Tolerance Documentation]({{ site.baseurl }}/setup/fault_tolerance.html) describes the options and parameters to enable and configure Flink's checkpointing mechanism.
### Controlling Latency
......
---
title: "Event Time"
sub-nav-id: eventtime
sub-nav-group: streaming
sub-nav-pos: 3
nav-id: event_time
nav-show_overview: true
nav-parent_id: dev
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -77,7 +77,7 @@ Flink supports different notions of *time* in streaming programs.
Internally, *Ingestion Time* is treated much like event time, with automatic timestamp assignment and
automatic Watermark generation.
<img src="fig/times_clocks.svg" class="center" width="80%" />
<img src="{{ site.baseurl }}/fig/times_clocks.svg" class="center" width="80%" />
### Setting a Time Characteristic
......@@ -137,7 +137,7 @@ the event timestamps, and what timely out-of-orderness the event stream exhibits
The section below describes the general mechanism behind *Timestamps* and *Watermarks*. For a guide on how
to use timestamp assignment and watermark generation in the Flink DataStream API, please refer to
[Generating Timestamps / Watermarks]({{ site.baseurl }}/apis/streaming/event_timestamps_watermarks.html)
[Generating Timestamps / Watermarks]({{ site.baseurl }}/dev/event_timestamps_watermarks.html)
# Event Time and Watermarks
......@@ -148,7 +148,7 @@ to use timestamp assignment and watermark generation in the Flink DataStream API
- The [Dataflow Model paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf)
A stream processor that supports *event time* needs a way to measure the progress of event time.
A stream processor that supports *event time* needs a way to measure the progress of event time.
For example, a window operator that builds hourly windows needs to be notified when event time has reached the
next full hour, such that the operator can close the next window.
......@@ -167,13 +167,13 @@ Watermarks flow as part of the data stream and carry a timestamp *t*. A *Waterma
The figure below shows a stream of events with (logical) timestamps, and watermarks flowing inline. The events are in order
(with respect to their timestamp), meaning that watermarks are simply periodic markers in the stream with an in-order timestamp.
<img src="fig/stream_watermark_in_order.svg" alt="A data stream with events (in order) and watermarks" class="center" width="65%" />
<img src="{{ site.baseurl }}/fig/stream_watermark_in_order.svg" alt="A data stream with events (in order) and watermarks" class="center" width="65%" />
Watermarks are crucial for *out-of-order* streams, as shown in the figure below, where, events do not occur ordered by their timestamp.
Watermarks establish points in the stream where all events up to a certain timestamp have occurred. Once these watermarks reach an
operator, the operator can advance its internal *event time clock* to the value of the watermark.
<img src="fig/stream_watermark_out_of_order.svg" alt="A data stream with events (out of order) and watermarks" class="center" width="65%" />
<img src="{{ site.baseurl }}/fig/stream_watermark_out_of_order.svg" alt="A data stream with events (out of order) and watermarks" class="center" width="65%" />
## Watermarks in Parallel Streams
......@@ -190,7 +190,7 @@ update their event time, so does the operator.
The figure below shows an example of events and watermarks flowing through parallel streams, and operators tracking event time.
<img src="fig/parallel_streams_watermarks.svg" alt="Parallel data streams and operators with events and watermarks" class="center" width="80%" />
<img src="{{ site.baseurl }}/fig/parallel_streams_watermarks.svg" alt="Parallel data streams and operators with events and watermarks" class="center" width="80%" />
## Late Elements
......@@ -204,5 +204,3 @@ the evaluation of the event time windows by too much.
Due to that, some streaming programs will explicitly expect a number of *late* elements. Late elements are elements that
arrive after the system's event time clock (as signaled by the watermarks) has already passed the time of the late element's
timestamp.
---
title: "Pre-defined Timestamp Extractors / Watermark Emitters"
sub-nav-group: streaming
sub-nav-pos: 2
sub-nav-parent: eventtime
nav-parent_id: event_time
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -27,20 +25,20 @@ under the License.
* toc
{:toc}
As described in [timestamps and watermark handling]({{ site.baseurl }}/apis/streaming/event_timestamps_watermarks.html),
Flink provides abstractions that allow the programmer to assign their own timestamps and emit their own watermarks. More specifically,
one can do so by implementing one of the `AssignerWithPeriodicWatermarks` and `AssignerWithPunctuatedWatermarks` interfaces, depending
on their use-case. In a nutshell, the first will emit watermarks periodically, while the second does so based on some property of
As described in [timestamps and watermark handling]({{ site.baseurl }}/dev/event_timestamps_watermarks.html),
Flink provides abstractions that allow the programmer to assign their own timestamps and emit their own watermarks. More specifically,
one can do so by implementing one of the `AssignerWithPeriodicWatermarks` and `AssignerWithPunctuatedWatermarks` interfaces, depending
on their use-case. In a nutshell, the first will emit watermarks periodically, while the second does so based on some property of
the incoming records, e.g. whenever a special element is encountered in the stream.
In order to further ease the programming effort for such tasks, Flink comes with some pre-implemented timestamp assigners.
This section provides a list of them. Apart from their out-of-the-box functionality, their implementation can serve as an example
In order to further ease the programming effort for such tasks, Flink comes with some pre-implemented timestamp assigners.
This section provides a list of them. Apart from their out-of-the-box functionality, their implementation can serve as an example
for custom assigner implementations.
#### **Assigner with Ascending Timestamps**
The simplest special case for *periodic* watermark generation is the case where timestamps seen by a given source task
occur in ascending order. In that case, the current timestamp can always act as a watermark, because no earlier timestamps will
The simplest special case for *periodic* watermark generation is the case where timestamps seen by a given source task
occur in ascending order. In that case, the current timestamp can always act as a watermark, because no earlier timestamps will
arrive.
Note that it is only necessary that timestamps are ascending *per parallel data source task*. For example, if
......@@ -53,7 +51,7 @@ watermarks whenever parallel streams are shuffled, unioned, connected, or merged
{% highlight java %}
DataStream<MyEvent> stream = ...
DataStream<MyEvent> withTimestampsAndWatermarks =
DataStream<MyEvent> withTimestampsAndWatermarks =
stream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<MyEvent>() {
@Override
......@@ -75,12 +73,12 @@ val withTimestampsAndWatermarks = stream.assignAscendingTimestamps( _.getCreatio
#### **Assigner which allows a fixed amount of record lateness**
Another example of periodic watermark generation is when the watermark lags behind the maximum (event-time) timestamp
seen in the stream by a fixed amount of time. This case covers scenarios where the maximum lateness that can be encountered in a
stream is known in advance, e.g. when creating a custom source containing elements with timestamps spread within a fixed period of
time for testing. For these cases, Flink provides the `BoundedOutOfOrdernessTimestampExtractor` which takes as an argument
the `maxOutOfOrderness`, i.e. the maximum amount of time an element is allowed to be late before being ignored when computing the
final result for the given window. Lateness corresponds to the result of `t - t_w`, where `t` is the (event-time) timestamp of an
element, and `t_w` that of the previous watermark. If `lateness > 0` then the element is considered late and is ignored when computing
seen in the stream by a fixed amount of time. This case covers scenarios where the maximum lateness that can be encountered in a
stream is known in advance, e.g. when creating a custom source containing elements with timestamps spread within a fixed period of
time for testing. For these cases, Flink provides the `BoundedOutOfOrdernessTimestampExtractor` which takes as an argument
the `maxOutOfOrderness`, i.e. the maximum amount of time an element is allowed to be late before being ignored when computing the
final result for the given window. Lateness corresponds to the result of `t - t_w`, where `t` is the (event-time) timestamp of an
element, and `t_w` that of the previous watermark. If `lateness > 0` then the element is considered late and is ignored when computing
the result of the job for its corresponding window.
<div class="codetabs" markdown="1">
......@@ -88,7 +86,7 @@ the result of the job for its corresponding window.
{% highlight java %}
DataStream<MyEvent> stream = ...
DataStream<MyEvent> withTimestampsAndWatermarks =
DataStream<MyEvent> withTimestampsAndWatermarks =
stream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<MyEvent>(Time.seconds(10)) {
@Override
......
---
title: "Generating Timestamps / Watermarks"
sub-nav-group: streaming
sub-nav-pos: 1
sub-nav-parent: eventtime
nav-parent_id: event_time
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -29,7 +27,7 @@ under the License.
This section is relevant for program running on **Event Time**. For an introduction to *Event Time*,
*Processing Time*, and *Ingestion Time*, please refer to the [event time introduction]({{ site.baseurl }}/apis/streaming/event_time.html)
*Processing Time*, and *Ingestion Time*, please refer to the [event time introduction]({{ site.baseurl }}/dev/event_time.html)
To work with *Event Time*, streaming programs need to set the *time characteristic* accordingly.
......@@ -123,13 +121,13 @@ In any case, the timestamp assigner needs to be specified before the first opera
(such as the first window operation). As a special case, when using Kafka as the source of a streaming job,
Flink allows the specification of a timestamp assigner / watermark emitter inside
the source (or consumer) itself. More information on how to do so can be found in the
[Kafka Connector documentation]({{ site.baseurl }}/apis/streaming/connectors/kafka.html).
[Kafka Connector documentation]({{ site.baseurl }}/dev/connectors/kafka.html).
**NOTE:** The remainder of this section presents the main interfaces a programmer has
to implement in order to create her own timestamp extractors/watermark emitters.
To see the pre-implemented extractors that ship with Flink, please refer to the
[Pre-defined Timestamp Extractors / Watermark Emitters]({{ site.baseurl }}/apis/streaming/event_timestamp_extractors.html) page.
[Pre-defined Timestamp Extractors / Watermark Emitters]({{ site.baseurl }}/dev/event_timestamp_extractors.html) page.
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
......@@ -329,4 +327,3 @@ class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[MyEvent] {
*Note:* It is possible to generate a watermark on every single event. However, because each watermark causes some
computation downstream, an excessive number of watermarks slows down performance.
---
title: "Libraries"
sub-nav-group: batch
sub-nav-id: libs
sub-nav-pos: 6
sub-nav-title: Libraries
title: "Application Development"
nav-id: dev
nav-title: '<i class="fa fa-code" aria-hidden="true"></i> Application Development'
nav-parent_id: root
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,7 +23,3 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
- Graph processing: [Gelly](gelly/index.html)
- Machine Learning: [FlinkML](ml/index.html)
- Relational Queries: [Table and SQL](table.html)
---
title: "Java 8 Programming Guide"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 12
top-nav-title: Java 8
title: "Java 8"
nav-parent_id: apis
nav-pos: 105
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -30,7 +28,7 @@ passing functions in a straightforward way without having to declare additional
The newest version of Flink supports the usage of Lambda Expressions for all operators of the Java API.
This document shows how to use Lambda Expressions and describes current limitations. For a general introduction to the
Flink API, please refer to the [Programming Guide](programming_guide.html)
Flink API, please refer to the [Programming Guide]({{ site.baseurl }}/dev/api_concepts.html)
* TOC
{:toc}
......
---
title: "Streaming Libraries"
sub-nav-group: streaming
sub-nav-id: libs
sub-nav-pos: 7
sub-nav-title: Libraries
title: "Libraries"
nav-id: libs
nav-parent_id: dev
nav-pos: 8
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,5 +22,3 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
- Complex event processing: [CEP](cep.html)
---
title: "FlinkCEP - Complex event processing for Flink"
# Top navigation
top-nav-group: libs
top-nav-pos: 2
top-nav-title: CEP
# Sub navigation
sub-nav-group: streaming
sub-nav-id: cep
sub-nav-pos: 1
sub-nav-parent: libs
sub-nav-title: Event Processing (CEP)
nav-title: Event Processing (CEP)
nav-parent_id: libs
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -44,7 +37,7 @@ because these are used for comparing and matching events.
## Getting Started
If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/apis/common/index.html#linking-with-flink).
If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/api_concepts.html#linking-with-flink).
Next, you have to add the FlinkCEP dependency to the `pom.xml` of your project.
<div class="codetabs" markdown="1">
......@@ -70,7 +63,7 @@ Next, you have to add the FlinkCEP dependency to the `pom.xml` of your project.
</div>
Note that FlinkCEP is currently not part of the binary distribution.
See linking with it for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
See linking with it for cluster execution [here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Now you can start writing your first CEP program using the pattern API.
......
---
title: Graph Algorithms
# Sub navigation
sub-nav-group: batch
sub-nav-parent: gelly
sub-nav-title: Graph Algorithms
nav-parent_id: graphs
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: Graph API
# Sub navigation
sub-nav-group: batch
sub-nav-parent: gelly
sub-nav-title: Graph API
nav-parent_id: graphs
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -469,7 +466,7 @@ graph.subgraph((vertex => vertex.getValue > 0), (edge => edge.getValue < 0))
</div>
<p class="text-center">
<img alt="Filter Transformations" width="80%" src="fig/gelly-filter.png"/>
<img alt="Filter Transformations" width="80%" src="{{ site.baseurl }}/fig/gelly-filter.png"/>
</p>
* <strong>Join</strong>: Gelly provides specialized methods for joining the vertex and edge datasets with other input datasets. `joinWithVertices` joins the vertices with a `Tuple2` input data set. The join is performed using the vertex ID and the first field of the `Tuple2` input as the join keys. The method returns a new `Graph` where the vertex values have been updated according to a provided user-defined transformation function.
......@@ -512,7 +509,7 @@ val networkWithWeights = network.joinWithEdgesOnSource(vertexOutDegrees, (v1: Do
* <strong>Union</strong>: Gelly's `union()` method performs a union operation on the vertex and edge sets of the specified graph and the current graph. Duplicate vertices are removed from the resulting `Graph`, while if duplicate edges exist, these will be preserved.
<p class="text-center">
<img alt="Union Transformation" width="50%" src="fig/gelly-union.png"/>
<img alt="Union Transformation" width="50%" src="{{ site.baseurl }}/fig/gelly-union.png"/>
</p>
* <strong>Difference</strong>: Gelly's `difference()` method performs a difference on the vertex and edge sets of the current graph and the specified graph.
......@@ -645,7 +642,7 @@ The neighborhood scope is defined by the `EdgeDirection` parameter, which takes
For example, assume that you want to select the minimum weight of all out-edges for each vertex in the following graph:
<p class="text-center">
<img alt="reduceOnEdges Example" width="50%" src="fig/gelly-example-graph.png"/>
<img alt="reduceOnEdges Example" width="50%" src="{{ site.baseurl }}/fig/gelly-example-graph.png"/>
</p>
The following code will collect the out-edges for each vertex and apply the `SelectMinWeight()` user-defined function on each of the resulting neighborhoods:
......@@ -685,7 +682,7 @@ final class SelectMinWeight extends ReduceEdgesFunction[Double] {
</div>
<p class="text-center">
<img alt="reduceOnEdges Example" width="50%" src="fig/gelly-reduceOnEdges.png"/>
<img alt="reduceOnEdges Example" width="50%" src="{{ site.baseurl }}/fig/gelly-reduceOnEdges.png"/>
</p>
Similarly, assume that you would like to compute the sum of the values of all in-coming neighbors, for every vertex. The following code will collect the in-coming neighbors for each vertex and apply the `SumValues()` user-defined function on each neighborhood:
......@@ -725,7 +722,7 @@ final class SumValues extends ReduceNeighborsFunction[Long] {
</div>
<p class="text-center">
<img alt="reduceOnNeighbors Example" width="70%" src="fig/gelly-reduceOnNeighbors.png"/>
<img alt="reduceOnNeighbors Example" width="70%" src="{{ site.baseurl }}/fig/gelly-reduceOnNeighbors.png"/>
</p>
When the aggregation function is not associative and commutative or when it is desirable to return more than one values per vertex, one can use the more general
......
---
title: Graph Generators
# Sub navigation
sub-nav-group: batch
sub-nav-parent: gelly
sub-nav-title: Graph Generators
nav-parent_id: graphs
nav-pos: 5
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
title: "Gelly: Flink Graph API"
# Top navigation
top-nav-group: libs
top-nav-pos: 1
top-nav-title: "Graphs: Gelly"
# Sub navigation
sub-nav-group: batch
sub-nav-id: gelly
sub-nav-pos: 1
sub-nav-parent: libs
sub-nav-title: Gelly
nav-id: graphs
nav-show_overview: true
nav-title: "Graphs: Gelly"
nav-parent_id: libs
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -67,7 +62,7 @@ Add the following dependency to your `pom.xml` to use Gelly.
</div>
</div>
Note that Gelly is currently not part of the binary distribution. See linking with it for cluster execution [here]({{ site.baseurl }}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Note that Gelly is currently not part of the binary distribution. See linking with it for cluster execution [here]({{ site.baseurl }}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
The remaining sections provide a description of available methods and present several examples of how to use Gelly and how to mix it with the Flink DataSet API. After reading this guide, you might also want to check the {% gh_link /flink-libraries/flink-gelly-examples/ "Gelly examples" %}.
......
---
title: Iterative Graph Processing
# Sub navigation
sub-nav-group: batch
sub-nav-parent: gelly
sub-nav-title: Iterative Graph Processing
nav-parent_id: graphs
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -40,7 +37,7 @@ In each superstep, all active vertices execute the
same user-defined computation in parallel. Supersteps are executed synchronously, so that messages sent during one superstep are guaranteed to be delivered in the beginning of the next superstep.
<p class="text-center">
<img alt="Vertex-Centric Computational Model" width="70%" src="fig/vertex-centric supersteps.png"/>
<img alt="Vertex-Centric Computational Model" width="70%" src="{{ site.baseurl }}/fig/vertex-centric supersteps.png"/>
</p>
To use vertex-centric iterations in Gelly, the user only needs to define the vertex compute function, `ComputeFunction`.
......@@ -176,7 +173,7 @@ and can be specified using the `setName()` method.
* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines
all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the user-defined `ComputeFunction`.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/batch/index.html#broadcast-variables) to the `ComputeFunction`, using the `addBroadcastSet()` method.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/dev/batch/index.html#broadcast-variables) to the `ComputeFunction`, using the `addBroadcastSet()` method.
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
......@@ -294,7 +291,7 @@ Additionally, the neighborhood type (in/out/all) over which to run the scatter-
Let us consider computing Single-Source-Shortest-Paths with scatter-gather iterations on the following graph and let vertex 1 be the source. In each superstep, each vertex sends a candidate distance message to all its neighbors. The message value is the sum of the current value of the vertex and the edge weight connecting this vertex with its neighbor. Upon receiving candidate distance messages, each vertex calculates the minimum distance and, if a shorter path has been discovered, it updates its value. If a vertex does not change its value during a superstep, then it does not produce messages for its neighbors for the next superstep. The algorithm converges when there are no value updates.
<p class="text-center">
<img alt="Scatter-gather SSSP superstep 1" width="70%" src="fig/gelly-vc-sssp1.png"/>
<img alt="Scatter-gather SSSP superstep 1" width="70%" src="{{ site.baseurl }}/fig/gelly-vc-sssp1.png"/>
</p>
<div class="codetabs" markdown="1">
......@@ -412,7 +409,7 @@ and can be specified using the `setName()` method.
* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines
all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the user-defined `ScatterFunction` and `GatherFunction`.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/batch/index.html#broadcast-variables) to the `ScatterFunction` and `GatherFunction`, using the `addBroadcastSetForUpdateFunction()` and `addBroadcastSetForMessagingFunction()` methods, respectively.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/dev/batch/index.html#broadcast-variables) to the `ScatterFunction` and `GatherFunction`, using the `addBroadcastSetForUpdateFunction()` and `addBroadcastSetForMessagingFunction()` methods, respectively.
* <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method.
The number of vertices can then be accessed in the vertex update function and in the messaging function using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return -1.
......@@ -664,7 +661,7 @@ Like in the scatter-gather model, Gather-Sum-Apply also proceeds in synchronized
Let us consider computing Single-Source-Shortest-Paths with GSA on the following graph and let vertex 1 be the source. During the `Gather` phase, we calculate the new candidate distances, by adding each vertex value with the edge weight. In `Sum`, the candidate distances are grouped by vertex ID and the minimum distance is chosen. In `Apply`, the newly calculated distance is compared to the current vertex value and the minimum of the two is assigned as the new value of the vertex.
<p class="text-center">
<img alt="GSA SSSP superstep 1" width="70%" src="fig/gelly-gsa-sssp1.png"/>
<img alt="GSA SSSP superstep 1" width="70%" src="{{ site.baseurl }}/fig/gelly-gsa-sssp1.png"/>
</p>
Notice that, if a vertex does not change its value during a superstep, it will not calculate candidate distance during the next superstep. The algorithm converges when no vertex changes value.
......@@ -784,7 +781,7 @@ Currently, the following parameters can be specified:
* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the user-defined `GatherFunction`, `SumFunction` and `ApplyFunction`.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/index.html#broadcast-variables) to the `GatherFunction`, `SumFunction` and `ApplyFunction`, using the methods `addBroadcastSetForGatherFunction()`, `addBroadcastSetForSumFunction()` and `addBroadcastSetForApplyFunction` methods, respectively.
* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/dev/index.html#broadcast-variables) to the `GatherFunction`, `SumFunction` and `ApplyFunction`, using the methods `addBroadcastSetForGatherFunction()`, `addBroadcastSetForSumFunction()` and `addBroadcastSetForApplyFunction` methods, respectively.
* <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method.
The number of vertices can then be accessed in the gather, sum and/or apply functions by using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return -1.
......
---
title: Library Methods
# Sub navigation
sub-nav-group: batch
sub-nav-parent: gelly
sub-nav-title: Library Methods
nav-parent_id: graphs
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
mathjax: include
title: Alternating Least Squares
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: ALS
nav-title: ALS
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -39,7 +36,7 @@ The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.
In order to find the user and item matrix, the following problem is solved:
$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 +
$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 +
\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$
with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
......@@ -59,13 +56,13 @@ As such, it supports the `fit` and `predict` operation.
### Fit
ALS is trained on the sparse representation of the rating matrix:
ALS is trained on the sparse representation of the rating matrix:
* `fit: DataSet[(Int, Int, Double)] => Unit`
* `fit: DataSet[(Int, Int, Double)] => Unit`
### Predict
ALS predicts for each tuple of row and column index the rating:
ALS predicts for each tuple of row and column index the rating:
* `predict: DataSet[(Int, Int)] => DataSet[(Int, Int, Double)]`
......@@ -115,9 +112,9 @@ The alternating least squares implementation can be controlled by the following
<td>
<p>
The number of blocks into which the user and item matrix are grouped.
The fewer blocks one uses, the less data is sent redundantly.
However, bigger blocks entail bigger update messages which have to be stored on the heap.
If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks.
The fewer blocks one uses, the less data is sent redundantly.
However, bigger blocks entail bigger update messages which have to be stored on the heap.
If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks.
(Default value: <strong>None</strong>)
</p>
</td>
......@@ -139,7 +136,7 @@ The alternating least squares implementation can be controlled by the following
If this value is set, then the algorithm is split into two preprocessing steps, the ALS iteration and a post-processing step which calculates a last ALS half-step.
The preprocessing steps calculate the <code>OutBlockInformation</code> and <code>InBlockInformation</code> for the given rating matrix.
The results of the individual steps are stored in the specified directory.
By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators.
By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators.
This allows the system to process bigger individual messages and improves the overall performance.
(Default value: <strong>None</strong>)
</p>
......
---
mathjax: include
title: How to Contribute
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: How To Contribute
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
mathjax: include
title: Cross Validation
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Cross Validation
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -172,4 +168,4 @@ val dataKFolded: Array[TrainTestDataSet] = Splitter.kFoldSplit(data, 10)
// create an array of 5 datasets
val dataMultiRandom: Array[DataSet[T]] = Splitter.multiRandomSplit(data, Array(0.5, 0.1, 0.1, 0.1, 0.1))
{% endhighlight %}
\ No newline at end of file
{% endhighlight %}
---
mathjax: include
title: Distance Metrics
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Distance Metrics
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -87,7 +83,7 @@ Currently, FlinkML supports the following metrics:
<tr>
<td><strong>Tanimoto Distance</strong></td>
<td>
$$d(\x, \y) = 1 - \frac{\x^T\y}{\Vert \x \Vert^2 + \Vert \y \Vert^2 - \x^T\y}$$
$$d(\x, \y) = 1 - \frac{\x^T\y}{\Vert \x \Vert^2 + \Vert \y \Vert^2 - \x^T\y}$$
with $\x$ and $\y$ being bit-vectors
</td>
</tr>
......
---
title: "FlinkML - Machine Learning for Flink"
# Top navigation
top-nav-group: libs
top-nav-pos: 2
top-nav-title: Machine Learning
# Sub navigation
sub-nav-group: batch
sub-nav-id: flinkml
sub-nav-pos: 2
sub-nav-parent: libs
sub-nav-title: Machine Learning
nav-id: ml
nav-show_overview: true
nav-title: Machine Learning
nav-parent_id: libs
nav-pos: 4
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -73,7 +68,7 @@ FlinkML currently supports the following algorithms:
You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
example.
If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/apis/batch/index.html#linking-with-flink).
If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/api_concepts.html#linking-with-flink).
Next, you have to add the FlinkML dependency to the `pom.xml` of your project.
{% highlight xml %}
......@@ -85,14 +80,12 @@ Next, you have to add the FlinkML dependency to the `pom.xml` of your project.
{% endhighlight %}
Note that FlinkML is currently not part of the binary distribution.
See linking with it for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
See linking with it for cluster execution [here]({{site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
Now you can start solving your analysis task.
The following code snippet shows how easy it is to train a multiple linear regression model.
{% highlight scala %}
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
......@@ -148,4 +141,4 @@ If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Tran
The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
In order to get quickly started with contributing to FlinkML, please read our official
[contribution guide]({{site.baseurl}}/libs/ml/contribution_guide.html).
[contribution guide]({{site.baseurl}}/dev/libs/ml/contribution_guide.html).
---
mathjax: include
htmlTitle: FlinkML - k-Nearest neighbors join
title: <a href="../ml">FlinkML</a> - k-Nearest neighbors join
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: k-Nearest neighbors join
title: k-Nearest Neighbors Join
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -37,11 +32,11 @@ $$
KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the k-nearest points to }b\text{ in }A \}
$$
The brute-force approach is to compute the distance between every training and testing point. To ease the brute-force computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
The brute-force approach is to compute the distance between every training and testing point. To ease the brute-force computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
## Operations
`KNN` is a `Predictor`.
`KNN` is a `Predictor`.
As such, it supports the `fit` and `predict` operation.
### Fit
......
---
mathjax: include
title: MinMax Scaler
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: MinMax Scaler
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......
---
mathjax: include
title: Multiple linear regression
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Multiple Linear Regression
title: Multiple Linear Regression
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -63,7 +59,7 @@ under the License.
The convergence criterion is the relative change in the sum of squared residuals:
$$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$
## Operations
`MultipleLinearRegression` is a `Predictor`.
......@@ -71,13 +67,13 @@ As such, it supports the `fit` and `predict` operation.
### Fit
MultipleLinearRegression is trained on a set of `LabeledVector`:
MultipleLinearRegression is trained on a set of `LabeledVector`:
* `fit: DataSet[LabeledVector] => Unit`
### Predict
MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:
MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:
* `predict[T <: Vector]: DataSet[T] => DataSet[LabeledVector]`
......@@ -92,7 +88,7 @@ the algorithm's performance.
## Parameters
The multiple linear regression implementation can be controlled by the following parameters:
<table class="table table-bordered">
<thead>
<tr>
......@@ -116,7 +112,7 @@ the algorithm's performance.
<p>
Initial step size for the gradient descent method.
This value controls how far the gradient descent method moves in the opposite direction of the gradient.
Tuning this parameter might be crucial to make it stable and to obtain a better performance.
Tuning this parameter might be crucial to make it stable and to obtain a better performance.
(Default value: <strong>0.1</strong>)
</p>
</td>
......
---
mathjax: include
title: Optimization
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Optimization
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -310,7 +307,7 @@ Where:
<tr>
<td><strong>Constant</strong></td>
<td>
<p>
<p>
The step size is constant throughout the learning task.
</p>
</td>
......@@ -321,10 +318,10 @@ Where:
<td><strong>Leon Bottou's Method</strong></td>
<td>
<p>
This is the <code>'optimal'</code> method of sklearn.
This is the <code>'optimal'</code> method of sklearn.
The optimal initial value $t_0$ has to be provided.
Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(-\beta, 1.0) / (\alpha \cdot \beta)$
with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
</p>
</td>
<td class="text-center">$\eta_j = 1 / (\lambda \cdot (t_0 + j -1)) $</td>
......
---
mathjax: include
title: Looking under the hood of pipelines
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Pipelines
nav-title: Pipelines
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -84,7 +82,7 @@ finding the correct weights in a linear regression task, or the mean and standar
the data in a feature scaler.
As evident by the naming, classes that implement
`Transformer` are transform operations like [scaling the input](standard_scaler.html) and
`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/libs/ml/multiple_linear_regression.html).
`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/dev/libs/ml/multiple_linear_regression.html).
Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
Pipelines that end with Predictor cannot be chained any further.
Below is an example of how a pipeline can be formed:
......@@ -141,12 +139,12 @@ method of `FitOperation`. The `predict` method of `Predictor` and the `transform
In these methods the operation object is provided as an implicit parameter.
Scala will [look for implicits](http://docs.scala-lang.org/tutorials/FAQ/finding-implicits.html)
in the companion object of a type, so classes that implement these interfaces should provide these
in the companion object of a type, so classes that implement these interfaces should provide these
objects as implicit objects inside the companion object.
As an example we can look at the `StandardScaler` class. `StandardScaler` extends `Transformer`, so it has access to its `fit` and `transform` functions.
These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
object, through `transformVectors` and `fitVectorStandardScaler`:
{% highlight scala %}
......@@ -192,10 +190,10 @@ This allows us to use the algorithm for input that is labeled or unlabeled, and
automatically, depending on the type of the input that we give to the fit and transform
operations. The correct implicit operation is chosen by the compiler, depending on the input type.
If we try to call the `fit` or `transform` methods with types that are not supported we will get a
runtime error before the job is launched.
While it would be possible to catch these kinds of errors at compile time as well, the error
messages that we are able to provide the user would be much less informative, which is why we chose
If we try to call the `fit` or `transform` methods with types that are not supported we will get a
runtime error before the job is launched.
While it would be possible to catch these kinds of errors at compile time as well, the error
messages that we are able to provide the user would be much less informative, which is why we chose
to throw runtime exceptions instead.
### Chaining
......@@ -237,7 +235,7 @@ object MeanTransformer {
case object Mean extends Parameter[Double] {
override val defaultValue: Option[Double] = Some(0.0)
}
def apply(): MeanTransformer = new MeanTransformer
}
{% endhighlight %}
......@@ -280,7 +278,7 @@ Thus, all the program logic takes place within the `FitOperation`.
The `FitOperation` has two type parameters.
The first defines the pipeline operator type for which this `FitOperation` shall work and the second type parameter defines the type of the data set elements.
If we first wanted to implement the `MeanTransformer` to work on `DenseVector`, we would, thus, have to provide an implementation for `FitOperation[MeanTransformer, DenseVector]`.
{% highlight scala %}
val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
......@@ -288,8 +286,8 @@ val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector]
val meanTrainingData: DataSet[DenseVector] = input
.map{ x => (x.asBreeze, 1) }
.reduce{
(left, right) =>
(left._1 + right._1, left._2 + right._2)
(left, right) =>
(left._1 + right._1, left._2 + right._2)
}
.map{ p => (p._1/p._2).fromBreeze }
}
......@@ -320,12 +318,12 @@ class MeanTransformer extends Transformer[Centering] {
val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
import org.apache.flink.ml.math.Breeze._
instance.meanOption = Some(input
.map{ x => (x.asBreeze, 1) }
.reduce{
(left, right) =>
(left._1 + right._1, left._2 + right._2)
(left, right) =>
(left._1 + right._1, left._2 + right._2)
}
.map{ p => (p._1/p._2).fromBreeze })
}
......@@ -339,14 +337,14 @@ A possible mean transforming implementation could look like the following.
val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
override def transform(
instance: MeanTransformer,
transformParameters: ParameterMap,
input: DataSet[DenseVector])
instance: MeanTransformer,
transformParameters: ParameterMap,
input: DataSet[DenseVector])
: DataSet[DenseVector] = {
val resultingParameters = parameters ++ transformParameters
val resultingMean = resultingParameters(MeanTransformer.Mean)
instance.meanOption match {
case Some(trainingMean) => {
input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
......@@ -362,12 +360,12 @@ class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVe
override def open(parameters: Configuration): Unit = {
trainingMean = getRuntimeContext().getBroadcastVariable[DenseVector]("trainingMean").get(0)
}
override def map(vector: DenseVector): DenseVector = {
import org.apache.flink.ml.math.Breeze._
val result = vector.asBreeze - trainingMean.asBreeze + resultingMean
result.fromBreeze
}
}
......@@ -391,7 +389,7 @@ In order to make the compiler aware of our implementation, we have to define it
{% highlight scala %}
object MeanTransformer{
implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...
implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
}
{% endhighlight %}
......@@ -434,12 +432,10 @@ Consequently, we have to define a `FitOperation[MeanTransformer, LabeledVector]`
{% highlight scala %}
object MeanTransformer {
implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...
implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
}
{% endhighlight %}
If we wanted to implement a `Predictor` instead of a `Transformer`, then we would have to provide a `FitOperation`, too.
Moreover, a `Predictor` requires a `PredictOperation` which implements how predictions are calculated from testing data.
---
mathjax: include
title: Polynomial Features
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Polynomial Features
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -39,7 +36,7 @@ $$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \
Flink's implementation orders the polynomials in decreasing order of their degree.
Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like
$$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$
This transformer can be prepended to all `Transformer` and `Predictor` implementations which expect an input of type `LabeledVector` or any sub-type of `Vector`.
......@@ -55,7 +52,7 @@ PolynomialFeatures is not trained on data and, thus, supports all types of input
### Transform
PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:
PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:
* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
......@@ -77,7 +74,7 @@ The polynomial features transformer can be controlled by the following parameter
<td><strong>Degree</strong></td>
<td>
<p>
The maximum polynomial degree.
The maximum polynomial degree.
(Default value: <strong>10</strong>)
</p>
</td>
......
---
mathjax: include
title: Quickstart Guide
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Quickstart Guide
nav-title: Quickstart
nav-parent_id: ml
nav-pos: 0
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -56,7 +55,7 @@ through [principal components analysis](https://en.wikipedia.org/wiki/Principal_
## Linking with FlinkML
In order to use FlinkML in your project, first you have to
[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
[set up a Flink program]({{ site.baseurl }}}/dev/api_concepts.html#linking-with-flink).
Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
{% highlight xml %}
......@@ -127,7 +126,7 @@ You can also save datasets in the LibSVM format using the `writeLibSVM` function
Let's import the svmguide1 dataset. You can download the
[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
We can simply import the dataset then using:
......@@ -148,7 +147,7 @@ create a classifier.
Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
regularization parameter determines the amount of $l_2$ regularization applied, which is used
to avoid overfitting. The step size determines the contribution of the weight vector updates to
the next weight vector value. This parameter sets the initial step size.
......@@ -222,7 +221,7 @@ The result of the prediction on `LabeledVector`s is a data set of tuples where t
This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
more you can do.
We recommend going through the [FlinkML documentation](index.html), and trying out the different
We recommend going through the [FlinkML documentation]({{ site.baseurl }}/dev/libs/ml/index.html), and trying out the different
algorithms.
A very good way to get started is to play around with interesting datasets from the UCI ML
repository and the LibSVM datasets.
......@@ -234,10 +233,10 @@ If you would like to contribute some new algorithms take a look at our
**References**
<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
press, 2012.
<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communication-efficient distributed dual
<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communication-efficient distributed dual
coordinate ascent.* Advances in Neural Information Processing Systems. 2014.
<a name="hsu"></a>[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin.
......
---
mathjax: include
title: Standard Scaler
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: Standard Scaler
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -30,14 +27,14 @@ under the License.
## Description
The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
Given a set of input data $x_1, x_2,... x_n$, with mean:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
and standard deviation:
Given a set of input data $x_1, x_2,... x_n$, with mean:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
and standard deviation:
$$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$
The scaled data set $z_1, z_2,...,z_n$ will be:
......@@ -53,16 +50,16 @@ As such, it supports the `fit` and `transform` operation.
### Fit
StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:
StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:
* `fit[T <: Vector]: DataSet[T] => Unit`
* `fit[T <: Vector]: DataSet[T] => Unit`
* `fit: DataSet[LabeledVector] => Unit`
### Transform
StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
## Parameters
......
---
mathjax: include
title: SVM using CoCoA
# Sub navigation
sub-nav-group: batch
sub-nav-parent: flinkml
sub-nav-title: SVM (CoCoA)
nav-parent_id: ml
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -172,8 +169,8 @@ The SVM implementation can be controlled by the following parameters:
<td>
<p>
Determines whether the predict and evaluate functions of the SVM should return the distance
to the separating hyperplane, or binary class labels. Setting this to true will
return the raw distance to the hyperplane for each example. Setting it to false will
to the separating hyperplane, or binary class labels. Setting this to true will
return the raw distance to the hyperplane for each example. Setting it to false will
return the binary class label (+1.0, -1.0) (Default value: <strong>false</strong>)
</p>
</td>
......
---
title: "Storm Compatibility"
is_beta: true
sub-nav-group: streaming
sub-nav-pos: 9
nav-parent_id: libs
nav-pos: 2
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,7 +23,7 @@ specific language governing permissions and limitations
under the License.
-->
[Flink streaming](index.html) is compatible with Apache Storm interfaces and therefore allows
[Flink streaming]({{ site.baseurl }}/dev/datastream_api.html) is compatible with Apache Storm interfaces and therefore allows
reusing code that was implemented for Storm.
You can:
......@@ -104,7 +104,7 @@ if(runLocal) { // submit to test cluster
As an alternative, Spouts and Bolts can be embedded into regular streaming programs.
The Storm compatibility layer offers a wrapper classes for each, namely `SpoutWrapper` and `BoltWrapper` (`org.apache.flink.storm.wrappers`).
Per default, both wrappers convert Storm output tuples to Flink's [Tuple]({{site.baseurl}}/apis/batch/index.html#tuples-and-case-classes) types (ie, `Tuple0` to `Tuple25` according to the number of fields of the Storm tuples).
Per default, both wrappers convert Storm output tuples to Flink's [Tuple]({{site.baseurl}}/dev/api_concepts.html#tuples-and-case-classes) types (ie, `Tuple0` to `Tuple25` according to the number of fields of the Storm tuples).
For single field output tuples a conversion to the field's data type is also possible (eg, `String` instead of `Tuple1<String>`).
Because Flink cannot infer the output field types of Storm operators, it is required to specify the output type manually.
......@@ -134,7 +134,7 @@ DataStream<String> rawInput = env.addSource(
If a Spout emits a finite number of tuples, `SpoutWrapper` can be configures to terminate automatically by setting `numberOfInvocations` parameter in its constructor.
This allows the Flink program to shut down automatically after all data is processed.
Per default the program will run until it is [canceled]({{site.baseurl}}/apis/cli.html) manually.
Per default the program will run until it is [canceled]({{site.baseurl}}/setup/cli.html) manually.
## Embed Bolts
......@@ -165,8 +165,8 @@ DataStream<Tuple2<String, Integer>> counts = text.transform(
Bolts can accesses input tuple fields via name (additionally to access via index).
To use this feature with embedded Bolts, you need to have either a
1. [POJO]({{site.baseurl}}/apis/batch/index.html#pojos) type input stream or
2. [Tuple]({{site.baseurl}}/apis/batch/index.html#tuples-and-case-classes) type input stream and specify the input schema (i.e. name-to-index-mapping)
1. [POJO]({{site.baseurl}}/dev/api_concepts.html#pojos) type input stream or
2. [Tuple]({{site.baseurl}}/dev/api_concepts.html#tuples-and-case-classes) type input stream and specify the input schema (i.e. name-to-index-mapping)
For POJO input types, Flink accesses the fields via reflection.
For this case, Flink expects either a corresponding public member variable or public getter method.
......
---
title: "Local Execution"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 7
nav-parent_id: dev
nav-pos: 11
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -38,7 +37,7 @@ The `CollectionEnvironment` is executing the Flink program on Java collections.
## Debugging
If you are running Flink programs locally, you can also debug your program like any other Java program. You can either use `System.out.println()` to write out some internal variables or you can use the debugger. It is possible to set breakpoints within `map()`, `reduce()` and all the other methods.
Please also refer to the [debugging section](programming_guide.html#debugging) in the Java API documentation for a guide to testing and local debugging utilities in the Java API.
Please also refer to the [debugging section]({{ site.baseurl }}/dev/batch/index.html#debugging) in the Java API documentation for a guide to testing and local debugging utilities in the Java API.
## Maven Dependency
......@@ -58,7 +57,7 @@ The `LocalEnvironment` is a handle to local execution for Flink programs. Use it
The local environment is instantiated via the method `ExecutionEnvironment.createLocalEnvironment()`. By default, it will use as many local threads for execution as your machine has CPU cores (hardware contexts). You can alternatively specify the desired parallelism. The local environment can be configured to log to the console using `enableLogging()`/`disableLogging()`.
In most cases, calling `ExecutionEnvironment.getExecutionEnvironment()` is the even better way to go. That method returns a `LocalEnvironment` when the program is started locally (outside the command line interface), and it returns a pre-configured environment for cluster execution, when the program is invoked by the [command line interface](cli.html).
In most cases, calling `ExecutionEnvironment.getExecutionEnvironment()` is the even better way to go. That method returns a `LocalEnvironment` when the program is started locally (outside the command line interface), and it returns a pre-configured environment for cluster execution, when the program is invoked by the [command line interface]({{ site.baseurl }}/setup/cli.html).
~~~java
public static void main(String[] args) throws Exception {
......
All image files in the folder and its subfolders are
licensed to the Apache Software Foundation (ASF) under one
---
title: "Quickstarts"
nav-id: quickstarts
nav-parent_id: dev
nav-pos: 1
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
......@@ -7,7 +13,7 @@ to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
......@@ -15,3 +21,4 @@ software distributed under the License is distributed on an
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
---
title: "Scala API Extensions"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 11
nav-parent_id: apis
nav-pos: 104
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,11 +22,11 @@ specific language governing permissions and limitations
under the License.
-->
In order to keep a fair amount of consistency between the Scala and Java APIs, some
In order to keep a fair amount of consistency between the Scala and Java APIs, some
of the features that allow a high-level of expressiveness in Scala have been left
out from the standard APIs for both batch and streaming.
If you want to _enjoy the full Scala experience_ you can choose to opt-in to
If you want to _enjoy the full Scala experience_ you can choose to opt-in to
extensions that enhance the Scala API via implicit conversions.
To use all the available extensions, you can just add a simple `import` for the
......@@ -62,7 +61,7 @@ data.map {
{% endhighlight %}
This extension introduces new methods in both the DataSet and DataStream Scala API
that have a one-to-one correspondance in the extended API. These delegating methods
that have a one-to-one correspondance in the extended API. These delegating methods
do support anonymous pattern matching functions.
#### DataSet API
......@@ -368,8 +367,8 @@ data1.join(data2).
For more information on the semantics of each method, please refer to the
[DataStream](batch/index.html) and [DataSet](streaming/index.html) API documentation.
For more information on the semantics of each method, please refer to the
[DataSet]({{ site.baseurl }}/dev/batch/index.html) and [DataStream]({{ site.baseurl }}/dev/datastream_api.html) API documentation.
To use this extension exclusively, you can add the following `import`:
......
---
title: "Scala Shell"
# Top-level navigation
top-nav-group: apis
top-nav-pos: 10
nav-parent_id: dev
nav-pos: 10
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,11 +22,9 @@ specific language governing permissions and limitations
under the License.
-->
Flink comes with an integrated interactive Scala Shell.
It can be used in a local setup as well as in a cluster setup.
To use the shell with an integrated Flink cluster just execute:
~~~bash
......@@ -37,7 +34,6 @@ bin/start-scala-shell.sh local
in the root directory of your binary Flink directory. To run the Shell on a
cluster, please see the Setup section below.
## Usage
The shell supports Batch and Streaming.
......@@ -134,10 +130,10 @@ shell. The number of YARN containers can be controlled by the parameter `-n <arg
The shell deploys a new Flink cluster on YARN and connects the
cluster. You can also specify options for YARN cluster such as memory for
JobManager, name of YARN application, etc.
For example, to start a Yarn cluster for the Scala Shell with two TaskManagers
use the following:
~~~bash
bin/start-scala-shell.sh yarn -n 2
~~~
......
---
title: "Working with State"
sub-nav-parent: fault_tolerance
sub-nav-group: streaming
sub-nav-pos: 1
nav-parent_id: dev
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -41,7 +39,7 @@ Flink's state interface.
By default state checkpoints will be stored in-memory at the JobManager. For proper persistence of large
state, Flink supports storing the checkpoints on file systems (HDFS, S3, or any mounted POSIX file system),
which can be configured in the `flink-conf.yaml` or via `StreamExecutionEnvironment.setStateBackend(…)`.
See [state backends]({{ site.baseurl }}/apis/streaming/state_backends.html) for information
See [state backends]({{ site.baseurl }}/dev/state_backends.html) for information
about the available state backends and how to configure them.
* ToC
......@@ -292,4 +290,4 @@ Flink currently only provides processing guarantees for jobs without iterations.
Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure.
{% top %}
\ No newline at end of file
{% top %}
---
title: "State Backends"
sub-nav-group: streaming
sub-nav-pos: 2
sub-nav-parent: fault_tolerance
title: "State Backends"
nav-parent_id: dev
nav-pos: 5
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,13 +22,13 @@ specific language governing permissions and limitations
under the License.
-->
Programs written in the [Data Stream API](index.html) often hold state in various forms:
Programs written in the [Data Stream API]({{ site.baseurl }}/dev/datastream_api.html) often hold state in various forms:
- Windows gather elements or aggregates until they are triggered
- Transformation functions may use the key/value state interface to store values
- Transformation functions may implement the `Checkpointed` interface to make their local variables fault tolerant
See also [Working with State](state.html) in the streaming API guide.
See also [Working with State]({{ site.baseurl }}/dev/state.html) in the streaming API guide.
When checkpointing is activated, such state is persisted upon checkpoints to guard against data loss and recover consistently.
How the state is represented internally, and how and where it is persisted upon checkpoints depends on the
......@@ -111,7 +110,7 @@ project:
{% endhighlight %}
The backend is currently not part of the binary distribution. See
[here]({{ site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
[here]({{ site.baseurl}}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for an explanation of how to include it for cluster execution.
## Configuring a State Backend
......
---
title: "Table API and SQL"
title: "Table and SQL"
is_beta: true
# Top-level navigation
top-nav-group: apis
top-nav-pos: 4
top-nav-title: "<strong>Table API and SQL</strong>"
nav-parent_id: apis
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -25,7 +23,6 @@ specific language governing permissions and limitations
under the License.
-->
**Table API and SQL are experimental features**
The Table API is a SQL-like expression language for relational stream and batch processing that can be easily embedded in Flink's DataSet and DataStream APIs (Java and Scala).
......@@ -50,7 +47,7 @@ The following dependency must be added to your project in order to use the Table
</dependency>
{% endhighlight %}
*Note: The Table API is currently not part of the binary distribution. See linking with it for cluster execution [here]({{ site.baseurl }}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).*
*Note: The Table API is currently not part of the binary distribution. See linking with it for cluster execution [here]({{ site.baseurl }}/dev/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).*
Registering Tables
......@@ -205,7 +202,7 @@ tableEnv.registerTableSource("Customers", custTS)
A `TableSource` can provide access to data stored in various storage systems such as databases (MySQL, HBase, ...), file formats (CSV, Apache Parquet, Avro, ORC, ...), or messaging systems (Apache Kafka, RabbitMQ, ...).
Currently, Flink provides the `CsvTableSource` to read CSV files and the `Kafka08JsonTableSource`/`Kafka09JsonTableSource` to read JSON objects from Kafka.
Currently, Flink provides the `CsvTableSource` to read CSV files and the `Kafka08JsonTableSource`/`Kafka09JsonTableSource` to read JSON objects from Kafka.
A custom `TableSource` can be defined by implementing the `BatchTableSource` or `StreamTableSource` interface.
### Available Table Sources
......@@ -693,12 +690,12 @@ Table result = in.orderBy("a.asc");
<p>Similar to a SQL LIMIT clause. Limits a sorted result to a specified number of records from an offset position. Limit is technically part of the Order By operator and thus must be preceded by it.</p>
{% highlight java %}
Table in = tableEnv.fromDataSet(ds, "a, b, c");
Table result = in.orderBy("a.asc").limit(3); // returns unlimited number of records beginning with the 4th record
Table result = in.orderBy("a.asc").limit(3); // returns unlimited number of records beginning with the 4th record
{% endhighlight %}
or
{% highlight java %}
Table in = tableEnv.fromDataSet(ds, "a, b, c");
Table result = in.orderBy("a.asc").limit(3, 5); // returns 5 records beginning with the 4th record
Table result = in.orderBy("a.asc").limit(3, 5); // returns 5 records beginning with the 4th record
{% endhighlight %}
</td>
</tr>
......@@ -915,12 +912,12 @@ val result = in.orderBy('a.asc);
<p>Similar to a SQL LIMIT clause. Limits a sorted result to a specified number of records from an offset position. Limit is technically part of the Order By operator and thus must be preceded by it.</p>
{% highlight scala %}
val in = ds.toTable(tableEnv, 'a, 'b, 'c);
val result = in.orderBy('a.asc).limit(3); // returns unlimited number of records beginning with the 4th record
val result = in.orderBy('a.asc).limit(3); // returns unlimited number of records beginning with the 4th record
{% endhighlight %}
or
{% highlight scala %}
val in = ds.toTable(tableEnv, 'a, 'b, 'c);
val result = in.orderBy('a.asc).limit(3, 5); // returns 5 records beginning with the 4th record
val result = in.orderBy('a.asc).limit(3, 5); // returns 5 records beginning with the 4th record
{% endhighlight %}
</td>
</tr>
......@@ -1936,7 +1933,7 @@ EXTRACT(TIMEINTERVALUNIT FROM TEMPORAL)
### User-defined Scalar Functions
If a required scalar function is not contained in the built-in functions, it is possible to define custom, user-defined scalar functions for both the Table API and SQL. A user-defined scalar functions maps zero, one, or multiple scalar values to a new scalar value.
If a required scalar function is not contained in the built-in functions, it is possible to define custom, user-defined scalar functions for both the Table API and SQL. A user-defined scalar functions maps zero, one, or multiple scalar values to a new scalar value.
In order to define a scalar function one has to extend the base class `ScalarFunction` in `org.apache.flink.api.table.functions` and implement (one or more) evaluation methods. The behavior of a scalar function is determined by the evaluation method. An evaluation method must be declared publicly and named `eval`. The parameter types and return type of the evaluation method also determine the parameter and return types of the scalar function. Evaluation methods can also be overloaded by implementing multiple methods named `eval`.
......
---
title: "Type Extraction and Serialization"
# Top navigation
top-nav-group: internals
top-nav-pos: 5
title: "Data Types"
nav-id: types
nav-parent_id: dev
nav-pos: 9
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -23,13 +23,12 @@ specific language governing permissions and limitations
under the License.
-->
Flink handles types in a unique way, containing its own type descriptors,
generic type extraction, and type serialization framework.
This document describes the concepts and the rationale behind them.
There are fundamental differences in the way that the Scala API and
the Java API handle type information, so most of the issues described
the Java API handle type information, so most of the issues described
here relate only to one of the to APIs.
* This will be replaced by the TOC
......@@ -68,14 +67,14 @@ Internally, Flink makes the following distinctions between types:
* Primitive arrays and Object arrays
* Composite types
* Composite types
* Flink Java Tuples (part of the Flink Java API)
* Scala *case classes* (including Scala tuples)
* POJOs: classes that follow a certain bean-like pattern
* Scala auxiliary types (Option, Either, Lists, Maps, ...)
* Generic types: These will not be serialized by Flink itself, but by Kryo.
......@@ -144,7 +143,7 @@ for every call and are not known at the site where the method is defined. The co
in an error that not enough implicit evidence is available.
In such cases, the type information has to be generated at the invocation site and passed to the
method. Scala offers *implicit parameters* for that.
method. Scala offers *implicit parameters* for that.
The following code tells Scala to bring a type information for *T* into the function. The type
information will then be generated at the sites where the method is invoked, rather than where the
......@@ -216,7 +215,7 @@ onwards)
**Improving Type information for Java Lambdas**
One of the Flink committers (Timo Walther) has actually become active in the Eclipse JDT compiler community and
in the OpenJDK community and submitted patches to the compiler to improve availability of type information
in the OpenJDK community and submitted patches to the compiler to improve availability of type information
available for Java 8 lambdas.
The Eclipse JDT compiler has added support for this as of version 4.5 M4. Discussion about the feature in the
......@@ -225,7 +224,7 @@ OpenJDK compiler is pending.
#### Serialization of POJO types
The PojoTypeInformation is creating serializers for all the fields inside the POJO. Standard types such as
The PojoTypeInformation is creating serializers for all the fields inside the POJO. Standard types such as
int, long, String etc. are handled by serializers we ship with Flink.
For all other types, we fall back to Kryo.
......@@ -252,7 +251,3 @@ env.getConfig().addDefaultKryoSerializer(Class<?> type, Class<? extends Serializ
{% endhighlight %}
There are different variants of these methods available.
---
title: "Windows"
sub-nav-id: windows
sub-nav-group: streaming
sub-nav-pos: 4
nav-parent_id: dev
nav-id: windows
nav-pos: 3
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
......@@ -42,7 +41,7 @@ windows work.
## Basics
For a windowed transformation you must at least specify a *key*
(see [specifying keys](/apis/common/index.html#specifying-keys)),
(see [specifying keys]({{ site.baseurl }}/dev/api_concepts.html#specifying-keys)),
a *window assigner* and a *window function*. The *key* divides the infinite, non-keyed, stream
into logical keyed streams while the *window assigner* assigns elements to finite per-key windows.
Finally, the *window function* is used to process the elements of each window.
......@@ -90,7 +89,7 @@ with pre-implemented window assigners for the most typical use cases, namely *tu
*sliding windows*, *session windows* and *global windows*, but you can implement your own by
extending the `WindowAssigner` class. All the built-in window assigners, except for the global
windows one, assign elements to windows based on time, which can either be processing time or event
time. Please take a look at our section on [event time](/apis/streaming/event_time.html) for more
time. Please take a look at our section on [event time]({{ site.baseurl }}/dev/event_time.html) for more
information about how Flink deals with time.
Let's first look at how each of these window assigners works before looking at how they can be used
......@@ -107,7 +106,7 @@ This windowing scheme is only useful if you also specify a custom [trigger](#tri
no computation is ever going to be performed, as the global window does not have a natural end at
which we could process the aggregated elements.
<img src="non-windowed.svg" class="center" style="width: 80%;" />
<img src="{{ site.baseurl }}/fig/non-windowed.svg" class="center" style="width: 80%;" />
### Tumbling Windows
......@@ -115,7 +114,7 @@ A *tumbling windows* assigner assigns elements to fixed length, non-overlapping
specified *window size*.. For example, if you specify a window size of 5 minutes, the window
function will get 5 minutes worth of elements in each invocation.
<img src="tumbling-windows.svg" class="center" style="width: 80%;" />
<img src="{{ site.baseurl }}/fig/tumbling-windows.svg" class="center" style="width: 80%;" />
### Sliding Windows
......@@ -128,7 +127,7 @@ For example, you could have windows of size 10 minutes that slide by 5 minutes.
minutes worth of elements in each invocation of the window function and it will be invoked for every
5 minutes of data.
<img src="sliding-windows.svg" class="center" style="width: 80%;" />
<img src="{{ site.baseurl }}/fig/sliding-windows.svg" class="center" style="width: 80%;" />
### Session Windows
......@@ -139,7 +138,7 @@ to have windows that start at individual points in time for each key and that en
been a certain period of inactivity. The configuration parameter is the *session gap* that specifies
how long to wait for new data before considering a session as closed.
<img src="session-windows.svg" class="center" style="width: 80%;" />
<img src="{{ site.baseurl }}/fig/session-windows.svg" class="center" style="width: 80%;" />
### Specifying a Window Assigner
......@@ -147,7 +146,7 @@ The built-in window assigners (except `GlobalWindows`) come in two versions. One
windowing and one for event-time windowing. The processing-time assigners assign elements to
windows based on the current clock of the worker machines while the event-time assigners assign
windows based on the timestamps of elements. Please have a look at
[event time](/apis/streaming/event_time.html) to learn about the difference between processing time
[event time]({{ site.baseurl }}/dev/event_time.html) to learn about the difference between processing time
and event time and about how timestamps can be assigned to elements.
The following code snippets show how each of the window assigners can be used in a program:
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册