streaming_guide.md 140.8 KB
Newer Older
1
---
2 3
title: "Flink DataStream API Programming Guide"
is_beta: false
4
---
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
23

24 25
<a href="#top"></a>

26 27 28 29 30 31
DataStream programs in Flink are regular programs that implement transformations on data streams
(e.g., filtering, updating state, defining windows, aggregating). The data streams are initially created from various
sources (e.g., message queues, socket streams, files). Results are returned via sinks, which may for
example write the data to files, or to standard output (for example the command line
terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs.
The execution can happen in a local JVM, or on clusters of many machines.
32

33 34 35 36
In order to create your own Flink DataStream program, we encourage you to start with the
[program skeleton](#program-skeleton) and gradually add your own
[transformations](#transformations). The remaining sections act as references for additional
operations and advanced features.
37 38


39 40
* This will be replaced by the TOC
{:toc}
41 42 43 44 45


Example Program
---------------

46 47
The following program is a complete, working example of streaming window word count application, that counts the
words coming from a web socket in 5 second windows. You can copy &amp; paste the code to run it locally.
48

49 50 51 52
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

{% highlight java %}
53
public class WindowWordCount {
54

55
    public static void main(String[] args) throws Exception {
56

57
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
58

59
        DataStream<Tuple2<String, Integer>> dataStream = env
60
                .socketTextStream("localhost", 9999)
61
                .flatMap(new Splitter())
62 63
                .keyBy(0)
                .timeWindow(Time.of(5, TimeUnit.SECONDS))
64
                .sum(1);
65

66
        dataStream.print();
67 68

        env.execute("Window WordCount");
69 70 71 72 73 74 75 76 77 78 79
    }
    
    public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
            for (String word: sentence.split(" ")) {
                out.collect(new Tuple2<String, Integer>(word, 1));
            }
        }
    }
    
80
}
81 82 83 84 85 86 87
{% endhighlight %}

</div>

<div data-lang="scala" markdown="1">
{% highlight scala %}

88
object WindowWordCount {
89 90 91 92 93 94 95
  def main(args: Array[String]) {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
96 97
      .keyBy(0)
      .timeWindow(Time.of(5, TimeUnit.SECONDS))
98 99 100 101
      .sum(1)

    counts.print

102
    env.execute("Window Stream WordCount")
103 104 105 106 107 108
  }
}
{% endhighlight %}
</div>

</div>
109

110
To run the example program, start the input stream with netcat first from a terminal:
111

112
~~~bash
113 114 115
nc -lk 9999
~~~

116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
Just type some words hitting return for a new word. These will be the input to the
word count program. If you want to see counts greater than 1, type the same word again and again within 
5 seconds (increase the window size from 5 seconds if you cannot type that fast &#9786;).

[Back to top](#top)


Linking with Flink
------------------

To write programs with Flink, you need to include the Flink DataStream library corresponding to
your programming language in your project.

The simplest way to do this is to use one of the quickstart scripts: either for
[Java]({{ site.baseurl }}/quickstart/java_api_quickstart.html) or for [Scala]({{ site.baseurl }}/quickstart/scala_api_quickstart.html). They
create a blank project from a template (a Maven Archetype), which sets up everything for you. To
manually create the project, you can use the archetype and create a project by calling:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight bash %}
mvn archetype:generate /
    -DarchetypeGroupId=org.apache.flink/
    -DarchetypeArtifactId=flink-quickstart-java /
    -DarchetypeVersion={{site.version }}
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight bash %}
mvn archetype:generate /
    -DarchetypeGroupId=org.apache.flink/
    -DarchetypeArtifactId=flink-quickstart-scala /
    -DarchetypeVersion={{site.version }}
{% endhighlight %}
</div>
</div>

The archetypes are working for stable releases and preview versions (`-SNAPSHOT`).

If you want to add Flink to an existing Maven project, add the following entry to your
*dependencies* section in the *pom.xml* file of your project:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
163
  <artifactId>flink-streaming-java</artifactId>
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
  <version>{{site.version }}</version>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-clients</artifactId>
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-scala</artifactId>
  <version>{{site.version }}</version>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-clients</artifactId>
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}
</div>
</div>

In order to create your own Flink program, we encourage you to start with the
[program skeleton](#program-skeleton) and gradually add your own
[transformations](#transformations).
192

193 194 195 196 197
[Back to top](#top)

Program Skeleton
----------------

198 199 200
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

201 202 203 204
<br />

As presented in the [example](#example-program), Flink DataStream programs look like regular Java
programs with a `main()` method. Each program consists of the same basic parts:
205

206
1. Obtaining a `StreamExecutionEnvironment`,
207 208 209 210 211
2. Connecting to data stream sources,
3. Specifying transformations on the data streams,
4. Specifying output for the processed data,
5. Executing the program.

212 213 214 215 216
We will now give an overview of each of those steps, please refer to the respective sections for
more details. 

The `StreamExecutionEnvironment` is the basis for all Flink DataStream programs. You can
obtain one using these static methods on class `StreamExecutionEnvironment`:
217 218

{% highlight java %}
219 220 221 222 223 224 225 226
getExecutionEnvironment()

createLocalEnvironment()
createLocalEnvironment(int parallelism)
createLocalEnvironment(int parallelism, Configuration customConfiguration)

createRemoteEnvironment(String host, int port, String... jarFiles)
createRemoteEnvironment(String host, int port, int parallelism, String... jarFiles)
227 228
{% endhighlight %}

229 230 231 232 233 234 235 236
Typically, you only need to use `getExecutionEnvironment()`, since this
will do the right thing depending on the context: if you are executing
your program inside an IDE or as a regular Java program it will create
a local environment that will execute your program on your local machine. If
you created a JAR file from your program, and invoke it through the [command line](cli.html)
or the [web interface](web_client.html),
the Flink cluster manager will execute your main method and `getExecutionEnvironment()` will return
an execution environment for executing your program on a cluster.
237

238 239 240
For specifying data sources the execution environment has several methods
to read from files, sockets, and external systems using various methods. To just read
data from a socket (useful also for debugging), you can use:
241

242
{% highlight java %}
243 244 245
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> lines = env.socketTextStream("localhost", 9999)
246 247
{% endhighlight %}

248 249 250
This will give you a DataStream on which you can then apply transformations. For
more information on data sources and input formats, please refer to
[Data Sources](#data-sources).
251

252 253 254 255 256 257
Once you have a DataStream you can apply transformations to create a new
DataStream which you can then write to a socket, transform again,
combine with other DataStreams, or push to an external system (e.g., a message queue, or a file system).
You apply transformations by calling
methods on DataStream with your own custom transformation functions. For example,
a map transformation looks like this:
258

259
{% highlight java %}
260 261 262 263 264 265 266 267
DataStream<String> input = ...;

DataStream<Integer> intValues = input.map(new MapFunction<String, Integer>() {
    @Override
    public Integer map(String value) {
        return Integer.parseInt(value);
    }
});
268 269
{% endhighlight %}

270 271 272
This will create a new DataStream by converting every String in the original
stream to an Integer. For more information and a list of all the transformations,
please refer to [Transformations](#transformations).
273

274 275 276
Once you have a DataStream containing your final results, you can push the result
to an external system (HDFS, Kafka, Elasticsearch), write it to a socket, write to a file,
or print it.
277

278
{% highlight java %}
279 280 281 282 283
writeAsText(String path, ...)
writeAsCsv(String path, ...)
writeToSocket(String hostname, int port, ...)

print()
284

285 286
addSink(...)
{% endhighlight %}
287

288 289 290 291
Once you specified the complete program you need to **trigger the program execution** by 
calling `execute()` on `StreamExecutionEnvironment`. This will either execute on
the local machine or submit the program for execution on a cluster, depending on the chosen execution environment.
        
292
{% highlight java %}
293
env.execute();
294 295 296 297 298
{% endhighlight %}

</div>
<div data-lang="scala" markdown="1">

299 300 301 302
<br />

As presented in the [example](#example-program), Flink DataStream programs look like regular Scala
programs with a `main()` method. Each program consists of the same basic parts:
303

304
1. Obtaining a `StreamExecutionEnvironment`,
305 306 307 308 309
2. Connecting to data stream sources,
3. Specifying transformations on the data streams,
4. Specifying output for the processed data,
5. Executing the program.

310 311 312 313 314
We will now give an overview of each of those steps, please refer to the respective sections for
more details.

The `StreamExecutionEnvironment` is the basis for all Flink DataStream programs. You can
obtain one using these static methods on class `StreamExecutionEnvironment`:
315

316
{% highlight scala %}
317 318 319 320 321 322
def getExecutionEnvironment

def createLocalEnvironment(parallelism: Int =  Runtime.getRuntime.availableProcessors())

def createRemoteEnvironment(host: String, port: Int, jarFiles: String*)
def createRemoteEnvironment(host: String, port: Int, parallelism: Int, jarFiles: String*)
323
{% endhighlight %}
324

325 326 327 328
Typically, you only need to use `getExecutionEnvironment`, since this
will do the right thing depending on the context: if you are executing
your program inside an IDE or as a regular Java program it will create
a local environment that will execute your program on your local machine. If
329
you created a JAR file from your program, and invoke it through the [command line](cli.html)
330 331 332
or the [web interface](web_client.html),
the Flink cluster manager will execute your main method and `getExecutionEnvironment()` will return
an execution environment for executing your program on a cluster.
333

334 335
For specifying data sources the execution environment has several methods
to read from files, sockets, and external systems using various methods. To just read
336
data from a socket (useful also for debugging), you can use:
337

338
{% highlight scala %}
339 340 341
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment

DataStream<String> lines = env.socketTextStream("localhost", 9999)
342
{% endhighlight %}
343

344 345 346
This will give you a DataStream on which you can then apply transformations. For
more information on data sources and input formats, please refer to
[Data Sources](#data-sources).
347

348 349 350 351 352 353
Once you have a DataStream you can apply transformations to create a new
DataStream which you can then write to a file, transform again,
combine with other DataStreams, or push to an external system.
You apply transformations by calling
methods on DataStream with your own custom transformation function. For example,
a map transformation looks like this:
354

355
{% highlight scala %}
356 357 358
val input: DataStream[String] = ...

val mapped = input.map { x => x.toInt }
359
{% endhighlight %}
360

361 362 363
This will create a new DataStream by converting every String in the original
set to an Integer. For more information and a list of all the transformations,
please refer to [Transformations](#transformations).
364

365 366 367
Once you have a DataStream containing your final results, you can push the result
to an external system (HDFS, Kafka, Elasticsearch), write it to a socket, write to a file,
or print it.
368

369
{% highlight scala %}
370 371 372 373 374 375 376
writeAsText(path: String, ...)
writeAsCsv(path: String, ...)
writeToSocket(hostname: String, port: Int, ...)

print()

addSink(...)
377
{% endhighlight %}
378

379 380 381
Once you specified the complete program you need to **trigger the program execution** by
calling `execute` on `StreamExecutionEnvironment`. This will either execute on
the local machine or submit the program for execution on a cluster, depending on the chosen execution environment.
382

383
{% highlight scala %}
384
env.execute()
385 386 387 388
{% endhighlight %}

</div>
</div>
389 390 391

[Back to top](#top)

392 393
DataStream Abstraction
----------------------
394

395
A `DataStream` is a possibly unbounded immutable collection of data items of a the same type.
396

397 398 399
Transformations may return different subtypes of `DataStream` allowing specialized transformations.
For example the `keyBy(…)` method returns a `KeyedDataStream` which is a stream of data that
is logically partitioned by a certain key, and can be further windowed.
400

401
[Back to top](#top)
402

403 404
Lazy Evaluation
---------------
405

406 407 408 409 410
All Flink DataStream programs are executed lazily: When the program's main method is executed, the data loading
and transformations do not happen directly. Rather, each operation is created and added to the
program's plan. The operations are actually executed when the execution is explicitly triggered by 
an `execute()` call on the `StreamExecutionEnvironment` object. Whether the program is executed locally 
or on a cluster depends on the type of `StreamExecutionEnvironment`.
411

412 413
The lazy evaluation lets you construct sophisticated programs that Flink executes as one
holistically planned unit.
414

415
[Back to top](#top)
416 417


418 419
Transformations
---------------
420

421 422
Data transformations transform one or more DataStreams into a new DataStream. Programs can combine
multiple transformations into sophisticated topologies.
423

424
This section gives a description of all the available transformations.
425

426

427 428
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
429

430
<br />
431

432 433 434 435 436 437 438 439 440 441 442 443 444 445 446
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
          <td><strong>Map</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Takes one element and produces one element. A map function that doubles the values of the input stream:</p>
    {% highlight java %}
DataStream<Integer> dataStream = //...
dataStream.map(new MapFunction<Integer, Integer>() {
447
    @Override
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
    public Integer map(Integer value) throws Exception {
        return 2 * value;
    }
});
    {% endhighlight %}
          </td>
        </tr>

        <tr>
          <td><strong>FlatMap</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:</p>
    {% highlight java %}
dataStream.flatMap(new FlatMapFunction<String, String>() {
    @Override
    public void flatMap(String value, Collector<String> out)
        throws Exception {
        for(String word: value.split(" ")){
            out.collect(word);
467 468
        }
    }
469 470 471 472 473 474 475 476 477 478 479 480
});
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Filter</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Evaluates a boolean function for each element and retains those for which the function returns true.
            A filter that filters out zero values:
            </p>
    {% highlight java %}
dataStream.filter(new FilterFunction<Integer>() {
481
    @Override
482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521
    public boolean filter(Integer value) throws Exception {
        return value != 0;
    }
});
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>KeyBy</strong><br>DataStream &rarr; KeyedStream</td>
          <td>
            <p>Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. 
            Internally, this is implemented with hash partitioning. See <a href="#specifying-keys">keys</a> on how to specify keys.
            This transformation returns a KeyedDataStream.</p>
    {% highlight java %}
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Reduce</strong><br>KeyedStream &rarr; DataStream</td>
          <td>
            <p>A "rolling" reduce on a keyed data stream. Combines the current element with the last reduced value and
            emits the new value.
                    <br/>
            	<br/>
            A reduce function that creates a stream of partial sums:</p>
            {% highlight java %}
keyedStream.reduce(new ReduceFunction<Integer>() {
    @Override
    public Integer reduce(Integer value1, Integer value2)
    throws Exception {
        return value1 + value2;
    }
});
            {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
522
          <td><strong>Fold</strong><br>KeyedStream &rarr; DataStream</td>
523 524 525 526 527 528
          <td>
          <p>A "rolling" fold on a keyed data stream with an initial value. 
          Combines the current element with the last folded value and
          emits the new value.
          <br/>
          <br/>
529 530
          <p>A fold function that, when applied on the sequence (1,2,3,4,5), 
          emits the sequence "start-1", "start-1-2", "start-1-2-3", ...</p>
531
          {% highlight java %}
532 533 534 535 536 537 538
DataStream<String> result = 
  keyedStream.fold("start", new FoldFunction<Integer, String>() {
    @Override
    public String fold(String current, Integer value) {
        return current + "-" + value;
    }
  });
539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575
          {% endhighlight %}
          </p>
          </td>
        </tr>
        <tr>
          <td><strong>Aggregations</strong><br>KeyedStream &rarr; DataStream</td>
          <td>
            <p>Rolling aggregations on a keyed data stream. The difference between min 
	    and minBy is that min returns the minimun value, whereas minBy returns
	    the element that has the minimum value in this field (same for max and maxBy).</p>
    {% highlight java %}
keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>Windows can be defined on already partitioned KeyedStreams. Windows group the data in each
            key according to some characteristic (e.g., the data that arrived within the last 5 seconds).
            See <a href="#windows">windows</a> for a complete description of windows.
    {% highlight java %}
dataStream.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)); // Last 5 seconds of data
    {% endhighlight %}
        </p>
          </td>
        </tr>
        <tr>
576
          <td><strong>WindowAll</strong><br>DataStream &rarr; AllWindowedStream</td>
577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
          <td>
              <p>Windows can be defined on regular DataStreams. Windows group all the stream events
              according to some characteristic (e.g., the data that arrived within the last 5 seconds).
              See <a href="#windows">windows</a> for a complete description of windows.</p>
              <p><strong>WARNING:</strong> This is in many cases a <strong>non-parallel</strong> transformation. All records will be
               gathered in one task for the windowAll operator.</p>
  {% highlight java %}
dataStream.windowAll(TumblingTimeWindows.of(Time.of(5, TimeUnit.SECONDS))); // Last 5 seconds of data
  {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Apply</strong><br>WindowedStream &rarr; DataStream<br>AllWindowedStream &rarr; DataStream</td>
          <td>
            <p>Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window.</p>
            <p><strong>Note:</strong> If you are using a windowAll transformation, you need to use an AllWindowFunction instead.</p>
    {% highlight java %}
windowedStream.apply (new WindowFunction<Tuple2<String,Integer>,Integer>, Tuple, Window>() {
    public void apply (Tuple tuple,
            Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
};
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Reduce</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
            <p>Applies a functional reduce function to the window and returns the reduced value.</p>
    {% highlight java %}
windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>() {
    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
        return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1);
    }
};
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Fold</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
625 626 627
            <p>Applies a functional fold function to the window and returns the folded value.
               The example function, when applied on the sequence (1,2,3,4,5),
               folds the sequence into the string "start-1-2-3-4-5":</p>
628
    {% highlight java %}
629 630 631
windowedStream.fold("start-", new FoldFunction<Integer, String>() {
    public String fold(String current, Integer value) {
        return current + "-" + value;
632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712
    }
};
    {% endhighlight %}
          </td>
        </tr>	
        <tr>
          <td><strong>Aggregations on windows</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
            <p>Aggregates the contents of a window. The difference between min 
	    and minBy is that min returns the minimun value, whereas minBy returns
	    the element that has the minimum value in this field (same for max and maxBy).</p>
    {% highlight java %}
windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Union</strong><br>DataStream* &rarr; DataStream</td>
          <td>
            <p>Union of two or more data streams creating a new stream containing all the elements from all the streams. Node: If you union a data stream
            with itself you will still only get each element once.</p>
    {% highlight java %}
dataStream.union(otherStream1, otherStream2, ...);
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Join</strong><br>DataStream,DataStream &rarr; DataStream</td>
          <td>
            <p>Join two data streams on a given key and a common window.</p>
    {% highlight java %}
dataStream.join(otherStream)
    .where(0).equalTo(1)
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
    .apply (new JoinFunction () {...});
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window CoGroup</strong><br>DataStream,DataStream &rarr; DataStream</td>
          <td>
            <p>Cogroups two data streams on a given key and a common window.</p>
    {% highlight java %}
dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
    .apply (new CoGroupFunction () {...});
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Connect</strong><br>DataStream,DataStream &rarr; ConnectedStreams</td>
          <td>
            <p>"Connects" two data streams retaining their types. Connect allowing for shared state between
            the two streams.</p>
    {% highlight java %}
DataStream<Integer> someStream = //...
DataStream<String> otherStream = //...

ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream);
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>CoMap, CoFlatMap</strong><br>ConnectedStreams &rarr; DataStream</td>
          <td>
            <p>Similar to map and flatMap on a connected data stream</p>
    {% highlight java %}
connectedStreams.map(new CoMapFunction<Integer, String, Boolean>() {
    @Override
    public Boolean map1(Integer value) {
        return true;
713 714
    }

715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817
    @Override
    public Boolean map2(String value) {
        return false;
    }
});
connectedStreams.flatMap(new CoFlatMapFunction<Integer, String, String>() {

   @Override
   public void flatMap1(Integer value, Collector<String> out) {
       out.collect(value.toString());
   }

   @Override
   public void flatMap2(String value, Collector<String> out) {
       for (String word: value.split(" ")) {
         out.collect(word);
       }
   }
});
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Split</strong><br>DataStream &rarr; SplitStream</td>
          <td>
            <p>
                Split the stream into two or more streams according to some criterion.
                {% highlight java %}
SplitStream<Integer> split = someDataStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>();
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Select</strong><br>SplitStream &rarr; DataStream</td>
          <td>
            <p>
                Select one or more streams from a split stream.
                {% highlight java %}
SplitStream<Integer> split;
DataStream<Integer> even = split.select("even");
DataStream<Integer> odd = split.select("odd");
DataStream<Integer> all = split.select("even","odd");
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Iterate</strong><br>DataStream &rarr; IterativeStream &rarr; DataStream</td>
          <td>
            <p>
                Creates a "feedback" loop in the flow, by redirecting the output of one operator
                to some previous operator. This is especially useful for defining algorithms that
                continuously update a model. The following code starts with a stream and applies
		the iteration body continuously. Elements that are greater than 0 are sent back
		to the feedback channel, and the rest of the elements are forwarded downstream.
		See <a href="#iterations">iterations</a> for a complete description.
                {% highlight java %}
IterativeStream<Long> iteration = initialStream.iterate();
DataStream<Long> iterationBody = iteration.map (/*do something*/);
DataStream<Long> feedback = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Integer value) throws Exception {
        return value > 0;
    }
});
iteration.closeWith(feedback);
DataStream<Long> output = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Integer value) throws Exception {
        return value <= 0;
    }
});
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Extract Timestamps</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>
                Extracts timestamps from records in order to work with windows
                that use event time semantics. See <a href="#working-with-time">working with time</a>.
                {% highlight java %}
stream.assignTimestamps (new TimeStampExtractor() {...});
                {% endhighlight %}
            </p>
          </td>
        </tr>
  </tbody>
</table>
818

819
</div>
820

821
<div data-lang="scala" markdown="1">
822

823
<br />
824

825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
          <td><strong>Map</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Takes one element and produces one element. A map function that doubles the values of the input stream:</p>
    {% highlight scala %}
dataStream.map { x => x * 2 }
    {% endhighlight %}
          </td>
        </tr>

        <tr>
          <td><strong>FlatMap</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:</p>
    {% highlight scala %}
dataStream.flatMap { str => str.split(" ") }
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Filter</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>Evaluates a boolean function for each element and retains those for which the function returns true.
            A filter that filters out zero values:
            </p>
    {% highlight scala %}
dataStream.filter { _ != 0 }
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>KeyBy</strong><br>DataStream &rarr; KeyedStream</td>
          <td>
            <p>Logically partitions a stream into disjoint partitions, each partition containing elements of the same key.
            Internally, this is implemented with hash partitioning. See <a href="#specifying-keys">keys</a> on how to specify keys.
            This transformation returns a KeyedDataStream.</p>
    {% highlight scala %}
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Reduce</strong><br>KeyedStream &rarr; DataStream</td>
          <td>
            <p>A "rolling" reduce on a keyed data stream. Combines the current element with the last reduced value and
            emits the new value.
                    <br/>
            	<br/>
            A reduce function that creates a stream of partial sums:</p>
            {% highlight scala %}
keyedStream.reduce { _ + _ }
            {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
890
          <td><strong>Fold</strong><br>KeyedStream &rarr; DataStream</td>
891 892 893 894 895 896
          <td>
          <p>A "rolling" fold on a keyed data stream with an initial value.
          Combines the current element with the last folded value and
          emits the new value.
          <br/>
          <br/>
897 898
          <p>A fold function that, when applied on the sequence (1,2,3,4,5),
          emits the sequence "start-1", "start-1-2", "start-1-2-3", ...</p>
899
          {% highlight scala %}
900 901
val result: DataStream[String] = 
    keyedStream.fold("start", (str, i) => { str + "-" + i })
902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938
          {% endhighlight %}
          </p>
          </td>
        </tr>
        <tr>
          <td><strong>Aggregations</strong><br>KeyedStream &rarr; DataStream</td>
          <td>
            <p>Rolling aggregations on a keyed data stream. The difference between min 
	    and minBy is that min returns the minimun value, whereas minBy returns
	    the element that has the minimum value in this field (same for max and maxBy).</p>
    {% highlight scala %}
keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
    {% endhighlight %}
          </td>
        </tr>	
        <tr>
          <td><strong>Window</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>Windows can be defined on already partitioned KeyedStreams. Windows group the data in each
            key according to some characteristic (e.g., the data that arrived within the last 5 seconds).
            See <a href="#windows">windows</a> for a description of windows.
    {% highlight scala %}
dataStream.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)) // Last 5 seconds of data // Last 5 seconds of data
    {% endhighlight %}
        </p>
          </td>
        </tr>
        <tr>
939
          <td><strong>WindowAll</strong><br>DataStream &rarr; AllWindowedStream</td>
940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972
          <td>
              <p>Windows can be defined on regular DataStreams. Windows group all the stream events
              according to some characteristic (e.g., the data that arrived within the last 5 seconds).
              See <a href="#windows">windows</a> for a complete description of windows.</p>
              <p><strong>WARNING:</strong> This is in many cases a <strong>non-parallel</strong> transformation. All records will be
               gathered in one task for the windowAll operator.</p>
  {% highlight scala %}
dataStream.windowAll(TumblingTimeWindows.of(Time.of(5, TimeUnit.SECONDS))) // Last 5 seconds of data
  {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Apply</strong><br>WindowedStream &rarr; DataStream<br>AllWindowedStream &rarr; DataStream</td>
          <td>
            <p>Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window.</p>
            <p><strong>Note:</strong> If you are using a windowAll transformation, you need to use an AllWindowFunction instead.</p>
    {% highlight scala %}
windowedStream.apply { applyFunction }
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Reduce</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
            <p>Applies a functional reduce function to the window and returns the reduced value.</p>
    {% highlight scala %}
windowedStream.reduce { _ + _ }
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Fold</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
973 974 975 976 977 978 979
            <p>Applies a functional fold function to the window and returns the folded value.
               The example function, when applied on the sequence (1,2,3,4,5),
               folds the sequence into the string "start-1-2-3-4-5":</p>
          {% highlight scala %}
val result: DataStream[String] = 
    windowedStream.fold("start", (str, i) => { str + "-" + i })
          {% endhighlight %}
980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018
          </td>
	</tr>
        <tr>
          <td><strong>Aggregations on windows</strong><br>WindowedStream &rarr; DataStream</td>
          <td>
            <p>Aggregates the contents of a window. The difference between min 
	    and minBy is that min returns the minimun value, whereas minBy returns
	    the element that has the minimum value in this field (same for max and maxBy).</p>
    {% highlight scala %}
windowedStream.sum(0)
windowedStream.sum("key")
windowedStream.min(0)
windowedStream.min("key")
windowedStream.max(0)
windowedStream.max("key")
windowedStream.minBy(0)
windowedStream.minBy("key")
windowedStream.maxBy(0)
windowedStream.maxBy("key")
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Union</strong><br>DataStream* &rarr; DataStream</td>
          <td>
            <p>Union of two or more data streams creating a new stream containing all the elements from all the streams. Node: If you union a data stream
            with itself you will still only get each element once.</p>
    {% highlight scala %}
dataStream.union(otherStream1, otherStream2, ...)
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window Join</strong><br>DataStream,DataStream &rarr; DataStream</td>
          <td>
            <p>Join two data streams on a given key and a common window.</p>
    {% highlight scala %}
dataStream.join(otherStream)
    .where(0).equalTo(1)
1019
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135
    .apply { ... }
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Window CoGroup</strong><br>DataStream,DataStream &rarr; DataStream</td>
          <td>
            <p>Cogroups two data streams on a given key and a common window.</p>
    {% highlight scala %}
dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
    .apply {}
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Connect</strong><br>DataStream,DataStream &rarr; ConnectedStreams</td>
          <td>
            <p>"Connects" two data streams retaining their types, allowing for shared state between
            the two streams.</p>
    {% highlight scala %}
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...

val connectedStreams = someStream.connect(otherStream)
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>CoMap, CoFlatMap</strong><br>ConnectedStreams &rarr; DataStream</td>
          <td>
            <p>Similar to map and flatMap on a connected data stream</p>
    {% highlight scala %}
connectedStreams.map(
    (_ : Int) => true,
    (_ : String) => false
)
connectedStreams.flatMap(
    (_ : Int) => true,
    (_ : String) => false
)
    {% endhighlight %}
          </td>
        </tr>
        <tr>
          <td><strong>Split</strong><br>DataStream &rarr; SplitStream</td>
          <td>
            <p>
                Split the stream into two or more streams according to some criterion.
                {% highlight scala %}
val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Select</strong><br>SplitStream &rarr; DataStream</td>
          <td>
            <p>
                Select one or more streams from a split stream.
                {% highlight scala %}

val even = split select "even"
val odd = split select "odd"
val all = split.select("even","odd")
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Iterate</strong><br>DataStream &rarr; IterativeStream  &rarr; DataStream</td>
          <td>
            <p>
                Creates a "feedback" loop in the flow, by redirecting the output of one operator
                to some previous operator. This is especially useful for defining algorithms that
                continuously update a model. The following code starts with a stream and applies
		the iteration body continuously. Elements that are greater than 0 are sent back
		to the feedback channel, and the rest of the elements are forwarded downstream.
		See <a href="#iterations">iterations</a> for a complete description.
                {% highlight java %}
initialStream. iterate {
  iteration => {
    val iterationBody = iteration.map {/*do something*/}
    (iterationBody.filter(_ > 0), iterationBody.filter(_ <= 0))
  }
}
IterativeStream<Long> iteration = initialStream.iterate();
DataStream<Long> iterationBody = iteration.map (/*do something*/);
DataStream<Long> feedback = iterationBody.filter ( _ > 0);
iteration.closeWith(feedback);
                {% endhighlight %}
            </p>
          </td>
        </tr>
        <tr>
          <td><strong>Extract Timestamps</strong><br>DataStream &rarr; DataStream</td>
          <td>
            <p>
                Extracts timestamps from records in order to work with windows
                that use event time semantics.
                See <a href="#working-with-time">working with time</a>.
                {% highlight scala %}
stream.assignTimestamps { timestampExtractor }
                {% endhighlight %}
            </p>
          </td>
        </tr>
  </tbody>
</table>
1136

1137 1138
</div>
</div>
1139

1140
The following transformations are available on data streams of Tuples:
1141 1142


1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

<br />

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
1156 1157
   <tr>
      <td><strong>Project</strong><br>DataStream &rarr; DataStream</td>
1158
      <td>
1159
        <p>Selects a subset of fields from the tuples
1160
{% highlight java %}
1161 1162
DataStream<Tuple3<Integer, Double, String>> in = // [...]
DataStream<Tuple2<String, Integer>> out = in.project(2,0);
1163
{% endhighlight %}
1164
        </p>
1165 1166
      </td>
    </tr>
1167 1168 1169 1170 1171 1172 1173 1174
  </tbody>
</table>

</div>

<div data-lang="scala" markdown="1">

<br />
1175

1176 1177
<table class="table table-bordered">
  <thead>
1178
    <tr>
1179 1180 1181 1182 1183 1184 1185
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td><strong>Project</strong><br>DataStream &rarr; DataStream</td>
1186
      <td>
1187 1188 1189 1190
        <p>Selects a subset of fields from the tuples
{% highlight scala %}
val in : DataStream[(Int,Double,String)] = // [...]
val out = in.project(2,0)
1191
{% endhighlight %}
1192
        </p>
1193 1194
      </td>
    </tr>
1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210
  </tbody>
</table>

</div>
</div>


### Physical partitioning

Flink also gives low-level control (if desired) on the exact stream partitioning after a transformation,
via the following functions.

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

<br />
1211

1212 1213
<table class="table table-bordered">
  <thead>
1214
    <tr>
1215 1216 1217 1218 1219 1220 1221
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td><strong>Hash partitioning</strong><br>DataStream &rarr; DataStream</td>
1222
      <td>
1223 1224 1225 1226 1227 1228
        <p>
            Identical to keyBy but returns a DataStream instead of a KeyedStream.
            {% highlight java %}
dataStream.partitionByHash("someKey");
dataStream.partitionByHash(0);
            {% endhighlight %}
1229 1230 1231
        </p>
      </td>
    </tr>
1232 1233
   <tr>
      <td><strong>Custom partitioning</strong><br>DataStream &rarr; DataStream</td>
1234
      <td>
1235 1236 1237 1238 1239 1240 1241
        <p>
            Uses a user-defined Partitioner to select the target task for each element.
            {% highlight java %}
dataStream.partitionCustom(new Partitioner(){...}, "someKey");
dataStream.partitionCustom(new Partitioner(){...}, 0);
            {% endhighlight %}
        </p>
1242 1243
      </td>
    </tr>
1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256
   <tr>
     <td><strong>Random partitioning</strong><br>DataStream &rarr; DataStream</td>
     <td>
       <p>
            Partitions elements randomly according to a uniform distribution.
            {% highlight java %}
dataStream.partitionRandom();
            {% endhighlight %}
       </p>
     </td>
   </tr>
   <tr>
      <td><strong>Rebalancing (Round-robin partitioning)</strong><br>DataStream &rarr; DataStream</td>
1257
      <td>
1258 1259 1260 1261 1262 1263 1264
        <p>
            Partitions elements round-robin, creating equal load per partition. Useful for performance
            optimization in the presence of data skew.
            {% highlight java %}
dataStream.rebalance();
            {% endhighlight %}
        </p>
1265 1266
      </td>
    </tr>
1267 1268
   <tr>
      <td><strong>Broadcasting</strong><br>DataStream &rarr; DataStream</td>
1269
      <td>
1270 1271 1272 1273 1274 1275
        <p>
            Broadcasts elements to every partition.
            {% highlight java %}
dataStream.broadcast();
            {% endhighlight %}
        </p>
1276 1277 1278 1279 1280
      </td>
    </tr>
  </tbody>
</table>

1281
</div>
1282

1283 1284 1285
<div data-lang="scala" markdown="1">

<br />
1286 1287 1288 1289 1290 1291 1292 1293 1294 1295

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
   <tr>
1296
      <td><strong>Hash partitioning</strong><br>DataStream &rarr; DataStream</td>
1297
      <td>
1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350
        <p>
            Identical to keyBy but returns a DataStream instead of a KeyedStream.
            {% highlight scala %}
dataStream.partitionByHash("someKey")
dataStream.partitionByHash(0)
            {% endhighlight %}
        </p>
      </td>
    </tr>
   <tr>
      <td><strong>Custom partitioning</strong><br>DataStream &rarr; DataStream</td>
      <td>
        <p>
            Uses a user-defined Partitioner to select the target task for each element.
            {% highlight scala %}
dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
            {% endhighlight %}
        </p>
      </td>
    </tr>
   <tr>
     <td><strong>Random partitioning</strong><br>DataStream &rarr; DataStream</td>
     <td>
       <p>
            Partitions elements randomly according to a uniform distribution.
            {% highlight scala %}
dataStream.partitionRandom()
            {% endhighlight %}
       </p>
     </td>
   </tr>
   <tr>
      <td><strong>Rebalancing (Round-robin partitioning)</strong><br>DataStream &rarr; DataStream</td>
      <td>
        <p>
            Partitions elements round-robin, creating equal load per partition. Useful for performance
            optimization in the presence of data skew.
            {% highlight scala %}
dataStream.rebalance()
            {% endhighlight %}
        </p>
      </td>
    </tr>
   <tr>
      <td><strong>Broadcasting</strong><br>DataStream &rarr; DataStream</td>
      <td>
        <p>
            Broadcasts elements to every partition.
            {% highlight scala %}
dataStream.broadcast()
            {% endhighlight %}
        </p>
1351 1352 1353 1354
      </td>
    </tr>
  </tbody>
</table>
1355

1356
</div>
1357 1358
</div>

1359
### Task chaining and resource groups
1360

1361
Chaining two subsequent transformations means co-locating them within the same thread for better
1362 1363 1364 1365 1366 1367 1368 1369 1370 1371
performance. Flink by default chains operators if this is possible (e.g., two subsequent map
transformations). The API gives fine-grained control over chaining if desired:

Use `StreamExecutionEnvironment.disableOperatorChaining()` if you want to disable chaining in
the whole job. For more fine grained control, the following functions are available. Note that
these functions can only be used right after a DataStream transformation as they refer to the
previous transformation. For example, you can use `someStream.map(...).startNewChain()`, but
you cannot use `someStream.startNewChain()`.

A resource group is a slot in Flink, see
1372
[slots]({{site.baseurl}}/setup/config.html#configuring-taskmanager-processing-slots). You can
1373
manually isolate operators in separate slots if desired.
1374

1375 1376
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
1377

1378
<br />
1379 1380 1381 1382 1383 1384 1385 1386 1387

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
1388 1389
   <tr>
      <td>Start new chain</td>
1390
      <td>
1391 1392 1393 1394 1395
        <p>Begin a new chain, starting with this operator. The two
	mappers will be chained, and filter will not be chained to
	the first mapper.
{% highlight java %}
someStream.filter(...).map(...).startNewChain().map(...);
1396
{% endhighlight %}
1397
        </p>
1398 1399
      </td>
    </tr>
1400 1401
   <tr>
      <td>Disable chaining</td>
1402
      <td>
1403 1404 1405
        <p>Do not chain the map operator
{% highlight java %}
someStream.map(...).disableChaining();
1406 1407 1408
{% endhighlight %}
        </p>
      </td>
1409 1410 1411
    </tr>    
   <tr>
      <td>Start a new resource group</td>
1412
      <td>
1413 1414 1415
        <p>Start a new resource group containing the map and the subsequent operators.
{% highlight java %}
someStream.filter(...).startNewResourceGroup();
1416
{% endhighlight %}
1417
        </p>
1418 1419
      </td>
    </tr>
1420 1421
   <tr>
      <td>Isolate resources</td>
1422
      <td>
1423 1424 1425
        <p>Isolate the operator in its own slot.
{% highlight java %}
someStream.map(...).isolateResources();
1426
{% endhighlight %}
1427
        </p>
1428
      </td>
1429 1430 1431 1432 1433 1434 1435 1436 1437
    </tr>        
  </tbody>
</table>

</div>

<div data-lang="scala" markdown="1">

<br />
1438

1439 1440
<table class="table table-bordered">
  <thead>
1441
    <tr>
1442 1443 1444 1445 1446 1447 1448
      <th class="text-left" style="width: 20%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td>Start new chain</td>
1449
      <td>
1450 1451 1452
        <p>Begin a new chain, starting with this operator. The two
	mappers will be chained, and filter will not be chained to
	the first mapper.
1453
{% highlight scala %}
1454
someStream.filter(...).map(...).startNewChain().map(...)
1455
{% endhighlight %}
1456
        </p>
1457 1458
      </td>
    </tr>
1459 1460
   <tr>
      <td>Disable chaining</td>
1461
      <td>
1462
        <p>Do not chain the map operator
1463
{% highlight scala %}
1464
someStream.map(...).disableChaining()
1465
{% endhighlight %}
1466
        </p>
1467
      </td>
1468 1469 1470 1471 1472
    </tr>    
   <tr>
      <td>Start a new resource group</td>
      <td>
        <p>Start a new resource group containing the map and the subsequent operators.
1473
{% highlight scala %}
1474
someStream.filter(...).startNewResourceGroup()
1475
{% endhighlight %}
1476
        </p>
1477 1478
      </td>
    </tr>
1479 1480
   <tr>
      <td>Isolate resources</td>
1481
      <td>
1482
        <p>Isolate the operator in its own slot.
1483
{% highlight scala %}
1484
someStream.map(...).isolateResources()
1485
{% endhighlight %}
1486
        </p>
1487
      </td>
1488
    </tr>        
1489 1490 1491 1492 1493
  </tbody>
</table>

</div>
</div>
1494

1495

1496
[Back to top](#top)
1497

1498 1499
Specifying Keys
----------------
1500

1501 1502
The `keyBy` transformation requires that a key is defined on
its argument DataStream.
1503

1504 1505 1506 1507 1508 1509 1510
A DataStream is keyed as
{% highlight java %}
DataStream<...> input = // [...]
DataStream<...> windowed = input
	.keyBy(/*define key here*/)
	.window(/*define window here*/);
{% endhighlight %}
1511

1512 1513 1514 1515
The data model of Flink is not based on key-value pairs. Therefore,
you do not need to physically pack the data stream types into keys and
values. Keys are "virtual": they are defined as functions over the
actual data to guide the grouping operator.
1516

1517 1518
See [the relevant section of the DataSet API documentation](programming_guide.html#specifying-keys) on how to specify keys.
Just replace `DataSet` with `DataStream`, and `groupBy` with `keyBy`.
1519 1520 1521



1522 1523
Passing Functions to Flink
--------------------------
1524

1525
Some transformations take user-defined functions as arguments. 
1526

1527
See [the relevant section of the DataSet API documentation](programming_guide.html#passing-functions-to-flink).
1528

1529

1530
[Back to top](#top)
1531

1532

1533 1534
Data Types
----------
1535

1536 1537 1538
Flink places some restrictions on the type of elements that are used in DataStreams and in results
of transformations. The reason for this is that the system analyzes the types to determine
efficient execution strategies.
1539

1540
See [the relevant section of the DataSet API documentation](programming_guide.html#data-types).
1541

1542
[Back to top](#top)
1543

1544

1545 1546
Data Sources
------------
1547

1548 1549
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
1550

1551
<br />
1552

1553 1554 1555 1556
Sources can by created by using `StreamExecutionEnvironment.addSource(sourceFunction)`.
You can either use one of the source functions that come with Flink or write a custom source
by implementing the `SourceFunction` for non-parallel sources, or by implementing the
`ParallelSourceFunction` interface or extending `RichParallelSourceFunction` for parallel sources.
1557

1558
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
1559

1560
File-based:
1561

1562
- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings.
1563

1564 1565
- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line wise and returns them as
  StringValues. StringValues are mutable strings.
1566

1567
- `readFile(path)` / Any input format - Reads files as dictated by the input format.
1568

1569
- `readFileOfPrimitives(path, Class)` / `PrimitiveInputFormat` - Parses files of new-line (or another char sequence) delimited primitive data types such as `String` or `Integer`.
1570

1571
- `readFileStream` - create a stream by appending elements when there are changes to a file
1572

1573
Socket-based:
1574

1575
- `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter.
1576

1577
Collection-based:
1578

1579 1580
- `fromCollection(Collection)` - Creates a data stream from the Java Java.util.Collection. All elements
  in the collection must be of the same type.
1581

1582 1583
- `fromCollection(Iterator, Class)` - Creates a data stream from an iterator. The class specifies the
  data type of the elements returned by the iterator.
1584

1585 1586
- `fromElements(T ...)` - Creates a data stream from the given sequence of objects. All objects must be
  of the same type.
1587

1588 1589 1590 1591 1592 1593 1594 1595 1596 1597
- `fromParallelCollection(SplittableIterator, Class)` - Creates a data stream from an iterator, in
  parallel. The class specifies the data type of the elements returned by the iterator.

- `generateSequence(from, to)` - Generates the sequence of numbers in the given interval, in
  parallel.

Custom:

- `addSource` - Attache a new source function. For example, to read from Apache Kafka you can use
    `addSource(new FlinkKafkaConsumer082<>(...))`. See [connectors](#connectors) for more details.
1598

1599 1600 1601 1602
</div>

<div data-lang="scala" markdown="1">

1603
<br />
1604

1605 1606 1607 1608
Sources can by created by using `StreamExecutionEnvironment.addSource(sourceFunction)`.
You can either use one of the source functions that come with Flink or write a custom source
by implementing the `SourceFunction` for non-parallel sources, or by implementing the
`ParallelSourceFunction` interface or extending `RichParallelSourceFunction` for parallel sources.
1609

1610
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
1611

1612
File-based:
1613

1614
- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings.
1615

1616 1617
- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line wise and returns them as
  StringValues. StringValues are mutable strings.
1618

1619
- `readFile(path)` / Any input format - Reads files as dictated by the input format.
1620

1621
- `readFileOfPrimitives(path, Class)` / `PrimitiveInputFormat` - Parses files of new-line (or another char sequence) delimited primitive data types such as `String` or `Integer`.
1622

1623
- `readFileStream` - create a stream by appending elements when there are changes to a file
1624

1625
Socket-based:
1626

1627
- `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter.
1628

1629
Collection-based:
1630

1631 1632
- `fromCollection(Seq)` - Creates a data stream from the Java Java.util.Collection. All elements
  in the collection must be of the same type.
1633

1634 1635
- `fromCollection(Iterator)` - Creates a data stream from an iterator. The class specifies the
  data type of the elements returned by the iterator.
1636

1637 1638
- `fromElements(elements: _*)` - Creates a data stream from the given sequence of objects. All objects must be
  of the same type.
1639

1640 1641 1642 1643 1644 1645 1646 1647 1648 1649
- `fromParallelCollection(SplittableIterator)` - Creates a data stream from an iterator, in
  parallel. The class specifies the data type of the elements returned by the iterator.

- `generateSequence(from, to)` - Generates the sequence of numbers in the given interval, in
  parallel.

Custom:

- `addSource` - Attache a new source function. For example, to read from Apache Kafka you can use
    `addSource(new FlinkKafkaConsumer082<>(...))`. See [connectors](#connectors) for more details.
1650 1651

</div>
1652 1653 1654 1655 1656 1657 1658
</div>

[Back to top](#top)


Execution Configuration
----------
1659

1660
The `StreamExecutionEnvironment` also contains the `ExecutionConfig` which allows to set job specific configuration values for the runtime.
1661

1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675
See [the relevant section of the DataSet API documentation](programming_guide.html#execution-configuration).

Parameters in the `ExecutionConfig` that pertain specifically to the DataStream API are:

- `enableTimestamps()` / **`disableTimestamps()`**: Attach a timestamp to each event emitted from a source.
    `areTimestampsEnabled()` returns the current value.

- `setAutoWatermarkInterval(long milliseconds)`: Set the interval for automatic watermark emission. You can 
    get the current value with `long getAutoWatermarkInterval()`

[Back to top](#top)

Data Sinks
----------
1676

1677 1678 1679
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704
<br />

Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them.
Flink comes with a variety of built-in output formats that are encapsulated behind operations on the
DataStreams:

- `writeAsText()` / `TextOuputFormat` - Writes elements line-wise as Strings. The Strings are
  obtained by calling the *toString()* method of each element.

- `writeAsCsv(...)` / `CsvOutputFormat` - Writes tuples as comma-separated value files. Row and field
  delimiters are configurable. The value for each field comes from the *toString()* method of the objects.

- `print()` / `printToErr()`  - Prints the *toString()* value
of each element on the standard out / strandard error stream. Optionally, a prefix (msg) can be provided which is
prepended to the output. This can help to distinguish between different calls to *print*. If the parallelism is
greater than 1, the output will also be prepended with the identifier of the task which produced the output.

- `write()` / `FileOutputFormat` - Method and base class for custom file outputs. Supports
  custom object-to-bytes conversion.

- `writeToSocket` - Writes elements to a socket according to a `SerializationSchema`

- `addSink` - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as 
    Apache Kafka) that are implemented as sink functions.

1705
</div>
1706
<div data-lang="scala" markdown="1">
1707

1708
<br />
1709

1710 1711 1712
Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them.
Flink comes with a variety of built-in output formats that are encapsulated behind operations on the
DataStreams:
1713

1714 1715
- `writeAsText()` / `TextOuputFormat` - Writes elements line-wise as Strings. The Strings are
  obtained by calling the *toString()* method of each element.
1716

1717 1718
- `writeAsCsv(...)` / `CsvOutputFormat` - Writes tuples as comma-separated value files. Row and field
  delimiters are configurable. The value for each field comes from the *toString()* method of the objects.
1719

1720 1721 1722 1723
- `print()` / `printToErr()`  - Prints the *toString()* value
of each element on the standard out / strandard error stream. Optionally, a prefix (msg) can be provided which is
prepended to the output. This can help to distinguish between different calls to *print*. If the parallelism is
greater than 1, the output will also be prepended with the identifier of the task which produced the output.
1724

1725 1726
- `write()` / `FileOutputFormat` - Method and base class for custom file outputs. Supports
  custom object-to-bytes conversion.
1727

1728
- `writeToSocket` - Writes elements to a socket according to a `SerializationSchema`
1729

1730 1731
- `addSink` - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as
    Apache Kafka) that are implemented as sink functions.
1732

1733 1734
</div>
</div>
1735

1736

1737
[Back to top](#top)
1738

1739 1740
Debugging
---------
1741

1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756
Before running a streaming program in a distributed cluster, it is a good
idea to make sure that the implemented algorithm works as desired. Hence, implementing data analysis
programs is usually an incremental process of checking results, debugging, and improving.

Flink provides features to significantly ease the development process of data analysis
programs by supporting local debugging from within an IDE, injection of test data, and collection of
result data. This section give some hints how to ease the development of Flink programs.

### Local Execution Environment

A `LocalStreamEnvironment` starts a Flink system within the same JVM process it was created in. If you
start the LocalEnvironement from an IDE, you can set breakpoints in your code and easily debug your
program.

A LocalEnvironment is created and used as follows:
1757

1758 1759 1760
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
1761 1762 1763 1764 1765 1766
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

DataStream<String> lines = env.addSource(/* some source */);
// build your program

env.execute();
1767 1768 1769
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
1770

1771
{% highlight scala %}
1772 1773 1774 1775 1776 1777
val env = StreamExecutionEnvironment.createLocalEnvironment()

val lines = env.addSource(/* some source */)
// build your program

env.execute()
1778 1779 1780
{% endhighlight %}
</div>
</div>
1781

1782 1783 1784 1785 1786 1787 1788
### Collection Data Sources

Flink provides special data sources which are backed
by Java collections to ease testing. Once a program has been tested, the sources and sinks can be
easily replaced by sources and sinks that read from / write to external systems.

Collection data sources can be used as follows:
1789

1790 1791 1792
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

// Create a DataStream from a list of elements
DataStream<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);

// Create a DataStream from any Java collection
List<Tuple2<String, Integer>> data = ...
DataStream<Tuple2<String, Integer>> myTuples = env.fromCollection(data);

// Create a DataStream from an Iterator
Iterator<Long> longIt = ...
DataStream<Long> myLongs = env.fromCollection(longIt, Long.class);
1805 1806 1807 1808
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820
val env = StreamExecutionEnvironment.createLocalEnvironment()

// Create a DataStream from a list of elements
val myInts = env.fromElements(1, 2, 3, 4, 5)

// Create a DataStream from any Collection
val data: Seq[(String, Int)] = ...
val myTuples = env.fromCollection(data)

// Create a DataStream from an Iterator
val longIt: Iterator[Long] = ...
val myLongs = env.fromCollection(longIt)
1821 1822 1823
{% endhighlight %}
</div>
</div>
1824

1825 1826 1827
**Note:** Currently, the collection data source requires that data types and iterators implement
`Serializable`. Furthermore, collection data sources can not be executed in parallel (
parallelism = 1).
1828

1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843
[Back to top](#top)


Windows
-------

### Working with Time

Windows are typically groups of events within a certain time period. Reasoning about time and windows assumes
a definition of time. Flink has support for three kinds of time:

- *Processing time:* Processing time is simply the wall clock time of the machine that happens to be
    executing the transformation. Processing time is the simplest notion of time and provides the best
    performance. However, in distributed and asynchronous environments processing time does not provide
    determinism.
1844

1845 1846 1847 1848 1849 1850
- *Event time:* Event time is the time that each individual event occurred. This time is 
    typically embedded within the records before they enter Flink or can be extracted from their contents.
    When using event time, out-of-order events can be properly handled. For example, an event with a lower
    timestamp may arrive after an event with a higher timestamp, but transformations will handle these events
    correctly. Event time processing provides predictable results, but incurs more latency, as out-of-order
    events need to be buffered
1851

1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865
- *Ingestion time:* Ingestion time is the time that events enter Flink. In particular, the timestamp of
    an event is assigned by the source operator as the current wall clock time of the machine that executes
    the source task at the time the records enter the Flink source. Ingestion time is more predictable
    than processing time, and gives lower latencies than event time as the latency does not depend on 
    external systems. Ingestion time provides thus a middle ground between processing time and event time.
    Ingestion time is a special case of event time (and indeed, it is treated by Flink identically to
    event time).

When dealing with event time, transformations need to avoid indefinite
wait times for events to arrive. *Watermarks* provide the mechanism to control the event time-processing time skew. Watermarks
are emitted by the sources. A watermark with a certain timestamp denotes the knowledge that no event
with timestamp lower than the timestamp of the watermark will ever arrive.
 
You can specify the semantics of time in a Flink DataStream program using `StreamExecutionEnviroment`, as
1866

1867 1868 1869
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
1870 1871 1872
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
1873 1874 1875
{% endhighlight %}
</div>

1876 1877 1878 1879 1880
<div data-lang="scala" markdown="1">
{% highlight java %}
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
1881 1882 1883
{% endhighlight %}
</div>
</div>
1884

1885 1886 1887
The default value is `TimeCharacteristic.ProcessingTime`, so in order to write a program with processing
time semantics nothing needs to be specified (e.g., the first [example](#example-program) in this guide follows processing
time semantics).
1888

1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902
In order to work with event time semantics, you need to follow four steps:

- Set `env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)`

- Use `DataStream.assignTimestamps(...)` in order to tell Flink how timestamps relate to events (e.g., which
    record field is the timestamp)

- Set `enableTimestamps()`, as well the interval for watermark emission (`setAutoWatermarkInterval(long milliseconds)`)
    in `ExecutionConfig`.

For example, assume that we have a data stream of tuples, in which the first field is the timestamp (assigned
by the system that generates these data streams), and we know that the lag between the current processing
time and the timestamp of an event is never more than 1 second:
    
1903 1904 1905
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
1906 1907 1908 1909 1910 1911
DataStream<Tuple4<Long,Integer,Double,String>> stream = //...
stream.assignTimestamps(new TimestampExtractor<Tuple4<Long,Integer,Double,String>>{
    @Override
    public long extractTimestamp(Tuple4<Long,Integer,Double,String> element, long currentTimestamp) {
        return element.f0;
    }
1912

1913 1914 1915 1916
    @Override
    public long extractWatermark(Tuple4<Long,Integer,Double,String> element, long currentTimestamp) {
        return element.f0 - 1000;
    }
1917

1918 1919 1920 1921 1922
    @Override
    public long getCurrentWatermark() {
        return Long.MIN_VALUE;
    }
});
1923 1924
{% endhighlight %}
</div>
1925

1926 1927
<div data-lang="scala" markdown="1">
{% highlight scala %}
1928 1929 1930
val stream: DataStream[(Long,Int,Double,String)] = null;
stream.assignTimestampts(new TimestampExtractor[(Long, Int, Double, String)] {
  override def extractTimestamp(element: (Long, Int, Double, String), currentTimestamp: Long): Long = element._1
1931

1932 1933 1934 1935
  override def extractWatermark(element: (Long, Int, Double, String), currentTimestamp: Long): Long = element._1 - 1000

  override def getCurrentWatermark: Long = Long.MinValue
})
1936 1937 1938
{% endhighlight %}
</div>
</div>
1939

1940 1941
If you know that timestamps of events are always ascending, i.e., elements arrive in order, you can use
the `AscendingTimestampExtractor`, and the system generates watermarks automatically:
1942

1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<Tuple4<Long,Integer,Double,String>> stream = //...
stream.assignTimestamps(new AscendingTimestampExtractor<Tuple4<Long,Integer,Double,String>>{
    @Override
    public long extractAscendingTimestamp(Tuple4<Long,Integer,Double,String> element, long currentTimestamp) {
        return element.f0;
    }
});
{% endhighlight %}
</div>

<div data-lang="scala" markdown="1">
{% highlight scala %}
stream.extractAscendingTimestamp(record => record._1)
{% endhighlight %}
</div>
</div>

In order to write a program with ingestion time semantics, you need to
set `env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)`. You can think of this setting as a
shortcut for writing a `TimestampExtractor` which assignes timestamps to events at the sources
based on the current source wall-clock time. Flink injects this timestamp extractor automatically.


### Windows on Keyed Data Streams

Flink offers a variety of methods for defining windows on a `KeyedStream`. All of these group elements *per key*,
i.e., each window will contain elements with the same key value.

#### Basic Window Constructs

Flink offers a general window mechanism that provides flexibility, as well as a number of pre-defined windows
for common use cases. See first if your use case can be served by the pre-defined windows below before moving
to defining your own windows.
1979

1980
<div class="codetabs" markdown="1">
1981
<div data-lang="java" markdown="1">
1982

1983
<br />
1984

1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Tumbling time window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 5 seconds, that "tumbles". This means that elements are
          grouped according to their timestamp in groups of 5 second duration, and every element belongs to exactly one window.
	  The notion of time is specified by the selected TimeCharacteristic (see <a href="#working-with-time">time</a>).
    {% highlight java %}
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));
    {% endhighlight %}
          </p>
        </td>
      </tr>
      <tr>
          <td><strong>Sliding time window</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>
             Defines a window of 5 seconds, that "slides" by 1 seconds. This means that elements are
             grouped according to their timestamp in groups of 5 second duration, and elements can belong to more than
             one window (since windows overlap by at most 4 seconds)
             The notion of time is specified by the selected TimeCharacteristic (see <a href="#working-with-time">time</a>).
      {% highlight java %}
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS));
      {% endhighlight %}
            </p>
          </td>
        </tr>
      <tr>
        <td><strong>Tumbling count window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 1000 elements, that "tumbles". This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements, 
          and every element belongs to exactly one window.
    {% highlight java %}
keyedStream.countWindow(1000);
    {% endhighlight %}
        </p>
        </td>
      </tr>
      <tr>
      <td><strong>Sliding count window</strong><br>KeyedStream &rarr; WindowedStream</td>
      <td>
        <p>
          Defines a window of 1000 elements, that "slides" every 100 elements. This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements, 
          and every element can belong to more than one window (as windows overlap by at most 900 elements).
  {% highlight java %}
keyedStream.countWindow(1000, 100)
  {% endhighlight %}
        </p>
      </td>
    </tr>
  </tbody>
</table>
2048

2049
</div>
2050

2051
<div data-lang="scala" markdown="1">
2052

2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130
<br />

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Tumbling time window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 5 seconds, that "tumbles". This means that elements are
          grouped according to their timestamp in groups of 5 second duration, and every element belongs to exactly one window.
          The notion of time is specified by the selected TimeCharacteristic (see <a href="#working-with-time">time</a>).
    {% highlight scala %}
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS))
    {% endhighlight %}
          </p>
        </td>
      </tr>
      <tr>
          <td><strong>Sliding time window</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>
             Defines a window of 5 seconds, that "slides" by 1 seconds. This means that elements are
             grouped according to their timestamp in groups of 5 second duration, and elements can belong to more than
             one window (since windows overlap by at most 4 seconds)
             The notion of time is specified by the selected TimeCharacteristic (see <a href="#working-with-time">time</a>).
      {% highlight scala %}
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))
      {% endhighlight %}
            </p>
          </td>
        </tr>
      <tr>
        <td><strong>Tumbling count window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 1000 elements, that "tumbles". This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element belongs to exactly one window.
    {% highlight scala %}
keyedStream.countWindow(1000)
    {% endhighlight %}
        </p>
        </td>
      </tr>
      <tr>
      <td><strong>Sliding count window</strong><br>KeyedStream &rarr; WindowedStream</td>
      <td>
        <p>
          Defines a window of 1000 elements, that "slides" every 100 elements. This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element can belong to more than one window (as windows overlap by at most 900 elements).
  {% highlight scala %}
keyedStream.countWindow(1000, 100)
  {% endhighlight %}
        </p>
      </td>
    </tr>
  </tbody>
</table>

</div>
</div>

#### Advanced Window Constructs
  
The general mechanism can define more powerful windows at the cost of more verbose syntax. For example,
below is a window definition where windows hold elements of the last 5 seconds and slides every 1 second,
but the execution of the window function is triggered when 100 elements have been added to the
window, and every time execution is triggered, 10 elements are retained in the window:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
2131
{% highlight java %}
2132 2133
keyedStream
    .window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))
2134 2135
    .trigger(CountTrigger.of(100))
    .evictor(CountEvictor.of(10));
2136
{% endhighlight %}
2137
</div>
2138

2139 2140 2141 2142
<div data-lang="scala" markdown="1">
{% highlight scala %}
keyedStream
    .window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))
2143 2144
    .trigger(CountTrigger.of(100))
    .evictor(CountEvictor.of(10))
2145 2146 2147
{% endhighlight %}
</div>
</div>
2148

2149 2150
The general recipe for building a custom window is to specify (1) a `WindowAssigner`, (2) a `Trigger` (optionally),
and (3) an `Evictor` (optionally).
2151

2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162
The `WindowAssigner` defines how incoming elements are assigned to windows. A window is a logical group of elements
that has a begin-value, and an end-value corresponding to a begin-time and end-time. Elements with timestamp (according
to some notion of time described above within these values are part of the window).

For example, the `SlidingTimeWindows`
assigner in the code above defines a window of size 5 seconds, and a slide of 1 second. Assume that
time starts from 0 and is measured in milliseconds. Then, we have 6 windows
that overlap: [0,5000], [1000,6000], [2000,7000], [3000, 8000], [4000, 9000], and [5000, 10000]. Each incoming
element is assigned to the windows according to its timestamp. For example, an element with timestamp 2000 will be 
assigned to the first three windows. Flink comes bundled with window assigners that cover the most common use cases. You can write your
own window types by extending the `WindowAssigner` class.
2163

2164
<div class="codetabs" markdown="1">
2165

2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318
<div data-lang="java" markdown="1">
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Global window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
	    All incoming elements of a given key are assigned to the same window.
	    The window does not contain a default trigger, hence it will never be triggered
	    if a trigger is not explicitly specified.
          </p>
    {% highlight java %}
stream.window(GlobalWindows.create());
    {% endhighlight %}
        </td>
      </tr>
      <tr>
          <td><strong>Tumbling time windows</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>
              Incoming elements are assigned to a window of a certain size (1 second below) based on
              their timestamp. Windows do not overlap, i.e., each element is assigned to exactly one window.
	      The notion of time is picked from the specified TimeCharacteristic (see <a href="#working-with-time">time</a>).
	      The window comes with a default trigger. For event/ingestion time, a window is triggered when a
	      watermark with value higher than its end-value is received, whereas for processing time
	      when the current processing time exceeds its current end value.
            </p>
      {% highlight java %}
stream.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)));
      {% endhighlight %}
          </td>
        </tr>
      <tr>
        <td><strong>Sliding time windows</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
            Incoming elements are assigned to a window of a certain size (5 seconds below) based on
            their timestamp. Windows "slide" by the provided value (1 second in the example), and hence
            overlap. The window comes with a default trigger. For event/ingestion time, a window is triggered when a
	    watermark with value higher than its end-value is received, whereas for processing time
	    when the current processing time exceeds its current end value.
          </p>
    {% highlight java %}
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)));
    {% endhighlight %}
        </td>
      </tr>
  </tbody>
</table>
</div>

<div data-lang="scala" markdown="1">
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Global window</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
            All incoming elements of a given key are assigned to the same window.
	    The window does not contain a default trigger, hence it will never be triggered
	    if a trigger is not explicitly specified.
          </p>
    {% highlight scala %}
stream.window(GlobalWindows.create)
    {% endhighlight %}
        </td>
      </tr>
      <tr>
          <td><strong>Tumbling time windows</strong><br>KeyedStream &rarr; WindowedStream</td>
          <td>
            <p>
              Incoming elements are assigned to a window of a certain size (1 second below) based on
              their timestamp. Windows do not overlap, i.e., each element is assigned to exactly one window.
	      The notion of time is specified by the selected TimeCharacteristic (see <a href="#working-with-time">time</a>).
	      The window comes with a default trigger. For event/ingestion time, a window is triggered when a
	      watermark with value higher than its end-value is received, whereas for processing time
	      when the current processing time exceeds its current end value.
            </p>
      {% highlight scala %}
stream.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
      {% endhighlight %}
          </td>
        </tr>
      <tr>
        <td><strong>Sliding time windows</strong><br>KeyedStream &rarr; WindowedStream</td>
        <td>
          <p>
            Incoming elements are assigned to a window of a certain size (5 seconds below) based on
            their timestamp. Windows "slide" by the provided value (1 second in the example), and hence
            overlap. The window comes with a default trigger. For event/ingestion time, a window is triggered when a
	    watermark with value higher than its end-value is received, whereas for processing time
	    when the current processing time exceeds its current end value.
          </p>
    {% highlight scala %}
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
    {% endhighlight %}
        </td>
      </tr>
  </tbody>
</table>
</div>

</div>

The `Trigger` specifies when the function that comes after the window clause (e.g., `sum`, `count`) is evaluated ("fires")
for each window. If a trigger is not specified, a default trigger for each window type is used (that is part of the
definition of the `WindowAssigner`). Flink comes bundled with a set of triggers if the ones that windows use by
default do not fit the application. You can write your own trigger by implementing the `Trigger` interface. Note that
specifying a trigger will override the default trigger of the window assigner.

<div class="codetabs" markdown="1">

<div data-lang="java" markdown="1">
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
  <tr>
    <td><strong>Processing time trigger</strong></td>
    <td>
      <p>
        A window is fired when the current processing time exceeds its end-value.
        The elements on the triggered window are henceforth discarded.
      </p>
{% highlight java %}
windowedStream.trigger(ProcessingTimeTrigger.create());
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Watermark trigger</strong></td>
    <td>
      <p>
        A window is fired when a watermark with value that exceeds the window's end-value has been received.
        The elements on the triggered window are henceforth discarded.
      </p>
{% highlight java %}
2319
windowedStream.trigger(EventTimeTrigger.create());
2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Continuous processing time trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5 seconds in the example). 
        The window is actually fired only when the current processing time exceeds its end-value.
        The elements on the triggered window are retained.
      </p>
{% highlight java %}
windowedStream.trigger(ContinuousProcessingTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Continuous watermark time trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5 seconds in the example).
        A window is actually fired when a watermark with value that exceeds the window's end-value has been received.
        The elements on the triggered window are retained.
      </p>
{% highlight java %}
2345
windowedStream.trigger(ContinuousEventTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));
2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Count trigger</strong></td>
    <td>
      <p>
        A window is fired when it has more than a certain number of elements (1000 below).
        The elements of the triggered window are retained.
      </p>
{% highlight java %}
windowedStream.trigger(CountTrigger.of(1000));
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Purging trigger</strong></td>
    <td>
      <p>
        Takes any trigger as an argument and forces the triggered window elements to be
        "purged" (discarded) after triggering.
      </p>
{% highlight java %}
windowedStream.trigger(PurgingTrigger.of(CountTrigger.of(1000)));
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Delta trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5000 milliseconds in the example).
        A window is actually fired when the value of the last added element exceeds the value of
        the first element inserted in the window according to a `DeltaFunction`.
      </p>
{% highlight java %}
windowedStream.trigger(new DeltaTrigger.of(5000.0, new DeltaFunction<Double>() {
    @Override
    public double getDelta (Double old, Double new) {
        return (new - old > 0.01);
2386
    }
2387 2388 2389 2390 2391 2392 2393
}));
{% endhighlight %}
    </td>
  </tr>
 </tbody>
</table>
</div>
2394

2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424

<div data-lang="scala" markdown="1">
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
  <tr>
    <td><strong>Processing time trigger</strong></td>
    <td>
      <p>
        A window is fired when the current processing time exceeds its end-value.
        The elements on the triggered window are henceforth discarded.
      </p>
{% highlight scala %}
windowedStream.trigger(ProcessingTimeTrigger.create);
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Watermark trigger</strong></td>
    <td>
      <p>
        A window is fired when a watermark with value that exceeds the window's end-value has been received.
        The elements on the triggered window are henceforth discarded.
      </p>
{% highlight scala %}
2425
windowedStream.trigger(EventTimeTrigger.create);
2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Continuous processing time trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5 seconds in the example).
        The window is actually fired only when the current processing time exceeds its end-value.
        The elements on the triggered window are retained.
      </p>
{% highlight scala %}
windowedStream.trigger(ContinuousProcessingTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Continuous watermark time trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5 seconds in the example).
        A window is actually fired when a watermark with value that exceeds the window's end-value has been received.
        The elements on the triggered window are retained.
      </p>
{% highlight scala %}
2451
windowedStream.trigger(ContinuousEventTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));
2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Count trigger</strong></td>
    <td>
      <p>
        A window is fired when it has more than a certain number of elements (1000 below).
        The elements of the triggered window are retained.
      </p>
{% highlight scala %}
windowedStream.trigger(CountTrigger.of(1000));
{% endhighlight %}
    </td>
  </tr>
  <tr>
    <td><strong>Purging trigger</strong></td>
    <td>
      <p>
        Takes any trigger as an argument and forces the triggered window elements to be
        "purged" (discarded) after triggering.
      </p>
{% highlight scala %}
windowedStream.trigger(PurgingTrigger.of(CountTrigger.of(1000)));
2476
{% endhighlight %}
2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496
    </td>
  </tr>
  <tr>
    <td><strong>Delta trigger</strong></td>
    <td>
      <p>
        A window is periodically considered for being fired (every 5000 milliseconds in the example).
        A window is actually fired when the value of the last added element exceeds the value of
        the first element inserted in the window according to a `DeltaFunction`.
      </p>
{% highlight scala %}
windowedStream.trigger(DeltaTrigger.of(5000.0, { (old,new) => new - old > 0.01 }))
{% endhighlight %}
    </td>
  </tr>
 </tbody>
</table>
</div>

</div>
2497

2498 2499 2500 2501 2502 2503
After the trigger fires, and before the function (e.g., `sum`, `count`) is applied to the window contents, an
optional `Evictor` removes some elements from the beginning of the window before the remaining elements
are passed on to the function. Flink comes bundled with a set of evictors You can write your own evictor by 
implementing the `Evictor` interface.

<div class="codetabs" markdown="1">
2504

2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521
<div data-lang="java" markdown="1">
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
  <tr>
      <td><strong>Time evictor</strong></td>
      <td>
        <p>
         Evict all elements from the beginning of the window, so that elements from end-value - 1 second
         until end-value are retained (the resulting window size is 1 second).
        </p>
  {% highlight java %}
2522
triggeredStream.evictor(TimeEvictor.of(Time.of(1, TimeUnit.SECONDS)));
2523 2524 2525 2526 2527 2528 2529 2530 2531 2532
  {% endhighlight %}
      </td>
    </tr>
   <tr>
       <td><strong>Count evictor</strong></td>
       <td>
         <p>
          Retain 1000 elements from the end of the window backwards, evicting all others.
         </p>
   {% highlight java %}
2533
triggeredStream.evictor(CountEvictor.of(1000));
2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545
   {% endhighlight %}
       </td>
     </tr>
    <tr>
        <td><strong>Delta evictor</strong></td>
        <td>
          <p>
            Starting from the beginning of the window, evict elements until an element with
            value lower than the value of the last element is found (by a threshold and a 
            DeltaFunction).
          </p>
    {% highlight java %}
2546
triggeredStream.evictor(DeltaEvictor.of(5000, new DeltaFunction<Double>() {
2547 2548
  public double (Double oldValue, Double newValue) {
      return newValue - oldValue;
2549 2550 2551 2552 2553 2554 2555
  }
}));
    {% endhighlight %}
        </td>
      </tr>
 </tbody>
</table>
2556
</div>
2557

2558
<div data-lang="scala" markdown="1">
2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
  <tr>
      <td><strong>Time evictor</strong></td>
      <td>
        <p>
         Evict all elements from the beginning of the window, so that elements from end-value - 1 second
         until end-value are retained (the resulting window size is 1 second).
        </p>
  {% highlight scala %}
2575
triggeredStream.evictor(TimeEvictor.of(Time.of(1, TimeUnit.SECONDS)));
2576 2577 2578 2579 2580 2581 2582 2583 2584 2585
  {% endhighlight %}
      </td>
    </tr>
   <tr>
       <td><strong>Count evictor</strong></td>
       <td>
         <p>
          Retain 1000 elements from the end of the window backwards, evicting all others.
         </p>
   {% highlight scala %}
2586
triggeredStream.evictor(CountEvictor.of(1000));
2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598
   {% endhighlight %}
       </td>
     </tr>
    <tr>
        <td><strong>Delta evictor</strong></td>
        <td>
          <p>
            Starting from the beginning of the window, evict elements until an element with
            value lower than the value of the last element is found (by a threshold and a
            DeltaFunction).
          </p>
    {% highlight scala %}
2599
windowedStream.evictor(DeltaEvictor.of(5000.0, { (old,new) => new - old > 0.01 }))
2600 2601 2602 2603 2604 2605
    {% endhighlight %}
        </td>
      </tr>
 </tbody>
</table>
</div>
2606

2607
</div>
2608

2609
#### Recipes for Building Windows
2610

2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634
The mechanism of window assigner, trigger, and evictor is very powerful, and it allows you to define
many different kinds of windows. Flink's basic window constructs are, in fact, syntactic
sugar on top of the general mechanism. Below is how some common types of windows can be
constructed using the general mechanism

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 35%">Window type</th>
      <th class="text-center">Definition</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td>
	  <strong>Tumbling count window</strong><br>
    {% highlight java %}
stream.countWindow(1000)    
    {% endhighlight %}
	</td>
        <td>
    {% highlight java %}
stream.window(GlobalWindows.create())
  .trigger(CountTrigger.of(1000)
2635
  .evictor(CountEvictor.of(1000)))
2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648
    {% endhighlight %}
        </td>
      </tr>
      <tr>
        <td>
	  <strong>Sliding count window</strong><br>
    {% highlight java %}
stream.countWindow(1000, 100)    
    {% endhighlight %}
	</td>
        <td>
    {% highlight java %}
stream.window(GlobalWindows.create())
2649 2650
  .evictor(CountEvictor.of(1000))
  .trigger(CountTrigger.of(100))
2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663
    {% endhighlight %}
        </td>
      </tr>
      <tr>
        <td>
	  <strong>Tumbling event time window</strong><br>
    {% highlight java %}
stream.timeWindow(Time.of(5, TimeUnit.SECONDS))    
    {% endhighlight %}
	</td>	
        <td>
    {% highlight java %}
stream.window(TumblingTimeWindows.of((Time.of(5, TimeUnit.SECONDS)))
2664
  .trigger(EventTimeTrigger.create())
2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677
    {% endhighlight %}
        </td>
      </tr>
      <tr>
        <td>
	  <strong>Sliding event time window</strong><br>
    {% highlight java %}
stream.timeWindow(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))    
    {% endhighlight %}
	</td>	
        <td>
    {% highlight java %}
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
2678
  .trigger(EventTimeTrigger.create())
2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725
    {% endhighlight %}
        </td>
      </tr>
      <tr>
        <td>
	  <strong>Tumbling processing time window</strong><br>
    {% highlight java %}
stream.timeWindow(Time.of(5, TimeUnit.SECONDS))    
    {% endhighlight %}
	</td>	
        <td>
    {% highlight java %}
stream.window(TumblingTimeWindows.of((Time.of(5, TimeUnit.SECONDS)))
  .trigger(ProcessingTimeTrigger.create())
    {% endhighlight %}
        </td>
      </tr>
      <tr>
        <td>
	  <strong>Sliding processing time window</strong><br>
    {% highlight java %}
stream.timeWindow(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))    
    {% endhighlight %}
	</td>	
        <td>
    {% highlight java %}
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
  .trigger(ProcessingTimeTrigger.create())
    {% endhighlight %}
        </td>
      </tr>            
  </tbody>
</table>


### Windows on Unkeyed Data Streams 

You can also define windows on regular (non-keyed) data streams using the `windowAll` transformation. These 
windowed data streams have all the capabilities of keyed windowed data streams, but are evaluated at a single
task (and hence at a single computing node). The syntax for defining triggers and evictors is exactly the
same: 

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
nonKeyedStream
    .windowAll(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))
2726 2727
    .trigger(CountTrigger.of(100))
    .evictor(CountEvictor.of(10));
2728 2729 2730 2731 2732 2733 2734
{% endhighlight %}
</div>

<div data-lang="scala" markdown="1">
{% highlight scala %}
nonKeyedStream
    .windowAll(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))
2735 2736
    .trigger(CountTrigger.of(100))
    .evictor(CountEvictor.of(10))
2737
{% endhighlight %}
2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810
</div>
</div>

Basic window definitions are also available for windows on non-keyed streams:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">

<br />

<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Tumbling time window all</strong><br>DataStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 5 seconds, that "tumbles". This means that elements are
          grouped according to their timestamp in groups of 5 second duration, and every element belongs to exactly one window.
          The notion of time used is controlled by the StreamExecutionEnvironment.
    {% highlight java %}
nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS));
    {% endhighlight %}
          </p>
        </td>
      </tr>
      <tr>
          <td><strong>Sliding time window all</strong><br>DataStream &rarr; WindowedStream</td>
          <td>
            <p>
             Defines a window of 5 seconds, that "slides" by 1 seconds. This means that elements are
             grouped according to their timestamp in groups of 5 second duration, and elements can belong to more than
             one window (since windows overlap by at least 4 seconds)
             The notion of time used is controlled by the StreamExecutionEnvironment.
      {% highlight java %}
nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS));
      {% endhighlight %}
            </p>
          </td>
        </tr>
      <tr>
        <td><strong>Tumbling count window all</strong><br>DataStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 1000 elements, that "tumbles". This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element belongs to exactly one window.
    {% highlight java %}
nonKeyedStream.countWindowAll(1000)
    {% endhighlight %}
        </p>
        </td>
      </tr>
      <tr>
      <td><strong>Sliding count window all</strong><br>DataStream &rarr; WindowedStream</td>
      <td>
        <p>
          Defines a window of 1000 elements, that "slides" every 100 elements. This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element can belong to more than one window (as windows overlap by at least 900 elements).
  {% highlight java %}
nonKeyedStream.countWindowAll(1000, 100)
  {% endhighlight %}
        </p>
      </td>
    </tr>
  </tbody>
</table>
2811

2812
</div>
2813

2814
<div data-lang="scala" markdown="1">
2815

2816
<br />
2817

2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880
<table class="table table-bordered">
  <thead>
    <tr>
      <th class="text-left" style="width: 25%">Transformation</th>
      <th class="text-center">Description</th>
    </tr>
  </thead>
  <tbody>
      <tr>
        <td><strong>Tumbling time window all</strong><br>DataStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 5 seconds, that "tumbles". This means that elements are
          grouped according to their timestamp in groups of 5 second duration, and every element belongs to exactly one window.
          The notion of time used is controlled by the StreamExecutionEnvironment.
    {% highlight scala %}
nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS));
    {% endhighlight %}
          </p>
        </td>
      </tr>
      <tr>
          <td><strong>Sliding time window all</strong><br>DataStream &rarr; WindowedStream</td>
          <td>
            <p>
             Defines a window of 5 seconds, that "slides" by 1 seconds. This means that elements are
             grouped according to their timestamp in groups of 5 second duration, and elements can belong to more than
             one window (since windows overlap by at least 4 seconds)
             The notion of time used is controlled by the StreamExecutionEnvironment.
      {% highlight scala %}
nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS));
      {% endhighlight %}
            </p>
          </td>
        </tr>
      <tr>
        <td><strong>Tumbling count window all</strong><br>DataStream &rarr; WindowedStream</td>
        <td>
          <p>
          Defines a window of 1000 elements, that "tumbles". This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element belongs to exactly one window.
    {% highlight scala %}
nonKeyedStream.countWindowAll(1000)
    {% endhighlight %}
        </p>
        </td>
      </tr>
      <tr>
      <td><strong>Sliding count window all</strong><br>DataStream &rarr; WindowedStream</td>
      <td>
        <p>
          Defines a window of 1000 elements, that "slides" every 100 elements. This means that elements are
          grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements,
          and every element can belong to more than one window (as windows overlap by at least 900 elements).
  {% highlight scala %}
nonKeyedStream.countWindowAll(1000, 100)
  {% endhighlight %}
        </p>
      </td>
    </tr>
  </tbody>
</table>
2881

2882 2883 2884
</div>
</div>

2885
[Back to top](#top)
2886

2887 2888
Execution Parameters
--------------------
2889

2890
### Fault Tolerance
2891

2892
The [Fault Tolerance Documentation]({{ site.baseurl }}/apis/fault_tolerance.html) describes the options and parameters to enable and configure Flink's checkpointing mechanism.
2893

2894
### Parallelism
2895

2896 2897
You can control the number of parallel instances created for each operator by 
calling the `operator.setParallelism(int)` method.
2898

2899
### Controlling Latency
2900

2901 2902 2903 2904 2905 2906
By default, elements are not transferred on the network one-by-one (which would cause unnecessary network traffic)
but are buffered. The size of the buffers (which are actually transferred between machines) can be set in the Flink config files.
While this method is good for optimizing throughput, it can cause latency issues when the incoming stream is not fast enough.
To control throughput and latency, you can use `env.setBufferTimeout(timeoutMillis)` on the execution environment
(or on individual operators) to set a maximum wait time for the buffers to fill up. After this time, the 
buffers are sent automatically even if they are not full. The default value for this timeout is 100 ms.
2907

2908
Usage:
2909

2910 2911 2912
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
2913 2914
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setBufferTimeout(timeoutMillis);
2915

2916
env.genereateSequence(1,10).map(new MyMapper()).setBufferTimeout(timeoutMillis);
2917 2918 2919 2920
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
2921 2922 2923 2924
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment
env.setBufferTimeout(timeoutMillis)

env.genereateSequence(1,10).map(myMap).setBufferTimeout(timeoutMillis)
2925 2926 2927
{% endhighlight %}
</div>
</div>
2928

2929 2930 2931
To maximize throughput, set `setBufferTimeout(-1)` which will remove the timeout and buffers will only be
flushed when they are full. To minimize latency, set the timeout to a value close to 0 (for example 5 or 10 ms). 
A buffer timeout of 0 should be avoided, because it can cause severe performance degradation.
2932 2933 2934

[Back to top](#top)

2935 2936
Working with State
------------------
2937

2938 2939 2940 2941 2942 2943
All transformations in Flink may look like functions (in the functional processing terminology), but
are in fact stateful operators. You can make *every* transformation (`map`, `filter`, etc) stateful
by declaring local variables or using Flink's state interface. You can register any local variable
as ***managed*** state by implementing an interface. In this case, and also in the case of using
Flink's native state interface, Flink will automatically take consistent snapshots of your state
periodically, and restore its value in the case of a failure.
2944

2945 2946
The end effect is that updates to any form of state are the same under failure-free execution and
execution under failures. 
2947

2948 2949
First, we look at how to make local variables consistent under failures, and then we look at
Flink's state interface.
2950

2951 2952 2953
By default state checkpoints will be stored in-memory at the JobManager. For proper persistence of large
state, Flink supports storing the checkpoints on file systems (HDFS, S3, or any mounted POSIX file system),
which can be configured in the `flink-conf.yaml` or via `StreamExecutionEnvironment.setStateBackend(…)`. 
2954 2955


2956
### Checkpointing Local Variables
2957

2958 2959 2960
Local variables can be checkpointed by using the `Checkpointed` interface.

When the user-defined function implements the `Checkpointed` interface, the `snapshotState(…)` and `restoreState(…)` 
2961
methods will be executed to draw and restore function state.
2962

2963 2964 2965 2966 2967
In addition to that, user functions can also implement the `CheckpointNotifier` interface to receive notifications on 
completed checkpoints via the `notifyCheckpointComplete(long checkpointId)` method.
Note that there is no guarantee for the user function to receive a notification if a failure happens between
checkpoint completion and notification. The notifications should hence be treated in a way that notifications from 
later checkpoints can subsume missing notifications.
2968

2969
For example the same counting, reduce function shown for `OperatorState`s by using the `Checkpointed` interface instead:
2970

2971
{% highlight java %}
2972
public class CounterSum extends ReduceFunction<Long>, Checkpointed<Long> {
2973

2974
    // persistent counter
2975
    private long counter = 0;
2976

2977
    @Override
2978
    public Long reduce(Long value1, Long value2) {
2979 2980 2981 2982 2983 2984
        counter++;
        return value1 + value2;
    }

    // regularly persists state during normal operation
    @Override
2985
    public Serializable snapshotState(long checkpointId, long checkpointTimestamp) {
2986 2987 2988 2989 2990 2991 2992 2993 2994 2995
        return counter;
    }

    // restores state on recovery from failure
    @Override
    public void restoreState(Long state) {
        counter = state;
    }
}
{% endhighlight %}
2996

2997
### Using the Key/Value State Interface
2998

2999 3000 3001 3002
The state interface gives access to key/value states, which are a collection of key/value pairs.
Because the state is partitioned by the keys (distributed accross workers), it can only be used
on the `KeyedStream`, created via `stream.keyBy(…)` (which means also that it is usable in all
types of functions on keyed windows).
3003

3004 3005 3006
The handle to the state can be obtained from the function's `RuntimeContext`. The state handle will
then give access to the value mapped under the key of the current record or window - each key consequently
has its own value.
3007

3008 3009 3010 3011
The following code sample shows how to use the key/value state inside a reduce function.
When creating the state handle, one needs to supply a name for that state (a function can have multiple states
of different types), the type of the state (used to create efficient serializers), and the default value (returned
as a value for keys that do not yet have a value associated).
3012 3013

{% highlight java %}
3014
public class CounterSum extends RichReduceFunction<Long> {
3015 3016

    /** The state handle */
3017
    private OperatorState<Long> counter;
3018 3019

    @Override
3020
    public Long reduce(Long value1, Long value2) {
3021
        counter.update(counter.value() + 1);
3022 3023 3024 3025
        return value1 + value2;
    }

    @Override
3026
    public void open(Configuration config) {
3027
        counter = getRuntimeContext().getKeyValueState("myCounter", Long.class, 0L);
3028 3029 3030 3031
    }
}
{% endhighlight %} 

3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059
State updated by this is usually kept locally inside the flink process (unless one configures explicitly
an external state backend). This means that lookups and updates are process local and this very fast.

The important implication of having the keys set implicitly is that it forces programs to group the stream
by key (via the `keyBy()` function), making the key partitioning transparent to Flink. That allows the system
to efficiently restore and redistribute keys and state.

The Scala API has shortcuts that for stateful `map()` or `flatMap()` functions on `KeyedStream`, which give the
state of the current key as an option directly into the function, and return the result with a state update:

{% highlight scala %}
val stream: DataStream[(String, Int)] = ...

val counts: DataStream[(String, Int)] = stream
  .keyBy(_._1)
  .mapWithState((in: (String, Int), count: Option[Int]) =>
    count match {
      case Some(c) => ( (in._1, c), Some(c + in._2) )
      case None => ( (in._1, 0), Some(in._2) )
    })
{% endhighlight %} 


### Stateful Source Functions

Stateful sources require a bit more care as opposed to other operators. 
In order to make the updates to the state and output collection atomic (required for exactly-once semantics
on failure/recovery), the user is required to get a lock from the source's context.
3060 3061

{% highlight java %}
3062
public static class CounterSource extends RichParallelSourceFunction<Long>, Checkpointed<Long> {
3063

3064 3065 3066 3067 3068
    /**  current offset for exactly once semantics */
    private long offset;

    /** flag for job cancellation */
    private volatile boolean isRunning = true;
3069 3070
    
    @Override
3071 3072
    public void run(SourceContext<Long> ctx) {
        final Object lock = ctx.getCheckpointLock();
3073 3074 3075
        
        while (isRunning) {
            // output and state update are atomic
3076
            synchronized (lock) {
3077
                ctx.collect(offset);
3078
                offset += 1;
3079 3080 3081 3082 3083
            }
        }
    }

    @Override
3084 3085
    public void cancel() {
        isRunning = false;
3086 3087 3088
    }

    @Override
3089 3090 3091 3092 3093 3094 3095 3096
    public Long snapshotState(long checkpointId, long checkpointTimestamp) {
        return offset;

    }

    @Override
	public void restoreState(Long state) {
        offset = state;
3097 3098 3099 3100
    }
}
{% endhighlight %}

3101
Some operators might need the information when a checkpoint is fully acknowledged by Flink to communicate that with the outside world. In this case see the `flink.streaming.api.checkpoint.CheckpointNotifier` interface.
3102

3103
### State Checkpoints in Iterative Jobs
3104

3105
Flink currently only provides processing guarantees for jobs without iterations. Enabling checkpointing on an iterative job causes an exception. In order to force checkpointing on an iterative program the user needs to set a special flag when enabling checkpointing: `env.enableCheckpointing(interval, force = true)`.
3106

3107
Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure.
3108

3109
[Back to top](#top)
3110

3111 3112
Iterations
----------
3113

3114 3115
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
3116

3117
<br />
3118

3119 3120 3121 3122
Iterative streaming programs implement a step function and embed it into an `IterativeStream`. As a DataStream
program may never finish, there is no maximum number of iterations. Instead, you need to specify which part
of the stream is fed back to the iteration and which part is forwarded downstream using a `split` transformation
or a `filter`. Here, we show an example using filters. First, we define an `IterativeStream`
3123

3124 3125 3126
{% highlight java %}
IterativeStream<Integer> iteration = input.iterate();
{% endhighlight %}
3127

3128 3129
Then, we specify the logic that will be executed inside the loop using a series of trasformations (here
a simple `map` transformation)
3130

3131 3132 3133
{% highlight java %}
DataStream<Integer> iterationBody = iteration.map(/* this is executed many times */);
{% endhighlight %}
3134

3135 3136 3137 3138 3139 3140
To close an iteration and define the iteration tail, call the `closeWith(feedbackStream)` method of the `IterativeStream`.
The DataStream given to the `closeWith` function will be fed back to the iteration head. 
A common pattern is to use a filter to separate the part of the strem that is fed back,
and the part of the stream which is propagated forward. These filters can, e.g., define
the "termination" logic, where an element is allowed to propagate downstream rather
than being fed back.
3141

3142
{% highlight java %}
3143
iteration.closeWith(iterationBody.filter(/* one part of the stream */));
3144 3145
DataStream<Integer> output = iterationBody.filter(/* some other part of the stream */);
{% endhighlight %}
3146

3147 3148
By default the partitioning of the feedback stream will be automatically set to be the same as the input of the 
iteration head. To override this the user can set an optional boolean flag in the `closeWith` method.
3149

3150
For example, here is program that continuously subtracts 1 from a series of integers until they reach zero:
3151

3152
{% highlight java %}
3153 3154 3155 3156 3157 3158 3159 3160 3161 3162
DataStream<Long> someIntegers = env.generateSequence(0, 1000);
		
IterativeStream<Long> iteration = someIntegers.iterate();

DataStream<Long> minusOne = iteration.map(new MapFunction<Long, Long>() {
  @Override
  public Long map(Long value) throws Exception {
    return value - 1 ;
  }
});
3163

3164 3165 3166 3167 3168 3169
DataStream<Long> stillGreaterThanZero = minusOne.filter(new FilterFunction<Long>() {
  @Override
  public boolean filter(Long value) throws Exception {
    return (value > 0);
  }
});
3170

3171 3172 3173 3174 3175 3176 3177 3178
iteration.closeWith(stillGreaterThanZero);

DataStream<Long> lessThanZero = minusOne.filter(new FilterFunction<Long>() {
  @Override
  public boolean filter(Long value) throws Exception {
    return (value <= 0);
  }
});
3179 3180
{% endhighlight %}
</div>
3181
<div data-lang="scala" markdown="1">
3182

3183
<br />
3184

3185 3186 3187 3188 3189 3190
Iterative streaming programs implement a step function and embed it into an `IterativeStream`. As a DataStream
program may never finish, there is no maximum number of iterations. Instead, you need to specify which part
of the stream is fed back to the iteration and which part is forwarded downstream using a `split` transformation
or a `filter`. Here, we show an example iteration where the body (the part of the computation that is repeated)
is a simple map transformation, and the elements that are fed back are distinguished by the elements that
are forwarded downstream using filters.
3191

3192 3193 3194 3195 3196 3197 3198
{% highlight scala %}
val iteratedStream = someDataStream.iterate(
  iteration => {
    val iterationBody = iteration.map(/* this is executed many times */)
    (tail.filter(/* one part of the stream */), tail.filter(/* some other part of the stream */))
})
{% endhighlight %}
3199

3200

3201 3202
By default the partitioning of the feedback stream will be automatically set to be the same as the input of the
iteration head. To override this the user can set an optional boolean flag in the `closeWith` method.
3203

3204
For example, here is program that continuously subtracts 1 from a series of integers until they reach zero:
3205

3206
{% highlight scala %}
3207
val someIntegers: DataStream[Long] = env.generateSequence(0, 1000)
3208

3209 3210 3211 3212 3213 3214 3215 3216
val iteratedStream = someIntegers.iterate(
  iteration => {
    val minusOne = iteration.map( v => v - 1)
    val stillGreaterThanZero = minusOne.filter (_ > 0)
    val lessThanZero = minusOne.filter(_ <= 0)
    (stillGreaterThanZero, lessThanZero)
  }
)
3217
{% endhighlight %}
3218

3219 3220
</div>
</div>
3221 3222

[Back to top](#top)
3223

3224 3225
Connectors
----------
3226

3227
<!-- TODO: reintroduce flume -->
3228
Connectors provide code for interfacing with various third-party systems.
3229

3230
Currently these systems are supported:
3231

3232 3233
 * [Apache Kafka](https://kafka.apache.org/) (sink/source)
 * [Elasticsearch](https://elastic.co/) (sink)
3234
 * [Hadoop FileSystem](http://hadoop.apache.org) (sink)
3235 3236 3237 3238 3239 3240 3241 3242 3243
 * [RabbitMQ](http://www.rabbitmq.com/) (sink/source)
 * [Twitter Streaming API](https://dev.twitter.com/docs/streaming-apis) (source)

To run an application using one of these connectors, additional third party
components are usually required to be installed and launched, e.g. the servers
for the message queues. Further instructions for these can be found in the
corresponding subsections. [Docker containers](#docker-containers-for-connectors)
are also provided encapsulating these services to aid users getting started
with connectors.
3244 3245 3246

### Apache Kafka

3247 3248
This connector provides access to event streams served by [Apache Kafka](https://kafka.apache.org/).

3249 3250 3251 3252
Flink provides special Kafka Connectors for reading and writing data from/to Kafka topics.
The Flink Kafka Consumer integrates with Flink's checkpointing mechanism to provide
exactly-once processing semantics. To achieve that, Flink does not purely rely on Kafka's consumer group
offset tracking, but tracks and checkpoints these offsets internally as well.
3253 3254

Please pick a package (maven artifact id) and class name for your use-case and environment.
3255
For most users, the `FlinkKafkaConsumer082` (part of `flink-connector-kafka`) is appropriate.
3256 3257 3258 3259 3260


<table class="table table-bordered">
  <thead>
    <tr>
3261
      <th class="text-left">Maven Dependency</th>
3262 3263
      <th class="text-left">Supported since</th>
      <th class="text-left">Class name</th>
3264 3265
      <th class="text-left">Kafka version</th>
      <th class="text-left">Notes</th>
3266 3267 3268 3269 3270 3271 3272
    </tr>
  </thead>
  <tbody>
    <tr>
        <td>flink-connector-kafka</td>
        <td>0.9.1, 0.10</td>
        <td>FlinkKafkaConsumer081</td>
3273 3274
        <td>0.8.1</td>	
        <td>Uses the <a href="https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example">SimpleConsumer</a> API of Kafka internally. Offsets are committed to ZK by Flink.</td>	
3275 3276
    </tr>
    <tr>
3277
        <td>flink-connector-kafka</td>
3278 3279
        <td>0.9.1, 0.10</td>
        <td>FlinkKafkaConsumer082</td>
3280 3281 3282
        <td>0.8.2</td>	
        <td>Uses the <a href="https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example">SimpleConsumer</a> API of Kafka internally. Offsets are committed to ZK by Flink.</td>	
    </tr>
3283 3284 3285 3286
  </tbody>
</table>

Then, import the connector in your maven project:
3287 3288 3289 3290

{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
3291
  <artifactId>flink-connector-kafka</artifactId>
3292 3293 3294
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}
3295

3296
Note that the streaming connectors are currently not part of the binary distribution. See how to link with them for cluster execution [here](cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).
3297

3298
#### Installing Apache Kafka
3299

3300
* Follow the instructions from [Kafka's quickstart](https://kafka.apache.org/documentation.html#quickstart) to download the code and launch a server (launching a Zookeeper and a Kafka server is required every time before starting the application).
3301
* On 32 bit computers [this](http://stackoverflow.com/questions/22325364/unrecognized-vm-option-usecompressedoops-when-running-kafka-from-my-ubuntu-in) problem may occur.
3302
* If the Kafka and Zookeeper servers are running on a remote machine, then the `advertised.host.name` setting in the `config/server.properties` file must be set to the machine's IP address.
3303

3304 3305
#### Kafka Consumer

3306
The standard `FlinkKafkaConsumer082` is a Kafka consumer providing access to one topic. It takes the following parameters to the constructor:
3307

3308 3309 3310 3311 3312 3313 3314
1. The topic name
2. A DeserializationSchema
3. Properties for the Kafka consumer.
  The following properties are required:
  - "bootstrap.servers" (comma separated list of Kafka brokers)
  - "zookeeper.connect" (comma separated list of Zookeeper servers)
  - "group.id" the id of the consumer group
3315

3316
Example:
3317

3318 3319 3320
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
3321 3322 3323 3324
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
3325
DataStream<String> stream = env
3326
	.addSource(new FlinkKafkaConsumer082<>("topic", new SimpleStringSchema(), properties))
3327
	.print();
3328 3329 3330 3331
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
3332 3333 3334 3335
val properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
3336
stream = env
3337
    .addSource(new FlinkKafkaConsumer082[String]("topic", new SimpleStringSchema(), properties))
3338 3339 3340 3341
    .print
{% endhighlight %}
</div>
</div>
3342

3343
#### Kafka Consumers and Fault Tolerance
3344

3345 3346 3347 3348
With Flink's checkpointing enabled, the Flink Kafka Consumer will consume records from a topic and periodically checkpoint all
its Kafka offsets, together with the state of other operations, in a consistent manner. In case of a job failure, Flink will restore
the streaming program to the state of the latest checkpoint and re-consume the records from Kafka, starting from the offsets that where
stored in the checkpoint.
3349

3350
The interval of drawing checkpoints therefore defines how much the program may have to go back at most, in case of a failure.
3351

3352
To use fault tolerant Kafka Consumers, checkpointing of the topology needs to be enabled at the execution environment:
3353 3354 3355 3356 3357

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
3358 3359 3360 3361 3362 3363 3364
env.enableCheckpointing(5000); // checkpoint every 5000 msecs
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(5000) // checkpoint every 5000 msecs
3365 3366 3367 3368 3369
{% endhighlight %}
</div>
</div>

Also note that Flink can only restart the topology if enough processing slots are available to restart the topology.
3370
So if the topology fails due to loss of a TaskManager, there must still be enough slots available afterwards.
3371 3372
Flink on YARN supports automatic restart of lost YARN containers.

3373
If checkpointing is not enabled, the Kafka consumer will periodically commit the offsets to Zookeeper.
3374

3375
#### Kafka Producer
3376

3377 3378
The `FlinkKafkaProducer` writes data to a Kafka topic. The producer can specify a custom partitioner that assigns
recors to partitions.
3379

3380
Example:
3381

3382 3383 3384
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
3385
stream.addSink(new FlinkKafkaProducer<String>("localhost:9092", "my-topic", new SimpleStringSchema()));
3386 3387 3388 3389
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
3390
stream.addSink(new FlinkKafkaProducer[String]("localhost:9092", "my-topic", new SimpleStringSchema()))
3391 3392 3393
{% endhighlight %}
</div>
</div>
3394

3395 3396 3397
You can also define a custom Kafka producer configuration for the KafkaSink with the constructor. Please refer to
the [Apache Kafka documentation](https://kafka.apache.org/documentation.html) for details on how to configure
Kafka Producers.
3398 3399 3400

[Back to top](#top)

3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485
### Elasticsearch

This connector provides a Sink that can write to an
[Elasticsearch](https://elastic.co/) Index. To use this connector, add the
following dependency to your project:

{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-elasticsearch</artifactId>
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}

Note that the streaming connectors are currently not part of the binary
distribution. See
[here](cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.

#### Installing Elasticsearch

Instructions for setting up an Elasticsearch cluster can be found
[here](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html).
Make sure to set and remember a cluster name. This must be set when
creating a Sink for writing to your cluster

#### Elasticsearch Sink
The connector provides a Sink that can send data to an Elasticsearch Index.

The sink can use two different methods for communicating with Elasticsearch:

1. An embedded Node
2. The TransportClient

See [here](https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/client.html)
for information about the differences between the two modes.

This code shows how to create a sink that uses an embedded Node for
communication:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<String> input = ...;

Map<String, String> config = Maps.newHashMap();
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "my-cluster-name");

input.addSink(new ElasticsearchSink<>(config, new IndexRequestBuilder<String>() {
    @Override
    public IndexRequest createIndexRequest(String element, RuntimeContext ctx) {
        Map<String, Object> json = new HashMap<>();
        json.put("data", element);

        return Requests.indexRequest()
                .index("my-index")
                .type("my-type")
                .source(json);
    }
}));
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val input: DataStream[String] = ...

val config = new util.HashMap[String, String]
config.put("bulk.flush.max.actions", "1")
config.put("cluster.name", "my-cluster-name")

text.addSink(new ElasticsearchSink(config, new IndexRequestBuilder[String] {
  override def createIndexRequest(element: String, ctx: RuntimeContext): IndexRequest = {
    val json = new util.HashMap[String, AnyRef]
    json.put("data", element)
    println("SENDING: " + element)
    Requests.indexRequest.index("my-index").`type`("my-type").source(json)
  }
}))
{% endhighlight %}
</div>
</div>

3486
Note how a Map of Strings is used to configure the Sink. The configuration keys
3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500
are documented in the Elasticsearch documentation
[here](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html).
Especially important is the `cluster.name` parameter that must correspond to
the name of your cluster.

Internally, the sink uses a `BulkProcessor` to send index requests to the cluster.
This will buffer elements before sending a request to the cluster. The behaviour of the
`BulkProcessor` can be configured using these config keys:
 * **bulk.flush.max.actions**: Maximum amount of elements to buffer
 * **bulk.flush.max.size.mb**: Maximum amount of data (in megabytes) to buffer
 * **bulk.flush.interval.ms**: Interval at which to flush data regardless of the other two
  settings in milliseconds

This example code does the same, but with a `TransportClient`:
3501

3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<String> input = ...;

Map<String, String> config = Maps.newHashMap();
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "my-cluster-name");

List<TransportAddress> transports = new ArrayList<String>();
transports.add(new InetSocketTransportAddress("node-1", 9300));
transports.add(new InetSocketTransportAddress("node-2", 9300));

input.addSink(new ElasticsearchSink<>(config, transports, new IndexRequestBuilder<String>() {
    @Override
    public IndexRequest createIndexRequest(String element, RuntimeContext ctx) {
        Map<String, Object> json = new HashMap<>();
        json.put("data", element);

        return Requests.indexRequest()
                .index("my-index")
                .type("my-type")
                .source(json);
    }
}));
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val input: DataStream[String] = ...

val config = new util.HashMap[String, String]
config.put("bulk.flush.max.actions", "1")
config.put("cluster.name", "my-cluster-name")

val transports = new ArrayList[String]
transports.add(new InetSocketTransportAddress("node-1", 9300))
transports.add(new InetSocketTransportAddress("node-2", 9300))

text.addSink(new ElasticsearchSink(config, transports, new IndexRequestBuilder[String] {
  override def createIndexRequest(element: String, ctx: RuntimeContext): IndexRequest = {
    val json = new util.HashMap[String, AnyRef]
    json.put("data", element)
    println("SENDING: " + element)
    Requests.indexRequest.index("my-index").`type`("my-type").source(json)
  }
}))
{% endhighlight %}
</div>
</div>

The difference is that we now need to provide a list of Elasticsearch Nodes
to which the sink should connect using a `TransportClient`.

More about information about Elasticsearch can be found [here](https://elastic.co).

[Back to top](#top)

3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634
### Hadoop FileSystem

This connector provides a Sink that writes rolling files to any filesystem supported by
Hadoop FileSystem. To use this connector, add the
following dependency to your project:

{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-filesystem</artifactId>
  <version>{{site.version}}</version>
</dependency>
{% endhighlight %}

Note that the streaming connectors are currently not part of the binary
distribution. See
[here](cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution)
for information about how to package the program with the libraries for
cluster execution.

#### Rolling File Sink

The rolling behaviour as well as the writing can be configured but we will get to that later.
This is how you can create a default rolling sink:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<String> input = ...;

input.addSink(new RollingSink<String>("/base/path"));

{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val input: DataStream[String] = ...

input.addSink(new RollingSink("/base/path"))

{% endhighlight %}
</div>
</div>

The only required parameter is the base path where the rolling files (buckets) will be
stored. The sink can be configured by specifying a custom bucketer, writer and batch size.

By default the rolling sink will use the pattern `"yyyy-MM-dd--HH"` to name the rolling buckets.
This pattern is passed to `SimpleDateFormat` with the current system time to form a bucket path. A
new bucket will be created whenever the bucket path changes. For example, if you have a pattern
that contains minutes as the finest granularity you will get a new bucket every minute.
Each bucket is itself a directory that contains several part files: Each parallel instance
of the sink will create its own part file and when part files get too big the sink will also
create a new part file next to the others. To specify a custom bucketer use `setBucketer()`
on a `RollingSink`.

The default writer is `StringWriter`. This will call `toString()` on the incoming elements
and write them to part files, separated by newline. To specify a custom writer use `setWriter()`
on a `RollingSink`. If you want to write Hadoop SequenceFiles you can use the provided
`SequenceFileWriter` which can also be configured to use compression.

The last configuration option is the batch size. This specifies when a part file should be closed
and a new one started. (The default part file size is 384 MB).

Example:

<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<Tuple2<IntWritable,Text>> input = ...;

RollingSink sink = new RollingSink<String>("/base/path");
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"));
sink.setWriter(new SequenceFileWriter<IntWritable, Text>());
3635
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,
3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647

input.addSink(sink);

{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
val input: DataStream[Tuple2[IntWritable, Text]] = ...

val sink = new RollingSink[String]("/base/path")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink.setWriter(new SequenceFileWriter[IntWritable, Text]())
3648
sink.setBatchSize(1024 * 1024 * 400) // this is 400 MB,
3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663

input.addSink(sink)

{% endhighlight %}
</div>
</div>

This will create a sink that writes to bucket files that follow this schema:

```
/base/path/{date-time}/part-{parallel-task}-{count}
```

Where `date-time` is the string that we get from the date/time format, `parallel-task` is the index
of the parallel sink instance and `count` is the running number of part files that where created
3664
because of the batch size.
3665

3666
For in-depth information, please refer to the JavaDoc for
3667 3668 3669 3670
[RollingSink](http://flink.apache.org/docs/latest/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html).

[Back to top](#top)

3671 3672
### RabbitMQ

3673
This connector provides access to data streams from [RabbitMQ](http://www.rabbitmq.com/). To use this connector, add the following dependency to your project:
3674 3675 3676 3677

{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
3678
  <artifactId>flink-connector-rabbitmq</artifactId>
3679 3680 3681
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}
3682

3683 3684
Note that the streaming connectors are currently not part of the binary distribution. See linking with them for cluster execution [here](cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).

3685 3686
#### Installing RabbitMQ
Follow the instructions from the [RabbitMQ download page](http://www.rabbitmq.com/download.html). After the installation the server automatically starts, and the application connecting to RabbitMQ can be launched.
3687 3688 3689

#### RabbitMQ Source

3690
A class providing an interface for receiving data from RabbitMQ.
3691

3692
The followings have to be provided for the `RMQSource(…)` constructor in order:
3693

3694 3695
1. The hostname
2. The queue name
3696
3. Deserialization schema
3697

3698
Example:
3699

3700 3701 3702
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
3703
DataStream<String> stream = env
3704
	.addSource(new RMQSource<String>("localhost", "hello", new SimpleStringSchema()))
3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715
	.print
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
stream = env
    .addSource(new RMQSource[String]("localhost", "hello", new SimpleStringSchema))
    .print
{% endhighlight %}
</div>
</div>
3716 3717

#### RabbitMQ Sink
3718
A class providing an interface for sending data to RabbitMQ.
3719

3720
The followings have to be provided for the `RMQSink(…)` constructor in order:
3721 3722 3723

1. The hostname
2. The queue name
3724
3. Serialization schema
3725

3726
Example:
3727

3728 3729 3730
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
3731
stream.addSink(new RMQSink<String>("localhost", "hello", new StringToByteSerializer()));
3732 3733 3734 3735 3736 3737 3738 3739
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
stream.addSink(new RMQSink[String]("localhost", "hello", new StringToByteSerializer))
{% endhighlight %}
</div>
</div>
3740 3741 3742 3743 3744

More about RabbitMQ can be found [here](http://www.rabbitmq.com/).

[Back to top](#top)

3745 3746
### Twitter Streaming API

3747
Twitter Streaming API provides opportunity to connect to the stream of tweets made available by Twitter. Flink Streaming comes with a built-in `TwitterSource` class for establishing a connection to this stream. To use this connector, add the following dependency to your project:
3748 3749 3750 3751

{% highlight xml %}
<dependency>
  <groupId>org.apache.flink</groupId>
3752
  <artifactId>flink-connector-twitter</artifactId>
3753 3754 3755
  <version>{{site.version }}</version>
</dependency>
{% endhighlight %}
3756

3757 3758
Note that the streaming connectors are currently not part of the binary distribution. See linking with them for cluster execution [here](cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution).

3759 3760 3761 3762
#### Authentication
In order to connect to Twitter stream the user has to register their program and acquire the necessary information for the authentication. The process is described below.

#### Acquiring the authentication information
3763
First of all, a Twitter account is needed. Sign up for free at [twitter.com/signup](https://twitter.com/signup) or sign in at Twitter's [Application Management](https://apps.twitter.com/) and register the application by clicking on the "Create New App" button. Fill out a form about your program and accept the Terms and Conditions.
3764
After selecting the application, the API key and API secret (called `consumerKey` and `consumerSecret` in `TwitterSource` respectively) is located on the "API Keys" tab. The necessary OAuth Access Token data (`token` and `secret` in `TwitterSource`) can be generated and acquired on the "Keys and Access Tokens" tab.
3765
Remember to keep these pieces of information secret and do not push them to public repositories.
3766 3767

#### Accessing the authentication information
3768
Create a properties file, and pass its path in the constructor of `TwitterSource`. The content of the file should be similar to this:
3769

3770
~~~bash
3771 3772 3773 3774 3775 3776 3777 3778 3779 3780
#properties file for my app
secret=***
consumerSecret=***
token=***-***
consumerKey=***
~~~

#### Constructors
The `TwitterSource` class has two constructors.

3781
1. `public TwitterSource(String authPath, int numberOfTweets);`
3782
to emit a finite number of tweets
3783
2. `public TwitterSource(String authPath);`
3784 3785
for streaming

3786
Both constructors expect a `String authPath` argument determining the location of the properties file containing the authentication information. In the first case, `numberOfTweets` determines how many tweet the source emits.
3787 3788

#### Usage
3789
In contrast to other connectors, the `TwitterSource` depends on no additional services. For example the following code should run gracefully:
3790

3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
DataStream<String> streamSource = env.addSource(new TwitterSource("/PATH/TO/myFile.properties"));
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
streamSource = env.addSource(new TwitterSource("/PATH/TO/myFile.properties"))
{% endhighlight %}
</div>
</div>
3803

3804
The `TwitterSource` emits strings containing a JSON code.
3805
To retrieve information from the JSON code you can add a FlatMap or a Map function handling JSON code. For example, there is an implementation `JSONParseFlatMap` abstract class among the examples. `JSONParseFlatMap` is an extension of the `FlatMapFunction` and has a
3806

3807 3808 3809
<div class="codetabs" markdown="1">
<div data-lang="java" markdown="1">
{% highlight java %}
3810
String getField(String jsonText, String field);
3811 3812 3813 3814 3815 3816 3817 3818
{% endhighlight %}
</div>
<div data-lang="scala" markdown="1">
{% highlight scala %}
getField(jsonText : String, field : String) : String
{% endhighlight %}
</div>
</div>
3819

3820
function which can be use to acquire the value of a given field.
3821 3822 3823 3824

There are two basic types of tweets. The usual tweets contain information such as date and time of creation, id, user, language and many more details. The other type is the delete information.

#### Example
3825
`TwitterStream` is an example of how to use `TwitterSource`. It implements a language frequency counter program.
3826 3827 3828

[Back to top](#top)

3829
### Docker containers for connectors
3830

3831
A Docker container is provided with all the required configurations for test running the connectors of Apache Flink. The servers for the message queues will be running on the docker container while the example topology can be run on the user's computer.
3832 3833 3834

#### Installing Docker
The official Docker installation guide can be found [here](https://docs.docker.com/installation/).
3835 3836 3837
After installing Docker an image can be pulled for each connector. Containers can be started from these images where all the required configurations are set.

#### Creating a jar with all the dependencies
3838
For the easiest setup, create a jar with all the dependencies of the *flink-streaming-connectors* project.
3839

3840
~~~bash
3841
cd /PATH/TO/GIT/flink/flink-staging/flink-streaming-connectors
3842
mvn assembly:assembly
3843
~~~bash
3844

3845
This creates an assembly jar under *flink-streaming-connectors/target*.
3846 3847

#### RabbitMQ
3848
Pull the docker image:
3849

3850
~~~bash
3851
sudo docker pull flinkstreaming/flink-connectors-rabbitmq
3852
~~~
3853

3854
To run the container, type:
3855

3856
~~~bash
3857
sudo docker run -p 127.0.0.1:5672:5672 -t -i flinkstreaming/flink-connectors-rabbitmq
3858
~~~
3859

3860
Now a terminal has started running from the image with all the necessary configurations to test run the RabbitMQ connector. The -p flag binds the localhost's and the Docker container's ports so RabbitMQ can communicate with the application through these.
3861 3862 3863

To start the RabbitMQ server:

3864
~~~bash
3865
sudo /etc/init.d/rabbitmq-server start
3866
~~~
3867

3868
To launch the example on the host computer, execute:
3869

3870
~~~bash
3871 3872
java -cp /PATH/TO/JAR-WITH-DEPENDENCIES org.apache.flink.streaming.connectors.rabbitmq.RMQTopology \
> log.txt 2> errorlog.txt
3873
~~~
3874

3875
There are two connectors in the example. One that sends messages to RabbitMQ, and one that receives messages from the same queue. In the logger messages, the arriving messages can be observed in the following format:
3876

3877
~~~
3878
<DATE> INFO rabbitmq.RMQTopology: String: <one> arrived from RMQ
3879 3880 3881 3882
<DATE> INFO rabbitmq.RMQTopology: String: <two> arrived from RMQ
<DATE> INFO rabbitmq.RMQTopology: String: <three> arrived from RMQ
<DATE> INFO rabbitmq.RMQTopology: String: <four> arrived from RMQ
<DATE> INFO rabbitmq.RMQTopology: String: <five> arrived from RMQ
3883
~~~
3884 3885 3886 3887 3888

#### Apache Kafka

Pull the image:

3889
~~~bash
3890
sudo docker pull flinkstreaming/flink-connectors-kafka
3891
~~~
3892 3893 3894

To run the container type:

3895
~~~bash
3896 3897
sudo docker run -p 127.0.0.1:2181:2181 -p 127.0.0.1:9092:9092 -t -i \
flinkstreaming/flink-connectors-kafka
3898
~~~
3899

3900
Now a terminal has started running from the image with all the necessary configurations to test run the Kafka connector. The -p flag binds the localhost's and the Docker container's ports so Kafka can communicate with the application through these.
3901 3902
First start a zookeeper in the background:

3903
~~~bash
3904 3905
/kafka_2.9.2-0.8.1.1/bin/zookeeper-server-start.sh /kafka_2.9.2-0.8.1.1/config/zookeeper.properties \
> zookeeperlog.txt &
3906
~~~
3907 3908 3909

Then start the kafka server in the background:

3910
~~~bash
3911 3912
/kafka_2.9.2-0.8.1.1/bin/kafka-server-start.sh /kafka_2.9.2-0.8.1.1/config/server.properties \
 > serverlog.txt 2> servererr.txt &
3913
~~~
3914 3915 3916

To launch the example on the host computer execute:

3917
~~~bash
3918 3919
java -cp /PATH/TO/JAR-WITH-DEPENDENCIES org.apache.flink.streaming.connectors.kafka.KafkaTopology \
> log.txt 2> errorlog.txt
3920
~~~
3921 3922


3923
In the example there are two connectors. One that sends messages to Kafka, and one that receives messages from the same queue. In the logger messages, the arriving messages can be observed in the following format:
3924

3925
~~~
3926 3927 3928 3929 3930 3931 3932 3933 3934 3935
<DATE> INFO kafka.KafkaTopology: String: (0) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (1) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (2) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (3) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (4) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (5) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (6) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (7) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (8) arrived from Kafka
<DATE> INFO kafka.KafkaTopology: String: (9) arrived from Kafka
3936
~~~
3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960


[Back to top](#top)

Program Packaging & Distributed Execution
-----------------------------------------

See [the relevant section of the DataSet API documentation](programming_guide.html#program-packaging-and-distributed-execution).

[Back to top](#top)

Parallel Execution
------------------

See [the relevant section of the DataSet API documentation](programming_guide.html#parallel-execution).

[Back to top](#top)

Execution Plans
---------------

See [the relevant section of the DataSet API documentation](programming_guide.html#execution-plans).

[Back to top](#top)