From 226d0a88a99bce57e3481fac94cde7f4f228cfc0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=99=BC=E9=87=8C=E5=95=AA=E5=95=A6=E5=98=A3?= Date: Mon, 9 Dec 2019 15:13:09 +0800 Subject: [PATCH] =?UTF-8?q?15=20=E5=AE=8C=E6=88=90?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/1.7/15.md | 208 +++++++++++++++++++++++++------------------------ 1 file changed, 105 insertions(+), 103 deletions(-) diff --git a/docs/1.7/15.md b/docs/1.7/15.md index c64d2fc..192a493 100644 --- a/docs/1.7/15.md +++ b/docs/1.7/15.md @@ -1,16 +1,15 @@ +# FLink数据流API编程指南 -# Flink DataStream API Programming Guide +Flink中的Datastream程序是实现数据流转换的常规程序(例如过滤、更新状态、定义窗口、聚合)。数据流最初是从各种源(例如消息队列、套接字流、文件)创建的。结果通过接收器返回,例如可以将数据写入文件或标准输出(例如命令行终端)。Flink程序在各种上下文中运行,独立运行,或嵌入到其他程序中。执行可以在本地JVM中进行,也可以在许多计算机集群上执行。 -DataStream programs in Flink are regular programs that implement transformations on data streams (e.g., filtering, updating state, defining windows, aggregating). The data streams are initially created from various sources (e.g., message queues, socket streams, files). Results are returned via sinks, which may for example write the data to files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines. +请参阅[基本concepts](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/api_concepts.html)]以了解FLinkAPI的基本概念。 -Please see [basic concepts](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/api_concepts.html) for an introduction to the basic concepts of the Flink API. +为了创建自己的FLink数据流程序,我们鼓励您从[Flinks程序的解剖结构](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/api_concepts.html#anatomy-of-a-flink-program)开始,并逐渐添加您自己的[流转换](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html)。其余章节用作附加操作和高级功能的参考。 -In order to create your own Flink DataStream program, we encourage you to start with [anatomy of a Flink Program](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/api_concepts.html#anatomy-of-a-flink-program) and gradually add your own [stream transformations](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html). The remaining sections act as references for additional operations and advanced features. +## 示例程序 -## Example Program - -The following program is a complete, working example of streaming window word count application, that counts the words coming from a web socket in 5 second windows. You can copy & paste the code to run it locally. +下面的程序是一个完整的,工作的例子,流窗口字计数应用程序,计算来自一个网络套接字在5秒窗口中的单词。您可以复制和粘贴代码以在本地运行。 @@ -81,7 +80,7 @@ object WindowWordCount { -To run the example program, start the input stream with netcat first from a terminal: +要运行示例程序,首先从终端启动带有netcat的输入流: @@ -91,135 +90,137 @@ nc -lk 9999 -Just type some words hitting return for a new word. These will be the input to the word count program. If you want to see counts greater than 1, type the same word again and again within 5 seconds (increase the window size from 5 seconds if you cannot type that fast ☺). +只需键入一些单词,点击返回一个新的单词。这些将是单词计数程序的输入。如果您想查看大于1的计数,请在5秒内一次又一次地键入相同的单词(如果无法键入该快速窗口,则从5秒增加窗口大小)。 -## Data Sources +## 数据来源 -Sources are where your program reads its input from. You can attach a source to your program by using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented source functions, but you can always write your own custom sources by implementing the `SourceFunction` for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the `RichParallelSourceFunction` for parallel sources. +来源是您的程序从其中读取其输入的地方。您可以使用`StreamExecutionEnvironment.addSource(sourceFunction)`将源附加到程序中。flink附带许多预实现的源函数,但您可以通过实现非平行源的 `SourceFunction` ,或通过实现`ParallelSourceFunction`接口或扩展并行源的`RichParallelSourceFunction` 来编写自己的自定义源。 -There are several predefined stream sources accessible from the `StreamExecutionEnvironment`: +可以从`StreamExecutionEnvironment`访问多个预定义的流来源: -File-based: +基于文件: -* `readTextFile(path)` - Reads text files, i.e. files that respect the `TextInputFormat` specification, line-by-line and returns them as Strings. +* `readTextFile(path)`--读取文本文件,即尊重`TextInputFormat` 规范的文件,逐行读取,并将它们作为String返回。 -* `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified file input format. +* `readFile(fileInputFormat, path)`-按指定的文件输入格式指定文件。 -* `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)` - This is the method called internally by the two previous ones. It reads files in the `path` based on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can further exclude files from being processed. +* `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)`-这是两个以前的文件内部调用的方法。它根据给定的“fileInputFormat”读取“路径”中的文件。根据所提供的“WatchType”,该来源可以定期监视(每个“间隔”ms)新数据的路径(“FileProcessingMode.Processor_Continuous”),或在路径和exit中的当前数据(“FileProcessingMode.Processing_Once”)。使用“PathFilter”,用户可以进一步排除正在处理的文件。 - _IMPLEMENTATION:_ + _成就:_ - Under the hood, Flink splits the file reading process into two sub-tasks, namely _directory monitoring_ and _data reading_. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in _splits_, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read multiple splits, one-by-one. + 在发动机罩下,FLink将文件读取过程分成两个子任务,即_directory monitoring_ 和 _data reading_。这些子任务中的每一个由单独的实体来实现。监视由单个、**非并行**(并行度=1)任务实现,而读取是由并行运行的多个任务执行的。后者的并行性等于作业并行度。单个监视任务的作用是扫描目录(根据 `watchType`定期或只一次),找到要处理的文件,将其划分为_splits_,并将这些拆分分配给下游读取器。读者是将阅读实际数据的读者。每个分割只由一个读取器读取,而读取器可以逐个读取多个分割。 - _IMPORTANT NOTES:_ + _重要笔记:_ - 1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can break the “exactly-once” semantics, as appending data at the end of a file will lead to **all** its contents being re-processed. + 1. 如果 `watchType`设置为`FileProcessingMode.PROCESS_CONTINUOUSLY`,则当文件被修改时,其内容将被完全重新处理。这可能会打破“精确一次”的语义,因为在文件末尾附加数据将导致**所有**其内容被重新处理。 - 2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint. + 2. 如果将`watchType` 设置为`FileProcessingMode.PROCESS_ONCE`,则源扫描路径**一次**并退出,而无需等待读取器完成读取文件内容。当然,读取器将继续读取,直到读取所有文件内容为止。关闭源导致该点之后不再有更多检查点。这可能导致节点故障后恢复较慢,因为作业将从上一个检查点恢复读取。 -Socket-based: +基于套接字: -* `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter. +* `socketTextStream`-从插槽中读取。元素可以通过分隔符分隔。 -Collection-based: +以收藏为基础: -* `fromCollection(Collection)` - Creates a data stream from the Java Java.util.Collection. All elements in the collection must be of the same type. +* `fromCollection(Collection)`--从JavaJava.util.Collection创建数据流。集合中的所有元素必须具有相同的类型。 -* `fromCollection(Iterator, Class)` - Creates a data stream from an iterator. The class specifies the data type of the elements returned by the iterator. +* `fromCollection(Iterator, Class)`--从迭代器创建数据流。类指定迭代器返回的元素的数据类型。 -* `fromElements(T ...)` - Creates a data stream from the given sequence of objects. All objects must be of the same type. +* `fromElements(T ...)` --从给定的对象序列创建数据流。所有对象都必须是同一类型的。 -* `fromParallelCollection(SplittableIterator, Class)` - Creates a data stream from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator. +* `fromParallelCollection(SplittableIterator, Class)`-从迭代器并行创建数据流。类指定迭代器返回的元素的数据类型。 -* `generateSequence(from, to)` - Generates the sequence of numbers in the given interval, in parallel. +* `generateSequence(from, to)` -在给定的间隔中并行地生成数字序列。 -Custom: +习俗: -* `addSource` - Attach a new source function. For example, to read from Apache Kafka you can use `addSource(new FlinkKafkaConsumer08<>(...))`. See [connectors](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/index.html) for more details. +* `addSource` 附加新的源函数。例如,要从ApacheKafka读取,可以使用`addSource(newFlinkKafkaConsumer08<>(...))`。有关详细信息,请参阅[connectors](/ci.apache.org/projects/flink/flinkdocsrelease1.7/dev/connectors/index.html)。 -Sources are where your program reads its input from. You can attach a source to your program by using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented source functions, but you can always write your own custom sources by implementing the `SourceFunction` for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the `RichParallelSourceFunction` for parallel sources. +来源是您的程序从其中读取其输入的地方。您可以使用`StreamExecutionEnvironment.addSource(sourceFunction)`将源附加到程序中。flink附带许多预实现的源函数,但您可以通过实现非平行源的`SourceFunction` ,或通过实现`ParallelSourceFunction`接口或扩展并行源的`RichParallelSourceFunction`来编写自己的自定义源。 -There are several predefined stream sources accessible from the `StreamExecutionEnvironment`: +可以从 `StreamExecutionEnvironment`访问多个预定义的流来源: -File-based: +基于文件: * `readTextFile(path)` - Reads text files, i.e. files that respect the `TextInputFormat` specification, line-by-line and returns them as Strings. +* `readTextFile(path)` --读取文本文件,即尊重 `TextInputFormat` 规范的文件,逐行读取,并将它们作为String返回。 * `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified file input format. +* `readFile(fileInputFormat, path)`-按指定的文件输入格式指定文件。 -* `readFile(fileInputFormat, path, watchType, interval, pathFilter)` - This is the method called internally by the two previous ones. It reads files in the `path` based on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can further exclude files from being processed. +* `readFile(fileInputFormat, path, watchType, interval, pathFilter)`--这是前面两个方法内部调用的方法。它根据给定的`fileInputFormat`读取`path` 中的文件。根据所提供的`watchType`,该来源可定期监测(每一个`interval`ms)新数据的路径(`FileProcessingMode.PROCESS_CONTINUOUSLY`),或在当前路径和出口(`FileProcessingMode.PROCESS_ONCE`)中处理一次数据。使用`pathFilter`,用户可以进一步排除正在处理的文件。 - _IMPLEMENTATION:_ + _成就:_ - Under the hood, Flink splits the file reading process into two sub-tasks, namely _directory monitoring_ and _data reading_. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in _splits_, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read multiple splits, one-by-one. + 在发动机罩下,FLink将文件读取过程分成两个子任务,即_directory monitoring_ 和 _data reading_。这些子任务中的每一个由单独的实体来实现。监视由单个、**非并行**(并行度=1)任务实现,而读取是由并行运行的多个任务执行的。后者的并行性等于作业并行度。单个监视任务的作用是扫描目录(根据`watchType`定期或只一次),找到要处理的文件,将其划分为_splits_,并将这些拆分分配给下游读取器。读者是将阅读实际数据的读者。每个分割只由一个读取器读取,而读取器可以逐个读取多个分割。 - _IMPORTANT NOTES:_ + _重要笔记:_ - 1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can break the “exactly-once” semantics, as appending data at the end of a file will lead to **all** its contents being re-processed. + 1. 如果`watchType`设置为`FileProcessingMode.PROCESS_CONTINUOUSLY`,则当文件被修改时,其内容将被完全重新处理。这可能会打破“精确一次”的语义,因为在文件末尾附加数据将导致**所有**其内容被重新处理。 - 2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint. + 2. 如果将 `watchType` 设置为`FileProcessingMode.PROCESS_ONCE`,则源扫描路径**一次**并退出,而无需等待读取器完成读取文件内容。当然,读取器将继续读取,直到读取所有文件内容为止。关闭源导致该点之后不再有更多检查点。这可能导致节点故障后恢复较慢,因为作业将从上一个检查点恢复读取。 -Socket-based: +基于套接字: -* `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter. +* `socketTextStream`-从插槽中读取。元素可以通过分隔符分隔。 -Collection-based: +以收藏为基础: -* `fromCollection(Seq)` - Creates a data stream from the Java Java.util.Collection. All elements in the collection must be of the same type. +* `fromCollection(Seq)` -从JavaJava.util.Collection.创建数据流。集合中的所有元素必须具有相同类型。 -* `fromCollection(Iterator)` - Creates a data stream from an iterator. The class specifies the data type of the elements returned by the iterator. +* `fromCollection(Iterator)` --从迭代器创建数据流。类指定迭代器返回的元素的数据类型。 -* `fromElements(elements: _*)` - Creates a data stream from the given sequence of objects. All objects must be of the same type. +* `fromElements(elements: _*)` -从给定的对象序列创建数据流。所有对象必须具有相同类型。 -* `fromParallelCollection(SplittableIterator)` - Creates a data stream from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator. +* `fromParallelCollection(SplittableIterator)` --并行地从迭代器创建数据流。类指定迭代器返回的元素的数据类型。 -* `generateSequence(from, to)` - Generates the sequence of numbers in the given interval, in parallel. +* `generateSequence(from, to)`-在给定的间隔中并行地生成数字序列。 -Custom: +风俗: -* `addSource` - Attach a new source function. For example, to read from Apache Kafka you can use `addSource(new FlinkKafkaConsumer08<>(...))`. See [connectors](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/) for more details. +* `addSource`-附加新的源函数。例如,要从ApacheKafka读取,您可以使用`addSource(new FlinkKafkaConsumer08<>(...))`。有关详细信息,请参见[连接器](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/)。 -## DataStream Transformations +## DataStream Transformations(Datastream变换) -Please see [operators](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html) for an overview of the available stream transformations. +有关可用流转换的概述,请参见[operators](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html)。 -## Data Sinks +## Data Sinks (数据沉没) -Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataStreams: +数据接收器会消耗数据流并将其转发到文件、套接字、外部系统或打印它们。FLink附带各种内置的输出格式,这些格式被封装在数据流上的操作后面: -* `writeAsText()` / `TextOutputFormat` - Writes elements line-wise as Strings. The Strings are obtained by calling the _toString()_ method of each element. +* `writeAsText()` / `TextOutputFormat` -将元素作为字符串写入。通过调用每个元素的 _toString()_ 方法来获得字符串。 -* `writeAsCsv(...)` / `CsvOutputFormat` - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the _toString()_ method of the objects. +* `writeAsCsv(...)` / `CsvOutputFormat` "行和字段分隔符是可配置的。每个字段的值来自对象的 _toString()_ 方法。 -* `print()` / `printToErr()` - Prints the _toString()_ value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to _print_. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output. +* `print()` / `printToErr()` -打印标准输出/标准错误流上每个元素的 _toString()_ 值。可选的,可以提供一个前缀(MSG),该前缀(MS G)被预先连接到输出。这有助于区分对_print_的不同调用。如果并行性大于1,则输出也将与产生输出的任务的标识符一起预先结束。 -* `writeUsingOutputFormat()` / `FileOutputFormat` - Method and base class for custom file outputs. Supports custom object-to-bytes conversion. +* `writeUsingOutputFormat()‘/`FileOutputFormat-Method和基类,用于自定义文件输出。支持自定义对象到字节的转换。 -* `writeToSocket` - Writes elements to a socket according to a `SerializationSchema` +* `writeToSocket` - 根据`SerializationSchema`将元素写入到套接字中 -* `addSink` - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as Apache Kafka) that are implemented as sink functions. +* `addSink` -调用自定义接收器函数。Flink与作为接收器函数实现的其他系统(如ApacheKafka)的连接器捆绑在一起。 -Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataStreams: +数据接收器会消耗数据流并将其转发到文件、套接字、外部系统或打印它们。FLink附带各种内置的输出格式,这些格式被封装在数据流上的操作后面: -* `writeAsText()` / `TextOutputFormat` - Writes elements line-wise as Strings. The Strings are obtained by calling the _toString()_ method of each element. +* `writeAsText()` / `TextOutputFormat` -将元素作为字符串写入。通过调用每个元素的_toString()_方法来获得字符串。 -* `writeAsCsv(...)` / `CsvOutputFormat` - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the _toString()_ method of the objects. +* `writeAsCsv(...)` / `CsvOutputFormat` - 行和字段分隔符是可配置的。每个字段的值来自对象的 _toString()_ 方法。 -* `print()` / `printToErr()` - Prints the _toString()_ value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to _print_. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output. +* `print()` / `printToErr()` -打印标准输出/标准错误流上每个元素的 _toString()_ 值。可选的,可以提供一个前缀(MSG),该前缀(MS G)被预先连接到输出。这有助于区分对 _print_ 的不同调用。如果并行性大于1,则输出也将与产生输出的任务的标识符一起预先结束。 -* `writeUsingOutputFormat()` / `FileOutputFormat` - Method and base class for custom file outputs. Supports custom object-to-bytes conversion. +* `writeUsingOutputFormat()` / `FileOutputFormat` -用于自定义文件输出的方法和基类。支持自定义对象到字节的转换.. -* `writeToSocket` - Writes elements to a socket according to a `SerializationSchema` +* `writeToSocket` -根据`SerializationSchema`将元素写入到套接字中 -* `addSink` - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as Apache Kafka) that are implemented as sink functions. +* `addSink` -调用自定义接收器函数。Flink与作为接收器函数实现的其他系统(如ApacheKafka)的连接器捆绑在一起。 -Note that the `write*()` methods on `DataStream` are mainly intended for debugging purposes. They are not participating in Flink’s checkpointing, this means these functions usually have at-least-once semantics. The data flushing to the target system depends on the implementation of the OutputFormat. This means that not all elements send to the OutputFormat are immediately showing up in the target system. Also, in failure cases, those records might be lost. +请注意,`DataStream`上的`write*()`方法主要用于调试目的。它们不参与FLink的检查点操作,这意味着这些函数通常具有至少一次语义。对目标系统的数据刷新取决于输出格式的实现。这意味着并非所有发送到OutputFormat的元素都会立即显示在目标系统中。此外,在失败的情况下,这些记录可能会丢失。 -For reliable, exactly-once delivery of a stream into a file system, use the `flink-connector-filesystem`. Also, custom implementations through the `.addSink(...)` method can participate in Flink’s checkpointing for exactly-once semantics. +为了可靠,精确地将流传递到文件系统中,请使用`flink-connector-filesystem`。此外,通过`.addSink(...)` 方法的自定义实现可以为一次语义参与Flink的检查点。 -## Iterations +## Iterations (迭代) -Iterative streaming programs implement a step function and embed it into an `IterativeStream`. As a DataStream program may never finish, there is no maximum number of iterations. Instead, you need to specify which part of the stream is fed back to the iteration and which part is forwarded downstream using a `split` transformation or a `filter`. Here, we show an example using filters. First, we define an `IterativeStream` +迭代流程序实现了STEP函数,并将其嵌入到`IterativeStream`中。由于Datastream程序可能永远不会完成,所以没有最大的迭代次数。相反,您需要指定流的哪一部分被反馈回迭代,哪些部分使用`split`转换或`filter`转发到下游。这里,我们展示了一个使用过滤器的例子。首先,我们定义了一个`IterativeStream` @@ -229,7 +230,7 @@ IterativeStream iteration = input.iterate(); -Then, we specify the logic that will be executed inside the loop using a series of transformations (here a simple `map` transformation) +然后,我们使用一系列转换(这里是一个简单的`map`转换)指定将在循环中执行的逻辑。 @@ -239,7 +240,7 @@ DataStream iterationBody = iteration.map(/* this is executed many times -To close an iteration and define the iteration tail, call the `closeWith(feedbackStream)` method of the `IterativeStream`. The DataStream given to the `closeWith` function will be fed back to the iteration head. A common pattern is to use a filter to separate the part of the stream that is fed back, and the part of the stream which is propagated forward. These filters can, e.g., define the “termination” logic, where an element is allowed to propagate downstream rather than being fed back. +若要结束迭代并定义迭代尾,请调用`IterativeStream`的`closeWith(feedbackStream)`方法。给予`closeWith`函数的Datastream将反馈给迭代头。一种常见的模式是使用过滤器将反馈的流部分和前向传播的流部分分开。例如,这些过滤器可以定义“终止”逻辑,其中允许元素向下游传播,而不是反馈。 @@ -250,7 +251,7 @@ DataStream output = iterationBody.filter(/* some other part of the stre -For example, here is program that continuously subtracts 1 from a series of integers until they reach zero: +例如,下面的程序不断地从一系列整数中减去1,直到它们达到零为止: @@ -285,7 +286,7 @@ DataStream lessThanZero = minusOne.filter(new FilterFunction() { -Iterative streaming programs implement a step function and embed it into an `IterativeStream`. As a DataStream program may never finish, there is no maximum number of iterations. Instead, you need to specify which part of the stream is fed back to the iteration and which part is forwarded downstream using a `split` transformation or a `filter`. Here, we show an example iteration where the body (the part of the computation that is repeated) is a simple map transformation, and the elements that are fed back are distinguished by the elements that are forwarded downstream using filters. +迭代流程序实现了STEP函数,并将其嵌入到`IterativeStream`中。由于Datastream程序可能永远不会完成,所以没有最大的迭代次数。相反,您需要指定流的哪一部分被反馈回迭代,哪些部分使用`split`转换或`filter`转发到下游。这里,我们展示了一个示例迭代,其中主体(计算中重复的部分)是一个简单的映射转换,反馈的元素通过使用过滤器向下游转发的元素来区分。 @@ -299,7 +300,7 @@ val iteratedStream = someDataStream.iterate( -For example, here is program that continuously subtracts 1 from a series of integers until they reach zero: +例如,下面的程序不断地从一系列整数中减去1,直到它们达到零为止: @@ -318,23 +319,23 @@ val iteratedStream = someIntegers.iterate( -## Execution Parameters +## Execution Parameters(执行参数) -The `StreamExecutionEnvironment` contains the `ExecutionConfig` which allows to set job specific configuration values for the runtime. +`StreamExecutionEnvironment`包含允许为运行时设置作业特定配置值的 `ExecutionConfig`。 -Please refer to [execution configuration](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/execution_configuration.html) for an explanation of most parameters. These parameters pertain specifically to the DataStream API: +有关大多数参数的解释,请参阅[执行配置](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/execution_configuration.html)。这些参数具体涉及Data StreamAPI: -* `setAutoWatermarkInterval(long milliseconds)`: Set the interval for automatic watermark emission. You can get the current value with `long getAutoWatermarkInterval()` +* `setAutoWatermarkInterval(long milliseconds)`:设置自动水印发射的间隔。可以使用`long getAutoWatermarkInterval()`获取当前值 -### Fault Tolerance +### Fault Tolerance(容错(性)) -[State & Checkpointing](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html) describes how to enable and configure Flink’s checkpointing mechanism. +[State&Checkpointing](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html)描述了如何启用和配置Flink的检查点机制。 -### Controlling Latency +### Controlling Latency(控制延迟) -By default, elements are not transferred on the network one-by-one (which would cause unnecessary network traffic) but are buffered. The size of the buffers (which are actually transferred between machines) can be set in the Flink config files. While this method is good for optimizing throughput, it can cause latency issues when the incoming stream is not fast enough. To control throughput and latency, you can use `env.setBufferTimeout(timeoutMillis)` on the execution environment (or on individual operators) to set a maximum wait time for the buffers to fill up. After this time, the buffers are sent automatically even if they are not full. The default value for this timeout is 100 ms. +默认情况下,元素不是一个接一个地在网络上传输(这会导致不必要的网络通信),而是被缓冲。缓冲区的大小(实际上在机器之间传输)可以在Flink配置文件中设置。虽然这种方法有利于优化吞吐量,但当传入流不够快时,可能会导致延迟问题。要控制吞吐量和延迟,可以在执行环境(或单个操作符)上使用`env.setBufferTimeout(timeoutMillis)` 来设置缓冲区填充的最大等待时间。之后,即使缓冲区没有满,缓冲区也会自动发送。此超时的默认值为100 ms。 -Usage: +使用: @@ -358,19 +359,20 @@ env.generateSequence(1,10).map(myMap).setBufferTimeout(timeoutMillis) -To maximize throughput, set `setBufferTimeout(-1)` which will remove the timeout and buffers will only be flushed when they are full. To minimize latency, set the timeout to a value close to 0 (for example 5 or 10 ms). A buffer timeout of 0 should be avoided, because it can cause severe performance degradation. +为了最大化吞吐量,设置`setBufferTimeout(-1)`,该设置将删除超时,并且缓冲区将仅在其满时被刷新。要使延迟最小化,将超时设置为接近0的值(例如5或10ms)。应避免0的缓冲区超时,因为它可能会导致严重的性能降级。 ## Debugging +## Debugging(调试) -Before running a streaming program in a distributed cluster, it is a good idea to make sure that the implemented algorithm works as desired. Hence, implementing data analysis programs is usually an incremental process of checking results, debugging, and improving. +在分布式集群中运行流程序之前,最好确保实现的算法按需要工作。因此,实现数据分析程序通常是一个检查结果、调试和改进的增量过程。 -Flink provides features to significantly ease the development process of data analysis programs by supporting local debugging from within an IDE, injection of test data, and collection of result data. This section give some hints how to ease the development of Flink programs. +Flink通过支持IDE内部的本地调试、测试数据的注入和结果数据的收集,大大简化了数据分析程序的开发过程。本节给出了如何简化Flink程序开发的一些提示。 -### Local Execution Environment +### Local Execution Environment(本地执行环境) -A `LocalStreamEnvironment` starts a Flink system within the same JVM process it was created in. If you start the LocalEnvironment from an IDE, you can set breakpoints in your code and easily debug your program. +`LocalStreamEnvironment`在创建它的JVM进程中启动Flink系统。如果从IDE启动LocalEnvironment,则可以在代码中设置断点并轻松调试程序。v -A LocalEnvironment is created and used as follows: +创建并使用LocalEnvironment如下: @@ -397,11 +399,11 @@ env.execute() -### Collection Data Sources +### Collection Data Sources(收集数据源) -Flink provides special data sources which are backed by Java collections to ease testing. Once a program has been tested, the sources and sinks can be easily replaced by sources and sinks that read from / write to external systems. +Flink提供由Java集合支持的特殊数据源,以方便测试。一旦程序经过测试,源和汇就可以很容易地被从/写入外部系统的源和汇所取代。 -Collection data sources can be used as follows: +采集数据源可按如下方式使用: @@ -438,11 +440,11 @@ val myLongs = env.fromCollection(longIt) -**Note:** Currently, the collection data source requires that data types and iterators implement `Serializable`. Furthermore, collection data sources can not be executed in parallel ( parallelism = 1). +**Note:** 目前,收集数据源要求数据类型和迭代器实现`Serializable`。此外,收集数据源不能并行执行(parallelism=1)。 -### Iterator Data Sink +### Iterator Data Sink (迭代器数据接收器) -Flink also provides a sink to collect DataStream results for testing and debugging purposes. It can be used as follows: +Flink还提供一个接收器来收集Datastream结果,以便进行测试和调试。它可用于以下方面: @@ -467,12 +469,12 @@ val myOutput: Iterator[(String, Int)] = DataStreamUtils.collect(myResult.javaStr -**Note:** `flink-streaming-contrib` module is removed from Flink 1.5.0. Its classes have been moved into `flink-streaming-java` and `flink-streaming-scala`. +**Note:** 从Flink 1.5.0中移除`flink-streaming-contrib`模块。它的类已被移到 `flink-streaming-java`和`flink-streaming-scala`中。 -## Where to go next? +## 下一步要去哪里? -* [Operators](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html): Specification of available streaming operators. -* [Event Time](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html): Introduction to Flink’s notion of time. -* [State & Fault Tolerance](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/index.html): Explanation of how to develop stateful applications. -* [Connectors](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/index.html): Description of available input and output connectors. +* [运算符](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/index.html):可用流运算符的规范。 +* [事件时间](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html):引入FLink的时间概念。 +* [状态和容错](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/index.html):说明如何开发有状态应用程序。 +* 可用输入和输出连接器的[Connectors](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/index.html):描述。 -- GitLab