提交 dd3a99bf 编写于 作者: W wizardforcel

2019-02-17 15:32:56

上级 fcde1692
......@@ -17,7 +17,7 @@ JobManager协调每个Flink部署。它负责_调度_和_资源管理_。
例如,请考虑以下三个JobManager实例的设置:
![](img/jobmanager_ha_overview.png)
![](../img/jobmanager_ha_overview.png)
### 配置
......
......@@ -45,7 +45,7 @@
此持续时间是在最新检查点结束和下一个检查点开始之间必须经过的最小时间间隔。下图说明了这对检查点的影响。
![插图检查点之间的最小时间参数如何影响检查点行为。](img/checkpoint_tuning.svg)
![插图检查点之间的最小时间参数如何影响检查点行为。](../img/checkpoint_tuning.svg)
_注意:_可以配置应用程序(通过`CheckpointConfig`)以允许多个检查点同时进行。对于Flink中具有大状态的应用程序,这通常会将太多资源绑定到检查点。当手动触发保存点时,它可能正在与正在进行的检查点同时进行。
......@@ -186,7 +186,7 @@ Flink为所有检查点和保存点提供可选压缩(默认:关闭)。目
请注意,根据所选的状态后台和检查点策略,每个检查点可能会产生一些额外费用,用于创建和存储辅助本地状态副本。例如,在大多数情况下,实现将简单地将对分布式存储的写入复制到本地文件。
![检查点的例证与任务地方恢复的。](img/local_recovery.png)
![检查点的例证与任务地方恢复的。](../img/local_recovery.png)
### 主(分布式存储)和辅助(任务 - 本地)状态SNAPSHOT的关系
......
......@@ -1611,5 +1611,5 @@ Flink通过将程序拆分为子任务并将这些子任务调度到处理槽来
启动Flink应用程序时,用户可以提供用于该作业的默认插槽数.因此调用命令行值`-p`(用于并行).此外,可以为整个应用程序和各个算子[设置编程API中的插槽数](https://flink.sojb.cn/dev/parallel.html).
![](img/slots_parallelism.svg)
![](../img/slots_parallelism.svg)
......@@ -13,7 +13,7 @@
为了获得更大的灵活性,可以单独启用和配置内部和外部连接的安全性。
![内部和外部连接](img/ssl_internal_external.svg)
![内部和外部连接](../img/ssl_internal_external.svg)
#### 内部连接
......
......@@ -32,7 +32,7 @@ Flink的Web界面提供了一个监视作业检查点的选项卡。作业终止
检查点历史记录保存有关最近触发的检查点的统计信息,包括当前正在进行的检查点。
<center>![检查点监控:历史](img/checkpoint_monitoring-history.png)</center>
<center>![检查点监控:历史](../img/checkpoint_monitoring-history.png)</center>
* **ID**:触发​​的检查点的ID。每个检查点的ID都会增加,从1开始。
* **状态**:检查点的当前状态,_正在进行中_(),_完成_(),或_失败_()。如果触发的检查点是保存点,您将看到一个 符号。
......@@ -62,7 +62,7 @@ web.checkpoints.history: 15
摘要计算在对齐期间缓冲的端到端持续时间,状态大小和字节的所有已完成检查点的简单最小/平均/最大统计数据(有关这些含义的详细信息,请参阅[历史记录](#history))。
<center>![检查点监控:摘要](img/checkpoint_monitoring-summary.png)</center>
<center>![检查点监控:摘要](../img/checkpoint_monitoring-summary.png)</center>
请注意,这些统计信息不会在JobManager丢失后继续存在,并在JobManager故障转移时重置为。
......@@ -81,13 +81,13 @@ web.checkpoints.history: 15
单击检查点的“ _更多详细信息”_链接时,将获得所有 算子的“最小/平均/最大”摘要以及每个子任务的详细数字。
<center>![检查点监控:详细信息](img/checkpoint_monitoring-details.png)</center>
<center>![检查点监控:详细信息](../img/checkpoint_monitoring-details.png)</center>
#### 每个算子摘要
<center>![检查点监控:详细信息摘要](img/checkpoint_monitoring-details_summary.png)</center>
<center>![检查点监控:详细信息摘要](../img/checkpoint_monitoring-details_summary.png)</center>
#### 所有子任务统计
<center>![检查点监控:子任务](img/checkpoint_monitoring-details_subtasks.png)</center>
<center>![检查点监控:子任务](../img/checkpoint_monitoring-details_subtasks.png)</center>
......@@ -17,7 +17,7 @@ Flink的Web界面提供了一个选项卡来监控正在运行的作业的背压
背压监测通过反复获取正在运行的任务的堆栈跟踪样本来工作。JobManager会触发对作业`Thread.getStackTrace()`任务的重复调用。
![](img/back_pressure_sampling.png)
![](../img/back_pressure_sampling.png)
如果示例显示任务线程卡在某个内部方法调用中(从网络堆栈请求缓冲区),则表示该任务存在背压。
......@@ -47,13 +47,13 @@ Flink的Web界面提供了一个选项卡来监控正在运行的作业的背压
请注意,单击该行可触发此 算子的所有子任务的样本。
![](img/back_pressure_sampling_in_progress.png)
![](../img/back_pressure_sampling_in_progress.png)
### 背压状态
如果您看到任务的状态**正常**,则表示没有背压指示。另一方面,**HIGH**意味着任务被加压。
![](img/back_pressure_sampling_ok.png)
![](../img/back_pressure_sampling_ok.png)
![](img/back_pressure_sampling_high.png)
![](../img/back_pressure_sampling_high.png)
......@@ -17,6 +17,6 @@
您可以单击图中的组件以了解更多信息。
<center>![Apache Flink:Stack](img/stack.png)</center>
<center>![Apache Flink:Stack](../img/stack.png)</center>
<map name="overview-stack"><area id="lib-datastream-cep" title="CEP: Complex Event Processing" href="https://flink.sojb.cn/dev/libs/cep.html" shape="rect" coords="63,0,143,177"> <area id="lib-datastream-table" title="Table: Relational DataStreams" href="https://flink.sojb.cn/dev/table_api.html" shape="rect" coords="143,0,223,177"> <area id="lib-dataset-ml" title="FlinkML: Machine Learning" href="https://flink.sojb.cn/dev/libs/ml/index.html" shape="rect" coords="382,2,462,176"> <area id="lib-dataset-gelly" title="Gelly: Graph Processing" href="https://flink.sojb.cn/dev/libs/gelly/index.html" shape="rect" coords="461,0,541,177"> <area id="lib-dataset-table" title=" Table API and SQL" href="https://flink.sojb.cn/dev/table_api.html" shape="rect" coords="544,0,624,177"> <area id="datastream" title="DataStream API" href="https://flink.sojb.cn/dev/datastream_api.html" shape="rect" coords="64,177,379,255"> <area id="dataset" title="DataSet API" href="https://flink.sojb.cn/dev/batch/index.html" shape="rect" coords="382,177,697,255"> <area id="runtime" title="Runtime" href="https://flink.sojb.cn/concepts/runtime.html" shape="rect" coords="63,257,700,335"> <area id="local" title="Local" href="https://flink.sojb.cn/tutorials/local_setup.html" shape="rect" coords="62,337,275,414"> <area id="cluster" title="Cluster" href="https://flink.sojb.cn/ops/deployment/cluster_setup.html" shape="rect" coords="273,336,486,413"> <area id="cloud" title="Cloud" href="https://flink.sojb.cn/ops/deployment/gce_setup.html" shape="rect" coords="485,336,700,414"></map>
\ No newline at end of file
......@@ -29,7 +29,7 @@ Flink的容错机制的核心部分是绘制分布式数据流和算子状态的
Flink分布式SNAPSHOT的核心数据元是_流障碍_。这些障碍被注入数据流并与记录一起作为数据流的一部分流动。障碍永远不会超过记录,流量严格符合要求。屏障将数据流中的记录分为进入当前SNAPSHOT的记录集和进入下一个SNAPSHOT的记录。每个屏障都带有SNAPSHOT的ID,该SNAPSHOT的记录在其前面推送。障碍不会中断流的流动,因此非常轻。来自不同SNAPSHOT的多个障碍可以同时在流中,这意味着可以同时发生各种SNAPSHOT。
![数据流中的检查点障碍](img/stream_barriers.svg)
![数据流中的检查点障碍](../img/stream_barriers.svg)
流障碍被注入流源的并行数据流中。注入SNAPSHOT_n_的障碍(我们称之为_S &lt;sub&gt;n&lt;/sub&gt;_)的点是源流中SNAPSHOT覆盖数据的位置。例如,在Apache Kafka中,此位置将是分区中最后一条记录的偏移量。该位置_S &lt;sub&gt;n&lt;/sub&gt;_被报告给_检查点协调员_(Flink的JobManager)。
......@@ -37,7 +37,7 @@ Flink分布式SNAPSHOT的核心数据元是_流障碍_。这些障碍被注入
一旦完成SNAPSHOT_n_,作业将永远不再向源请求来自_S &lt;sub&gt;n&lt;/sub&gt;_之前的记录,因为此时这些记录(及其后代记录)将通过整个数据流拓扑。
![在具有多个输入的 算子处对齐数据流](img/stream_aligning.svg)
![在具有多个输入的 算子处对齐数据流](../img/stream_aligning.svg)
接收多个输入流的 算子需要在SNAPSHOT屏障上_对齐_输入流。上图说明了这一点:
......@@ -60,7 +60,7 @@ Flink分布式SNAPSHOT的核心数据元是_流障碍_。这些障碍被注入
* 对于每个并行流数据源,启动SNAPSHOT时流中的偏移/位置
* 对于每个 算子,指向作为SNAPSHOT的一部分存储的状态的指针
![检查点机制的例证](img/checkpointing.svg)
![检查点机制的例证](../img/checkpointing.svg)
### 完全一次与至少一次
......
......@@ -13,7 +13,7 @@ Flink中的执行资源通过_任务槽_定义。每个TaskManager都有一个
下图说明了这一点。考虑一个带有数据源,_MapFunction_和_ReduceFunction的程序_。源和MapFunction以4的并行度执行,而ReduceFunction以3的并行度执行。管道由序列Source - Map - Reduce组成。在具有2个TaskManagers且每个具有3个插槽的群集上,程序将按如下所述执行。
![将任务管道分配给插槽](img/slots.svg)
![将任务管道分配给插槽](../img/slots.svg)
在内部,Flink限定通过[SlotSharingGroup](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.java)[CoLocationGroup](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/jobmanager/scheduler/CoLocationGroup.java) 哪些任务可以共享的狭槽(许可),分别哪些任务必须严格放置到相同的时隙。
......@@ -25,7 +25,7 @@ Flink中的执行资源通过_任务槽_定义。每个TaskManager都有一个
JobManager将JobGraph转换为[ExecutionGraph](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/)。ExecutionGraph是JobGraph的并行版本:对于每个JobVertex,它包含每个并行子任务的[ExecutionVertex](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java)。并行度为100的 算子将具有一个JobVertex和100个ExecutionVertices。ExecutionVertex跟踪特定子任务的执行状态。来自一个JobVertex所有ExecutionVertices都保存在 [ExecutionJobVertex中](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionJobVertex.java),它跟踪整个算子的状态。除了顶点之外,ExecutionGraph还包含[IntermediateResult](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/IntermediateResult.java)[IntermediateResultPartition](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/IntermediateResultPartition.java)。前者跟踪_IntermediateDataSet_的状态,后者是每个分区的状态。
![JobGraph和ExecutionGraph](img/job_and_execution_graph.svg)
![JobGraph和ExecutionGraph](../img/job_and_execution_graph.svg)
每个ExecutionGraph都有一个与之关联的作业状态。此作业状态指示作业执行的当前状态。
......@@ -35,8 +35,8 @@ Flink作业首先处于_创建_状态,然后切换到_运行,_并在完成
与_完成_,_取消_和_失败_的状态不同,它表示全局终端状态,因此触发清理作业,_暂停_状态仅在本地终端。本地终端意味着作业的执行已在相应的JobManager上终止,但Flink集群的另一个JobManager可以从持久性HA存储中检索作业并重新启动它。因此,到达_暂停_状态的作业将不会被完全清除。
![Flink工作的状态和转型](img/job_status.svg)
![Flink工作的状态和转型](../img/job_status.svg)
在执行ExecutionGraph期间,每个并行任务都经历多个阶段,从_创建_到_完成_或_失败_。下图说明了它们之间的状态和可能的转换。可以多次执行任务(例如,在故障恢复过程中)。因此,在Execution中跟踪[ExecutionVertex执行](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java)。每个ExecutionVertex都有一个当前的Execution和先前的Executions。
![任务执行的状态和转变](img/state_machine.svg)
\ No newline at end of file
![任务执行的状态和转变](../img/state_machine.svg)
\ No newline at end of file
......@@ -34,7 +34,7 @@ Flink 在流处理节目中支持不同的_时间_概念。
在内部,_摄取时间_与_事件时间_非常相似,但具有自动时间戳分配和自动水印生成函数。
![](img/times_clocks.svg)
![](../img/times_clocks.svg)
### 设定时间特征
......@@ -112,11 +112,11 @@ Flink中用于衡量事件时间进度的机制是**水印**。水印作为数
下图显示了具有(逻辑)时间戳的事件流,以及内联流水印。在该示例中,事件按顺序(关于它们的时间戳),意味着水印仅是流中的周期性标记。
![包含事件(按顺序)和水印的数据流](img/stream_watermark_in_order.svg)
![包含事件(按顺序)和水印的数据流](../img/stream_watermark_in_order.svg)
水印对于_无序_流是至关重要的,如下所示,其中事件不按时间戳排序。通常,水印是一种声明,通过流中的该点,到达某个时间戳的所有事件都应该到达。一旦水印到达算子,算子就可以将其内部_事件时钟_提前到水印的值。
![包含事件(乱序)和水印的数据流](img/stream_watermark_out_of_order.svg)
![包含事件(乱序)和水印的数据流](../img/stream_watermark_out_of_order.svg)
请注意,事件时间由新生成的流数据元(或多个数据元)继承,这些数据元来自生成它们的事件或触发创建这些数据元的水印。
......@@ -130,7 +130,7 @@ Flink中用于衡量事件时间进度的机制是**水印**。水印作为数
下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的 算子。
![具有事件和水印的并行数据流和 算子](img/parallel_streams_watermarks.svg)
![具有事件和水印的并行数据流和 算子](../img/parallel_streams_watermarks.svg)
请注意,Kafka源支持每分区水印,您可以[在此处](https://flink.sojb.cn/dev/event_timestamps_watermarks.html#timestamps-per-kafka-partition)详细了解。
......
......@@ -343,5 +343,5 @@ val stream: DataStream[MyType] = env.addSource(kafkaSource)
![生成具有Kafka分区感知的水印](img/parallel_kafka_watermarks.svg)
![生成具有Kafka分区感知的水印](../img/parallel_kafka_watermarks.svg)
......@@ -9,7 +9,7 @@
Flink提供不同级别的抽象来开发流/批处理应用程序。
![编程抽象级别](img/levels_of_abstraction.svg)
![编程抽象级别](../img/levels_of_abstraction.svg)
* 最低级抽象只提供**有状态流**。它 通过[Process Function](https://flink.sojb.cn/dev/stream/operators/process_function.html)嵌入到[DataStream API中](https://flink.sojb.cn/dev/datastream_api.html)。它允许用户自由处理来自一个或多个流的事件,并使用一致的容错_状态_。此外,用户可以注册事件时间和处理时间回调,允许程序实现复杂的计算。[](https://flink.sojb.cn/dev/stream/operators/process_function.html)
......@@ -29,7 +29,7 @@ Flink程序的基本构建块是**流**和**转换**。(请注意,Flink的Da
执行时,Flink程序映射到**流数据流**,由**流**和转换 **算子组成**。每个数据流都以一个或多个**源**开头,并以一个或多个**接收器**结束。数据流类似于任意有**向无环图** _(DAG)_。尽管通过_迭代_结构允许特殊形式的循环 ,但为了简单起见,我们将在大多数情况下对此进行掩饰。
![DataStream程序及其数据流。](img/program_dataflow.svg)
![DataStream程序及其数据流。](../img/program_dataflow.svg)
通常,程序中的转换与数据流中的 算子之间存在一对一的对应关系。但是,有时一个转换可能包含多个转换 算子。
......@@ -41,7 +41,7 @@ Flink中的程序本质上是并行和分布式的。在执行期间,_流_具
算子子任务的数量是该特定 算子的**并行**度。流的并行性始终是其生成 算子的并行性。同一程序的不同 算子可能具有不同的并行级别。
![并行数据流](img/parallel_dataflow.svg)
![并行数据流](../img/parallel_dataflow.svg)
流可以_以一对一_(或_转发_)模式或以_重新分发_模式在两个算子之间传输数据:
......@@ -57,7 +57,7 @@ Flink中的程序本质上是并行和分布式的。在执行期间,_流_具
Windows可以是_时间驱动的_(例如:每30秒)或_数据驱动_(例如:每100个数据元)。一个典型地区分不同类型的窗口,例如_翻滚窗口_(没有重叠), _滑动窗口_(具有重叠)和_会话窗口_(由不活动的间隙打断)。
![时间和计数Windows](img/windows.svg)
![时间和计数Windows](../img/windows.svg)
更多窗口示例可以在此[博客文章中](https://flink.apache.org/news/2015/12/04/Introducing-windows.html)找到。更多详细信息在[窗口文档中](https://flink.sojb.cn/dev/stream/operators/windows.html)
......@@ -71,7 +71,7 @@ Windows可以是_时间驱动的_(例如:每30秒)或_数据驱动_(例
* **处理时间**是执行基于时间的 算子操作的每个算子的本地时间。
![事件时间,摄取时间和处理时间](img/event_ingestion_processing_time.svg)
![事件时间,摄取时间和处理时间](../img/event_ingestion_processing_time.svg)
有关如何处理时间的更多详细信息,请参阅[事件时间文档](https://flink.sojb.cn/dev/event_time.html)
......@@ -81,7 +81,7 @@ Windows可以是_时间驱动的_(例如:每30秒)或_数据驱动_(例
状态 算子操作的状态保持在可以被认为是嵌入式键/值存储的状态中。状态被分区并严格地与有状态算子读取的流一起分发。因此,只有在_keyBy()_函数之后才能在_被Key化的数据流_上访问键/值状态,并且限制为与当前事件的键相关联的值。对齐流和状态的Keys可确保所有状态更新都是本地 算子操作,从而保证一致性而无需事务开销。此对齐还允许Flink重新分配状态并透明地调整流分区。
![状态和分区](img/state_partitioning.svg)
![状态和分区](../img/state_partitioning.svg)
有关更多信息,请参阅有关[状态](https://flink.sojb.cn/dev/stream/state/index.html)的文档。
......
......@@ -862,7 +862,7 @@ dataStream.rebalance();
|
| **重新调整**
DataStream→DataStream | 分区数据元,循环,到下游 算子操作的子集。如果您希望拥有管道,例如,从源的每个并行实例扇出到多个映射器的子集以分配负载但又不希望发生rebalance()会产生完全Rebalance ,那么这非常有用。这将仅需要本地数据传输而不是通过网络传输数据,具体取决于其他配置值,例如TaskManagers的插槽数。上游 算子操作发送数据元的下游 算子操作的子集取决于上游和下游 算子操作的并行度。例如,如果上游 算子操作具有并行性2并且下游 算子操作具有并行性6,则一个上游 算子操作将分配元件到三个下游 算子操作,而另一个上游 算子操作将分配到其他三个下游 算子操作。另一方面,如果下游 算子操作具有并行性2而上游 算子操作具有并行性6,则三个上游 算子操作将分配到一个下游 算子操作,而其他三个上游 算子操作将分配到另一个下游 算子操作。在不同并行度不是彼此的倍数的情况下,一个或多个下游 算子操作将具有来自上游 算子操作的不同数量的输入。请参阅此图以获取上例中连接模式的可视化:![数据流中的检查点障碍](img/rescale.svg)
DataStream→DataStream | 分区数据元,循环,到下游 算子操作的子集。如果您希望拥有管道,例如,从源的每个并行实例扇出到多个映射器的子集以分配负载但又不希望发生rebalance()会产生完全Rebalance ,那么这非常有用。这将仅需要本地数据传输而不是通过网络传输数据,具体取决于其他配置值,例如TaskManagers的插槽数。上游 算子操作发送数据元的下游 算子操作的子集取决于上游和下游 算子操作的并行度。例如,如果上游 算子操作具有并行性2并且下游 算子操作具有并行性6,则一个上游 算子操作将分配元件到三个下游 算子操作,而另一个上游 算子操作将分配到其他三个下游 算子操作。另一方面,如果下游 算子操作具有并行性2而上游 算子操作具有并行性6,则三个上游 算子操作将分配到一个下游 算子操作,而其他三个上游 算子操作将分配到另一个下游 算子操作。在不同并行度不是彼此的倍数的情况下,一个或多个下游 算子操作将具有来自上游 算子操作的不同数量的输入。请参阅此图以获取上例中连接模式的可视化:![数据流中的检查点障碍](../img/rescale.svg)
&lt;figure class="highlight"&gt;
......@@ -926,7 +926,7 @@ dataStream.rebalance()
|
| **Rescaling**
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations would distribute to one downstream operation while the other two upstream operations would distribute to the other downstream operations.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.&lt;/p&gt; Please see this figure for a visualization of the connection pattern in the above example: &lt;/p&gt;![Checkpoint barriers in data streams](img/rescale.svg)
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations would distribute to one downstream operation while the other two upstream operations would distribute to the other downstream operations.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.&lt;/p&gt; Please see this figure for a visualization of the connection pattern in the above example: &lt;/p&gt;![Checkpoint barriers in data streams](../img/rescale.svg)
&lt;figure class="highlight"&gt;
......
......@@ -70,7 +70,7 @@ A `WindowAssigner`负责将每个传入数据元分配给一个或多个窗口
一个_翻滚窗口_分配器的每个数据元分配给指定的窗口_的窗口大小_。翻滚窗具有固定的尺寸,不重叠。例如,如果指定大小为5分钟的翻滚窗口,则将评估当前窗口,并且每五分钟将启动一个新窗口,如下图所示。
![](img/tumbling-windows.svg)
![](../img/tumbling-windows.svg)
以下代码段显示了如何使用翻滚窗口。
......@@ -136,7 +136,7 @@ val input: DataStream[T] = ...
例如,您可以将大小为10分钟的窗口滑动5分钟。有了这个,你每隔5分钟就会得到一个窗口,其中包含过去10分钟内到达的事件,如下图所示。
![](img/sliding-windows.svg)
![](../img/sliding-windows.svg)
以下代码段显示了如何使用滑动窗口。
......@@ -200,7 +200,7 @@ val input: DataStream[T] = ...
在_会话窗口_中按活动会话分配器组中的数据元。与_翻滚窗口_和_滑动窗口_相比,会话窗口不重叠并且没有固定的开始和结束时间。相反,当会话窗口在一段时间内没有接收到数据元时,_即_当发生不活动的间隙时,会关闭会话窗口。会话窗口分配器可以配置静态_会话间隙_或 _会话间隙提取器_函数,该函数定义不活动时间段的长度。当此期限到期时,当前会话将关闭,后续数据元将分配给新的会话窗口。
![](img/session-windows.svg)
![](../img/session-windows.svg)
以下代码段显示了如何使用会话窗口。
......@@ -289,7 +289,7 @@ val input: DataStream[T] = ...
一个_全局性的窗口_分配器分配使用相同的Keys相同的单个的所有数据元_全局窗口_。此窗口方案仅在您还指定自定义[触发器](#triggers)时才有用。否则,将不执行任何计算,因为全局窗口没有我们可以处理聚合数据元的自然结束。
![](img/non-windowed.svg)
![](../img/non-windowed.svg)
以下代码段显示了如何使用全局窗口。
......
......@@ -39,7 +39,7 @@ stream.join(otherStream)
当执行翻滚窗口连接时,具有公共Keys和公共翻滚窗口的所有数据元以成对组合的形式连接并传递给`JoinFunction``FlatJoinFunction`。因为它的行为类似于内连接,所以不会发出一个流的数据元,这些数据元在其翻滚窗口中没有来自另一个流的数据元!
![](img/tumbling-window-join.svg)
![](../img/tumbling-window-join.svg)
如图所示,我们定义了一个大小为2毫秒的翻滚窗口,这导致了窗体的窗口`[0,1], [2,3], ...`。镜像显示了每个窗口中所有数据元的成对组合,这些数据元将被传递给`JoinFunction`。请注意,在翻滚窗口中`[6,7]`没有任何东西被发射,因为绿色流中不存在与橙色数据元⑥和⑦连接的数据元。
......@@ -96,7 +96,7 @@ orangeStream.join(greenStream)
执行滑动窗口连接时,具有公共键和公共滑动窗口的所有数据元都是成对组合并传递给`JoinFunction``FlatJoinFunction`。不会释放当前滑动窗口中没有来自其他流的数据元的一个流的数据元!请注意,某些数据元可能在一个滑动窗口中连接而在另一个滑动窗口中不连
![](img/sliding-window-join.svg)
![](../img/sliding-window-join.svg)
在这个例子中,我们使用大小为2毫秒的滑动窗口并将它们滑动一毫秒,从而产生滑动窗口`[-1, 0],[0,1],[1,2],[2,3], …`。x轴下方的连接数据元是传递给`JoinFunction`每个滑动窗口的数据元。在这里,您还可以看到橙色②如何与窗口中的绿色③ `[2,3]`连接,但未与窗口中的任何内容连接`[1,2]`
......@@ -153,7 +153,7 @@ orangeStream.join(greenStream)
在执行会话窗口连接时,具有相同键的所有数据元在_“组合”_满足会话条件时以成对组合方式连接并传递给`JoinFunction``FlatJoinFunction`。再次执行内连接,因此如果有一个会话窗口只包含来自一个流的数据元,则不会发出任何输出!
![](img/session-window-join.svg)
![](../img/session-window-join.svg)
这里我们定义一个会话窗口连接,其中每个会话除以至少1ms的间隙。有三个会话,在前两个会话中,两个流的连接数据元都传递给`JoinFunction`。在第三阶段,绿色流中没有数据元,所以⑧和⑨没有连接!
......@@ -221,7 +221,7 @@ orangeStream.join(greenStream)
注意间隔连接当前仅支持事件时间。
![](img/interval-join.svg)
![](../img/interval-join.svg)
在上面的例子中,我们连接两个流'orange'和'green',下限为-2毫秒,上限为+1毫秒。缺省情况下,这些界限是包容性的,但`.lowerBoundExclusive()``.upperBoundExclusive`可以应用到改变行为。
......
......@@ -11,7 +11,7 @@
下图中的示例数据流由五个子任务执行,因此具有五个并行线程。
![算子链接到任务](img/tasks_chains.svg)
![算子链接到任务](../img/tasks_chains.svg)
## TaskManager,JobManager,客户端
......@@ -29,7 +29,7 @@ JobManagers和TaskManagers可以通过多种方式启动:作为[独立集群](
**客户端**是不运行时和程序执行的一部分,而是被用来准备和发送的数据流的JobManager。之后,客户端可以断开连接或保持连接以接收进度报告。客户端既可以作为触发执行的Java / Scala程序的一部分运行,也可以在命令行进程中运行`./bin/flink run ...`
![执行Flink数据流所涉及的过程](img/processes.svg)
![执行Flink数据流所涉及的过程](../img/processes.svg)
## 任务槽和资源
......@@ -39,7 +39,7 @@ JobManagers和TaskManagers可以通过多种方式启动:作为[独立集群](
通过调整任务槽的数量,用户可以定义子任务如何相互隔离。每个TaskManager有一个插槽意味着每个任务组在一个单独的JVM中运行(例如,可以在一个单独的容器中启动)。拥有多个插槽意味着更多子任务共享同一个JVM。同一JVM中的任务共享TCP连接(通过多路复用)和心跳消息。它们还可以共享数据集和数据结构,从而Reduce每任务开销。
![具有任务槽和任务的TaskManager](img/tasks_slots.svg)
![具有任务槽和任务的TaskManager](../img/tasks_slots.svg)
默认情况下,Flink允许子任务共享插槽,即使它们是不同任务的子任务,只要它们来自同一个作业。结果是一个槽可以保存作业的整个管道。允许此_插槽共享_有两个主要好处:
......@@ -47,7 +47,7 @@ JobManagers和TaskManagers可以通过多种方式启动:作为[独立集群](
* 更容易获得更好的资源利用率。如果没有插槽共享,非密集 _源/ map()_子任务将阻止与资源密集型_窗口_子任务一样多的资源。通过插槽共享,将示例中的基本并行性从2增加到6可以充分利用时隙资源,同时确保繁重的子任务在TaskManagers之间公平分配。
![具有共享任务槽的TaskManagers](img/slot_sharing.svg)
![具有共享任务槽的TaskManagers](../img/slot_sharing.svg)
API还包括可用于防止不期望的时隙共享的_[资源组](https://flink.sojb.cn/dev/stream/operators/#task-chaining-and-resource-groups)_机制。
......@@ -57,7 +57,7 @@ API还包括可用于防止不期望的时隙共享的_[资源组](https://flink
存储键/值索引的确切数据结构取决于所选的[状态后台](https://flink.sojb.cn/ops/state/state_backends.html)。一个状态后台将数据存储在内存中的哈希映射中,另一个状态后台使用[RocksDB](http://rocksdb.org)作为键/值存储。除了定义保存状态的数据结构之外,状态后台还实现逻辑以获取键/值状态的时间点SNAPSHOT,并将该SNAPSHOT存储为检查点的一部分。
![检查点和SNAPSHOT](img/checkpoints.svg)
![检查点和SNAPSHOT](../img/checkpoints.svg)
## 保存点
......
......@@ -17,7 +17,7 @@
与数据库的异步交互意味着单个并行函数实例可以同时处理许多请求并同时接收响应。这样,可以通过发送其他请求和接收响应来覆盖等待时间。至少,等待时间在多个请求上摊销。这导致大多数情况下流量吞吐量更高。
![](img/async_io.svg)
![](../img/async_io.svg)
_注意:_通过仅扩展`MapFunction`到非常高的并行度来提高吞吐量在某些情况下也是可能的,但通常会产生非常高的资源成本:拥有更多并行MapFunction实例意味着更多任务,线程,Flink内部网络连接,网络连接到数据库,缓冲区和一般内部副本记录开销。
......
......@@ -398,11 +398,11 @@ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wiki-result
您还可以查看应在[http:// localhost:8081上](http://localhost:8081)运行的Flink仪表板。您将获得群集资源和正在运行的作业的概述:
[![JobManager概述](img/quickstart-example/jobmanager-overview.png)](img/quickstart-example/jobmanager-overview.png)
[![JobManager概述](../img/quickstart-example/jobmanager-overview.png)](../img/quickstart-example/jobmanager-overview.png)
如果单击正在运行的作业,您将获得一个视图,您可以在其中检查各个 算子操作,例如,查看已处理数据元的数量:
[![作业视图示例](img/quickstart-example/jobmanager-job.png)](img/quickstart-example/jobmanager-job.png)
[![作业视图示例](../img/quickstart-example/jobmanager-job.png)](../img/quickstart-example/jobmanager-job.png)
这就结束了我们对Flink的小游览。如果您有任何疑问,请随时询问我们的[邮件列表](http://flink.apache.org/community.html#mailing-lists)
......@@ -60,7 +60,7 @@ $ ./bin/start-cluster.sh # Start Flink
检查**分派器的web前端**[HTTP://localhost:8081](http://localhost:8081),并确保一切都正常运行。Web前端应报告单个可用的TaskManager实例。
[![调度员:概述](img/jobmanager-1.png)](img/jobmanager-1.png)
[![调度员:概述](../img/jobmanager-1.png)](../img/jobmanager-1.png)
您还可以通过检查`logs`目录中的日志文件来验证系统是否正在运行:
......@@ -231,7 +231,7 @@ Starting execution of program
程序连接到套接字并等待输入。您可以检查Web界面以验证作业是否按预期运行:
[![调度员:概述(续)](img/jobmanager-2.png)](img/jobmanager-2.png)[![调度程序:运行作业](img/jobmanager-3.png)](img/jobmanager-3.png)
[![调度员:概述(续)](../img/jobmanager-2.png)](../img/jobmanager-2.png)[![调度程序:运行作业](../img/jobmanager-3.png)](../img/jobmanager-3.png)
* 单词在5秒的时间窗口(处理时间,翻滚窗口)中计算并打印到`stdout`。监视TaskManager的输出文件并写入一些文本`nc`(输入在点击后逐行发送到Flink &lt;return&gt;):&lt;/return&gt;
......
......@@ -40,7 +40,7 @@ Flink程序通过定义**步进函数**并将其嵌入到特殊的迭代 算子
**迭代 算子**覆盖所述_迭代简单形式_:在每次迭代中,**阶梯函数**消耗**整个输入**(在_先前的迭代的结果_,或在_初始数据集_),并且计算**该部分解决方案的下一个版本**(例如`map``reduce``join`,等等。)。
![迭代 算子](img/iterations_iterate_operator.png)
![迭代 算子](../img/iterations_iterate_operator.png)
1. **迭代输入**:来自_数据源_或_先前 算子_的_第一次迭代的_初始输入。
2. **步骤函数**:步进函数将在每次迭代中执行。它是由像算子的任意数据流`map``reduce``join`等,取决于手头的特定任务。
......@@ -74,7 +74,7 @@ setFinalState(state);
在以下示例中,我们**迭代地递增一组数字**
![迭代 算子示例](img/iterations_iterate_operator_example.png)
![迭代 算子示例](../img/iterations_iterate_operator_example.png)
1. **迭代输入**:初始输入从数据源读取和由五个单字段记录(整数`1``5`)。
2. **步进函数**:步进函数是单个`map` 算子,它将整数字段从`i`增加到`i+1`。它将应用于输入的每个记录。
......@@ -102,7 +102,7 @@ map(5) -> 6 map(6) -> 7 ... map(14) -> 15
在适用的情况下,这会导致**更高效的算法**,因为解决方案集中的每个数据元都不会在每次迭代中发生变化。这样可以**专注于**解决方案**的热部件**,并**保持冷部件不受影响**。通常,大多数解决方案相对较快地冷却,后来的迭代仅在一小部分数据上运行。
![Delta迭代算子](img/iterations_delta_iterate_operator.png)
![Delta迭代算子](../img/iterations_delta_iterate_operator.png)
1. **迭代输入**:从_数据源_或_先前的 算子_读取初始工作集和解决方案集作为第一次迭代的输入。
2. **步骤函数**:步进函数将在每次迭代中执行。它是由像算子的任意数据流`map``reduce``join`等,取决于手头的特定任务。
......@@ -136,7 +136,7 @@ setFinalState(solution);
在以下示例中,每个顶点都有一个**ID**和一个**着色**。每个顶点将其顶点ID传播到相邻顶点。该**目标**是_最小ID分配给子图的每个顶点_。如果接收的ID小于当前的ID,则它将变为具有接收到的ID的顶点的颜色。其中一个应用可以在_社区分析_或_连通组件_计算中找到。
![Delta迭代 算子示例](img/iterations_delta_iterate_operator_example.png)
![Delta迭代 算子示例](../img/iterations_delta_iterate_operator_example.png)
**初始输入**被设置为**两个工作集和溶液组。**在上图中,颜色可视化**解决方案集****演变**。每次迭代时,最小ID的颜色在相应的子图中展开。同时,每次迭代,工作量(交换和比较顶点ID)都会Reduce。这对应于**工作集的大小减小**,其在三次迭代之后从所有七个顶点变为零,此时迭代终止。在**重要的观察**是,_较低的子收敛上半之前_不和增量迭代能够与工作集抽象捕捉到这一点。
......@@ -154,5 +154,5 @@ setFinalState(solution);
我们将迭代 算子的阶梯函数的每次执行称为_单次迭代_。在并行设置中,在迭代状态的不同分区上**并行评估步骤函数的多个实例**。在许多设置中,对所有并行实例的步骤函数的一个评估形成所谓的**超级步骤**,其也是同步的粒度。因此,迭代的_所有_并行任务都需要在初始化下一个超级步骤之前完成超级步骤。**终止标准**也将在超级障碍评估。
![超级步](img/iterations_supersteps.png)
![超级步](../img/iterations_supersteps.png)
......@@ -37,7 +37,7 @@ _动态表_是Flink的 Table API和SQL支持流数据的核心概念。与表示
下图显示了流,动态表和连续查询的关系:
<center>![动态表格](img/stream-query-stream.png)</center>
<center>![动态表格](../img/stream-query-stream.png)</center>
1. 流转换为动态表。
2. 在动态表上评估连续查询,生成新的动态表。
......@@ -65,7 +65,7 @@ _动态表_是Flink的 Table API和SQL支持流数据的核心概念。与表示
下图显示了click事件流(左侧)如何转换为表(右侧)。随着更多点击流记录的插入,生成的表不断增长。
<center>![追加模式](img/append-mode.png)</center>
<center>![追加模式](../img/append-mode.png)</center>
**注意:**在流上定义的表在内部未实现。
......@@ -77,13 +77,13 @@ _动态表_是Flink的 Table API和SQL支持流数据的核心概念。与表示
第一个查询是一个简单的`GROUP-BY COUNT`聚合查询。这组`clicks`对表`user`字段和计数访问的网址的数量。下图显示了在`clicks`使用其他行更新表时,如何评估查询。
<center>![连续非窗口查询](img/table-streaming/query-groupBy-cnt.png)</center>
<center>![连续非窗口查询](../img/table-streaming/query-groupBy-cnt.png)</center>
查询启动时,`clicks`表(左侧)为空。当第一行插入表中时,查询开始计算结果`clicks`表。`[Mary, ./home]`Insert第一行后,结果表(右侧,顶部)由一行组成`[Mary, 1]`。当第二行`[Bob, ./cart]`Insert`clicks`表中时,查询将更新结果表并插入新行`[Bob, 1]`。第三行`[Mary, ./prod?id=1]`产生已更新的已计算结果行的`[Mary, 1]`更新`[Mary, 2]`。最后,`[Liz, 1]`当第四行附加到`clicks`表时,查询将第三行插入到结果表中。
第二个查询类似于第一个查询,但`clicks`除了`user`属性之外还在[每小时滚动窗口](https://flink.sojb.cn/sql.html#group-windows)上对表进行分组,然后计算URL的数量(基于时间的计算,例如窗口基于特殊[时间属性](#time-attributes),这将在下面讨论) )。同样,该图显示了不同时间点的输入和输出,以显示动态表的变化性质。
<center>![连续组窗口查询](img/table-streaming/query-groupBy-window-cnt.png)</center>
<center>![连续组窗口查询](../img/table-streaming/query-groupBy-window-cnt.png)</center>
和以前一样,输入表`clicks`显示在左侧。查询每小时连续计算结果并更新结果表。点击表包含四行,时间戳(`cTime`)位于`12:00:00`和之间`12:59:59`。查询从此输入计算两个结果行(每个一行`user`)并将它们附加到结果表。对于`13:00:00`和之间的下一个窗口`13:59:59`,该`clicks`表包含三行,这导致另外两行被追加到结果表中。结果表已更新,`clicks`随着时间的推移会附加更多行。
......@@ -140,11 +140,11 @@ FROM (
* 撤消**流:**撤消流是具有两种类型的消息的流,_添加消息_和_撤消消息_。通过将`INSERT`更改编码为添加消息,将`DELETE`更改编码为收回消息,将`UPDATE`更改编码为更新(先前)行的收回消息和更新(新)行的添加消息,将动态表转换为收回流。下图显示了动态表到回收流的转换。
<center>![动态表格](img/table-streaming/undo-redo-mode.png)</center>
<center>![动态表格](../img/table-streaming/undo-redo-mode.png)</center>
* **Upsert流:** upsert流是一种包含两种消息,_upsert消息_和_删除消息的流_。转换为upsert流的动态表需要(可能是复合的)唯一键。具有唯一键的动态表通过编码转换为动态表,`INSERT``UPDATE`更改为upsert消息并`DELETE`更改为删除消息。流消耗 算子需要知道唯一键属性才能正确应用消息。与收回流的主要区别在于,`UPDATE`使用单个消息对更改进行编码,因此更有效。下图显示了动态表到upsert流的转换。
<center>![动态表格](img/table-streaming/redo-mode.png)</center>
<center>![动态表格](../img/table-streaming/redo-mode.png)</center>
`DataStream`[Common Concepts](https://flink.sojb.cn/common.html#convert-a-table-into-a-datastream)页面上讨论了将动态表转换为a的API 。请注意,将动态表格转换为a时,仅支持附加和撤消流`DataStream`[TableSources和TableSinks](https://flink.sojb.cn/sourceSinks.html#define-a-tablesink)页面`TableSink`讨论了向外部系统发出动态表的接口。[](https://flink.sojb.cn/sourceSinks.html#define-a-tablesink)
......
......@@ -248,7 +248,7 @@ class CustomTypeSplit extends TableFunction[Row] {
用户定义的聚合函数(UDAGG)将一个表(一个或多个具有一个或多个属性的行)聚合到标量值。
<center>![UDAGG机制](img/udagg-mechanism.png)</center>
<center>![UDAGG机制](../img/udagg-mechanism.png)</center>
上图显示了聚合的示例。假设您有一个包含饮料数据的表格。该表由三列的`id``name``price`5行。想象一下,您需要找到表中所有饮料的最高价格,即执行`max()`聚合。您需要检查5行中的每一行,结果将是单个数值。
......
......@@ -9,7 +9,7 @@ Flink的Table&SQL API可以处理用SQL语言编写的查询,但是这些查
在_SQL客户端_旨在提供编写,调试,并提交表格程序到Flink集群的一个简单的方法没有的Java或Scala代码一行。在_SQL客户端CLI_允许检索和命令行中运行分布式应用可视化实时结果。
[![在群集上运行表程序的Flink SQL Client CLI的动画演示](img/sql_client_demo.gif)](img/sql_client_demo.gif)
[![在群集上运行表程序的Flink SQL Client CLI的动画演示](../img/sql_client_demo.gif)](../img/sql_client_demo.gif)
注意 SQL客户端处于早期开发阶段。即使应用程序还没有生产就绪,它可以是一个非常有用的工具,用于原型设计和Flink SQL。在未来,社区计划通过提供基于REST的[SQL客户端网关](sqlClient.html#limitations--future)来扩展其函数。
......
......@@ -48,7 +48,7 @@ println(env.getExecutionPlan())
完成这些步骤后,将显示详细的执行计划。
![flink作业执行图。](img/plan_visualizer.png)
![flink作业执行图。](../img/plan_visualizer.png)
**Web界面**
......
......@@ -465,7 +465,7 @@ val graph: Graph[Long, Long, Long] = ...
![过滤转换](img/gelly-filter.png)
![过滤转换](../img/gelly-filter.png)
* **Join**:Gelly提供了将顶点和边缘数据集与其他输入数据集连接的专用方法。`joinWithVertices`使用`Tuple2`输入数据集连接顶点。使用顶点ID和`Tuple2`输入的第一个字段作为连接键来执行连接。该方法返回一个新的`Graph`,其中顶点值已根据提供的用户定义的转换函数进行更新。类似地,输入数据集可以使用三种方法之一与边连接。`joinWithEdges`期望的输入`DataSet``Tuple3`,并关联对源和目标顶点ID的组合键。`joinWithEdgesOnSource`期望一个`DataSet``Tuple2`并关联上边缘和输入数据集的所述第一属性的源Keys和`joinWithEdgesOnTarget`期望一个`DataSet``Tuple2`并连接边的目标键和输入数据集的第一个属性。所有这三种方法都在边缘和输入数据集值上应用变换函数。请注意,如果输入数据集多次包含键,则所有Gelly连接方法将仅考虑遇到的第一个值。
......@@ -508,7 +508,7 @@ val vertexOutDegrees: DataSet[(Long, LongValue)] = network.outDegrees
* **Union**:Gelly的`union()`方法对指定图形的顶点和边集以及当前图执行并集 算子操作。从结果中删除重复的顶点`Graph`,而如果存在重复的边,则将保存这些顶点。
![Union转型](img/gelly-union.png)
![Union转型](../img/gelly-union.png)
* **差异**:Gelly的`difference()`方法对当前图形和指定图形的顶点和边集进行差异。
......@@ -624,7 +624,7 @@ Graph<K, VV, EV> removeEdges(List<Edge<K, EV>> edgesToBeRemoved)
例如,假设您要在下图中为每个顶点选择所有外边的最小权重:
![reduceOnEdges示例](img/gelly-example-graph.png)
![reduceOnEdges示例](../img/gelly-example-graph.png)
以下代码将收集每个顶点的外边缘,并`SelectMinWeight()`在每个结果邻域上应用用户定义的函数:
......@@ -666,7 +666,7 @@ val minWeights = graph.reduceOnEdges(new SelectMinWeight, EdgeDirection.OUT)
![reduceOnEdges示例](img/gelly-reduceOnEdges.png)
![reduceOnEdges示例](../img/gelly-reduceOnEdges.png)
类似地,假设您想为每个顶点计算所有进入邻居的值的总和。以下代码将收集每个顶点的进入邻居,并`SumValues()`在每个邻域上应用用户定义的函数:
......@@ -708,7 +708,7 @@ val verticesWithSum = graph.reduceOnNeighbors(new SumValues, EdgeDirection.IN)
![reduceOnNeighbors示例](img/gelly-reduceOnNeighbors.png)
![reduceOnNeighbors示例](../img/gelly-reduceOnNeighbors.png)
当聚合作用是不关联的,并且交换或当期望返回每个顶点多于一个的值,可以使用更一般的 `groupReduceOnEdges()``groupReduceOnNeighbors()`的方法。这些方法每个顶点返回零个,一个或多个值,并提供对整个邻域的访问。
......
......@@ -13,7 +13,7 @@ Gelly利用Flink的高效迭代 算子来支持大规模迭代图处理。目前
计算模型如下图所示。虚线框对应于并行化单元。在每个超级步骤中,所有活动顶点并行执行相同的用户定义计算。超级步骤是同步执行的,因此保证在一个超级步骤期间发送的消息在下一个超级步骤的开始时传递。
![以顶点为中心的计算模型](img/vertex-centric supersteps.png)
![以顶点为中心的计算模型](../img/vertex-centric supersteps.png)
要在Gelly中使用以顶点为中心的迭代,用户只需要定义顶点计算函数,`ComputeFunction`。该函数和最大运行迭代次数作为Gelly的参数给出`runVertexCentricIteration`。此方法将在输入Graph上执行以顶点为中心的迭代,并返回具有更新顶点值的新Graph。`MessageCombiner`可以定义可选的消息组合器以降低通信成本。
......@@ -246,7 +246,7 @@ Gelly提供了分散 - 聚集迭代的方法。用户只需要实现两个函数
让我们考虑在下图中使用散射 - 聚集迭代计算单源最短路径,并让顶点1成为源。在每个超级步骤中,每个顶点向其所有邻居发送候选距离消息。消息值是顶点的当前值与连接此顶点与其邻居的边缘权重之和。在接收候选距离消息时,每个顶点计算最小距离,并且如果已发现较短路径,则其更新其值。如果顶点在超级步骤期间没有改变其值,则它不会为其下一个超级步的邻居生成消息。当没有值更新时,算法收敛。
![Scatter-gather SSSP superstep 1](img/gelly-vc-sssp1.png)
![Scatter-gather SSSP superstep 1](../img/gelly-vc-sssp1.png)
* [**Java**](#tab_java_2)
* [**Scala**](#tab_scala_2)
......@@ -584,7 +584,7 @@ final class VertexUpdater extends GatherFunction {...}
让我们考虑在下图中用GSA计算单源最短路径,并让顶点1成为源。在该`Gather`阶段期间,我们通过将每个顶点值与边缘权重相加来计算新的候选距离。在`Sum`,候选距离按顶点ID分组,并选择最小距离。在`Apply`,将新计算的距离与当前顶点值进行比较,并将两者中的最小值指定为顶点的新值。
![GSA SSSP超越1](img/gelly-gsa-sssp1.png)
![GSA SSSP超越1](../img/gelly-gsa-sssp1.png)
请注意,如果顶点在超级步骤期间未更改其值,则在下一个超级步骤期间不会计算候选距离。当没有顶点改变值时,算法收敛。
......
......@@ -88,7 +88,7 @@ Graph<String, String, Long, Long, Double> graph = BipartiteGraph.fromDataSet(top
* **Projection**:Projection是二分图的常见 算子操作,可将二分图转换为常规图。有两种类型的Projection:顶部和底部Projection。顶部Projection仅保存结果图中的顶部节点,并且仅当顶部节点在原始图中连接到中间底部节点时才在新图中创建它们之间的链接。底部Projection与顶部Projection相反,即仅保存底部节点并连接一对节点(如果它们在原始图形中连接)。
![二分图Projection](img/bipartite_graph_projections.png)
![二分图Projection](../img/bipartite_graph_projections.png)
Gelly支持两种子类型的Projection:简单Projection和完整Projection。它们之间的唯一区别是数据与结果图中的边相关联。
......
......@@ -53,7 +53,7 @@ cd flink-*
以下示例说明了具有三个节点(IP地址从_10.0.0.1_ 到_10.0.0.3_以及主机名_master_,_worker1_,_worker2_)的设置,并显示了配置文件的内容(需要在所有计算机上的相同路径上访问) ):
![](img/quickstart_cluster.png)
![](../img/quickstart_cluster.png)
/ path / to / **flink / conf /
flink-conf.yaml**
......
......@@ -361,7 +361,7 @@ yarn logs -applicationId <application ID>
本节简要介绍Flink和YARN如何交互。
![](img/FlinkOnYarn.svg)
![](../img/FlinkOnYarn.svg)
YARN客户端需要访问Hadoop配置以连接到YARN资源管理器和HDFS。它使用以下策略确定Hadoop配置:
......
......@@ -50,7 +50,7 @@ Finally, you must provide a list of all nodes in your cluster which shall be use
The following example illustrates the setup with three nodes (with IP addresses from _10.0.0.1_ to _10.0.0.3_ and hostnames _master_, _worker1_, _worker2_) and shows the contents of the configuration files (which need to be accessible at the same path on all machines):
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart_cluster.png)
![](../img/quickstart_cluster.png)
/path/to/**flink/conf/
flink-conf.yaml**
......
......@@ -346,7 +346,7 @@ These two configuration options accept single ports (for example: “50010”),
This section briefly describes how Flink and YARN interact.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/FlinkOnYarn.svg)
![](../img/FlinkOnYarn.svg)
The YARN client needs to access the Hadoop configuration to connect to the YARN resource manager and to HDFS. It determines the Hadoop configuration using the following strategy:
......
......@@ -14,7 +14,7 @@ The general idea of JobManager high availability for standalone clusters is that
As an example, consider the following setup with three JobManager instances:
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/jobmanager_ha_overview.png)
![](../img/jobmanager_ha_overview.png)
### Configuration
......
......@@ -42,7 +42,7 @@ To prevent such a situation, applications can define a _minimum duration between
This duration is the minimum time interval that must pass between the end of the latest checkpoint and the beginning of the next. The figure below illustrates how this impacts checkpointing.
![Illustration how the minimum-time-between-checkpoints parameter affects checkpointing behavior.](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_tuning.svg)
![Illustration how the minimum-time-between-checkpoints parameter affects checkpointing behavior.](../img/checkpoint_tuning.svg)
_Note:_ Applications can be configured (via the `CheckpointConfig`) to allow multiple checkpoints to be in progress at the same time. For applications with large state in Flink, this often ties up too many resources into the checkpointing. When a savepoint is manually triggered, it may be in process concurrently with an ongoing checkpoint.
......@@ -179,7 +179,7 @@ However, for each task that can be rescheduled to the previous location for reco
Please note that this can come at some additional costs per checkpoint for creating and storing the secondary local state copy, depending on the chosen state backend and checkpointing strategy. For example, in most cases the implementation will simply duplicate the writes to the distributed store to a local file.
![Illustration of checkpointing with task-local recovery.](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/local_recovery.png)
![Illustration of checkpointing with task-local recovery.](../img/local_recovery.png)
### Relationship of primary (distributed store) and secondary (task-local) state snapshots
......
......@@ -1776,5 +1776,5 @@ Each Flink TaskManager provides processing slots in the cluster. The number of s
When starting a Flink application, users can supply the default number of slots to use for that job. The command line value therefore is called `-p` (for parallelism). In addition, it is possible to [set the number of slots in the programming APIs](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/parallel.html) for the whole application and for individual operators.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/slots_parallelism.svg)
![](../img/slots_parallelism.svg)
......@@ -10,7 +10,7 @@ When securing network connections between machines processes through authenticat
For more flexibility, security for internal and external connectivity can be enabled and configured separately.
![Internal and External Connectivity](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/ssl_internal_external.svg)
![Internal and External Connectivity](../img/ssl_internal_external.svg)
### Internal Connectivity
......
......@@ -29,7 +29,7 @@ The overview tabs lists the following statistics. Note that these statistics don
The checkpoint history keeps statistics about recently triggered checkpoints, including those that are currently in progress.
<center>![Checkpoint Monitoring: History](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_monitoring-history.png)</center>
<center>![Checkpoint Monitoring: History](../img/checkpoint_monitoring-history.png)</center>
* **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.
* **Status**: The current status of the checkpoint, which is either _In Progress_ (), _Completed_ (), or _Failed_ (). If the triggered checkpoint is a savepoint, you will see a symbol.
......@@ -56,7 +56,7 @@ web.checkpoints.history: 15
The summary computes a simple min/average/maximum statistics over all completed checkpoints for the End to End Duration, State Size, and Bytes Buffered During Alignment (see [History](#history) for details about what these mean).
<center>![Checkpoint Monitoring: Summary](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_monitoring-summary.png)</center>
<center>![Checkpoint Monitoring: Summary](../img/checkpoint_monitoring-summary.png)</center>
Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.
......@@ -75,13 +75,13 @@ The configuration list your streaming configuration:
When you click on a _More details_ link for a checkpoint, you get a Minimum/Average/Maximum summary over all its operators and also the detailed numbers per single subtask.
<center>![Checkpoint Monitoring: Details](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_monitoring-details.png)</center>
<center>![Checkpoint Monitoring: Details](../img/checkpoint_monitoring-details.png)</center>
#### Summary per Operator
<center>![Checkpoint Monitoring: Details Summary](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_monitoring-details_summary.png)</center>
<center>![Checkpoint Monitoring: Details Summary](../img/checkpoint_monitoring-details_summary.png)</center>
#### All Subtask Statistics
<center>![Checkpoint Monitoring: Subtasks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoint_monitoring-details_subtasks.png)</center>
<center>![Checkpoint Monitoring: Subtasks](../img/checkpoint_monitoring-details_subtasks.png)</center>
......@@ -14,7 +14,7 @@ Take a simple `Source -&gt; Sink` job as an example. If you see a warning for `S
Back pressure monitoring works by repeatedly taking stack trace samples of your running tasks. The JobManager triggers repeated calls to `Thread.getStackTrace()` for the tasks of your job.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/back_pressure_sampling.png)
![](../img/back_pressure_sampling.png)
If the samples show that a task Thread is stuck in a certain internal method call (requesting buffers from the network stack), this indicates that there is back pressure for the task.
......@@ -44,13 +44,13 @@ This means that the JobManager triggered a stack trace sample of the running tas
Note that clicking the row, you trigger the sample for all subtasks of this operator.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/back_pressure_sampling_in_progress.png)
![](../img/back_pressure_sampling_in_progress.png)
### Back Pressure Status
If you see status **OK** for the tasks, there is no indication of back pressure. **HIGH** on the other hand means that the tasks are back pressured.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/back_pressure_sampling_ok.png)
![](../img/back_pressure_sampling_ok.png)
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/back_pressure_sampling_high.png)
![](../img/back_pressure_sampling_high.png)
......@@ -14,6 +14,6 @@ As a software stack, Flink is a layered system. The different layers of the stac
You can click on the components in the figure to learn more.
<center>![Apache Flink: Stack](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/stack.png)</center>
<center>![Apache Flink: Stack](../img/stack.png)</center>
<map name="overview-stack"><area id="lib-datastream-cep" title="CEP: Complex Event Processing" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/libs/cep.html" shape="rect" coords="63,0,143,177"> <area id="lib-datastream-table" title="Table: Relational DataStreams" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table_api.html" shape="rect" coords="143,0,223,177"> <area id="lib-dataset-ml" title="FlinkML: Machine Learning" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/libs/ml/index.html" shape="rect" coords="382,2,462,176"> <area id="lib-dataset-gelly" title="Gelly: Graph Processing" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/libs/gelly/index.html" shape="rect" coords="461,0,541,177"> <area id="lib-dataset-table" title="Table API and SQL" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table_api.html" shape="rect" coords="544,0,624,177"> <area id="datastream" title="DataStream API" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/datastream_api.html" shape="rect" coords="64,177,379,255"> <area id="dataset" title="DataSet API" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/batch/index.html" shape="rect" coords="382,177,697,255"> <area id="runtime" title="Runtime" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/runtime.html" shape="rect" coords="63,257,700,335"> <area id="local" title="Local" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/tutorials/local_setup.html" shape="rect" coords="62,337,275,414"> <area id="cluster" title="Cluster" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/cluster_setup.html" shape="rect" coords="273,336,486,413"> <area id="cloud" title="Cloud" href="//ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/gce_setup.html" shape="rect" coords="485,336,700,414"></map>
\ No newline at end of file
......@@ -26,7 +26,7 @@ The central part of Flink’s fault tolerance mechanism is drawing consistent sn
A core element in Flink’s distributed snapshotting are the _stream barriers_. These barriers are injected into the data stream and flow with the records as part of the data stream. Barriers never overtake records, the flow strictly in line. A barrier separates the records in the data stream into the set of records that goes into the current snapshot, and the records that go into the next snapshot. Each barrier carries the ID of the snapshot whose records it pushed in front of it. Barriers do not interrupt the flow of the stream and are hence very lightweight. Multiple barriers from different snapshots can be in the stream at the same time, which means that various snapshots may happen concurrently.
![Checkpoint barriers in data streams](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/stream_barriers.svg)
![Checkpoint barriers in data streams](../img/stream_barriers.svg)
Stream barriers are injected into the parallel data flow at the stream sources. The point where the barriers for snapshot _n_ are injected (let’s call it _S&lt;sub&gt;n&lt;/sub&gt;_) is the position in the source stream up to which the snapshot covers the data. For example, in Apache Kafka, this position would be the last record’s offset in the partition. This position _S&lt;sub&gt;n&lt;/sub&gt;_ is reported to the _checkpoint coordinator_ (Flink’s JobManager).
......@@ -34,7 +34,7 @@ The barriers then flow downstream. When an intermediate operator has received a
Once snapshot _n_ has been completed, the job will never again ask the source for records from before _S&lt;sub&gt;n&lt;/sub&gt;_, since at that point these records (and their descendant records) will have passed through the entire data flow topology.
![Aligning data streams at operators with multiple inputs](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/stream_aligning.svg)
![Aligning data streams at operators with multiple inputs](../img/stream_aligning.svg)
Operators that receive more than one input stream need to _align_ the input streams on the snapshot barriers. The figure above illustrates this:
......@@ -57,7 +57,7 @@ The resulting snapshot now contains:
* For each parallel stream data source, the offset/position in the stream when the snapshot was started
* For each operator, a pointer to the state that was stored as part of the snapshot
![Illustration of the Checkpointing Mechanism](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpointing.svg)
![Illustration of the Checkpointing Mechanism](../img/checkpointing.svg)
### Exactly Once vs. At Least Once
......
......@@ -10,7 +10,7 @@ Execution resources in Flink are defined through _Task Slots_. Each TaskManager
The figure below illustrates that. Consider a program with a data source, a _MapFunction_, and a _ReduceFunction_. The source and MapFunction are executed with a parallelism of 4, while the ReduceFunction is executed with a parallelism of 3\. A pipeline consists of the sequence Source - Map - Reduce. On a cluster with 2 TaskManagers with 3 slots each, the program will be executed as described below.
![Assigning Pipelines of Tasks to Slots](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/slots.svg)
![Assigning Pipelines of Tasks to Slots](../img/slots.svg)
Internally, Flink defines through [SlotSharingGroup](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/jobmanager/scheduler/SlotSharingGroup.java) and [CoLocationGroup](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/jobmanager/scheduler/CoLocationGroup.java) which tasks may share a slot (permissive), respectively which tasks must be strictly placed into the same slot.
......@@ -22,7 +22,7 @@ The JobManager receives the [JobGraph](https://github.com/apache/flink/blob/mast
The JobManager transforms the JobGraph into an [ExecutionGraph](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/). The ExecutionGraph is a parallel version of the JobGraph: For each JobVertex, it contains an [ExecutionVertex](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java) per parallel subtask. An operator with a parallelism of 100 will have one JobVertex and 100 ExecutionVertices. The ExecutionVertex tracks the state of execution of a particular subtask. All ExecutionVertices from one JobVertex are held in an [ExecutionJobVertex](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionJobVertex.java), which tracks the status of the operator as a whole. Besides the vertices, the ExecutionGraph also contains the [IntermediateResult](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/IntermediateResult.java) and the [IntermediateResultPartition](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/IntermediateResultPartition.java). The former tracks the state of the _IntermediateDataSet_, the latter the state of each of its partitions.
![JobGraph and ExecutionGraph](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/job_and_execution_graph.svg)
![JobGraph and ExecutionGraph](../img/job_and_execution_graph.svg)
Each ExecutionGraph has a job status associated with it. This job status indicates the current state of the job execution.
......@@ -32,8 +32,8 @@ In case that the user cancels the job, it will go into the _cancelling_ state. T
Unlike the states _finished_, _canceled_ and _failed_ which denote a globally terminal state and, thus, trigger the clean up of the job, the _suspended_ state is only locally terminal. Locally terminal means that the execution of the job has been terminated on the respective JobManager but another JobManager of the Flink cluster can retrieve the job from the persistent HA store and restart it. Consequently, a job which reaches the _suspended_ state won’t be completely cleaned up.
![States and Transitions of Flink job](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/job_status.svg)
![States and Transitions of Flink job](../img/job_status.svg)
During the execution of the ExecutionGraph, each parallel task goes through multiple stages, from _created_ to _finished_ or _failed_. The diagram below illustrates the states and possible transitions between them. A task may be executed multiple times (for example in the course of failure recovery). For that reason, the execution of an ExecutionVertex is tracked in an [Execution](https://github.com/apache/flink/blob/master//flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java). Each ExecutionVertex has a current Execution, and prior Executions.
![States and Transitions of Task Executions](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/state_machine.svg)
\ No newline at end of file
![States and Transitions of Task Executions](../img/state_machine.svg)
\ No newline at end of file
......@@ -28,7 +28,7 @@ Flink supports different notions of _time_ in streaming programs.
Internally, _ingestion time_ is treated much like _event time_, but with automatic timestamp assignment and automatic watermark generation.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/times_clocks.svg)
![](../img/times_clocks.svg)
### Setting a Time Characteristic
......@@ -100,11 +100,11 @@ The mechanism in Flink to measure progress in event time is **watermarks**. Wate
The figure below shows a stream of events with (logical) timestamps, and watermarks flowing inline. In this example the events are in order (with respect to their timestamps), meaning that the watermarks are simply periodic markers in the stream.
![A data stream with events (in order) and watermarks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/stream_watermark_in_order.svg)
![A data stream with events (in order) and watermarks](../img/stream_watermark_in_order.svg)
Watermarks are crucial for _out-of-order_ streams, as illustrated below, where the events are not ordered by their timestamps. In general a watermark is a declaration that by that point in the stream, all events up to a certain timestamp should have arrived. Once a watermark reaches an operator, the operator can advance its internal _event time clock_ to the value of the watermark.
![A data stream with events (out of order) and watermarks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/stream_watermark_out_of_order.svg)
![A data stream with events (out of order) and watermarks](../img/stream_watermark_out_of_order.svg)
Note that event time is inherited by a freshly created stream element (or elements) from either the event that produced them or from watermark that triggered creation of those elements.
......@@ -118,7 +118,7 @@ Some operators consume multiple input streams; a union, for example, or operator
The figure below shows an example of events and watermarks flowing through parallel streams, and operators tracking event time.
![Parallel data streams and operators with events and watermarks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/parallel_streams_watermarks.svg)
![Parallel data streams and operators with events and watermarks](../img/parallel_streams_watermarks.svg)
Note that the Kafka source supports per-partition watermarking, which you can read more about [here](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html#timestamps-per-kafka-partition).
......
......@@ -322,5 +322,5 @@ val stream: DataStream[MyType] = env.addSource(kafkaSource)
![Generating Watermarks with awareness for Kafka-partitions](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/parallel_kafka_watermarks.svg)
![Generating Watermarks with awareness for Kafka-partitions](../img/parallel_kafka_watermarks.svg)
......@@ -6,7 +6,7 @@
Flink offers different levels of abstraction to develop streaming/batch applications.
![Programming levels of abstraction](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/levels_of_abstraction.svg)
![Programming levels of abstraction](../img/levels_of_abstraction.svg)
* The lowest level abstraction simply offers **stateful streaming**. It is embedded into the [DataStream API](../dev/datastream_api.html) via the [Process Function](../dev/stream/operators/process_function.html). It allows users freely process events from one or more streams, and use consistent fault tolerant _state_. In addition, users can register event time and processing time callbacks, allowing programs to realize sophisticated computations.
......@@ -26,7 +26,7 @@ The basic building blocks of Flink programs are **streams** and **transformation
When executed, Flink programs are mapped to **streaming dataflows**, consisting of **streams** and transformation **operators**. Each dataflow starts with one or more **sources** and ends in one or more **sinks**. The dataflows resemble arbitrary **directed acyclic graphs** _(DAGs)_. Although special forms of cycles are permitted via _iteration_ constructs, for the most part we will gloss over this for simplicity.
![A DataStream program, and its dataflow.](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/program_dataflow.svg)
![A DataStream program, and its dataflow.](../img/program_dataflow.svg)
Often there is a one-to-one correspondence between the transformations in the programs and the operators in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.
......@@ -38,7 +38,7 @@ Programs in Flink are inherently parallel and distributed. During execution, a _
The number of operator subtasks is the **parallelism** of that particular operator. The parallelism of a stream is always that of its producing operator. Different operators of the same program may have different levels of parallelism.
![A parallel dataflow](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/parallel_dataflow.svg)
![A parallel dataflow](../img/parallel_dataflow.svg)
Streams can transport data between two operators in a _one-to-one_ (or _forwarding_) pattern, or in a _redistributing_ pattern:
......@@ -54,7 +54,7 @@ Aggregating events (e.g., counts, sums) works differently on streams than in bat
Windows can be _time driven_ (example: every 30 seconds) or _data driven_ (example: every 100 elements). One typically distinguishes different types of windows, such as _tumbling windows_ (no overlap), _sliding windows_ (with overlap), and _session windows_ (punctuated by a gap of inactivity).
![Time- and Count Windows](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/windows.svg)
![Time- and Count Windows](../img/windows.svg)
More window examples can be found in this [blog post](https://flink.apache.org/news/2015/12/04/Introducing-windows.html). More details are in the [window docs](../dev/stream/operators/windows.html).
......@@ -68,7 +68,7 @@ When referring to time in a streaming program (for example to define windows), o
* **Processing Time** is the local time at each operator that performs a time-based operation.
![Event Time, Ingestion Time, and Processing Time](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/event_ingestion_processing_time.svg)
![Event Time, Ingestion Time, and Processing Time](../img/event_ingestion_processing_time.svg)
More details on how to handle time are in the [event time docs](//ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html).
......@@ -78,7 +78,7 @@ While many operations in a dataflow simply look at one individual _event at a ti
The state of stateful operations is maintained in what can be thought of as an embedded key/value store. The state is partitioned and distributed strictly together with the streams that are read by the stateful operators. Hence, access to the key/value state is only possible on _keyed streams_, after a _keyBy()_ function, and is restricted to the values associated with the current event’s key. Aligning the keys of streams and state makes sure that all state updates are local operations, guaranteeing consistency without transaction overhead. This alignment also allows Flink to redistribute the state and adjust the stream partitioning transparently.
![State and Partitioning](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/state_partitioning.svg)
![State and Partitioning](../img/state_partitioning.svg)
For more information, see the documentation on [state](../dev/stream/state/index.html).
......
......@@ -845,7 +845,7 @@ dataStream.rebalance();
|
| **Rescaling**
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 6, then one upstream operation would distribute elements to three downstream operations while the other upstream operation would distribute to the other three downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 6 then three upstream operations would distribute to one downstream operation while the other three upstream operations would distribute to the other downstream operation.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.Please see this figure for a visualization of the connection pattern in the above example:![Checkpoint barriers in data streams](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/rescale.svg)
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 6, then one upstream operation would distribute elements to three downstream operations while the other upstream operation would distribute to the other three downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 6 then three upstream operations would distribute to one downstream operation while the other three upstream operations would distribute to the other downstream operation.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.Please see this figure for a visualization of the connection pattern in the above example:![Checkpoint barriers in data streams](../img/rescale.svg)
&lt;figure class="highlight"&gt;
......@@ -909,7 +909,7 @@ dataStream.rebalance()
|
| **Rescaling**
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations would distribute to one downstream operation while the other two upstream operations would distribute to the other downstream operations.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.&lt;/p&gt; Please see this figure for a visualization of the connection pattern in the above example: &lt;/p&gt;![Checkpoint barriers in data streams](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/rescale.svg)
DataStream → DataStream | Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations would distribute to one downstream operation while the other two upstream operations would distribute to the other downstream operations.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.&lt;/p&gt; Please see this figure for a visualization of the connection pattern in the above example: &lt;/p&gt;![Checkpoint barriers in data streams](../img/rescale.svg)
&lt;figure class="highlight"&gt;
......
......@@ -67,7 +67,7 @@ In the following, we show how Flink’s pre-defined window assigners work and ho
A _tumbling windows_ assigner assigns each element to a window of a specified _window size_. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/tumbling-windows.svg)
![](../img/tumbling-windows.svg)
The following code snippets show how to use tumbling windows.
......@@ -130,7 +130,7 @@ The _sliding windows_ assigner assigns elements to windows of fixed length. Simi
For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/sliding-windows.svg)
![](../img/sliding-windows.svg)
The following code snippets show how to use sliding windows.
......@@ -191,7 +191,7 @@ As shown in the last example, sliding window assigners also take an optional `of
The _session windows_ assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to _tumbling windows_ and _sliding windows_. Instead a session window closes when it does not receive elements for a certain period of time, _i.e._, when a gap of inactivity occurred. A session window assigner can be configured with either a static _session gap_ or with a _session gap extractor_ function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/session-windows.svg)
![](../img/session-windows.svg)
The following code snippets show how to use session windows.
......@@ -277,7 +277,7 @@ Attention Since session windows do not have a fixed start and end, they are eval
A _global windows_ assigner assigns all elements with the same key to the same single _global window_. This windowing scheme is only useful if you also specify a custom [trigger](#triggers). Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/non-windowed.svg)
![](../img/non-windowed.svg)
The following code snippets show how to use a global window.
......
......@@ -33,7 +33,7 @@ In the following section we are going to give an overview over how different kin
When performing a tumbling window join, all elements with a common key and a common tumbling window are joined as pairwise combinations and passed on to a `JoinFunction` or `FlatJoinFunction`. Because this behaves like an inner join, elements of one stream that do not have elements from another stream in their tumbling window are not emitted!
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/tumbling-window-join.svg)
![](../img/tumbling-window-join.svg)
As illustrated in the figure, we define a tumbling window with the size of 2 milliseconds, which results in windows of the form `[0,1], [2,3], ...`. The image shows the pairwise combinations of all elements in each window which will be passed on to the `JoinFunction`. Note that in the tumbling window `[6,7]` nothing is emitted because no elements exist in the green stream to be joined with the orange elements ⑥ and ⑦.
......@@ -87,7 +87,7 @@ orangeStream.join(greenStream)
When performing a sliding window join, all elements with a common key and common sliding window are joined as pairwise combinations and passed on to the `JoinFunction` or `FlatJoinFunction`. Elements of one stream that do not have elements from the other stream in the current sliding window are not emitted! Note that some elements might be joined in one sliding window but not in another!
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/sliding-window-join.svg)
![](../img/sliding-window-join.svg)
In this example we are using sliding windows with a size of two milliseconds and slide them by one millisecond, resulting in the sliding windows `[-1, 0],[0,1],[1,2],[2,3], …`. The joined elements below the x-axis are the ones that are passed to the `JoinFunction` for each sliding window. Here you can also see how for example the orange ② is joined with the green ③ in the window `[2,3]`, but is not joined with anything in the window `[1,2]`.
......@@ -141,7 +141,7 @@ orangeStream.join(greenStream)
When performing a session window join, all elements with the same key that when _“combined”_ fulfill the session criteria are joined in pairwise combinations and passed on to the `JoinFunction` or `FlatJoinFunction`. Again this performs an inner join, so if there is a session window that only contains elements from one stream, no output will be emitted!
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/session-window-join.svg)
![](../img/session-window-join.svg)
Here we define a session window join where each session is divided by a gap of at least 1ms. There are three sessions, and in the first two sessions the joined elements from both streams are passed to the `JoinFunction`. In the third session there are no elements in the green stream, so ⑧ and ⑨ are not joined!
......@@ -203,7 +203,7 @@ When a pair of elements are passed to the `ProcessJoinFunction`, they will be as
Note The interval join currently only supports event time.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/interval-join.svg)
![](../img/interval-join.svg)
In the example above, we join two streams ‘orange’ and ‘green’ with a lower bound of -2 milliseconds and an upper bound of +1 millisecond. Be default, these boundaries are inclusive, but `.lowerBoundExclusive()` and `.upperBoundExclusive` can be applied to change the behaviour.
......
......@@ -8,7 +8,7 @@ For distributed execution, Flink _chains_ operator subtasks together into _tasks
The sample dataflow in the figure below is executed with five subtasks, and hence with five parallel threads.
![Operator chaining into Tasks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/tasks_chains.svg)
![Operator chaining into Tasks](../img/tasks_chains.svg)
## Job Managers, Task Managers, Clients
......@@ -26,7 +26,7 @@ The JobManagers and TaskManagers can be started in various ways: directly on the
The **client** is not part of the runtime and program execution, but is used to prepare and send a dataflow to the JobManager. After that, the client can disconnect, or stay connected to receive progress reports. The client runs either as part of the Java/Scala program that triggers the execution, or in the command line process `./bin/flink run ...`.
![The processes involved in executing a Flink dataflow](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/processes.svg)
![The processes involved in executing a Flink dataflow](../img/processes.svg)
## Task Slots and Resources
......@@ -36,7 +36,7 @@ Each _task slot_ represents a fixed subset of resources of the TaskManager. A Ta
By adjusting the number of task slots, users can define how subtasks are isolated from each other. Having one slot per TaskManager means each task group runs in a separate JVM (which can be started in a separate container, for example). Having multiple slots means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead.
![A TaskManager with Task Slots and Tasks](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/tasks_slots.svg)
![A TaskManager with Task Slots and Tasks](../img/tasks_slots.svg)
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. Allowing this _slot sharing_ has two main benefits:
......@@ -44,7 +44,7 @@ By default, Flink allows subtasks to share slots even if they are subtasks of di
* It is easier to get better resource utilization. Without slot sharing, the non-intensive _source/map()_ subtasks would block as many resources as the resource intensive _window_ subtasks. With slot sharing, increasing the base parallelism in our example from two to six yields full utilization of the slotted resources, while making sure that the heavy subtasks are fairly distributed among the TaskManagers.
![TaskManagers with shared Task Slots](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/slot_sharing.svg)
![TaskManagers with shared Task Slots](../img/slot_sharing.svg)
The APIs also include a _[resource group](../dev/stream/operators/#task-chaining-and-resource-groups)_ mechanism which can be used to prevent undesirable slot sharing.
......@@ -54,7 +54,7 @@ As a rule-of-thumb, a good default number of task slots would be the number of C
The exact data structures in which the key/values indexes are stored depends on the chosen [state backend](../ops/state/state_backends.html). One state backend stores data in an in-memory hash map, another state backend uses [RocksDB](http://rocksdb.org) as the key/value store. In addition to defining the data structure that holds the state, the state backends also implement the logic to take a point-in-time snapshot of the key/value state and store that snapshot as part of a checkpoint.
![checkpoints and snapshots](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/checkpoints.svg)
![checkpoints and snapshots](../img/checkpoints.svg)
## Savepoints
......
......@@ -14,7 +14,7 @@ Naively accessing data in the external database, for example in a `MapFunction`,
Asynchronous interaction with the database means that a single parallel function instance can handle many requests concurrently and receive the responses concurrently. That way, the waiting time can be overlayed with sending other requests and receiving responses. At the very least, the waiting time is amortized over multiple requests. This leads in most cased to much higher streaming throughput.
![](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/async_io.svg)
![](../img/async_io.svg)
_Note:_ Improving throughput by just scaling the `MapFunction` to a very high parallelism is in some cases possible as well, but usually comes at a very high resource cost: Having many more parallel MapFunction instances means more tasks, threads, Flink-internal network connections, network connections to the database, buffers, and general internal bookkeeping overhead.
......
......@@ -390,11 +390,11 @@ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wiki-result
You can also check out the Flink dashboard which should be running at [http://localhost:8081](http://localhost:8081). You get an overview of your cluster resources and running jobs:
[![JobManager Overview](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-overview.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-overview.png)
[![JobManager Overview](../img/quickstart-example/jobmanager-overview.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-overview.png)
If you click on your running job you will get a view where you can inspect individual operations and, for example, see the number of processed elements:
[![Example Job View](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-job.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-job.png)
[![Example Job View](../img/quickstart-example/jobmanager-job.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-example/jobmanager-job.png)
This concludes our little tour of Flink. If you have any questions, please don’t hesitate to ask on our [Mailing Lists](http://flink.apache.org/community.html#mailing-lists).
......@@ -69,7 +69,7 @@ $ ./bin/start-cluster.sh # Start Flink
Check the **Dispatcher’s web frontend** at [http://localhost:8081](http://localhost:8081) and make sure everything is up and running. The web frontend should report a single available TaskManager instance.
[![Dispatcher: Overview](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-1.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-1.png)
[![Dispatcher: Overview](../img/quickstart-setup/jobmanager-1.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-1.png)
You can also verify that the system is running by checking the log files in the `logs` directory:
......@@ -237,7 +237,7 @@ Starting execution of program
The program connects to the socket and waits for input. You can check the web interface to verify that the job is running as expected:
[![Dispatcher: Overview (cont'd)](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-2.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-2.png)[![Dispatcher: Running Jobs](https://ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-3.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-3.png)
[![Dispatcher: Overview (cont'd)](../img/quickstart-setup/jobmanager-2.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-2.png)[![Dispatcher: Running Jobs](../img/quickstart-setup/jobmanager-3.png)](//ci.apache.org/projects/flink/flink-docs-release-1.7/page/img/quickstart-setup/jobmanager-3.png)
* Words are counted in time windows of 5 seconds (processing time, tumbling windows) and are printed to `stdout`. Monitor the TaskManager’s output file and write some text in `nc` (input is sent to Flink line by line after hitting &lt;return&gt;):&lt;/return&gt;
......
......@@ -37,7 +37,7 @@ The following table provides an overview of both operators:
The **iterate operator** covers the _simple form of iterations_: in each iteration, the **step function** consumes the **entire input** (the _result of the previous iteration_, or the _initial data set_), and computes the **next version of the partial solution** (e.g. `map`, `reduce`, `join`, etc.).
![Iterate Operator](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/iterations_iterate_operator.png)
![Iterate Operator](../img/iterations_iterate_operator.png)
1. **Iteration Input**: Initial input for the _first iteration_ from a _data source_ or _previous operators_.
2. **Step Function**: The step function will be executed in each iteration. It is an arbitrary data flow consisting of operators like `map`, `reduce`, `join`, etc. and depends on your specific task at hand.
......@@ -71,7 +71,7 @@ See the **[Programming Guide](index.html)** for details and code examples.
In the following example, we **iteratively increment a set numbers**:
![Iterate Operator Example](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/iterations_iterate_operator_example.png)
![Iterate Operator Example](../img/iterations_iterate_operator_example.png)
1. **Iteration Input**: The initial input is read from a data source and consists of five single-field records (integers `1` to `5`).
2. **Step function**: The step function is a single `map` operator, which increments the integer field from `i` to `i+1`. It will be applied to every record of the input.
......@@ -99,7 +99,7 @@ The **delta iterate operator** covers the case of **incremental iterations**. In
Where applicable, this leads to **more efficient algorithms**, because not every element in the solution set changes in each iteration. This allows to **focus on the hot parts** of the solution and leave the **cold parts untouched**. Frequently, the majority of the solution cools down comparatively fast and the later iterations operate only on a small subset of the data.
![Delta Iterate Operator](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/iterations_delta_iterate_operator.png)
![Delta Iterate Operator](../img/iterations_delta_iterate_operator.png)
1. **Iteration Input**: The initial workset and solution set are read from _data sources_ or _previous operators_ as input to the first iteration.
2. **Step Function**: The step function will be executed in each iteration. It is an arbitrary data flow consisting of operators like `map`, `reduce`, `join`, etc. and depends on your specific task at hand.
......@@ -133,7 +133,7 @@ See the **[programming guide](index.html)** for details and code examples.
In the following example, every vertex has an **ID** and a **coloring**. Each vertex will propagate its vertex ID to neighboring vertices. The **goal** is to _assign the minimum ID to every vertex in a subgraph_. If a received ID is smaller then the current one, it changes to the color of the vertex with the received ID. One application of this can be found in _community analysis_ or _connected components_ computation.
![Delta Iterate Operator Example](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/iterations_delta_iterate_operator_example.png)
![Delta Iterate Operator Example](../img/iterations_delta_iterate_operator_example.png)
The **initial input** is set as **both workset and solution set.** In the above figure, the colors visualize the **evolution of the solution set**. With each iteration, the color of the minimum ID is spreading in the respective subgraph. At the same time, the amount of work (exchanged and compared vertex IDs) decreases with each iteration. This corresponds to the **decreasing size of the workset**, which goes from all seven vertices to zero after three iterations, at which time the iteration terminates. The **important observation** is that _the lower subgraph converges before the upper half_ does and the delta iteration is able to capture this with the workset abstraction.
......@@ -151,5 +151,5 @@ The iteration **terminates**, when the workset is empty after the **3rd iteratio
We referred to each execution of the step function of an iteration operator as _a single iteration_. In parallel setups, **multiple instances of the step function are evaluated in parallel** on different partitions of the iteration state. In many settings, one evaluation of the step function on all parallel instances forms a so called **superstep**, which is also the granularity of synchronization. Therefore, _all_ parallel tasks of an iteration need to complete the superstep, before a next superstep will be initialized. **Termination criteria** will also be evaluated at superstep barriers.
![Supersteps](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/iterations_supersteps.png)
![Supersteps](../img/iterations_supersteps.png)
......@@ -34,7 +34,7 @@ It is important to note that the result of a continuous query is always semantic
The following figure visualizes the relationship of streams, dynamic tables, and continuous queries:
<center>![Dynamic tables](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/stream-query-stream.png)</center>
<center>![Dynamic tables](../img/stream-query-stream.png)</center>
1. A stream is converted into a dynamic table.
2. A continuous query is evaluated on the dynamic table yielding a new dynamic table.
......@@ -62,7 +62,7 @@ In order to process a stream with a relational query, it has to be converted int
The following figure visualizes how the stream of click event (left-hand side) is converted into a table (right-hand side). The resulting table is continuously growing as more records of the click stream are inserted.
<center>![Append mode](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/append-mode.png)</center>
<center>![Append mode](../img/append-mode.png)</center>
**Note:** A table which is defined on a stream is internally not materialized.
......@@ -76,13 +76,13 @@ In the following we show two example queries on a `clicks` table that is defined
The first query is a simple `GROUP-BY COUNT` aggregation query. It groups the `clicks` table on the `user` field and counts the number of visited URLs. The following figure shows how the query is evaluated over time as the `clicks` table is updated with additional rows.
<center>![Continuous Non-Windowed Query](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/query-groupBy-cnt.png)</center>
<center>![Continuous Non-Windowed Query](../img/query-groupBy-cnt.png)</center>
When the query is started, the `clicks` table (left-hand side) is empty. The query starts to compute the result table, when the first row is inserted into the `clicks` table. After the first row `[Mary, ./home]` was inserted, the result table (right-hand side, top) consists of a single row `[Mary, 1]`. When the second row `[Bob, ./cart]` is inserted into the `clicks` table, the query updates the result table and inserts a new row `[Bob, 1]`. The third row `[Mary, ./prod?id=1]` yields an update of an already computed result row such that `[Mary, 1]` is updated to `[Mary, 2]`. Finally, the query inserts a third row `[Liz, 1]` into the result table, when the fourth row is appended to the `clicks` table.
The second query is similar to the first one but groups the `clicks` table in addition to the `user` attribute also on an [hourly tumbling window](../sql.html#group-windows) before it counts the number of URLs (time-based computations such as windows are based on special [time attributes](time_attributes.html), which are discussed later.). Again, the figure shows the input and output at different points in time to visualize the changing nature of dynamic tables.
<center>![Continuous Group-Window Query](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/query-groupBy-window-cnt.png)</center>
<center>![Continuous Group-Window Query](../img/query-groupBy-window-cnt.png)</center>
As before, the input table `clicks` is shown on the left. The query continuously computes results every hour and updates the result table. The clicks table contains four rows with timestamps (`cTime`) between `12:00:00` and `12:59:59`. The query computes two results rows from this input (one for each `user`) and appends them to the result table. For the next window between `13:00:00` and `13:59:59`, the `clicks` table contains three rows, which results in another two rows being appended to the result table. The result table is updated, as more rows are appended to `clicks` over time.
......@@ -139,11 +139,11 @@ When converting a dynamic table into a stream or writing it to an external syste
* **Retract stream:** A retract stream is a stream with two types of messages, _add messages_ and _retract messages_. A dynamic table is converted into an retract stream by encoding an `INSERT` change as add message, a `DELETE` change as retract message, and an `UPDATE` change as a retract message for the updated (previous) row and an add message for the updating (new) row. The following figure visualizes the conversion of a dynamic table into a retract stream.
<center>![Dynamic tables](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/undo-redo-mode.png)</center>
<center>![Dynamic tables](../img/undo-redo-mode.png)</center>
* **Upsert stream:** An upsert stream is a stream with two types of messages, _upsert messages_ and _delete messages_. A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key. A dynamic table with unique key is converted into a stream by encoding `INSERT` and `UPDATE` changes as upsert messages and `DELETE` changes as delete messages. The stream consuming operator needs to be aware of the unique key attribute in order to apply messages correctly. The main difference to a retract stream is that `UPDATE` changes are encoded with a single message and hence more efficient. The following figure visualizes the conversion of a dynamic table into an upsert stream.
<center>![Dynamic tables](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/table-streaming/redo-mode.png)</center>
<center>![Dynamic tables](../img/redo-mode.png)</center>
The API to convert a dynamic table into a `DataStream` is discussed on the [Common Concepts](../common.html#convert-a-table-into-a-datastream) page. Please note that only append and retract streams are supported when converting a dynamic table into a `DataStream`. The `TableSink` interface to emit a dynamic table to an external system are discussed on the [TableSources and TableSinks](../sourceSinks.html#define-a-tablesink) page.
......@@ -235,7 +235,7 @@ class CustomTypeSplit extends TableFunction[Row] {
User-Defined Aggregate Functions (UDAGGs) aggregate a table (one ore more rows with one or more attributes) to a scalar value.
<center>![UDAGG mechanism](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/udagg-mechanism.png)</center>
<center>![UDAGG mechanism](../img/udagg-mechanism.png)</center>
The above figure shows an example of an aggregation. Assume you have a table that contains data about beverages. The table consists of three columns, `id`, `name` and `price` and 5 rows. Imagine you need to find the highest price of all beverages in the table, i.e., perform a `max()` aggregation. You would need to check each of the 5 rows and the result would be a single numeric value.
......
......@@ -6,7 +6,7 @@ Flink’s Table & SQL API makes it possible to work with queries written in the
The _SQL Client_ aims to provide an easy way of writing, debugging, and submitting table programs to a Flink cluster without a single line of Java or Scala code. The _SQL Client CLI_ allows for retrieving and visualizing real-time results from the running distributed application on the command line.
[![Animated demo of the Flink SQL Client CLI running table programs on a cluster](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/sql_client_demo.gif)](//ci.apache.org/projects/flink/flink-docs-release-1.7/fig/sql_client_demo.gif)
[![Animated demo of the Flink SQL Client CLI running table programs on a cluster](../img/sql_client_demo.gif)](../img/sql_client_demo.gif)
Attention The SQL Client is in an early development phase. Even though the application is not production-ready yet, it can be a quite useful tool for prototyping and playing around with Flink SQL. In the future, the community plans to extend its functionality by providing a REST-based [SQL Client Gateway](sqlClient.html#limitations--future).
......
......@@ -42,7 +42,7 @@ To visualize the execution plan, do the following:
After these steps, a detailed execution plan will be visualized.
![A flink job execution graph.](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/plan_visualizer.png)
![A flink job execution graph.](../img/plan_visualizer.png)
**Web Interface**
......
......@@ -432,7 +432,7 @@ val graph: Graph[Long, Long, Long] = ...
![Filter Transformations](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-filter.png)
![Filter Transformations](../img/gelly-filter.png)
* **Join**: Gelly provides specialized methods for joining the vertex and edge datasets with other input datasets. `joinWithVertices` joins the vertices with a `Tuple2` input data set. The join is performed using the vertex ID and the first field of the `Tuple2` input as the join keys. The method returns a new `Graph` where the vertex values have been updated according to a provided user-defined transformation function. Similarly, an input dataset can be joined with the edges, using one of three methods. `joinWithEdges` expects an input `DataSet` of `Tuple3` and joins on the composite key of both source and target vertex IDs. `joinWithEdgesOnSource` expects a `DataSet` of `Tuple2` and joins on the source key of the edges and the first attribute of the input dataset and `joinWithEdgesOnTarget` expects a `DataSet` of `Tuple2` and joins on the target key of the edges and the first attribute of the input dataset. All three methods apply a transformation function on the edge and the input data set values. Note that if the input dataset contains a key multiple times, all Gelly join methods will only consider the first value encountered.
......@@ -472,7 +472,7 @@ val vertexOutDegrees: DataSet[(Long, LongValue)] = network.outDegrees
* **Union**: Gelly’s `union()` method performs a union operation on the vertex and edge sets of the specified graph and the current graph. Duplicate vertices are removed from the resulting `Graph`, while if duplicate edges exist, these will be preserved.
![Union Transformation](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-union.png)
![Union Transformation](../img/gelly-union.png)
* **Difference**: Gelly’s `difference()` method performs a difference on the vertex and edge sets of the current graph and the specified graph.
......@@ -582,7 +582,7 @@ Neighborhood methods allow vertices to perform an aggregation on their first-hop
For example, assume that you want to select the minimum weight of all out-edges for each vertex in the following graph:
![reduceOnEdges Example](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-example-graph.png)
![reduceOnEdges Example](../img/gelly-example-graph.png)
The following code will collect the out-edges for each vertex and apply the `SelectMinWeight()` user-defined function on each of the resulting neighborhoods:
......@@ -621,7 +621,7 @@ val minWeights = graph.reduceOnEdges(new SelectMinWeight, EdgeDirection.OUT)
![reduceOnEdges Example](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-reduceOnEdges.png)
![reduceOnEdges Example](../img/gelly-reduceOnEdges.png)
Similarly, assume that you would like to compute the sum of the values of all in-coming neighbors, for every vertex. The following code will collect the in-coming neighbors for each vertex and apply the `SumValues()` user-defined function on each neighborhood:
......@@ -660,7 +660,7 @@ val verticesWithSum = graph.reduceOnNeighbors(new SumValues, EdgeDirection.IN)
![reduceOnNeighbors Example](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-reduceOnNeighbors.png)
![reduceOnNeighbors Example](../img/gelly-reduceOnNeighbors.png)
When the aggregation function is not associative and commutative or when it is desirable to return more than one values per vertex, one can use the more general `groupReduceOnEdges()` and `groupReduceOnNeighbors()` methods. These methods return zero, one or more values per vertex and provide access to the whole neighborhood.
......
......@@ -10,7 +10,7 @@ The vertex-centric model, also known as “think like a vertex” or “Pregel
The computational model is shown in the figure below. The dotted boxes correspond to parallelization units. In each superstep, all active vertices execute the same user-defined computation in parallel. Supersteps are executed synchronously, so that messages sent during one superstep are guaranteed to be delivered in the beginning of the next superstep.
![Vertex-Centric Computational Model](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/vertex-centric supersteps.png)
![Vertex-Centric Computational Model](../img/vertex-centric supersteps.png)
To use vertex-centric iterations in Gelly, the user only needs to define the vertex compute function, `ComputeFunction`. This function and the maximum number of iterations to run are given as parameters to Gelly’s `runVertexCentricIteration`. This method will execute the vertex-centric iteration on the input Graph and return a new Graph, with updated vertex values. An optional message combiner, `MessageCombiner`, can be defined to reduce communication costs.
......@@ -237,7 +237,7 @@ A scatter-gather iteration can be extended with information such as the total nu
Let us consider computing Single-Source-Shortest-Paths with scatter-gather iterations on the following graph and let vertex 1 be the source. In each superstep, each vertex sends a candidate distance message to all its neighbors. The message value is the sum of the current value of the vertex and the edge weight connecting this vertex with its neighbor. Upon receiving candidate distance messages, each vertex calculates the minimum distance and, if a shorter path has been discovered, it updates its value. If a vertex does not change its value during a superstep, then it does not produce messages for its neighbors for the next superstep. The algorithm converges when there are no value updates.
![Scatter-gather SSSP superstep 1](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-vc-sssp1.png)
![Scatter-gather SSSP superstep 1](../img/gelly-vc-sssp1.png)
......@@ -563,7 +563,7 @@ Like in the scatter-gather model, Gather-Sum-Apply also proceeds in synchronized
Let us consider computing Single-Source-Shortest-Paths with GSA on the following graph and let vertex 1 be the source. During the `Gather` phase, we calculate the new candidate distances, by adding each vertex value with the edge weight. In `Sum`, the candidate distances are grouped by vertex ID and the minimum distance is chosen. In `Apply`, the newly calculated distance is compared to the current vertex value and the minimum of the two is assigned as the new value of the vertex.
![GSA SSSP superstep 1](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/gelly-gsa-sssp1.png)
![GSA SSSP superstep 1](../img/gelly-gsa-sssp1.png)
Notice that, if a vertex does not change its value during a superstep, it will not calculate candidate distance during the next superstep. The algorithm converges when no vertex changes value.
......
......@@ -79,7 +79,7 @@ Graph<String, String, Long, Long, Double> graph = BipartiteGraph.fromDataSet(top
* **Projection**: Projection is a common operation for bipartite graphs that converts a bipartite graph into a regular graph. There are two types of projections: top and bottom projections. Top projection preserves only top nodes in the result graph and creates a link between them in a new graph only if there is an intermediate bottom node both top nodes connect to in the original graph. Bottom projection is the opposite to top projection, i.e. only preserves bottom nodes and connects a pair of nodes if they are connected in the original graph.
![Bipartite Graph Projections](https://ci.apache.org/projects/flink/flink-docs-release-1.7/fig/bipartite_graph_projections.png)
![Bipartite Graph Projections](../img/bipartite_graph_projections.png)
Gelly supports two sub-types of projections: simple projections and full projections. The only difference between them is what data is associated with edges in the result graph.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册