Table API and SQL queries have the same semantics regardless whether their input is bounded batch input or unbounded stream input. In many cases, continuous queries on streaming input are capable of computing accurate results that are identical to offline computed results. However, this is not possible in general case because continuous queries have to restrict the size of the state they are maintaining in order to avoid to run out of storage and to be able to process unbounded streaming data over a long period of time. As a result, a continuous query might only be able to provide approximated results depending on the characteristics of the input data and the query itself.
Flink’s Table API and SQL interface provide parameters to tune the accuracy and resource consumption of continuous queries. The parameters are specified via a `QueryConfig` object. The `QueryConfig` can be obtained from the `TableEnvironment` and is passed back when a `Table` is translated, i.e., when it is [transformed into a DataStream](../common.html#convert-a-table-into-a-datastream-or-dataset) or [emitted via a TableSink](../common.html#emit-a-table).
@@ -70,20 +64,12 @@ val tableEnv = TableEnvironment.getTableEnvironment(env)
In the following we describe the parameters of the `QueryConfig` and how they affect the accuracy and resource consumption of a query.
在下文中,我们将描述 `QueryConfig` 的参数以及它们如何影响查询的准确性和资源消耗。
## Idle State Retention Time
## 空闲状态保留时间
Many queries aggregate or join records on one or more key attributes. When such a query is executed on a stream, the continuous query needs to collect records or maintain partial results per key. If the key domain of the input stream is evolving, i.e., the active key values are changing over time, the continuous query accumulates more and more state as more and more distinct keys are observed. However, often keys become inactive after some time and their corresponding state becomes stale and useless.
For example the following query computes the number of clicks per session.
例如,以下查询计算每个会话(session)的单击次数。
...
...
@@ -94,29 +80,17 @@ SELECT sessionId, COUNT(*) FROM clicks GROUP BY sessionId;
The `sessionId` attribute is used as a grouping key and the continuous query maintains a count for each `sessionId` it observes. The `sessionId` attribute is evolving over time and `sessionId` values are only active until the session ends, i.e., for a limited period of time. However, the continuous query cannot know about this property of `sessionId` and expects that every `sessionId` value can occur at any point of time. It maintains a count for each observed `sessionId` value. Consequently, the total state size of the query is continuously growing as more and more `sessionId` values are observed.
The _Idle State Retention Time_ parameters define for how long the state of a key is retained without being updated before it is removed. For the previous example query, the count of a `sessionId` would be removed as soon as it has not been updated for the configured period of time.
_空闲状态保留时间参数(Idle State Retention Time)_ 定义了在删除 key 之前保留 key 状态多长时间而不进行更新。对于前面的示例查询,只要在配置的时间段内没有更新 `sessionId`,就会删除它的计数。
By removing the state of a key, the continuous query completely forgets that it has seen this key before. If a record with a key, whose state has been removed before, is processed, the record will be treated as if it was the first record with the respective key. For the example above this means that the count of a `sessionId` would start again at `0`.
There are two parameters to configure the idle state retention time:
配置空闲状态保留时间有两个参数:
* The _minimum idle state retention time_ defines how long the state of an inactive key is at least kept before it is removed.
* The _maximum idle state retention time_ defines how long the state of an inactive key is at most kept before it is removed.
* _minimum idle state retention time_ 定义了非活动key的状态在被删除之前至少保持多长时间。
* _maximum idle state retention time_ 义了非活动key的状态在被删除之前最多保留多长时间。
The parameters are specified as follows:
参数规定如下:
...
...
@@ -140,6 +114,4 @@ val qConfig: StreamQueryConfig = ???
Cleaning up state requires additional bookkeeping which becomes less expensive for larger differences of `minTime` and `maxTime`. The difference between `minTime` and `maxTime` must be at least 5 minutes.