提交 82974e5e 编写于 作者: AlexZFX's avatar AlexZFX 提交者: Ivan Blinkov

zhdocs: translate zh collapsingmergetree.md (#5168)

* translate zh collapsingmergetree.md and fix zh attention format

* fix the create table link
上级 c280907f
......@@ -110,9 +110,9 @@ When ClickHouse merges data parts, each group of consecutive rows with the same
For each resulting data part ClickHouse saves:
1. The first "cancel" and the last "state" rows, if the number of "state" and "cancel" rows matches.
1. The last "state" row, if there is one more "state" row than "cancel" rows.
1. The first "cancel" row, if there is one more "cancel" row than "state" rows.
1. None of the rows, in all other cases.
2. The last "state" row, if there is one more "state" row than "cancel" rows.
3. The first "cancel" row, if there is one more "cancel" row than "state" rows.
4. None of the rows, in all other cases.
The merge continues, but ClickHouse treats this situation as a logical error and records it in the server log. This error can occur if the same data were inserted more than once.
......
# CollapsingMergeTree {#table_engine-collapsingmergetree}
The engine inherits from [MergeTree](mergetree.md) and adds the logic of rows collapsing to data parts merge algorithm.
该引擎继承于 [MergeTree](mergetree.md),并在数据块合并算法中添加了折叠行的逻辑。
`CollapsingMergeTree` asynchronously deletes (collapses) pairs of rows if all of the fields in a row are equivalent excepting the particular field `Sign` which can have `1` and `-1` values. Rows without a pair are kept. For more details see the [Collapsing](#collapsing) section of the document.
`CollapsingMergeTree` 会异步的删除(折叠)这些除了特定列 `Sign``1``-1` 的值以外,其余所有字段的值都相等的成对的行。没有成对的行会被保留。更多的细节请看本文的[折叠](#table_engine-collapsingmergetree-collapsing)部分。
The engine may significantly reduce the volume of storage and increase efficiency of `SELECT` query as a consequence.
因此,该引擎可以显著的降低存储量并提高 `SELECT` 查询效率。
## Creating a Table
## 建表
```sql
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
......@@ -21,22 +21,22 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
[SETTINGS name=value, ...]
```
For a description of request parameters, see [request description](../../query_language/create.md).
请求参数的描述,参考[请求参数](../../query_language/create.md)
**CollapsingMergeTree Parameters**
**CollapsingMergeTree 参数**
- `sign`Name of the column with the type of row: `1` is a "state" row, `-1` is a "cancel" row.
- `sign`类型列的名称: `1` 是“状态”行,`-1` 是“取消”行。
Column data type — `Int8`.
列数据类型 — `Int8`
**Query clauses**
**子句**
When creating a `CollapsingMergeTree` table, the same [clauses](mergetree.md) are required, as when creating a `MergeTree` table.
创建 `CollapsingMergeTree` 表时,需要与创建 `MergeTree` 表时相同的[子句](mergetree.md#table_engine-mergetree-creating-a-table)
<details markdown="1"><summary>Deprecated Method for Creating a Table</summary>
<details markdown="1"><summary>已弃用的建表方法</summary>
!!! attention
Do not use this method in new projects and, if possible, switch the old projects to the method described above.
!!! attention "注意"
不要在新项目中使用该方法,可能的话,请将旧项目切换到上述方法。
```sql
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
......@@ -47,23 +47,23 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
) ENGINE [=] CollapsingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, sign)
```
All of the parameters excepting `sign` have the same meaning as in `MergeTree`.
除了 `sign` 的所有参数都与 `MergeTree` 中的含义相同。
- `sign`Name of the column with the type of row: `1` — "state" row, `-1` — "cancel" row.
- `sign`类型列的名称: `1` 是“状态”行,`-1` 是“取消”行。
Column Data Type — `Int8`.
列数据类型 — `Int8`。
</details>
## Collapsing
## 折叠 {#table_engine-collapsingmergetree-collapsing}
### Data
### 数据
Consider the situation where you need to save continually changing data for some object. It sounds logical to have one row for an object and update it at any change, but update operation is expensive and slow for DBMS because it requires rewriting of the data in the storage. If you need to write data quickly, update not acceptable, but you can write the changes of an object sequentially as follows.
考虑你需要为某个对象保存不断变化的数据的情景。似乎为一个对象保存一行记录并在其发生任何变化时更新记录是合乎逻辑的,但是更新操作对 DBMS 来说是昂贵且缓慢的,因为它需要重写存储中的数据。如果你需要快速的写入数据,则更新操作是不可接受的,但是你可以按下面的描述顺序地更新一个对象的变化。
Use the particular column `Sign` when writing row. If `Sign = 1` it means that the row is a state of an object, let's call it "state" row. If `Sign = -1` it means the cancellation of the state of an object with the same attributes, let's call it "cancel" row.
在写入行的时候使用特定的列 `Sign`。如果 `Sign = 1` 则表示这一行是对象的状态,我们称之为“状态”行。如果 `Sign = -1` 则表示是对具有相同属性的状态行的取消,我们称之为“取消”行。
For example, we want to calculate how much pages users checked at some site and how long they were there. At some moment of time we write the following row with the state of user activity:
例如,我们想要计算用户在某个站点访问的页面页面数以及他们在那里停留的时间。在某个时候,我们将用户的活动状态写入下面这样的行。
```
┌──────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
......@@ -71,7 +71,7 @@ For example, we want to calculate how much pages users checked at some site and
└─────────────────────┴───────────┴──────────┴──────┘
```
At some moment later we register the change of user activity and write it with the following two rows.
一段时间后,我们写入下面的两行来记录用户活动的变化。
```
┌──────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
......@@ -80,11 +80,11 @@ At some moment later we register the change of user activity and write it with t
└─────────────────────┴───────────┴──────────┴──────┘
```
The first row cancels the previous state of the object (user). It should copy all of the fields of the canceled state excepting `Sign`.
第一行取消了这个对象(用户)的状态。它需要复制被取消的状态行的所有除了 `Sign` 的属性。
The second row contains the current state.
第二行包含了当前的状态。
As we need only the last state of user activity, the rows
因为我们只需要用户活动的最后状态,这些行
```
┌──────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
......@@ -93,43 +93,43 @@ As we need only the last state of user activity, the rows
└─────────────────────┴───────────┴──────────┴──────┘
```
can be deleted collapsing the invalid (old) state of an object. `CollapsingMergeTree` does this while merging of the data parts.
可以在折叠对象的失效(老的)状态的时候被删除。`CollapsingMergeTree` 会在合并数据片段的时候做这件事。
Why we need 2 rows for each change read in the "Algorithm" paragraph.
为什么我们每次改变需要 2 行可以阅读[算法](#table_engine-collapsingmergetree-collapsing-algorithm)段。
**Peculiar properties of such approach**
**这种方法的特殊属性**
1. The program that writes the data should remember the state of an object to be able to cancel it. "Cancel" string should be the copy of "state" string with the opposite `Sign`. It increases the initial size of storage but allows to write the data quickly.
2. Long growing arrays in columns reduce the efficiency of the engine due to load for writing. The more straightforward data, the higher efficiency.
3. `SELECT` results depend strongly on the consistency of object changes history. Be accurate when preparing data for inserting. You can get unpredictable results in inconsistent data, for example, negative values for non-negative metrics such as session depth.
1. 写入的程序应该记住对象的状态从而可以取消它。“取消”字符串应该是“状态”字符串的复制,除了相反的 `Sign`。它增加了存储的初始数据的大小,但使得写入数据更快速。
2. 由于写入的负载,列中长的增长阵列会降低引擎的效率。数据越简单,效率越高。
3. `SELECT` 的结果很大程度取决于对象变更历史的一致性。在准备插入数据时要准确。在不一致的数据中会得到不可预料的结果,例如,像会话深度这种非负指标的负值。
### Algorithm
### 算法 {#table_engine-collapsingmergetree-collapsing-algorithm}
When ClickHouse merges data parts, each group of consecutive rows with the same primary key is reduced to not more than two rows, one with `Sign = 1` ("state" row) and another with `Sign = -1` ("cancel" row). In other words, entries collapse.
当 ClickHouse 合并数据片段时,每组具有相同主键的连续行被减少到不超过两行,一行 `Sign = 1`(“状态”行),另一行 `Sign = -1` (“取消”行),换句话说,数据项被折叠了。
For each resulting data part ClickHouse saves:
对每个结果的数据部分 ClickHouse 保存:
1. The first "cancel" and the last "state" rows, if the number of "state" and "cancel" rows matches.
1. The last "state" row, if there is one more "state" row than "cancel" rows.
1. The first "cancel" row, if there is one more "cancel" row than "state" rows.
1. None of the rows, in all other cases.
1. 第一个“取消”和最后一个“状态”行,如果“状态”和“取消”行的数量匹配
2. 最后一个“状态”行,如果“状态”行比“取消”行多一个。
3. 第一个“取消”行,如果“取消”行比“状态”行多一个。
4. 没有行,在其他所有情况下。
The merge continues, but ClickHouse treats this situation as a logical error and records it in the server log. This error can occur if the same data were inserted more than once.
合并会继续,但是 ClickHouse 会把此情况视为逻辑错误并将其记录在服务日志中。这个错误会在相同的数据被插入超过一次时出现。
Thus, collapsing should not change the results of calculating statistics.
Changes gradually collapsed so that in the end only the last state of almost every object left.
因此,折叠不应该改变统计数据的结果。
变化逐渐地被折叠,因此最终几乎每个对象都只剩下了最后的状态。
The `Sign` is required because the merging algorithm doesn't guarantee that all of the rows with the same primary key will be in the same resulting data part and even on the same physical server. ClickHouse process `SELECT` queries with multiple threads, and it can not predict the order of rows in the result. The aggregation is required if there is a need to get completely "collapsed" data from `CollapsingMergeTree` table.
`Sign` 是必须的因为合并算法不保证所有有相同主键的行都会在同一个结果数据片段中,甚至是在同一台物理服务器上。ClickHouse 用多线程来处理 `SELECT` 请求,所以它不能预测结果中行的顺序。如果要从 `CollapsingMergeTree` 表中获取完全“折叠”后的数据,则需要聚合。
To finalize collapsing write a query with `GROUP BY` clause and aggregate functions that account for the sign. For example, to calculate quantity, use `sum(Sign)` instead of `count()`. To calculate the sum of something, use `sum(Sign * x)` instead of `sum(x)`, and so on, and also add `HAVING sum(Sign) > 0`.
要完成折叠,请使用 `GROUP BY` 子句和用于处理符号的聚合函数编写请求。例如,要计算数量,使用 `sum(Sign)` 而不是 `count()`。要计算某物的总和,使用 `sum(Sign * x)` 而不是 `sum(x)`,并添加 `HAVING sum(Sign) > 0` 子句。
The aggregates `count`, `sum` and `avg` could be calculated this way. The aggregate `uniq` could be calculated if an object has at list one state not collapsed. The aggregates `min` and `max` could not be calculated because `CollapsingMergeTree` does not save values history of the collapsed states.
聚合体 `count`,`sum``avg` 可以用这种方式计算。如果一个对象至少有一个未被折叠的状态,则可以计算 `uniq` 聚合。`min``max` 聚合无法计算,因为 `CollaspingMergeTree` 不会保存折叠状态的值的历史记录。
If you need to extract data without aggregation (for example, to check whether rows are present whose newest values match certain conditions), you can use the `FINAL` modifier for the `FROM` clause. This approach is significantly less efficient.
如果你需要在不进行聚合的情况下获取数据(例如,要检查是否存在最新值与特定条件匹配的行),你可以在 `FROM` 从句中使用 `FINAL` 修饰符。这种方法显然是更低效的。
## Example of use
## 示例
Example data:
示例数据:
```
┌──────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
......@@ -139,7 +139,7 @@ Example data:
└─────────────────────┴───────────┴──────────┴──────┘
```
Creation of the table:
建表:
```sql
CREATE TABLE UAct
......@@ -153,7 +153,7 @@ ENGINE = CollapsingMergeTree(Sign)
ORDER BY UserID
```
Insertion of the data:
插入数据:
```sql
INSERT INTO UAct VALUES (4324182021466249494, 5, 146, 1)
......@@ -162,9 +162,9 @@ INSERT INTO UAct VALUES (4324182021466249494, 5, 146, 1)
INSERT INTO UAct VALUES (4324182021466249494, 5, 146, -1),(4324182021466249494, 6, 185, 1)
```
We use two `INSERT` queries to create two different data parts. If we insert the data with one query ClickHouse creates one data part and will not perform any merge ever.
我们使用两次 `INSERT` 请求来创建两个不同的数据片段。如果我们使用一个请求插入数据,ClickHouse 只会创建一个数据片段且不会执行任何合并操作。
Getting the data:
获取数据:
```
SELECT * FROM UAct
......@@ -180,11 +180,11 @@ SELECT * FROM UAct
└─────────────────────┴───────────┴──────────┴──────┘
```
What do we see and where is collapsing?
With two `INSERT` queries, we created 2 data parts. The `SELECT` query was performed in 2 threads, and we got a random order of rows.
Collapsing not occurred because there was no merge of the data parts yet. ClickHouse merges data part in an unknown moment of time which we can not predict.
我们看到了什么,哪里有折叠?
Thus we need aggregation:
通过两个 `INSERT` 请求,我们创建了两个数据片段。`SELECT` 请求在两个线程中被执行,我们得到了随机顺序的行。没有发生折叠是因为还没有合并数据片段。ClickHouse 在一个我们无法预料的未知时刻合并数据片段。
因此我们需要聚合:
```sql
SELECT
......@@ -201,7 +201,7 @@ HAVING sum(Sign) > 0
└─────────────────────┴───────────┴──────────┘
```
If we do not need aggregation and want to force collapsing, we can use `FINAL` modifier for `FROM` clause.
如果我们不需要聚合并想要强制进行折叠,我们可以在 `FROM` 从句中使用 `FINAL` 修饰语。
```sql
SELECT * FROM UAct FINAL
......@@ -212,6 +212,6 @@ SELECT * FROM UAct FINAL
└─────────────────────┴───────────┴──────────┴──────┘
```
This way of selecting the data is very inefficient. Don't use it for big tables.
这种查询数据的方法是非常低效的。不要在大表中使用它。
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/collapsingmergetree/) <!--hide-->
[来源文章](https://clickhouse.yandex/docs/en/operations/table_engines/collapsingmergetree/) <!--hide-->
......@@ -30,7 +30,7 @@ ORDER BY (CounterID, StartDate, intHash32(UserID));
新数据插入到表中时,这些数据会存储为按主键排序的新片段(块)。插入后 10-15 分钟,同一分区的各个片段会合并为一整个片段。
!!! 注意
!!! attention "注意"
那些有相同分区表达式值的数据片段才会合并。这意味着 **你不应该用太精细的分区方案**(超过一千个分区)。否则,会因为文件系统中的文件数量和需要找开的文件描述符过多,导致 `SELECT` 查询效率不佳。
可以通过 [system.parts](../system_tables.md#system_tables-parts) 表查看表片段和分区信息。例如,假设我们有一个 `visits` 表,按月分区。对 `system.parts` 表执行 `SELECT`
......@@ -67,7 +67,7 @@ WHERE table = 'visits'
- `3` 是数据块的最大编号。
- `1` 是块级别(即在由块组成的合并树中,该块在树中的深度)。
!!! 注意
!!! attention "注意"
旧类型表的片段名称为:`20190117_20190123_2_2_0`(最小日期 - 最大日期 - 最小块编号 - 最大块编号 - 块级别)。
`active` 列为片段状态。`1` 激活状态;`0` 非激活状态。非激活片段是那些在合并到较大片段之后剩余的源数据片段。损坏的数据片段也表示为非活动状态。
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册