mergetree.md 37.2 KB
Newer Older
1
# MergeTree {#table_engines-mergetree}
A
Andrey Dudin 已提交
2

S
Sergei Bocharov 已提交
3
The `MergeTree` engine and other engines of this family (`*MergeTree`) are the most robust ClickHouse table engines.
A
Andrey Dudin 已提交
4

B
BayoNet 已提交
5
Engines in the `MergeTree` family are designed for inserting a very large amount of data into a table. The data is quickly written to the table part by part, then rules are applied for merging the parts in the background. This method is much more efficient than continually rewriting the data in storage during insert.
A
Andrey Dudin 已提交
6

7
Main features:
8

9 10 11 12
- Stores data sorted by primary key.

    This allows you to create a small sparse index that helps find data faster.

B
BayoNet 已提交
13
- Partitions can be used if the [partitioning key](custom_partitioning_key.md) is specified.
14

B
BayoNet 已提交
15
    ClickHouse supports certain operations with partitions that are more effective than general operations on the same data with the same result. ClickHouse also automatically cuts off the partition data where the partitioning key is specified in the query. This also improves query performance.
16 17 18

- Data replication support.

B
BayoNet 已提交
19
    The family of `ReplicatedMergeTree` tables provides data replication. For more information, see [Data replication](replication.md).
20 21 22 23 24

- Data sampling support.

    If necessary, you can set the data sampling method in the table.

25
!!! info
I
Ivan Blinkov 已提交
26
    The [Merge](merge.md) engine does not belong to the `*MergeTree` family.
27 28


29
## Creating a Table  {#table_engine-mergetree-creating-a-table}
30

31
```sql
32 33
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
A
Anton Popov 已提交
34 35
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
36
    ...
N
docs en  
Nikita Vasilev 已提交
37 38
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
39 40 41
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
42
[PRIMARY KEY expr]
43
[SAMPLE BY expr]
44
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
45
[SETTINGS name=value, ...]
46
```
A
Andrey Dudin 已提交
47

48 49
For a description of parameters, see the [CREATE query description](../../query_language/create.md).

50
!!!note "Note"
51
    `INDEX` is an experimental feature, see [Data Skipping Indexes](#table_engine-mergetree-data_skipping-indexes).
52

S
Sergei Shtykov 已提交
53
### Query Clauses {#mergetree-query-clauses}
54

B
BayoNet 已提交
55
- `ENGINE` — Name and parameters of the engine. `ENGINE = MergeTree()`. The `MergeTree` engine does not have parameters.
56

I
Ivan Blinkov 已提交
57
- `PARTITION BY` — The [partitioning key](custom_partitioning_key.md).
58

I
Ivan Blinkov 已提交
59
    For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format.
60 61

- `ORDER BY` — The sorting key.
62 63 64

    A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.

65
- `PRIMARY KEY` — The primary key if it [differs from the sorting key](mergetree.md).
66

67
    By default the primary key is the same as the sorting key (which is specified by the `ORDER BY` clause). Thus in most cases it is unnecessary to specify a separate `PRIMARY KEY` clause.
68

69
- `SAMPLE BY` — An expression for sampling.
70

71
    If a sampling expression is used, the primary key must contain it. Example: `SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID))`.
72

73
- `TTL` — A list of rules specifying storage duration of rows and defining logic of automatic parts movement [between disks and volumes](#table_engine-mergetree-multiple-volumes).
A
Anton Popov 已提交
74

75
    Expression must have one `Date` or `DateTime` column as a result. Example:
A
Anton Popov 已提交
76 77
    `TTL date + INTERVAL 1 DAY`

78 79
    Type of the rule `DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'` specifies an action to be done with the part if the expression is satisfied (reaches current time): removal of expired rows, moving a part (if expression is satisfied for all rows in a part) to specified disk (`TO DISK 'xxx'`) or to volume (`TO VOLUME 'xxx'`). Default type of the rule is removal (`DELETE`). List of multiple rules can specified, but there should be no more than one `DELETE` rule.

80
    For more details, see [TTL for columns and tables](#table_engine-mergetree-ttl)
A
Anton Popov 已提交
81

82
- `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`:
83
    - `index_granularity` — Maximum number of data rows between the marks of an index. Default value: 8192. See [Data Storage](#mergetree-data-storage).
84 85
    - `index_granularity_bytes` — Maximum size of data granules in bytes. Default value: 10Mb. To restrict the granule size only by number of rows, set to 0 (not recommended). See [Data Storage](#mergetree-data-storage).
    - `enable_mixed_granularity_parts` — Enables or disables transitioning to control the granule size with the `index_granularity_bytes` setting. Before version 19.11, there was only the `index_granularity` setting for restricting granule size. The `index_granularity_bytes` setting improves ClickHouse performance when selecting data from tables with big rows (tens and hundreds of megabytes). If you have tables with big rows, you can enable this setting for the tables to improve the efficiency of `SELECT` queries.
B
BayoNet 已提交
86 87
    - `use_minimalistic_part_header_in_zookeeper` — Storage method of the data parts headers in ZooKeeper. If  `use_minimalistic_part_header_in_zookeeper=1`, then ZooKeeper stores less data. For more information, see the [setting description](../server_settings/settings.md#server-settings-use_minimalistic_part_header_in_zookeeper) in "Server configuration parameters".
    - `min_merge_bytes_to_use_direct_io` — The minimum data volume for merge operation that is required for using direct I/O access to the storage disk. When merging data parts, ClickHouse calculates the total storage volume of all the data to be merged. If the volume exceeds `min_merge_bytes_to_use_direct_io` bytes, ClickHouse reads and writes the data to the storage disk using the direct I/O interface (`O_DIRECT` option). If `min_merge_bytes_to_use_direct_io = 0`, then direct I/O is disabled. Default value: `10 * 1024 * 1024 * 1024` bytes.
88
    <a name="mergetree_setting-merge_with_ttl_timeout"></a>
B
BayoNet 已提交
89
    - `merge_with_ttl_timeout` — Minimum delay in seconds before repeating a merge with TTL. Default value: 86400 (1 day).
G
Gleb Novikov 已提交
90
    - `write_final_mark` — Enables or disables writing the final index mark at the end of data part (after the last byte). Default value: 1. Don't turn it off.
91
    - `merge_max_block_size` — Maximum number of rows in block for merge operations. Default value: 8192.
92 93
    - `storage_policy` — Storage policy. See [Using Multiple Block Devices for Data Storage](#table_engine-mergetree-multiple-volumes).

94

95
**Example of Sections Setting**
96

B
BayoNet 已提交
97
```sql
98
ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192
99
```
A
Andrey Dudin 已提交
100

101 102
In the example, we set partitioning by month.

B
BayoNet 已提交
103
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If you define a [SAMPLE](../../query_language/select.md#select-sample-clause) clause when selecting the data, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
104

B
BayoNet 已提交
105
The `index_granularity` setting can be omitted because 8192 is the default value.
106

107
<details markdown="1"><summary>Deprecated Method for Creating a Table</summary>
108 109

!!! attention
B
BayoNet 已提交
110
    Do not use this method in new projects. If possible, switch old projects to the method described above.
111

112
```sql
113 114 115 116 117 118
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE [=] MergeTree(date-column [, sampling_expression], (primary, key), index_granularity)
119 120
```

121
**MergeTree() Parameters**
122

B
BayoNet 已提交
123 124 125
- `date-column` — The name of a column of the [Date](../../data_types/date.md) type. ClickHouse automatically creates partitions by month based on this column. The partition names are in the `"YYYYMM"` format.
- `sampling_expression` — An expression for sampling.
- `(primary, key)` — Primary key. Type: [Tuple()](../../data_types/tuple.md)
126
- `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. The value 8192 is appropriate for most tasks.
127

128
**Example**
129

130
```sql
131 132
MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)
```
A
Andrey Dudin 已提交
133

134
The `MergeTree` engine is configured in the same way as in the example above for the main engine configuration method.
135
</details>
136

137
## Data Storage {#mergetree-data-storage}
138

B
BayoNet 已提交
139
A table consists of data parts sorted by primary key.
140 141 142

When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is `(CounterID, Date)`, the data in the part is sorted by `CounterID`, and within each `CounterID`, it is ordered by `Date`.

143
Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. The merge mechanism does not guarantee that all rows with the same primary key will be in the same data part.
144

145
Each data part is logically divided into granules. A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ClickHouse doesn't split rows or values, so each granule always contains an integer number of rows. The first row of a granule is marked with the value of the primary key for the row. For each data part, ClickHouse creates an index file that stores the marks. For each column, whether it's in the primary key or not, ClickHouse also stores the same marks. These marks let you find data directly in column files.
146

147
The granule size is restricted by the `index_granularity` and `index_granularity_bytes` settings of the table engine. The number of rows in a granule lays in the `[1, index_granularity]` range, depending on the size of the rows. The size of a granule can exceed `index_granularity_bytes` if the size of a single row is greater than the value of the setting. In this case, the size of the granule equals the size of the row.
148

149
## Primary Keys and Indexes in Queries {#primary-keys-and-indexes-in-queries}
150

B
BayoNet 已提交
151
Take the `(CounterID, Date)` primary key as an example. In this case, the sorting and index can be illustrated as follows:
152 153 154 155 156 157 158 159 160 161 162 163 164 165

```
Whole data:     [-------------------------------------------------------------------------]
CounterID:      [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
Date:           [1111111222222233331233211111222222333211111112122222223111112223311122333]
Marks:           |      |      |      |      |      |      |      |      |      |      |
                a,1    a,2    a,3    b,3    e,2    e,3    g,1    h,2    i,1    i,3    l,3
Marks numbers:   0      1      2      3      4      5      6      7      8      9      10
```

If the data query specifies:

- `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`.
- `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`.
166
- `Date = 3`, the server reads the data in the range of marks `[1, 10]`.
A
Andrey Dudin 已提交
167

168
The examples above show that it is always more effective to use an index than a full scan.
A
Andrey Dudin 已提交
169

170
A sparse index allows extra data to be read. When reading a single range of the primary key, up to `index_granularity * 2` extra rows in each data block can be read.
A
Andrey Dudin 已提交
171

172
Sparse indexes allow you to work with a very large number of table rows, because in most cases, such indexes fit in the computer's RAM.
A
Andrey Dudin 已提交
173

174
ClickHouse does not require a unique primary key. You can insert multiple rows with the same primary key.
A
Andrey Dudin 已提交
175

176
### Selecting the Primary Key
A
Andrey Dudin 已提交
177

178
The number of columns in the primary key is not explicitly limited. Depending on the data structure, you can include more or fewer columns in the primary key. This may:
A
Andrey Dudin 已提交
179

180
- Improve the performance of an index.
A
Andrey Dudin 已提交
181

182
    If the primary key is `(a, b)`, then adding another column `c` will improve the performance if the following conditions are met:
183

184 185
    - There are queries with a condition on column `c`.
    - Long data ranges (several times longer than the `index_granularity`) with identical values for `(a, b)` are common. In other words, when adding another column allows you to skip quite long data ranges.
A
Andrey Dudin 已提交
186

187
- Improve data compression.
A
Andrey Dudin 已提交
188

189 190
    ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.

B
BayoNet 已提交
191
- Provide additional logic when merging data parts in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines.
192

193
    In this case it makes sense to specify the *sorting key* that is different from the primary key.
194 195 196

A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.

197

B
BayoNet 已提交
198
### Choosing a Primary Key that Differs from the Sorting Key
199

B
BayoNet 已提交
200
It is possible to specify a primary key (an expression with values that are written in the index file for each mark) that is different from the sorting key (an expression for sorting the rows in data parts). In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.
201 202

This feature is helpful when using the [SummingMergeTree](summingmergetree.md) and
B
BayoNet 已提交
203
[AggregatingMergeTree](aggregatingmergetree.md) table engines. In a common case when using these engines, the table has two types of columns: *dimensions* and *measures*. Typical queries aggregate values of measure columns with arbitrary `GROUP BY` and filtering by dimensions. Because SummingMergeTree and AggregatingMergeTree aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result, the key expression consists of a long list of columns and this list must be frequently updated with newly added dimensions.
204

B
BayoNet 已提交
205
In this case it makes sense to leave only a few columns in the primary key that will provide efficient range scans and add the remaining dimension columns to the sorting key tuple.
206

B
BayoNet 已提交
207
[ALTER](../../query_language/alter.md) of the sorting key is a lightweight operation because when a new column is simultaneously added to the table and to the sorting key, existing data parts don't need to be changed. Since the old sorting key is a prefix of the new sorting key and there is no data in the newly added column, the data is sorted by both the old and new sorting keys at the moment of table modification.
208

209
### Use of Indexes and Partitions in Queries
210

B
BayoNet 已提交
211
For `SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
A
Andrey Dudin 已提交
212

B
BayoNet 已提交
213
Thus, it is possible to quickly run queries on one or many ranges of the primary key. In this example, queries will be fast when run for a specific tracking tag, for a specific tag and date range, for a specific tag and date, for multiple tags with a date range, and so on.
A
Andrey Dudin 已提交
214

215 216 217 218 219 220 221 222
Let's look at the engine configured as follows:

```
ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate) SETTINGS index_granularity=8192
```

In this case, in queries:

I
Ivan Blinkov 已提交
223
``` sql
224 225 226 227
SELECT count() FROM table WHERE EventDate = toDate(now()) AND CounterID = 34
SELECT count() FROM table WHERE EventDate = toDate(now()) AND (CounterID = 34 OR CounterID = 42)
SELECT count() FROM table WHERE ((EventDate >= toDate('2014-01-01') AND EventDate <= toDate('2014-01-31')) OR EventDate = toDate('2014-05-01')) AND CounterID IN (101500, 731962, 160656) AND (CounterID = 101500 OR EventDate != toDate('2014-05-01'))
```
A
Andrey Dudin 已提交
228

229
ClickHouse will use the primary key index to trim improper data and the monthly partitioning key to trim partitions that are in improper date ranges.
A
Andrey Dudin 已提交
230

231
The queries above show that the index is used even for complex expressions. Reading from the table is organized so that using the index can't be slower than a full scan.
A
Andrey Dudin 已提交
232

233
In the example below, the index can't be used.
A
Andrey Dudin 已提交
234

I
Ivan Blinkov 已提交
235
``` sql
236 237
SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'
```
A
Andrey Dudin 已提交
238

239
To check whether ClickHouse can use the index when running a query, use the settings [force_index_by_date](../settings/settings.md#settings-force_index_by_date) and [force_primary_key](../settings/settings.md).
A
Andrey Dudin 已提交
240

241
The key for partitioning by month allows reading only those data blocks which contain dates from the proper range. In this case, the data block may contain data for many dates (up to an entire month). Within a block, data is sorted by primary key, which might not contain the date as the first column. Because of this, using a query with only a date condition that does not specify the primary key prefix will cause more data to be read than for a single date.
A
Andrey Dudin 已提交
242

243 244
### Use of Index for Partially-Monotonic Primary Keys

245
Consider, for example, the days of the month. They form a [monotonic sequence](https://en.wikipedia.org/wiki/Monotonic_function) for one month, but not monotonic for more extended periods. This is a partially-monotonic sequence. If a user creates the table with partially-monotonic primary key, ClickHouse creates a sparse index as usual. When a user selects data from this kind of table, ClickHouse analyzes the query conditions. If the user wants to get data between two marks of the index and both these marks fall within one month, ClickHouse can use the index in this particular case because it can calculate the distance between the parameters of a query and index marks.
246

247
ClickHouse cannot use an index if the values of the primary key in the query parameter range don't represent a monotonic sequence. In this case, ClickHouse uses the full scan method.
248

249
ClickHouse uses this logic not only for days of the month sequences, but for any primary key that represents a partially-monotonic sequence.
N
docs en  
Nikita Vasilev 已提交
250

251
### Data Skipping Indexes (Experimental) {#table_engine-mergetree-data_skipping-indexes}
N
docs en  
Nikita Vasilev 已提交
252

B
BayoNet 已提交
253
The index declaration is in the columns section of the `CREATE` query.
N
docs en  
Nikita Vasilev 已提交
254 255 256 257
```sql
INDEX index_name expr TYPE type(...) GRANULARITY granularity_value
```

B
BayoNet 已提交
258
For tables from the `*MergeTree` family, data skipping indices can be specified.
N
docs en  
Nikita Vasilev 已提交
259

B
BayoNet 已提交
260
These indices aggregate some information about the specified expression on blocks, which consist of `granularity_value` granules (the size of the granule is specified using the `index_granularity` setting in the table engine). Then these aggregates are used in `SELECT` queries for reducing the amount of data to read from the disk by skipping big blocks of data where the `where` query cannot be satisfied.
N
docs en  
Nikita Vasilev 已提交
261 262


263 264
**Example**

N
docs en  
Nikita Vasilev 已提交
265 266 267 268 269 270 271 272
```sql
CREATE TABLE table_name
(
    u64 UInt64,
    i32 Int32,
    s String,
    ...
    INDEX a (u64 * i32, s) TYPE minmax GRANULARITY 3,
N
Nikita Vasilev 已提交
273
    INDEX b (u64 * length(s)) TYPE set(1000) GRANULARITY 4
N
docs en  
Nikita Vasilev 已提交
274 275 276 277
) ENGINE = MergeTree()
...
```

B
BayoNet 已提交
278
Indices from the example can be used by ClickHouse to reduce the amount of data to read from disk in the following queries:
279

N
docs en  
Nikita Vasilev 已提交
280 281 282 283 284 285 286
```sql
SELECT count() FROM table WHERE s < 'z'
SELECT count() FROM table WHERE u64 * i32 == 10 AND u64 * length(s) >= 1234
```

#### Available Types of Indices

287 288
- `minmax`

B
BayoNet 已提交
289
    Stores extremes of the specified expression (if the expression is `tuple`, then it stores extremes for each element of `tuple`), uses stored info for skipping blocks of data like the primary key.
290 291

- `set(max_rows)`
N
unique  
Nikita Vasilev 已提交
292

B
BayoNet 已提交
293
    Stores unique values of the specified expression (no more than `max_rows` rows, `max_rows=0` means "no limits"). Uses the values to check if the `WHERE` expression is not satisfiable on a block of data.  
N
docs en  
Nikita Vasilev 已提交
294

295
- `ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)`
N
docs  
Nikita Vasilev 已提交
296

297
    Stores a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) that contains all ngrams from a block of data. Works only with strings. Can be used for optimization of `equals`, `like` and `in` expressions.
298 299

    - `n` — ngram size,
B
BayoNet 已提交
300
    - `size_of_bloom_filter_in_bytes` — Bloom filter size in bytes (you can use large values here, for example, 256 or 512, because it can be compressed well).
301 302
    - `number_of_hash_functions` — The number of hash functions used in the Bloom filter.
    - `random_seed` — The seed for Bloom filter hash functions.
303 304 305

- `tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)`

B
BayoNet 已提交
306
    The same as `ngrambf_v1`, but stores tokens instead of ngrams. Tokens are sequences separated by non-alphanumeric characters.
N
docs  
Nikita Vasilev 已提交
307

308
- `bloom_filter([false_positive])` — Stores a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) for the specified columns.
309

310
    The optional `false_positive` parameter is the probability of receiving a false positive response from the filter. Possible values: (0, 1). Default value: 0.025.
311

312
    Supported data types: `Int*`, `UInt*`, `Float*`, `Enum`, `Date`, `DateTime`, `String`, `FixedString`, `Array`, `LowCardinality`, `Nullable`.
313

314
    The following functions can use it: [equals](../../query_language/functions/comparison_functions.md), [notEquals](../../query_language/functions/comparison_functions.md), [in](../../query_language/functions/in_functions.md), [notIn](../../query_language/functions/in_functions.md), [has](../../query_language/functions/array_functions.md).
315

N
docs en  
Nikita Vasilev 已提交
316
```sql
N
Nikita Vasilev 已提交
317
INDEX sample_index (u64 * length(s)) TYPE minmax GRANULARITY 4
N
docs  
Nikita Vasilev 已提交
318
INDEX sample_index2 (u64 * length(str), i32 + f64 * 100, date, str) TYPE set(100) GRANULARITY 4
N
_v1  
Nikita Vasilev 已提交
319
INDEX sample_index3 (lower(str), str) TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 4
N
docs en  
Nikita Vasilev 已提交
320 321
```

322 323
#### Functions Support

324
Conditions in the `WHERE` clause contains calls of the functions that operate with columns. If the column is a part of an index, ClickHouse tries to use this index when performing the functions. ClickHouse supports different subsets of functions for using indexes.
325

326
The `set` index can be used with all functions. Function subsets for other indexes are shown in the table below.
327 328 329 330 331 332

Function (operator) / Index | primary key | minmax | ngrambf_v1 | tokenbf_v1 | bloom_filter
----------------------------|-------------|--------|------------|------------|---------------
[equals (=, ==)](../../query_language/functions/comparison_functions.md#function-equals) | ✔ | ✔ | ✔ | ✔ | ✔ 
[notEquals(!=, <>)](../../query_language/functions/comparison_functions.md#function-notequals) | ✔ | ✔ | ✔ | ✔ | ✔
[like](../../query_language/functions/string_search_functions.md#function-like) | ✔ | ✔ | ✔ | ✗ |  ✗
333
[notLike](../../query_language/functions/string_search_functions.md#function-notlike) | ✔ | ✔ | ✔ | ✗ | ✗
334
[startsWith](../../query_language/functions/string_functions.md#function-startswith) | ✔ | ✔ | ✔ | ✔ | ✗
335
[endsWith](../../query_language/functions/string_functions.md#function-endswith) | ✗ | ✗ | ✔ | ✔ | ✗
336
[multiSearchAny](../../query_language/functions/string_search_functions.md#function-multisearchany) | ✗ | ✗ | ✔ | ✗ | ✗
337 338 339 340 341 342 343 344 345 346
[in](../../query_language/functions/in_functions.md#in-functions) | ✔ | ✔ | ✔ | ✔ | ✔ 
[notIn](../../query_language/functions/in_functions.md#in-functions) | ✔ | ✔ | ✔ | ✔ | ✔ 
[less (<)](../../query_language/functions/comparison_functions.md#function-less) | ✔ | ✔ | ✗ | ✗ | ✗
[greater (>)](../../query_language/functions/comparison_functions.md#function-greater) | ✔ | ✔ | ✗ | ✗ | ✗
[lessOrEquals (<=)](../../query_language/functions/comparison_functions.md#function-lessorequals) | ✔ | ✔ | ✗ | ✗ | ✗
[greaterOrEquals (>=)](../../query_language/functions/comparison_functions.md#function-greaterorequals) | ✔ | ✔ | ✗ | ✗ | ✗
[empty](../../query_language/functions/array_functions.md#function-empty)  | ✔ | ✔ | ✗ | ✗ | ✗
[notEmpty](../../query_language/functions/array_functions.md#function-notempty)  | ✔ | ✔ | ✗ | ✗ | ✗
hasToken  | ✗ | ✗ | ✗ | ✔ | ✗ 

347
Functions with a constant argument that is less than ngram size can't be used by `ngrambf_v1` for query optimization.
348

349
Bloom filters can have false positive matches, so the `ngrambf_v1`, `tokenbf_v1`, and `bloom_filter` indexes can't be used for optimizing queries where the result of a function is expected to be false, for example:
350 351 352 353 354 355 356 357 358 359 360 361 362 363

- Can be optimized:
    - `s LIKE '%test%'`
    - `NOT s NOT LIKE '%test%'`
    - `s = 1`
    - `NOT s != 1`
    - `startsWith(s, 'test')`
- Can't be optimized:
    - `NOT s LIKE '%test%'`
    - `s NOT LIKE '%test%'`
    - `NOT s = 1`
    - `s != 1`
    - `NOT startsWith(s, 'test')`

364
## Concurrent Data Access
A
Andrey Dudin 已提交
365

A
Andrey Dudin 已提交
366
For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations.
A
Andrey Dudin 已提交
367

A
Andrey Dudin 已提交
368
Reading from a table is automatically parallelized.
A
Andrey Dudin 已提交
369

A
Anton Popov 已提交
370

B
BayoNet 已提交
371
## TTL for Columns and Tables {#table_engine-mergetree-ttl}
A
Anton Popov 已提交
372

373
Determines the lifetime of values.
A
Anton Popov 已提交
374

375 376 377
The `TTL` clause can be set for the whole table and for each individual column. Table-level TTL can also specify logic of automatic move of data between disks and volumes.

Expressions must evaluate to [Date](../../data_types/date.md) or [DateTime](../../data_types/datetime.md) data type.
378

379
Example:
380

381
```sql
382 383
TTL time_column
TTL time_column + interval
384 385
```

386
To define `interval`, use [time interval](../../query_language/operators.md#operators-datetime) operators.
387

388
```sql
389 390 391 392
TTL date_time + INTERVAL 1 MONTH
TTL date_time + INTERVAL 15 HOUR
```

393
### Column TTL {#mergetree-column-ttl}
394

395
When the values in the column expire, ClickHouse replaces them with the default values for the column data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in a filesystem.
396

397
The `TTL` clause can't be used for key columns.
398

D
Denis Zhuravlev 已提交
399 400
Examples:

D
Denis Zhuravlev 已提交
401 402
Creating a table with TTL

D
Denis Zhuravlev 已提交
403 404 405 406
```sql
CREATE TABLE example_table 
(
    d DateTime,
D
Denis Zhuravlev 已提交
407 408
    a Int TTL d + INTERVAL 1 MONTH,
    b Int TTL d + INTERVAL 1 MONTH,
D
Denis Zhuravlev 已提交
409 410 411 412 413
    c String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d;
D
Denis Zhuravlev 已提交
414
```
D
Denis Zhuravlev 已提交
415

D
Denis Zhuravlev 已提交
416
Adding TTL to a column of an existing table
D
Denis Zhuravlev 已提交
417

D
Denis Zhuravlev 已提交
418
```sql
D
Denis Zhuravlev 已提交
419 420
ALTER TABLE example_table
    MODIFY COLUMN
D
Denis Zhuravlev 已提交
421 422
    c String TTL d + INTERVAL 1 DAY;
```
D
Denis Zhuravlev 已提交
423

D
Denis Zhuravlev 已提交
424 425 426
Altering TTL of the column

```sql
D
Denis Zhuravlev 已提交
427 428
ALTER TABLE example_table
    MODIFY COLUMN
D
Denis Zhuravlev 已提交
429
    c String TTL d + INTERVAL 1 MONTH;
D
Denis Zhuravlev 已提交
430 431
```

432
### Table TTL {#mergetree-table-ttl}
433

434 435 436 437 438 439 440 441 442 443 444
Table can have an expression for removal of expired rows, and multiple expressions for automatic move of parts between [disks or volumes](#table_engine-mergetree-multiple-volumes). When rows in the table expire, ClickHouse deletes all corresponding rows. For parts moving feature, all rows of a part must satisfy the movement expression criteria.

```sql
TTL expr [DELETE|TO DISK 'aaa'|TO VOLUME 'bbb'], ...
```

Type of TTL rule may follow each TTL expression. It affects an action which is to be done once the expression is satisfied (reaches current time):

- `DELETE` - delete expired rows (default action);
- `TO DISK 'aaa'` - move part to the disk `aaa`;
- `TO VOLUME 'bbb'` - move part to the disk `bbb`.
445

D
Denis Zhuravlev 已提交
446 447
Examples:

D
Denis Zhuravlev 已提交
448 449
Creating a table with TTL

D
Denis Zhuravlev 已提交
450 451 452 453 454 455 456 457 458
```sql
CREATE TABLE example_table 
(
    d DateTime,
    a Int
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d
459 460 461
TTL d + INTERVAL 1 MONTH [DELETE],
    d + INTERVAL 1 WEEK TO VOLUME 'aaa',
    d + INTERVAL 2 WEEK TO DISK 'bbb';
D
Denis Zhuravlev 已提交
462
```
D
Denis Zhuravlev 已提交
463

D
Denis Zhuravlev 已提交
464
Altering TTL of the table
D
Denis Zhuravlev 已提交
465

D
Denis Zhuravlev 已提交
466
```sql
D
Denis Zhuravlev 已提交
467
ALTER TABLE example_table
D
Denis Zhuravlev 已提交
468
    MODIFY TTL d + INTERVAL 1 DAY;
D
Denis Zhuravlev 已提交
469 470
```

471
**Removing Data**
472

473
Data with an expired TTL is removed when ClickHouse merges data parts.
474

475
When ClickHouse see that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set [merge_with_ttl_timeout](#mergetree_setting-merge_with_ttl_timeout). If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.
476

477
If you perform the `SELECT` query between merges, you may get expired data. To avoid it, use the [OPTIMIZE](../../query_language/misc.md#misc_operations-optimize) query before `SELECT`.
A
Anton Popov 已提交
478

I
Ivan Blinkov 已提交
479
[Original article](https://clickhouse.tech/docs/en/operations/table_engines/mergetree/) <!--hide-->
480 481


482
## Using Multiple Block Devices for Data Storage {#table_engine-mergetree-multiple-volumes}
483

484
### Introduction
485

486
`MergeTree` family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into "hot" and "cold". The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely. If several disks are available, the "hot" data may be located on fast disks (for example, NVMe SSDs or in memory), while the "cold" data - on relatively slow ones (for example, HDD).
487

488
Data part is the minimum movable unit for `MergeTree`-engine tables. The data belonging to one part are stored on one disk. Data parts can be moved between disks in the background (according to user settings) as well as by means of the [ALTER](../../query_language/alter.md#alter_move-partition) queries. 
489 490 491

### Terms

492 493 494 495 496 497
- Disk — Block device mounted to the filesystem.
- Default disk — Disk that stores the path specified in the [path](../server_settings/settings.md#server_settings-path) server setting.
- Volume — Ordered set of equal disks (similar to [JBOD](https://en.wikipedia.org/wiki/Non-RAID_drive_architectures)).
- Storage policy — Set of volumes and the rules for moving data between them.

The names given to the described entities can be found in the system tables, [system.storage_policies](../system_tables.md#system_tables-storage_policies) and [system.disks](../system_tables.md#system_tables-disks). To apply one of the configured storage policies for a table, use the `storage_policy` setting of `MergeTree`-engine family tables.
498 499 500

### Configuration {#table_engine-mergetree-multiple-volumes_configure}

501 502 503
Disks, volumes and storage policies should be declared inside the `<storage_configuration>` tag either in the main file `config.xml` or in a distinct file in the `config.d` directory. 

Configuration structure:
504 505

```xml
506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522
<storage_configuration>
    <disks>
        <disk_name_1> <!-- disk name -->
            <path>/mnt/fast_ssd/clickhouse</path>
        </disk_name_1>
        <disk_name_2>
            <path>/mnt/hdd1/clickhouse</path>
            <keep_free_space_bytes>10485760</keep_free_space_bytes>
        </disk_name_2>
        <disk_name_3>
            <path>/mnt/hdd2/clickhouse</path>
            <keep_free_space_bytes>10485760</keep_free_space_bytes>
        </disk_name_3>

        ...
    </disks>
    
523
    ...
524
</storage_configuration>
525 526
```

527
Tags:
528

529 530 531
- `<disk_name_N>` — Disk name. Names must be different for all disks.
- `path` — path under which a server will store data (`data` and `shadow` folders), should be terminated with '/'.
- `keep_free_space_bytes` — the amount of free disk space to be reserved.
532 533 534

The order of the disk definition is not important.

535 536 537
Storage policies configuration markup:

```xml
538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561
<storage_configuration>
    ...
    <policies>
        <policy_name_1>
            <volumes>
                <volume_name_1>
                    <disk>disk_name_from_disks_configuration</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </volume_name_1>
                <volume_name_2>
                    <!-- configuration -->
                </volume_name_2>
                <!-- more volumes -->
            </volumes>
            <move_factor>0.2</move_factor>
        </policy_name_1>
        <policy_name_2>
            <!-- configuration -->
        </policy_name_2>

        <!-- more policies -->
    </policies>
    ...
</storage_configuration>
562 563 564 565 566 567 568 569 570 571 572
```

Tags:

- `policy_name_N` — Policy name. Policy names must be unique.
- `volume_name_N` — Volume name. Volume names must be unique.
- `disk` — a disk within a volume.
- `max_data_part_size_bytes` — the maximum size of a part that can be stored on any of the volume's disks.
- `move_factor` — when the amount of available space gets lower than this factor, data automatically start to move on the next volume if any (by default, 0.1).

Cofiguration examples:
573 574

```xml
575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601
<storage_configuration>
    ...
    <policies>
        <hdd_in_order> <!-- policy name -->
            <volumes>
                <single> <!-- volume name -->
                    <disk>disk1</disk>
                    <disk>disk2</disk>
                </single>
            </volumes>
        </hdd_in_order>

        <moving_from_ssd_to_hdd>
            <volumes>
                <hot>
                    <disk>fast_ssd</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </hot>
                <cold>
                    <disk>disk1</disk>
                </cold>            
            </volumes>
            <move_factor>0.2</move_factor>
        </moving_from_ssd_to_hdd>
    </policies>
    ...
</storage_configuration>
602 603
```

604
In given example, the `hdd_in_order` policy implements the [round-robin](https://en.wikipedia.org/wiki/Round-robin_scheduling) approach. Thus this policy defines only one volume (`single`), the data parts are stored on all its disks in circular order. Such policy can be quite useful if there are several similar disks are mounted to the system, but RAID is not configured. Keep in mind that each individual disk drive is not reliable and you might want to compensate it with replication factor of 3 or more.
605

606
If there are different kinds of disks available in the system, `moving_from_ssd_to_hdd` policy can be used instead. The volume `hot` consists of an SSD disk (`fast_ssd`), and the maximum size of a part that can be stored on this volume is 1GB. All the parts with the size larger than 1GB will be stored directly on the `cold` volume, which contains an HDD disk `disk1`. 
607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628
Also, once the disk `fast_ssd` gets filled by more than 80%, data will be transferred to the `disk1` by a background process.

The order of volume enumeration within a storage policy is important. Once a volume is overfilled, data are moved to the next one. The order of disk enumeration is important as well because data are stored on them in turns. 

When creating a table, one can apply one of the configured storage policies to it: 

```sql
CREATE TABLE table_with_non_default_policy (
    EventDate Date,
    OrderID UInt64,
    BannerID UInt64,
    SearchPhrase String
) ENGINE = MergeTree
ORDER BY (OrderID, BannerID)
PARTITION BY toYYYYMM(EventDate)
SETTINGS storage_policy = 'moving_from_ssd_to_hdd'
```

The `default` storage policy implies using only one volume, which consists of only one disk given in `<path>`. Once a table is created, its storage policy cannot be changed.  

### Details

629
In the case of `MergeTree` tables, data is getting to disk in different ways:
630

631 632 633 634
- As a result of an insert (`INSERT` query).
- During background merges and [mutations](../../query_language/alter.md#alter-mutations).
- When downloading from another replica.
- As a result of partition freezing [ALTER TABLE ... FREEZE PARTITION](../../query_language/alter.md#alter_freeze-partition).
635 636 637 638 639 640 641 642 643

In all these cases except for mutations and partition freezing, a part is stored on a volume and a disk according to the given storage policy:

1. The first volume (in the order of definition) that has enough disk space for storing a part (`unreserved_space > current_part_size`) and allows for storing parts of a given size (`max_data_part_size_bytes > current_part_size`) is chosen.
2. Within this volume, that disk is chosen that follows the one, which was used for storing the previous chunk of data, and that has free space more than the part size (`unreserved_space - keep_free_space_bytes > current_part_size`). 

Under the hood, mutations and partition freezing make use of [hard links](https://en.wikipedia.org/wiki/Hard_link). Hard links between different disks are not supported, therefore in such cases the resulting parts are stored on the same disks as the initial ones. 

In the background, parts are moved between volumes on the basis of the amount of free space (`move_factor` parameter) according to the order the volumes are declared in the configuration file.
A
alexey-milovidov 已提交
644
Data is never transferred from the last one and into the first one. One may use system tables [system.part_log](../system_tables.md#system_tables-part-log) (field `type = MOVE_PART`) and [system.parts](../system_tables.md#system_tables-parts) (fields `path` and `disk`) to monitor background moves. Also, the detailed information can be found in server logs.
645 646 647 648 649 650 651 652

User can force moving a part or a partition from one volume to another using the query [ALTER TABLE ... MOVE PART|PARTITION ... TO VOLUME|DISK ...](../../query_language/alter.md#alter_move-partition), all the restrictions for background operations are taken into account. The query initiates a move on its own and does not wait for background operations to be completed. User will get an error message if not enough free space is available or if any of the required conditions are not met.

Moving data does not interfere with data replication. Therefore, different storage policies can be specified for the same table on different replicas.

After the completion of background merges and mutations, old parts are removed only after a certain amount of time (`old_parts_lifetime`).
During this time, they are not moved to other volumes or disks. Therefore, until the parts are finally removed, they are still taken into account for evaluation of the occupied disk space.

I
Ivan Blinkov 已提交
653
[Original article](https://clickhouse.tech/docs/ru/operations/table_engines/mergetree/) <!--hide-->