提交 d6c43775 编写于 作者: I Ivan Blinkov

revert some more harmful patches

上级 702885eb
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
# External Dictionaries # External Dictionaries
You can add your own dictionaries from various data sources. The data source for a dictionary can be a local text or executable file, an HTTP(s) resource, or another DBMS. For more information, see "[Sources of external dictionaries](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources)". You can add your own dictionaries from various data sources. The data source for a dictionary can be a local text or executable file, an HTTP(s) resource, or another DBMS. For more information, see "[Sources for external dictionaries](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources)".
ClickHouse: ClickHouse:
- Fully or partially stores dictionaries in RAM. > - Fully or partially stores dictionaries in RAM.
- Periodically updates dictionaries and dynamically loads missing values. In other words, dictionaries can be loaded dynamically. - Periodically updates dictionaries and dynamically loads missing values. In other words, dictionaries can be loaded dynamically.
The configuration of external dictionaries is located in one or more files. The path to the configuration is specified in the [dictionaries_config](../../operations/server_settings/settings.md#server_settings-dictionaries_config) parameter. The configuration of external dictionaries is located in one or more files. The path to the configuration is specified in the [dictionaries_config](../../operations/server_settings/settings.md#server_settings-dictionaries_config) parameter.
...@@ -37,14 +37,7 @@ The dictionary config file has the following format: ...@@ -37,14 +37,7 @@ The dictionary config file has the following format:
You can [configure](external_dicts_dict.md#dicts-external_dicts_dict) any number of dictionaries in the same file. The file format is preserved even if there is only one dictionary (i.e. `<yandex><dictionary> <!--configuration -> </dictionary></yandex>` ). You can [configure](external_dicts_dict.md#dicts-external_dicts_dict) any number of dictionaries in the same file. The file format is preserved even if there is only one dictionary (i.e. `<yandex><dictionary> <!--configuration -> </dictionary></yandex>` ).
> You can convert values ​​for a small dictionary by describing it in a `SELECT` query (see the [transform](../functions/other_functions.md#other_functions-transform) function). This functionality is not related to external dictionaries. See also "[Functions for working with external dictionaries](../functions/ext_dict_functions.md#ext_dict_functions)".
See also:
- [Configuring an external dictionary](external_dicts_dict.md#dicts-external_dicts_dict)
- [Storing dictionaries in memory](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout)
- [Updating dictionaries](external_dicts_dict_lifetime#dicts-external_dicts_dict_lifetime)
- [Sources of external dictionaries](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources)
- [Dictionary key and fields](external_dicts_dict_structure.md#dicts-external_dicts_dict_dict_structure)
- [Functions for working with external dictionaries](../functions/ext_dict_functions.md#ext_dict_functions)
!!! attention
You can convert values for a small dictionary by describing it in a `SELECT` query (see the [transform](../functions/other_functions.md#other_functions-transform) function). This functionality is not related to external dictionaries.
...@@ -4,9 +4,9 @@ ...@@ -4,9 +4,9 @@
There are a [variety of ways](#dicts-external_dicts_dict_layout-manner) to store dictionaries in memory. There are a [variety of ways](#dicts-external_dicts_dict_layout-manner) to store dictionaries in memory.
We recommend [flat](#dicts-external_dicts_dict_layout-flat), [hashed](#dicts-external_dicts_dict_layout-hashed) and [complex_key_hashed](#dicts-external_dicts_dict_layout-complex_key_hashed). which provide optimal processing speed. We recommend [flat](#dicts-external_dicts_dict_layout-flat), [hashed](#dicts-external_dicts_dict_layout-hashed)and[complex_key_hashed](#dicts-external_dicts_dict_layout-complex_key_hashed). which provide optimal processing speed.
Caching is not recommended because of potentially poor performance and difficulties in selecting optimal parameters. Read more in the section " [cache](#dicts-external_dicts_dict_layout-cache)". Caching is not recommended because of potentially poor performance and difficulties in selecting optimal parameters. Read more in the section "[cache](#dicts-external_dicts_dict_layout-cache)".
There are several ways to improve dictionary performance: There are several ways to improve dictionary performance:
...@@ -54,7 +54,7 @@ The configuration looks like this: ...@@ -54,7 +54,7 @@ The configuration looks like this:
The dictionary is completely stored in memory in the form of flat arrays. How much memory does the dictionary use? The amount is proportional to the size of the largest key (in space used). The dictionary is completely stored in memory in the form of flat arrays. How much memory does the dictionary use? The amount is proportional to the size of the largest key (in space used).
The dictionary key has the `UInt64` type and the value is limited to 500,000. If a larger key is discovered when creating the dictionary, ClickHouse throws an exception and does not create the dictionary. The dictionary key has the ` UInt64` type and the value is limited to 500,000. If a larger key is discovered when creating the dictionary, ClickHouse throws an exception and does not create the dictionary.
All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety. All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety.
...@@ -72,7 +72,7 @@ Configuration example: ...@@ -72,7 +72,7 @@ Configuration example:
### hashed ### hashed
The dictionary is completely stored in memory in the form of a hash table. The dictionary can contain any number of elements with any identifiers. In practice, the number of keys can reach tens of millions of items. The dictionary is completely stored in memory in the form of a hash table. The dictionary can contain any number of elements with any identifiers In practice, the number of keys can reach tens of millions of items.
All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety. All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety.
...@@ -88,7 +88,7 @@ Configuration example: ...@@ -88,7 +88,7 @@ Configuration example:
### complex_key_hashed ### complex_key_hashed
This type of storage is for use with complex [keys](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure). Similar to `hashed`. This type of storage is for use with composite [keys](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure). Similar to `hashed`.
Configuration example: Configuration example:
...@@ -140,13 +140,15 @@ Example: ...@@ -140,13 +140,15 @@ Example:
To work with these dictionaries, you need to pass an additional date argument to the `dictGetT` function: To work with these dictionaries, you need to pass an additional date argument to the `dictGetT` function:
dictGetT('dict_name', 'attr_name', id, date) ```
dictGetT('dict_name', 'attr_name', id, date)
```
This function returns the value for the specified `id`s and the date range that includes the passed date. This function returns the value for the specified `id`s and the date range that includes the passed date.
Details of the algorithm: Details of the algorithm:
- If the `id` is not found or a range is not found for the `id`, it returns the default value for the dictionary. - If the ` id` is not found or a range is not found for the ` id`, it returns the default value for the dictionary.
- If there are overlapping ranges, you can use any. - If there are overlapping ranges, you can use any.
- If the range delimiter is `NULL` or an invalid date (such as 1900-01-01 or 2039-01-01), the range is left open. The range can be open on both sides. - If the range delimiter is `NULL` or an invalid date (such as 1900-01-01 or 2039-01-01), the range is left open. The range can be open on both sides.
...@@ -191,11 +193,11 @@ The dictionary is stored in a cache that has a fixed number of cells. These cell ...@@ -191,11 +193,11 @@ The dictionary is stored in a cache that has a fixed number of cells. These cell
When searching for a dictionary, the cache is searched first. For each block of data, all keys that are not found in the cache or are outdated are requested from the source using ` SELECT attrs... FROM db.table WHERE id IN (k1, k2, ...)`. The received data is then written to the cache. When searching for a dictionary, the cache is searched first. For each block of data, all keys that are not found in the cache or are outdated are requested from the source using ` SELECT attrs... FROM db.table WHERE id IN (k1, k2, ...)`. The received data is then written to the cache.
For cache dictionaries, the expiration ([lifetime](external_dicts_dict_lifetime.md#dicts-external_dicts_dict_lifetime)) of data in the cache can be set. If more time than `lifetime` has passed since loading the data in a cell, the cell's value is not used, and it is re-requested the next time it needs to be used. For cache dictionaries, the expiration [lifetime](external_dicts_dict_lifetime.md#dicts-external_dicts_dict_lifetime) of data in the cache can be set. If more time than `lifetime` has passed since loading the data in a cell, the cell's value is not used, and it is re-requested the next time it needs to be used.
This is the least effective of all the ways to store dictionaries. The speed of the cache depends strongly on correct settings and the usage scenario. A cache type dictionary performs well only when the hit rates are high enough (recommended 99% and higher). You can view the average hit rate in the `system.dictionaries` table. This is the least effective of all the ways to store dictionaries. The speed of the cache depends strongly on correct settings and the usage scenario. A cache type dictionary performs well only when the hit rates are high enough (recommended 99% and higher). You can view the average hit rate in the `system.dictionaries` table.
To improve cache performance, use a subquery with `LIMIT`, and call the function with the dictionary externally. To improve cache performance, use a subquery with ` LIMIT`, and call the function with the dictionary externally.
Supported [sources](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources): MySQL, ClickHouse, executable, HTTP. Supported [sources](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources): MySQL, ClickHouse, executable, HTTP.
...@@ -217,14 +219,14 @@ Set a large enough cache size. You need to experiment to select the number of ce ...@@ -217,14 +219,14 @@ Set a large enough cache size. You need to experiment to select the number of ce
3. Assess memory consumption using the `system.dictionaries` table. 3. Assess memory consumption using the `system.dictionaries` table.
4. Increase or decrease the number of cells until the required memory consumption is reached. 4. Increase or decrease the number of cells until the required memory consumption is reached.
!!! Warning: !!! warning
Do not use ClickHouse as a source, because it is slow to process queries with random reads. Do not use ClickHouse as a source, because it is slow to process queries with random reads.
<a name="dicts-external_dicts_dict_layout-complex_key_cache"></a> <a name="dicts-external_dicts_dict_layout-complex_key_cache"></a>
### complex_key_cache ### complex_key_cache
This type of storage is for use with complex [keys](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure). Similar to `cache`. This type of storage is for use with composite [keys](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure). Similar to `cache`.
<a name="dicts-external_dicts_dict_layout-ip_trie"></a> <a name="dicts-external_dicts_dict_layout-ip_trie"></a>
...@@ -273,7 +275,7 @@ Example: ...@@ -273,7 +275,7 @@ Example:
... ...
``` ```
The key must have only one `String` type attribute that contains an allowed IP prefix. Other types are not supported yet. The key must have only one String type attribute that contains an allowed IP prefix. Other types are not supported yet.
For queries, you must use the same functions (`dictGetT` with a tuple) as for dictionaries with composite keys: For queries, you must use the same functions (`dictGetT` with a tuple) as for dictionaries with composite keys:
...@@ -281,7 +283,7 @@ For queries, you must use the same functions (`dictGetT` with a tuple) as for di ...@@ -281,7 +283,7 @@ For queries, you must use the same functions (`dictGetT` with a tuple) as for di
dictGetT('dict_name', 'attr_name', tuple(ip)) dictGetT('dict_name', 'attr_name', tuple(ip))
``` ```
The function accepts either `UInt32` for IPv4, or `FixedString(16)` for IPv6: The function takes either `UInt32` for IPv4, or `FixedString(16)` for IPv6:
``` ```
dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1'))) dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))
...@@ -289,5 +291,4 @@ dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1'))) ...@@ -289,5 +291,4 @@ dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))
Other types are not supported yet. The function returns the attribute for the prefix that corresponds to this IP address. If there are overlapping prefixes, the most specific one is returned. Other types are not supported yet. The function returns the attribute for the prefix that corresponds to this IP address. If there are overlapping prefixes, the most specific one is returned.
Data is stored in a `trie` . It must completely fit into RAM. Data is stored in a `trie`. It must completely fit into RAM.
...@@ -39,7 +39,7 @@ ClickHouse supports the following types of keys: ...@@ -39,7 +39,7 @@ ClickHouse supports the following types of keys:
A structure can contain either `<id>` or `<key>` . A structure can contain either `<id>` or `<key>` .
!!! note !!! warning
The key doesn't need to be defined separately in attributes. The key doesn't need to be defined separately in attributes.
### Numeric Key ### Numeric Key
...@@ -112,7 +112,6 @@ Configuration fields: ...@@ -112,7 +112,6 @@ Configuration fields:
- `type` – The column type. Sets the method for interpreting data in the source. For example, for MySQL, the field might be `TEXT`, `VARCHAR`, or `BLOB` in the source table, but it can be uploaded as `String`. - `type` – The column type. Sets the method for interpreting data in the source. For example, for MySQL, the field might be `TEXT`, `VARCHAR`, or `BLOB` in the source table, but it can be uploaded as `String`.
- `null_value` – The default value for a non-existing element. In the example, it is an empty string. - `null_value` – The default value for a non-existing element. In the example, it is an empty string.
- `expression` – The attribute can be an expression. The tag is not required. - `expression` – The attribute can be an expression. The tag is not required.
- `hierarchical` – Hierarchical support. Mirrored to the parent identifier. By default, `false`. - `hierarchical` – Hierarchical support. Mirrored to the parent identifier. By default, ` false`.
- `injective` – Whether the `id -> attribute` image is injective. If `true`, then you can optimize the ` GROUP BY` clause. By default, `false`. - `injective` – Whether the `id -> attribute` image is injective. If ` true`, then you can optimize the ` GROUP BY` clause. By default, `false`.
- `is_object_id` – Whether the query is executed for a MongoDB document by `ObjectID`. - `is_object_id` – Whether the query is executed for a MongoDB document by `ObjectID`.
<a name="higher_order_functions"></a>
# Higher-order functions # Higher-order functions
## `->` operator, lambda(params, expr) function ## `->` operator, lambda(params, expr) function
...@@ -91,9 +89,9 @@ SELECT arrayCumSum([1, 1, 1, 1]) AS res ...@@ -91,9 +89,9 @@ SELECT arrayCumSum([1, 1, 1, 1]) AS res
### arraySort(\[func,\] arr1, ...) ### arraySort(\[func,\] arr1, ...)
Returns the `arr1` array sorted in ascending order. If `func` is set, the sort order is determined by the result of the `func` function on the elements of the array or arrays. Returns an array as result of sorting the elements of `arr1` in ascending order. If the `func` function is specified, sorting order is determined by the result of the function `func` applied to the elements of array (arrays)
To improve sorting efficiency, we use the [Schwartzian Transform](https://en.wikipedia.org/wiki/%D0%9F%D1%80%D0%B5%D0%BE%D0%B1%D1%80%D0%B0%D0%B7%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5_%D0%A8%D0%B2%D0%B0%D1%80%D1%86%D0%B0). The [Schwartzian transform](https://en.wikipedia.org/wiki/Schwartzian_transform) is used to impove sorting efficiency.
Example: Example:
...@@ -109,5 +107,9 @@ SELECT arraySort((x, y) -> y, ['hello', 'world'], [2, 1]); ...@@ -109,5 +107,9 @@ SELECT arraySort((x, y) -> y, ['hello', 'world'], [2, 1]);
### arrayReverseSort(\[func,\] arr1, ...) ### arrayReverseSort(\[func,\] arr1, ...)
Returns the `arr1` array sorted in descending order. If `func` is set, the sort order is determined by the result of the `func` function on the elements of the array or arrays. Returns an array as result of sorting the elements of `arr1` in descending order. If the `func` function is specified, sorting order is determined by the result of the function `func` applied to the elements of array (arrays)
<a name="ym_dict_functions"></a>
# Functions for working with Yandex.Metrica dictionaries # Functions for working with Yandex.Metrica dictionaries
In order for the functions below to work, the server config must specify the paths and addresses for getting all the Yandex.Metrica dictionaries. The dictionaries are loaded at the first call of any of these functions. If the reference lists can't be loaded, an exception is thrown. In order for the functions below to work, the server config must specify the paths and addresses for getting all the Yandex.Metrica dictionaries. The dictionaries are loaded at the first call of any of these functions. If the reference lists can't be loaded, an exception is thrown.
...@@ -23,9 +21,9 @@ All functions for working with regions have an optional argument at the end – ...@@ -23,9 +21,9 @@ All functions for working with regions have an optional argument at the end –
Example: Example:
```text ```text
regionToCountry (RegionID) — Uses the default dictionary: /opt/geo/regions_hierarchy.txt. regionToCountry(RegionID) – Uses the default dictionary: /opt/geo/regions_hierarchy.txt
regionToCountry (RegionID, '') — Uses the default dictionary: /opt/geo/regions_hierarchy.txt. regionToCountry(RegionID, '') – Uses the default dictionary: /opt/geo/regions_hierarchy.txt
regionToCountry (RegionID, 'ua') — Uses the dictionary for the ua key: /opt/geo/regions_hierarchy_ua.txt. regionToCountry(RegionID, 'ua') – Uses the dictionary for the 'ua' key: /opt/geo/regions_hierarchy_ua.txt
``` ```
### regionToCity(id[, geobase]) ### regionToCity(id[, geobase])
...@@ -47,13 +45,13 @@ LIMIT 15 ...@@ -47,13 +45,13 @@ LIMIT 15
│ │ │ │
│ Moscow and Moscow region │ │ Moscow and Moscow region │
│ St. Petersburg and Leningrad region │ │ St. Petersburg and Leningrad region │
│ Belogorod region │ Belgorod region
│ Ivanovo region │ Ivanovsk region
│ Kaluga region │ │ Kaluga region │
│ Kostroma region │ │ Kostroma region │
│ Kursk region │ │ Kursk region │
│ Lipetsk region │ │ Lipetsk region │
│ Oryol region │ Orlov region
│ Ryazan region │ │ Ryazan region │
│ Smolensk region │ │ Smolensk region │
│ Tambov region │ │ Tambov region │
...@@ -75,19 +73,19 @@ LIMIT 15 ...@@ -75,19 +73,19 @@ LIMIT 15
```text ```text
┌─regionToName(regionToDistrict(toUInt32(number), \'ua\'))─┐ ┌─regionToName(regionToDistrict(toUInt32(number), \'ua\'))─┐
│ │ │ │
│ Central Federal District │ Central federal district
│ Northwest Federal District │ Northwest federal district
│ Southern Federal District │ South federal district
│ North Caucasian Federal District │ North Caucases federal district
│ Privolzhsky Federal District │ Privolga federal district
│ Ural Federal District │ Ural federal district
│ Siberian Federal District │ Siberian federal district
│ Far East Federal District │ Far East federal district
│ Scotland │ │ Scotland │
│ Faroe Islands │ │ Faroe Islands │
│ Flemish region │ │ Flemish region │
│ Brussels capital region │ │ Brussels capital region │
│ Walloon │ Wallonia
│ Federation of Bosnia and Herzegovina │ │ Federation of Bosnia and Herzegovina │
└──────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────┘
``` ```
......
...@@ -6,10 +6,9 @@ This query is exactly the same as `CREATE`, but ...@@ -6,10 +6,9 @@ This query is exactly the same as `CREATE`, but
- instead of the word `CREATE` it uses the word `ATTACH`. - instead of the word `CREATE` it uses the word `ATTACH`.
- The query doesn't create data on the disk, but assumes that data is already in the appropriate places, and just adds information about the table to the server. - The query doesn't create data on the disk, but assumes that data is already in the appropriate places, and just adds information about the table to the server.
After executing an ATTACH query, the server will know about the existence of the table.
After executing an `ATTACH` query, the server will know about the existence of the table. If the table was previously detached (``DETACH``), meaning that its structure is known, you can use shorthand without defining the structure.
If the table was previously detached (`DETACH`), meaning that its structure is known, you can use shorthand without defining the structure.
```sql ```sql
ATTACH TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster] ATTACH TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
...@@ -176,8 +175,8 @@ Supported only by `*MergeTree` engines, in which this query initializes a non-sc ...@@ -176,8 +175,8 @@ Supported only by `*MergeTree` engines, in which this query initializes a non-sc
If you specify a `PARTITION`, only the specified partition will be optimized. If you specify a `PARTITION`, only the specified partition will be optimized.
If you specify `FINAL`, optimization will be performed even when all the data is already in one part. If you specify `FINAL`, optimization will be performed even when all the data is already in one part.
!!! Important: !!! warning
The OPTIMIZE query can't fix the cause of the "Too many parts" error. OPTIMIZE can't fix the "Too many parts" error.
## KILL QUERY ## KILL QUERY
...@@ -194,7 +193,7 @@ The queries to terminate are selected from the system.processes table using the ...@@ -194,7 +193,7 @@ The queries to terminate are selected from the system.processes table using the
Examples: Examples:
```sql ```sql
-- Terminates all queries with the specified query_id: -- Forcibly terminates all queries with the specified query_id:
KILL QUERY WHERE query_id='2-857d-4a57-9ee0-327da5d60a90' KILL QUERY WHERE query_id='2-857d-4a57-9ee0-327da5d60a90'
-- Synchronously terminates all queries run by 'username': -- Synchronously terminates all queries run by 'username':
......
# SELECT Queries Syntax # SELECT Queries Syntax
`SELECT` forms samples of the data. `SELECT` performs data retrieval.
```sql ```sql
SELECT [DISTINCT] expr_list SELECT [DISTINCT] expr_list
...@@ -26,9 +26,7 @@ The clauses below are described in almost the same order as in the query executi ...@@ -26,9 +26,7 @@ The clauses below are described in almost the same order as in the query executi
If the query omits the `DISTINCT`, `GROUP BY` and `ORDER BY` clauses and the `IN` and `JOIN` subqueries, the query will be completely stream processed, using O(1) amount of RAM. If the query omits the `DISTINCT`, `GROUP BY` and `ORDER BY` clauses and the `IN` and `JOIN` subqueries, the query will be completely stream processed, using O(1) amount of RAM.
Otherwise, the query might consume a lot of RAM if the appropriate restrictions are not specified: `max_memory_usage`, `max_rows_to_group_by`, `max_rows_to_sort`, `max_rows_in_distinct`, `max_bytes_in_distinct`, `max_rows_in_set`, `max_bytes_in_set`, `max_rows_in_join`, `max_bytes_in_join`, `max_bytes_before_external_sort`, `max_bytes_before_external_group_by`. For more information, see the section "Settings". It is possible to use external sorting (saving temporary tables to a disk) and external aggregation. `The system does not have "merge join"`. Otherwise, the query might consume a lot of RAM if the appropriate restrictions are not specified: `max_memory_usage`, `max_rows_to_group_by`, `max_rows_to_sort`, `max_rows_in_distinct`, `max_bytes_in_distinct`, `max_rows_in_set`, `max_bytes_in_set`, `max_rows_in_join`, `max_bytes_in_join`, `max_bytes_before_external_sort`, `max_bytes_before_external_group_by`. For more information, see the section "Settings". It is possible to use external sorting (saving temporary tables to a disk) and external aggregation. `The system does not have "merge join"`.
<a name="query_language-section-from"></a> ### FROM Clause
### FROM clause
If the FROM clause is omitted, data will be read from the `system.one` table. If the FROM clause is omitted, data will be read from the `system.one` table.
The 'system.one' table contains exactly one row (this table fulfills the same purpose as the DUAL table found in other DBMSs). The 'system.one' table contains exactly one row (this table fulfills the same purpose as the DUAL table found in other DBMSs).
...@@ -46,8 +44,6 @@ If a query does not list any columns (for example, SELECT count() FROM t), some ...@@ -46,8 +44,6 @@ If a query does not list any columns (for example, SELECT count() FROM t), some
The FINAL modifier can be used only for a SELECT from a CollapsingMergeTree table. When you specify FINAL, data is selected fully "collapsed". Keep in mind that using FINAL leads to a selection that includes columns related to the primary key, in addition to the columns specified in the SELECT. Additionally, the query will be executed in a single stream, and data will be merged during query execution. This means that when using FINAL, the query is processed more slowly. In most cases, you should avoid using FINAL. For more information, see the section "CollapsingMergeTree engine". The FINAL modifier can be used only for a SELECT from a CollapsingMergeTree table. When you specify FINAL, data is selected fully "collapsed". Keep in mind that using FINAL leads to a selection that includes columns related to the primary key, in addition to the columns specified in the SELECT. Additionally, the query will be executed in a single stream, and data will be merged during query execution. This means that when using FINAL, the query is processed more slowly. In most cases, you should avoid using FINAL. For more information, see the section "CollapsingMergeTree engine".
<a name="select-section-sample"></a>
### SAMPLE Clause ### SAMPLE Clause
The SAMPLE clause allows for approximated query processing. Approximated query processing is only supported by MergeTree\* type tables, and only if the sampling expression was specified during table creation (see the section "MergeTree engine"). The SAMPLE clause allows for approximated query processing. Approximated query processing is only supported by MergeTree\* type tables, and only if the sampling expression was specified during table creation (see the section "MergeTree engine").
...@@ -336,9 +332,7 @@ The query can only specify a single ARRAY JOIN clause. ...@@ -336,9 +332,7 @@ The query can only specify a single ARRAY JOIN clause.
The corresponding conversion can be performed before the WHERE/PREWHERE clause (if its result is needed in this clause), or after completing WHERE/PREWHERE (to reduce the volume of calculations). The corresponding conversion can be performed before the WHERE/PREWHERE clause (if its result is needed in this clause), or after completing WHERE/PREWHERE (to reduce the volume of calculations).
<a name="query_language-join"></a> ### JOIN Clause
### JOIN clause
The normal JOIN, which is not related to ARRAY JOIN described above. The normal JOIN, which is not related to ARRAY JOIN described above.
...@@ -432,58 +426,29 @@ Among the various types of JOINs, the most efficient is ANY LEFT JOIN, then ANY ...@@ -432,58 +426,29 @@ Among the various types of JOINs, the most efficient is ANY LEFT JOIN, then ANY
If you need a JOIN for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a JOIN might not be very convenient due to the bulky syntax and the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of JOIN. For more information, see the section "External dictionaries". If you need a JOIN for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a JOIN might not be very convenient due to the bulky syntax and the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of JOIN. For more information, see the section "External dictionaries".
#### NULL processing ### WHERE Clause
The JOIN behavior is affected by the [join_use_nulls](../operations/settings/settings.md#settings-join_use_nulls) setting. With `join_use_nulls=1,` `JOIN` works like in standard SQL.
If the JOIN keys are [Nullable](../data_types/nullable.md#data_types-nullable) fields, the rows where at least one of the keys has the value [NULL](syntax.md#null-literal) are not joined.
<a name="query_language-queries-where"></a>
### WHERE clause
Allows you to set an expression that ClickHouse uses to filter data before all other actions in the query, other than the expressions contained in the [PREWHERE](#query_language-queries-prewhere) clause. This is usually an expression with logical operators.
The result of the expression must be of type `UInt8`.
ClickHouse uses indexes in the expression if this is allowed by the [table engine](../operations/table_engines/index.md#table_engines).
If [NULL](syntax.md#null-literal) must be checked in the clause, then use the [IS NULL](operators.md#operator-is-null) and [IS NOT NULL](operators.md#operator-is-not-null) operators and the related `isNull` and `isNotNull` functions. Otherwise, the expression will always be considered as not executed.
Example of checking for `NULL`:
```bash
:) SELECT * FROM t_null WHERE y IS NULL
SELECT *
FROM t_null
WHERE isNull(y)
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
└───┴──────┘
1 rows in set. Elapsed: 0.002 sec. If there is a WHERE clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators.
``` This expression will be used for filtering data before all other transformations.
<a name="query_language-queries-prewhere"></a> If indexes are supported by the database table engine, the expression is evaluated on the ability to use indexes.
### PREWHERE Clause ### PREWHERE Clause
It has the same purpose as the [WHERE](#query_language-queries-where) clause. The difference is in which data is read from the table. This clause has the same meaning as the WHERE clause. The difference is in which data is read from the table.
When using `PREWHERE`, first only the columns necessary for executing `PREWHERE` are read. Then the other columns are read that are needed for running the query, but only those blocks where the `PREWHERE` expression is true. When using PREWHERE, first only the columns necessary for executing PREWHERE are read. Then the other columns are read that are needed for running the query, but only those blocks where the PREWHERE expression is true.
`PREWHERE` makes sense if there are filtration conditions that are not suitable for indexes that are used by a minority of the columns in the query, but that provide strong data filtration. This reduces the volume of data to read. It makes sense to use PREWHERE if there are filtration conditions that are not suitable for indexes that are used by a minority of the columns in the query, but that provide strong data filtration. This reduces the volume of data to read.
For example, it is useful to write `PREWHERE` for queries that extract a large number of columns, but that only have filtration for a few columns. For example, it is useful to write PREWHERE for queries that extract a large number of columns, but that only have filtration for a few columns.
`PREWHERE` is only supported by tables from the `*MergeTree` family. PREWHERE is only supported by tables from the `*MergeTree` family.
A query may simultaneously specify `PREWHERE` and `WHERE`. In this case, `PREWHERE` goes before `WHERE`. A query may simultaneously specify PREWHERE and WHERE. In this case, PREWHERE precedes WHERE.
Keep in mind that it does not make much sense for `PREWHERE` to only specify those columns that have an index, because when using an index, only the data blocks that match the index are read. Keep in mind that it does not make much sense for PREWHERE to only specify those columns that have an index, because when using an index, only the data blocks that match the index are read.
If the setting `optimize_move_to_prewhere` is set to `1`, in the absence of `PREWHERE`, the system will automatically move parts of expressions from `WHERE` to `PREWHERE` according to heuristic analysis. If the 'optimize_move_to_prewhere' setting is set to 1 and PREWHERE is omitted, the system uses heuristics to automatically move parts of expressions from WHERE to PREWHERE.
### GROUP BY Clause ### GROUP BY Clause
...@@ -525,39 +490,7 @@ GROUP BY is not supported for array columns. ...@@ -525,39 +490,7 @@ GROUP BY is not supported for array columns.
A constant can't be specified as arguments for aggregate functions. Example: sum(1). Instead of this, you can get rid of the constant. Example: `count()`. A constant can't be specified as arguments for aggregate functions. Example: sum(1). Instead of this, you can get rid of the constant. Example: `count()`.
#### NULL processing #### WITH TOTALS Modifier
For grouping, ClickHouse interprets [NULL](syntax.md#null-literal) as a value, and `NULL=NULL`.
Here's an example to show what this means.
Assume you have this table:
```
┌─x─┬────y─┐
│ 1 │ 2 │
│ 2 │ ᴺᵁᴸᴸ │
│ 3 │ 2 │
│ 3 │ 3 │
│ 3 │ ᴺᵁᴸᴸ │
└───┴──────┘
```
The query `SELECT sum(x), y FROM t_null_big GROUP BY y` results in:
```
┌─sum(x)─┬────y─┐
│ 4 │ 2 │
│ 3 │ 3 │
│ 5 │ ᴺᵁᴸᴸ │
└────────┴──────┘
```
You can see that `GROUP BY` for `У = NULL` summed up `x`, as if `NULL` is this value.
If you pass several keys to `GROUP BY`, the result will give you all the combinations of the selection, as if `NULL` were a specific value.
#### WITH TOTALS modifier
If the WITH TOTALS modifier is specified, another row will be calculated. This row will have key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the "total" values). If the WITH TOTALS modifier is specified, another row will be calculated. This row will have key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the "total" values).
...@@ -589,20 +522,20 @@ The `max_bytes_before_external_group_by` setting determines the threshold RAM co ...@@ -589,20 +522,20 @@ The `max_bytes_before_external_group_by` setting determines the threshold RAM co
When using `max_bytes_before_external_group_by`, we recommend that you set max_memory_usage about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasn't dumped, then stage 2 might require up to the same amount of memory as in stage 1. When using `max_bytes_before_external_group_by`, we recommend that you set max_memory_usage about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasn't dumped, then stage 2 might require up to the same amount of memory as in stage 1.
For example, if `max_memory_usage` was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max_memory_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than `max_bytes_before_external_group_by`. For example, if `max_memory_usage` was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max_memory_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than ` max_bytes_before_external_group_by`.
With distributed query processing, external aggregation is performed on remote servers. In order for the requestor server to use only a small amount of RAM, set `distributed_aggregation_memory_efficient` to 1. With distributed query processing, external aggregation is performed on remote servers. In order for the requestor server to use only a small amount of RAM, set ` distributed_aggregation_memory_efficient` to 1.
When merging data flushed to the disk, as well as when merging results from remote servers when the `distributed_aggregation_memory_efficient` setting is enabled, consumes up to 1/256 \* the number of threads from the total amount of RAM. When merging data flushed to the disk, as well as when merging results from remote servers when the ` distributed_aggregation_memory_efficient` setting is enabled, consumes up to 1/256 \* the number of threads from the total amount of RAM.
When external aggregation is enabled, if there was less than `max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times). When external aggregation is enabled, if there was less than ` max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times).
If you have an ORDER BY with a small LIMIT after GROUP BY, then the ORDER BY CLAUSE will not use significant amounts of RAM. If you have an ORDER BY with a small LIMIT after GROUP BY, then the ORDER BY CLAUSE will not use significant amounts of RAM.
But if the ORDER BY doesn't have LIMIT, don't forget to enable external sorting (`max_bytes_before_external_sort`). But if the ORDER BY doesn't have LIMIT, don't forget to enable external sorting (`max_bytes_before_external_sort`).
### LIMIT N BY Clause ### LIMIT N BY Clause
`LIMIT N BY COLUMNS` selects the top `N` rows for each group of `COLUMNS`. `LIMIT N BY` is not related to `LIMIT`; they can both be used in the same query. The key for `LIMIT N BY` can contain any number of columns or expressions. LIMIT N BY COLUMNS selects the top N rows for each group of COLUMNS. LIMIT N BY is not related to LIMIT; they can both be used in the same query. The key for LIMIT N BY can contain any number of columns or expressions.
Example: Example:
...@@ -621,9 +554,7 @@ LIMIT 100 ...@@ -621,9 +554,7 @@ LIMIT 100
The query will select the top 5 referrers for each `domain, device_type` pair, but not more than 100 rows (`LIMIT n BY + LIMIT`). The query will select the top 5 referrers for each `domain, device_type` pair, but not more than 100 rows (`LIMIT n BY + LIMIT`).
`LIMIT n BY` works with [NULL](syntax.md#null-literal) as if it were a specific value. This means that as the result of the query, the user will get all the combinations of fields specified in `BY`. ### HAVING Clause
### HAVING clause
Allows filtering the result received after GROUP BY, similar to the WHERE clause. Allows filtering the result received after GROUP BY, similar to the WHERE clause.
WHERE and HAVING differ in that WHERE is performed before aggregation (GROUP BY), while HAVING is performed after it. WHERE and HAVING differ in that WHERE is performed before aggregation (GROUP BY), while HAVING is performed after it.
...@@ -642,47 +573,7 @@ We only recommend using COLLATE for final sorting of a small number of rows, sin ...@@ -642,47 +573,7 @@ We only recommend using COLLATE for final sorting of a small number of rows, sin
Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be nondeterministic (different each time). Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be nondeterministic (different each time).
If the ORDER BY clause is omitted, the order of the rows is also undefined, and may be nondeterministic as well. If the ORDER BY clause is omitted, the order of the rows is also undefined, and may be nondeterministic as well.
`NaN` and `NULL` sorting order: When floating point numbers are sorted, NaNs are separate from the other values. Regardless of the sorting order, NaNs come at the end. In other words, for ascending sorting they are placed as if they are larger than all the other numbers, while for descending sorting they are placed as if they are smaller than the rest.
- With the modifier `NULLS FIRST` — First `NULL`, then `NaN`, then other values.
- With the modifier `NULLS LAST` — First the values, then `NaN`, then `NULL`.
- Default — The same as with the `NULLS LAST` modifier.
Example:
For the table
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 2 │
│ 1 │ nan │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ nan │
│ 7 │ ᴺᵁᴸᴸ │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
Run the query `SELECT * FROM t_null_nan ORDER BY y NULLS FIRST` to get:
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
│ 1 │ nan │
│ 6 │ nan │
│ 2 │ 2 │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
Less RAM is used if a small enough LIMIT is specified in addition to ORDER BY. Otherwise, the amount of memory spent is proportional to the volume of data for sorting. For distributed query processing, if GROUP BY is omitted, sorting is partially done on remote servers, and the results are merged on the requestor server. This means that for distributed sorting, the volume of data to sort can be greater than the amount of memory on a single server. Less RAM is used if a small enough LIMIT is specified in addition to ORDER BY. Otherwise, the amount of memory spent is proportional to the volume of data for sorting. For distributed query processing, if GROUP BY is omitted, sorting is partially done on remote servers, and the results are merged on the requestor server. This means that for distributed sorting, the volume of data to sort can be greater than the amount of memory on a single server.
...@@ -701,16 +592,14 @@ These expressions work as if they are applied to separate rows in the result. ...@@ -701,16 +592,14 @@ These expressions work as if they are applied to separate rows in the result.
### DISTINCT Clause ### DISTINCT Clause
If `DISTINCT` is specified, only a single row will remain out of all the sets of fully matching rows in the result. If DISTINCT is specified, only a single row will remain out of all the sets of fully matching rows in the result.
The result will be the same as if `GROUP BY` were specified across all the fields specified in `SELECT` without aggregate functions. But there are several differences from `GROUP BY`: The result will be the same as if GROUP BY were specified across all the fields specified in SELECT without aggregate functions. But there are several differences from GROUP BY:
- `DISTINCT` can be applied together with `GROUP BY`. - DISTINCT can be applied together with GROUP BY.
- When `ORDER BY` is omitted and `LIMIT` is defined, the query stops running immediately after the required number of different rows has been read. - When ORDER BY is omitted and LIMIT is defined, the query stops running immediately after the required number of different rows has been read.
- Data blocks are output as they are processed, without waiting for the entire query to finish running. - Data blocks are output as they are processed, without waiting for the entire query to finish running.
`DISTINCT` is not supported if `SELECT` has at least one array column. DISTINCT is not supported if SELECT has at least one array column.
`DISTINCT` works with [NULL](syntax.md#null-literal) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once.
### LIMIT Clause ### LIMIT Clause
...@@ -721,11 +610,9 @@ LIMIT n, m allows you to select the first 'm' rows from the result after skippin ...@@ -721,11 +610,9 @@ LIMIT n, m allows you to select the first 'm' rows from the result after skippin
If there isn't an ORDER BY clause that explicitly sorts results, the result may be arbitrary and nondeterministic. If there isn't an ORDER BY clause that explicitly sorts results, the result may be arbitrary and nondeterministic.
`DISTINCT` works with [NULL](syntax.md#null-literal) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once. ### UNION ALL Clause
### UNION ALL clause
You can use `UNION ALL` to combine any number of queries. Example: You can use UNION ALL to combine any number of queries. Example:
```sql ```sql
SELECT CounterID, 1 AS table, toInt64(count()) AS c SELECT CounterID, 1 AS table, toInt64(count()) AS c
...@@ -740,13 +627,13 @@ SELECT CounterID, 2 AS table, sum(Sign) AS c ...@@ -740,13 +627,13 @@ SELECT CounterID, 2 AS table, sum(Sign) AS c
HAVING c > 0 HAVING c > 0
``` ```
Only `UNION ALL` is supported. The normal `UNION` (`UNION DISTINCT`) is not supported. If you need `UNION DISTINCT`, you can write `SELECT DISTINCT` from a subquery containing `UNION ALL`. Only UNION ALL is supported. The regular UNION (UNION DISTINCT) is not supported. If you need UNION DISTINCT, you can write SELECT DISTINCT from a subquery containing UNION ALL.
Queries that are part of a `UNION ALL` can be run in parallel and their results might be mixed together when returned. Queries that are parts of UNION ALL can be run simultaneously, and their results can be mixed together.
The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query. Type casting is performed for unions. For example, if two queries being combined have the same field with non-`Nullable` and `Nullable` types from a compatible type, the resulting `UNION ALL` has a `Nullable` type field. The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query.
Queries that are parts of `UNION ALL` can't be enclosed in brackets. `ORDER BY` and `LIMIT` are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with `UNION ALL` in a subquery in the `FROM` clause. Queries that are parts of UNION ALL can't be enclosed in brackets. ORDER BY and LIMIT are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with UNION ALL in a subquery in the FROM clause.
### INTO OUTFILE Clause ### INTO OUTFILE Clause
...@@ -765,9 +652,7 @@ If the FORMAT clause is omitted, the default format is used, which depends on bo ...@@ -765,9 +652,7 @@ If the FORMAT clause is omitted, the default format is used, which depends on bo
When using the command-line client, data is passed to the client in an internal efficient format. The client independently interprets the FORMAT clause of the query and formats the data itself (thus relieving the network and the server from the load). When using the command-line client, data is passed to the client in an internal efficient format. The client independently interprets the FORMAT clause of the query and formats the data itself (thus relieving the network and the server from the load).
<a name="query_language-in_operators"></a> ### IN Operators
### IN operators
The `IN`, `NOT IN`, `GLOBAL IN`, and `GLOBAL NOT IN` operators are covered separately, since their functionality is quite rich. The `IN`, `NOT IN`, `GLOBAL IN`, and `GLOBAL NOT IN` operators are covered separately, since their functionality is quite rich.
...@@ -831,47 +716,14 @@ ORDER BY EventDate ASC ...@@ -831,47 +716,14 @@ ORDER BY EventDate ASC
For each day after March 17th, count the percentage of pageviews made by users who visited the site on March 17th. For each day after March 17th, count the percentage of pageviews made by users who visited the site on March 17th.
A subquery in the IN clause is always run just one time on a single server. There are no dependent subqueries. A subquery in the IN clause is always run just one time on a single server. There are no dependent subqueries.
#### NULL processing
During request processing, the IN operator assumes that the result of an operation with [NULL](syntax.md#null-literal) is always equal to `0`, regardless of whether `NULL` is on the right or left side of the operator. `NULL` values are not included in any dataset, do not correspond to each other and cannot be compared.
Here is an example with the `t_null` table:
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 3 │
└───┴──────┘
```
Running the query `SELECT x FROM t_null WHERE y IN (NULL,3)` gives you the following result:
```
┌─x─┐
│ 2 │
└───┘
```
You can see that the row in which `y = NULL` is thrown out of the query results. This is because ClickHouse can't decide whether `NULL` is included in the `(NULL,3)` set, returns `0` as the result of the operation, and `SELECT` excludes this row from the final output.
```
SELECT y IN (NULL, 3)
FROM t_null
┌─in(y, tuple(NULL, 3))─┐
│ 0 │
│ 1 │
└───────────────────────┘
```
<a name="queries-distributed-subrequests"></a> <a name="queries-distributed-subrequests"></a>
#### Distributed Subqueries #### Distributed Subqueries
There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / `OIN` and `IN GLOBAL` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing. There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / ` OIN` and `IN GLOBAL` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing.
!!! Attention: !!! attention
Remember that the algorithms described below may work differently depending on the [settings](../operations/settings/settings.md#settings-distributed_product_mode) `distributed_product_mode` setting. Remember that the algorithms described below may work differently depending on the [settings](../operations/settings/settings.md#settings-distributed_product_mode) `distributed_product_mode` setting.
When using the regular IN, the query is sent to remote servers, and each of them runs the subqueries in the `IN` or `JOIN` clause. When using the regular IN, the query is sent to remote servers, and each of them runs the subqueries in the `IN` or `JOIN` clause.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册