提交 fdf0edc4 编写于 作者: B BayoNet 提交者: Ivan Blinkov

DOCAPI-4160: The 'uniq*' and 'count' functions desctiption update. (#5594)

上级 b0188e6e
......@@ -11,7 +11,7 @@ Restrictions:
- Only applied for IN and JOIN subqueries.
- Only if the FROM section uses a distributed table containing more than one shard.
- If the subquery concerns a distributed table containing more than one shard,
- If the subquery concerns a distributed table containing more than one shard.
- Not used for a table-valued [remote](../../query_language/table_functions/remote.md) function.
Possible values:
......@@ -688,5 +688,17 @@ Possible values:
Default value: 1.
## count_distinct_implementation {#settings-count_distinct_implementation}
Specifies which of the `uniq*` functions should be used for performing the [COUNT(DISTINCT ...)](../../query_language/agg_functions/reference.md#agg_function-count) construction.
Possible values:
- [uniq](../../query_language/agg_functions/reference.md#agg_function-uniq)
- [uniqCombined](../../query_language/agg_functions/reference.md#agg_function-uniqcombined)
- [uniqHLL12](../../query_language/agg_functions/reference.md#agg_function-uniqhll12)
- [uniqExact](../../query_language/agg_functions/reference.md#agg_function-uniqexact)
Default value: `uniqExact`.
[Original article](https://clickhouse.yandex/docs/en/operations/settings/settings/) <!--hide-->
# Function reference
# Function Reference
## count() {#agg_function-count}
## count {#agg_function-count}
Counts the number of rows. Accepts zero arguments and returns UInt64.
The syntax `COUNT(DISTINCT x)` is not supported. The separate `uniq` aggregate function exists for this purpose.
Counts the number of rows or not-NULL values.
ClickHouse supports the following syntaxes for `count`:
- `count(expr)` or `COUNT(DISTINCT expr)`.
- `count()` or `COUNT(*)`. The `count()` syntax is a ClickHouse-specific implementation.
**Parameters**
The function can take:
- Zero parameters.
- One [expression](../syntax.md#syntax-expressions).
**Returned value**
- If the function is called without parameters it counts the number of rows.
- If the [expression](../syntax.md#syntax-expressions) is passed, then the function counts how many times this expression returned not null. If the expression returns a value of the [Nullable](../../data_types/nullable.md) data type, then the result of `count` stays not `Nullable`. The function returns 0 if the expression returned `NULL` for all the rows.
In both cases the type of the returned value is [UInt64](../../data_types/int_uint.md).
**Details**
ClickHouse supports the `COUNT(DISTINCT ...)` syntax. The behavior of this construction depends on the [count_distinct_implementation](../../operations/settings/settings.md#settings-count_distinct_implementation) setting. It defines which of the [uniq*](#agg_function-uniq) functions is used to perform the operation. By default the [uniqExact](#agg_function-uniqexact) function.
A `SELECT count() FROM table` query is not optimized, because the number of entries in the table is not stored separately. It chooses some small column from the table and count the number of values in it.
**Examples**
Example 1:
```sql
SELECT count() FROM t
```
```text
┌─count()─┐
│ 5 │
└─────────┘
```
Example 2:
```sql
SELECT name, value FROM system.settings WHERE name = 'count_distinct_implementation'
```
```text
┌─name──────────────────────────┬─value─────┐
│ count_distinct_implementation │ uniqExact │
└───────────────────────────────┴───────────┘
```
```sql
SELECT count(DISTINCT num) FROM t
```
```text
┌─uniqExact(num)─┐
│ 3 │
└────────────────┘
```
This example shows that `count(DISTINCT num)` is performed by the function `uniqExact` corresponding to the `count_distinct_implementation` setting value.
A `SELECT count() FROM table` query is not optimized, because the number of entries in the table is not stored separately. It will select some small column from the table and count the number of values in it.
## any(x) {#agg_function-any}
......@@ -454,45 +512,143 @@ Calculates the average.
Only works for numbers.
The result is always Float64.
## uniq(x) {#agg_function-uniq}
## uniq {#agg_function-uniq}
Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments.
Calculates the approximate number of different values of the argument.
Uses an adaptive sampling algorithm: for the calculation state, it uses a sample of element hash values with a size up to 65536.
This algorithm is also very accurate for data sets with low cardinality (up to 65536) and very efficient on CPU (when computing not too many of these functions, using `uniq` is almost as fast as using other aggregate functions).
```
uniq(x[, ...])
```
The result is determinate (it doesn't depend on the order of query processing).
**Parameters**
This function provides excellent accuracy even for data sets with extremely high cardinality (over 10 billion elements). It is recommended for default use.
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
## uniqCombined(HLL_precision)(x)
**Returned value**
Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments.
- The number of the [UInt64](../../data_types/int_uint.md) type.
A combination of three algorithms is used: array, hash table and [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) with an error correction table. For small number of distinct elements, the array is used. When the set size becomes larger the hash table is used, while it is smaller than HyperLogLog data structure. For larger number of elements, the HyperLogLog is used, and it will occupy fixed amount of memory.
**Implementation details**
The parameter "HLL_precision" is the base-2 logarithm of the number of cells in HyperLogLog. You can omit the parameter (omit first parens). The default value is 17, that is effectively 96 KiB of space (2^17 cells of 6 bits each). The memory consumption is several times smaller than for the `uniq` function, and the accuracy is several times higher. Performance is slightly lower than for the `uniq` function, but sometimes it can be even higher than it, such as with distributed queries that transmit a large number of aggregation states over the network.
Function:
The result is deterministic (it doesn't depend on the order of query processing).
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values with a size up to 65536.
The `uniqCombined` function is a good default choice for calculating the number of different values, but keep in mind that the estimation error for large sets (200 million elements and more) will become larger than theoretical value due to poor choice of hash function.
This algorithm is very accurate and very efficient on CPU. When query contains several of these functions, using `uniq` is almost as fast as using other aggregate functions.
## uniqHLL12(x)
- Provides the result deterministically (it doesn't depend on the order of query processing).
Uses the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm to approximate the number of different values of the argument.
212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
We recommend to use this function in almost all scenarios.
The result is determinate (it doesn't depend on the order of query processing).
**See Also**
- [uniqCombined](#agg_function-uniqcombined)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)
## uniqCombined {#agg_function-uniqcombined}
Calculates the approximate number of different values of the argument.
```
uniqCombined(HLL_precision)(x[, ...])
```
The `uniqCombined` function is a good choice for calculating the number of different values, but keep in mind that the estimation error for large sets (200 million elements and more) will become larger than theoretical value due to poor choice of hash function.
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
`HLL_precision` is the base-2 logarithm of the number of cells in [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Optional, you can use the function as `uniqCombined(x[, ...])`. The default value for `HLL_precision` is 17, that is effectively 96 KiB of space (2^17 cells of 6 bits each).
**Returned value**
- The number of the [UInt64](../../data_types/int_uint.md) type.
**Implementation details**
Function:
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses combination of three algorithms: array, hash table and HyperLogLog with an error correction table.
For small number of distinct elements, the array is used. When the set size becomes larger the hash table is used, while it is smaller than HyperLogLog data structure. For larger number of elements, the HyperLogLog is used, and it will occupy fixed amount of memory.
- Provides the result deterministically (it doesn't depend on the order of query processing).
In comparison with the [uniq](#agg_function-uniq) function the `uniqCombined`:
- Consumes several times less memory.
- Calculates with several times higher accuracy.
- Performs slightly lower usually. In some scenarios `uniqCombined` can perform better than `uniq`, for example, with distributed queries that transmit a large number of aggregation states over the network.
**See Also**
- [uniq](#agg_function-uniq)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)
## uniqHLL12 {#agg_function-uniqhll12}
Calculates the approximate number of different values of the argument, using the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algortithm.
```
uniqHLL12(x[, ...])
```
**Parameters**
We don't recommend using this function. In most cases, use the `uniq` or `uniqCombined` function.
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
## uniqExact(x)
**Returned value**
- The number of the [UInt64](../../data_types/int_uint.md) type.
**Implementation details**
Function:
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses the HyperLogLog algorithm to approximate the number of different values of the argument.
212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
- Provides the determinate result (it doesn't depend on the order of query processing).
We don't recommend using this function. In most cases, use the [uniq](#agg_function-uniq) or [uniqCombined](#agg_function-uniqcombined) function.
**See Also**
- [uniq](#agg_function-uniq)
- [uniqCombined](#agg_function-uniqcombined)
- [uniqExact](#agg_function-uniqexact)
## uniqExact {#agg_function-uniqexact}
Calculates the exact number of different values of the argument.
```
uniqExact(x[, ...])
```
Use the `uniqExact` function if you definitely need an exact result. Otherwise use the [uniq](#agg_function-uniq) function.
The `uniqExact` function uses more memory than the `uniq`, because the size of the state has unbounded growth as the number of different values increases.
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
**See Also**
Calculates the number of different values of the argument, exactly.
There is no reason to fear approximations. It's better to use the `uniq` function.
Use the `uniqExact` function if you definitely need an exact result.
- [uniq](#agg_function-uniq)
- [uniqCombined](#agg_function-uniqcombined)
- [uniqHLL12](#agg_function-uniqhll12)
The `uniqExact` function uses more memory than the `uniq` function, because the size of the state has unbounded growth as the number of different values increases.
## groupArray(x), groupArray(max_size)(x)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册