select.md 63.5 KB
Newer Older
1
# SELECT Queries Syntax
2

I
Ivan Blinkov 已提交
3
`SELECT` performs data retrieval.
4

5
```sql
6
[WITH expr_list|(subquery)]
7
SELECT [DISTINCT] expr_list
8 9 10 11 12 13 14 15 16
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[ARRAY JOIN ...]
[GLOBAL] [ANY|ALL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER] JOIN (subquery)|table USING columns_list
[PREWHERE expr]
[WHERE expr]
[GROUP BY expr_list] [WITH TOTALS]
[HAVING expr]
[ORDER BY expr_list]
A
alexey-milovidov 已提交
17
[LIMIT [offset_value, ]n BY columns]
18 19 20 21
[LIMIT [n, ]m]
[UNION ALL ...]
[INTO OUTFILE filename]
[FORMAT format]
22 23 24 25 26 27 28 29
```

All the clauses are optional, except for the required list of expressions immediately after SELECT.
The clauses below are described in almost the same order as in the query execution conveyor.

If the query omits the `DISTINCT`, `GROUP BY` and `ORDER BY` clauses and the `IN` and `JOIN` subqueries, the query will be completely stream processed, using O(1) amount of RAM.
Otherwise, the query might consume a lot of RAM if the appropriate restrictions are not specified: `max_memory_usage`, `max_rows_to_group_by`, `max_rows_to_sort`, `max_rows_in_distinct`, `max_bytes_in_distinct`, `max_rows_in_set`, `max_bytes_in_set`, `max_rows_in_join`, `max_bytes_in_join`, `max_bytes_before_external_sort`, `max_bytes_before_external_group_by`. For more information, see the section "Settings". It is possible to use external sorting (saving temporary tables to a disk) and external aggregation. `The system does not have "merge join"`.

30 31 32 33 34 35 36 37
### WITH Clause
This section provides support for Common Table Expressions ([CTE](https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL)), with some limitations:
1. Recursive queries are not supported
2. When subquery is used inside WITH section, it's result should be scalar with exactly one row
3. Expression's results are not available in subqueries
Results of WITH clause expressions can be used inside SELECT clause.

Example 1: Using constant expression as "variable"
38
```sql
39 40 41 42 43 44 45 46 47
WITH '2019-08-01 15:23:00' as ts_upper_bound
SELECT *
FROM hits
WHERE
    EventDate = toDate(ts_upper_bound) AND
    EventTime <= ts_upper_bound
```

Example 2: Evicting sum(bytes) expression result from SELECT clause column list
48
```sql
49 50 51 52 53 54 55 56 57 58
WITH sum(bytes) as s
SELECT
    formatReadableSize(s),
    table
FROM system.parts
GROUP BY table
ORDER BY s
```

Example 3: Using results of scalar subquery
59
```sql
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
/* this example would return TOP 10 of most huge tables */
WITH
    (
        SELECT sum(bytes)
        FROM system.parts
        WHERE active
    ) AS total_disk_usage
SELECT
    (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
    table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10
```

Example 4: Re-using expression in subquery
As a workaround for current limitation for expression usage in subqueries, you may duplicate it.
78
```sql
79 80 81 82 83 84 85 86 87
WITH ['hello'] AS hello
SELECT
    hello,
    *
FROM
(
    WITH ['hello'] AS hello
    SELECT hello
)
88 89
```
```text
90 91 92 93 94 95
┌─hello─────┬─hello─────┐
│ ['hello'] │ ['hello'] │
└───────────┴───────────┘
```


96
### FROM Clause {#select-from}
97 98

If the FROM clause is omitted, data will be read from the `system.one` table.
99
The `system.one` table contains exactly one row (this table fulfills the same purpose as the DUAL table found in other DBMSs).
100

101
The `FROM` clause specifies the source to read data from:
102

103 104 105
- Table
- Subquery
- [Table function](table_functions/index.md)
106

107 108 109
`ARRAY JOIN` and the regular `JOIN` may also be included (see below).

Instead of a table, the `SELECT` subquery may be specified in parenthesis.
110
In contrast to standard SQL, a synonym does not need to be specified after a subquery.
111 112

To execute a query, all the columns listed in the query are extracted from the appropriate table. Any columns not needed for the external query are thrown out of the subqueries.
113
If a query does not list any columns (for example, `SELECT count() FROM t`), some column is extracted from the table anyway (the smallest one is preferred), in order to calculate the number of rows.
114

115 116
#### FINAL Modifier {#select-from-final}

117
Applicable when selecting data from tables from the [MergeTree](../operations/table_engines/mergetree.md)-engine family other than `GraphiteMergeTree`. When `FINAL` is specified, ClickHouse fully merges the data before returning the result and thus performs all data transformations that happen during merges for the given table engine.
118 119 120

Also supported for:
- [Replicated](../operations/table_engines/replication.md) versions of `MergeTree` engines.
121
- [View](../operations/table_engines/view.md), [Buffer](../operations/table_engines/buffer.md), [Distributed](../operations/table_engines/distributed.md), and [MaterializedView](../operations/table_engines/materializedview.md) engines that operate over other engines, provided they were created over `MergeTree`-engine tables.
122

123
Queries that use `FINAL` are executed not as fast as similar queries that don't, because:
124

125 126
- Query is executed in a single thread and data is merged during query execution.
- Queries with `FINAL` read primary key columns in addition to the columns specified in the query.
127

128
In most cases, avoid using `FINAL`.
129

130
### SAMPLE Clause {#select-sample-clause}
131

132
The `SAMPLE` clause allows for approximated query processing.
133 134 135 136 137 138 139 140 141 142 143

When data sampling is enabled, the query is not performed on all the data, but only on a certain fraction of data (sample). For example, if you need to calculate statistics for all the visits, it is enough to execute the query on the 1/10 fraction of all the visits and then multiply the result by 10.

Approximated query processing can be useful in the following cases:

- When you have strict timing requirements (like <100ms) but you can't justify the cost of additional hardware resources to meet them.
- When your raw data is not accurate, so approximation doesn't noticeably degrade the quality.
- Business requirements target approximate results (for cost-effectiveness, or in order to market exact results to premium users).

!!! note
    You can only use sampling with the tables in the [MergeTree](../operations/table_engines/mergetree.md) family, and only if the sampling expression was specified during table creation (see [MergeTree engine](../operations/table_engines/mergetree.md#table_engine-mergetree-creating-a-table)).
144

O
ogorbacheva 已提交
145
The features of data sampling are listed below:
146

O
ogorbacheva 已提交
147
- Data sampling is a deterministic mechanism. The result of the same `SELECT .. SAMPLE` query is always the same.
148
- Sampling works consistently for different tables. For tables with a single sampling key, a sample with the same coefficient always selects the same subset of possible data. For example, a sample of user IDs takes rows with the same subset of all the possible user IDs from different tables. This means that you can use the sample in subqueries in the [IN](#select-in-operators) clause. Also, you can join samples using the [JOIN](#select-join) clause.
149
- Sampling allows reading less data from a disk. Note that you must specify the sampling key correctly. For more information, see [Creating a MergeTree Table](../operations/table_engines/mergetree.md#table_engine-mergetree-creating-a-table).
150

151
For the `SAMPLE` clause the following syntax is supported:
O
ogorbacheva 已提交
152

153 154
| SAMPLE&#160;Clause&#160;Syntax | Description |
| ---------------- | --------- |
155 156
| `SAMPLE k` | Here `k` is the number from 0 to 1.</br>The query is executed on `k` fraction of data. For example, `SAMPLE 0.1` runs the query on 10% of data. [Read more](#select-sample-k)|
| `SAMPLE n` | Here `n` is a sufficiently large integer.</br>The query is executed on a sample of at least `n` rows (but not significantly more than this). For example, `SAMPLE 10000000` runs the query on a minimum of 10,000,000 rows. [Read more](#select-sample-n) |
157
| `SAMPLE k OFFSET m` | Here `k` and `m` are the numbers from 0 to 1.</br>The query is executed on a sample of `k` fraction of the data. The data used for the sample is offset by `m` fraction. [Read more](#select-sample-offset) |
158

O
ogorbacheva 已提交
159

O
ogorbacheva 已提交
160
#### SAMPLE k {#select-sample-k}
O
ogorbacheva 已提交
161

162 163
Here `k` is the number from 0 to 1 (both fractional and decimal notations are supported). For example, `SAMPLE 1/2` or `SAMPLE 0.5`.

164
In a `SAMPLE k` clause, the sample is taken from the `k` fraction of data. The example is shown below:
165

166
```sql
167 168 169 170 171 172 173 174 175 176 177
SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000
```

178
In this example, the query is executed on a sample from 0.1 (10%) of data. Values of aggregate functions are not corrected automatically, so to get an approximate result, the value `count()` is manually multiplied by 10.
179

O
ogorbacheva 已提交
180
#### SAMPLE n {#select-sample-n}
O
ogorbacheva 已提交
181

182 183 184
Here `n` is a sufficiently large integer. For example, `SAMPLE 10000000`.

In this case, the query is executed on a sample of at least `n` rows (but not significantly more than this). For example, `SAMPLE 10000000` runs the query on a minimum of 10,000,000 rows.
O
ogorbacheva 已提交
185

O
ogorbacheva 已提交
186
Since the minimum unit for data reading is one granule (its size is set by the `index_granularity` setting), it makes sense to set a sample that is much larger than the size of the granule.
O
ogorbacheva 已提交
187

188 189 190
When using the `SAMPLE n` clause, you don't know which relative percent of data was processed. So you don't know the coefficient the aggregate functions should be multiplied by. Use the `_sample_factor` virtual column to get the approximate result.

The `_sample_factor` column contains relative coefficients that are calculated dynamically. This column is created automatically when you [create](../operations/table_engines/mergetree.md#table_engine-mergetree-creating-a-table) a table with the specified sampling key. The usage examples of the `_sample_factor` column are shown below.
O
ogorbacheva 已提交
191

192
Let's consider the table `visits`, which contains the statistics about site visits. The first example shows how to calculate the number of page views:
193

194
```sql
195
SELECT sum(PageViews * _sample_factor)
O
ogorbacheva 已提交
196 197
FROM visits
SAMPLE 10000000
198
```  
O
ogorbacheva 已提交
199

200
The next example shows how to calculate the total number of visits:
O
ogorbacheva 已提交
201

202
```sql
O
fixes  
ogorbacheva 已提交
203
SELECT sum(_sample_factor)
O
ogorbacheva 已提交
204 205
FROM visits
SAMPLE 10000000
206
``` 
O
ogorbacheva 已提交
207

208
The example below shows how to calculate the average session duration. Note that you don't need to use the relative coefficient to calculate the average values.
O
ogorbacheva 已提交
209

210
```sql
O
ogorbacheva 已提交
211 212 213
SELECT avg(Duration)
FROM visits
SAMPLE 10000000
214
``` 
215

O
ogorbacheva 已提交
216
#### SAMPLE k OFFSET m {#select-sample-offset}
217

218
Here `k` and `m` are numbers from 0 to 1. Examples are shown below.
219

220
**Example 1**
221

222
```sql
223 224 225
SAMPLE 1/10
```

226
In this example, the sample is 1/10th of all data:
227 228 229

`[++------------------]`

230
**Example 2**
231

232
```sql
233 234 235
SAMPLE 1/10 OFFSET 1/2
```

236
Here, a sample of 10% is taken from the second half of the data.
237 238

`[----------++--------]`
B
BayoNet 已提交
239

240
### ARRAY JOIN Clause {#select-array-join-clause}
241

242
Allows executing `JOIN` with an array or nested data structure. The intent is similar to the [arrayJoin](functions/array_join.md#functions_arrayjoin) function, but its functionality is broader.
243

244
```sql
245 246 247 248 249
SELECT <expr_list>
FROM <left_subquery>
[LEFT] ARRAY JOIN <array>
[WHERE|PREWHERE <expr>]
...
I
Ivan Blinkov 已提交
250
```
251

252
You can specify only a single `ARRAY JOIN` clause in a query.
253

254
The query execution order is optimized when running `ARRAY JOIN`. Although `ARRAY JOIN` must always be specified before the `WHERE/PREWHERE` clause, it can be performed either before `WHERE/PREWHERE` (if the result is needed in this clause), or after completing it (to reduce the volume of calculations). The processing order is controlled by the query optimizer.
255

256
Supported types of `ARRAY JOIN` are listed below:
257

258 259
- `ARRAY JOIN` - In this case, empty arrays are not included in the result of `JOIN`.
- `LEFT ARRAY JOIN` - The result of `JOIN` contains rows with empty arrays. The value for an empty array is set to the default value for the array element type (usually 0, empty string or NULL).
260

261
The examples below demonstrate the usage of the `ARRAY JOIN` and `LEFT ARRAY JOIN` clauses. Let's create a table with an [Array](../data_types/array.md) type column and insert values into it:
262

263
```sql
264 265 266 267 268
CREATE TABLE arrays_test
(
    s String,
    arr Array(UInt8)
) ENGINE = Memory;
269

270 271 272
INSERT INTO arrays_test
VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', []);
```
273
```text
274 275 276 277 278 279
┌─s───────────┬─arr─────┐
│ Hello       │ [1,2]   │
│ World       │ [3,4,5] │
│ Goodbye     │ []      │
└─────────────┴─────────┘
```
280

281
The example below uses the `ARRAY JOIN` clause:
282

283
```sql
284 285
SELECT s, arr
FROM arrays_test
286 287
ARRAY JOIN arr;
```
288
```text
289 290 291 292 293 294 295
┌─s─────┬─arr─┐
│ Hello │   1 │
│ Hello │   2 │
│ World │   3 │
│ World │   4 │
│ World │   5 │
└───────┴─────┘
296
```
297

298
The next example uses the `LEFT ARRAY JOIN` clause:
299

300
```sql
301
SELECT s, arr
302
FROM arrays_test
303
LEFT ARRAY JOIN arr;
304
```
305
```text
306 307 308 309 310 311 312 313
┌─s───────────┬─arr─┐
│ Hello       │   1 │
│ Hello       │   2 │
│ World       │   3 │
│ World       │   4 │
│ World       │   5 │
│ Goodbye     │   0 │
└─────────────┴─────┘
314
```
315

316 317 318 319
#### Using Aliases

An alias can be specified for an array in the `ARRAY JOIN` clause. In this case, an array item can be accessed by this alias, but the array itself is accessed by the original name. Example:

320
```sql
321 322 323 324 325
SELECT s, arr, a
FROM arrays_test
ARRAY JOIN arr AS a;
```

326
```text
327 328 329 330 331 332 333 334 335 336
┌─s─────┬─arr─────┬─a─┐
│ Hello │ [1,2]   │ 1 │
│ Hello │ [1,2]   │ 2 │
│ World │ [3,4,5] │ 3 │
│ World │ [3,4,5] │ 4 │
│ World │ [3,4,5] │ 5 │
└───────┴─────────┴───┘
```

Using aliases, you can perform `ARRAY JOIN` with an external array. For example:
337

338
```sql
339
SELECT s, arr_external
340
FROM arrays_test
341 342
ARRAY JOIN [1, 2, 3] AS arr_external;
```
343

344
```text
345 346 347 348 349 350 351 352 353 354 355 356 357
┌─s───────────┬─arr_external─┐
│ Hello       │            1 │
│ Hello       │            2 │
│ Hello       │            3 │
│ World       │            1 │
│ World       │            2 │
│ World       │            3 │
│ Goodbye     │            1 │
│ Goodbye     │            2 │
│ Goodbye     │            3 │
└─────────────┴──────────────┘
```

358
Multiple arrays can be comma-separated in the `ARRAY JOIN` clause. In this case, `JOIN` is performed with them simultaneously (the direct sum, not the cartesian product). Note that all the arrays must have the same size. Example:
359

360
```sql
361 362
SELECT s, arr, a, num, mapped
FROM arrays_test
363 364
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num, arrayMap(x -> x + 1, arr) AS mapped;
```
365

366
```text
367 368 369 370 371 372 373
┌─s─────┬─arr─────┬─a─┬─num─┬─mapped─┐
│ Hello │ [1,2]   │ 1 │   1 │      2 │
│ Hello │ [1,2]   │ 2 │   2 │      3 │
│ World │ [3,4,5] │ 3 │   1 │      4 │
│ World │ [3,4,5] │ 4 │   2 │      5 │
│ World │ [3,4,5] │ 5 │   3 │      6 │
└───────┴─────────┴───┴─────┴────────┘
374
```
375

376 377
The example below uses the [arrayEnumerate](functions/array_functions.md#array_functions-arrayenumerate) function:

378
```sql
379 380
SELECT s, arr, a, num, arrayEnumerate(arr)
FROM arrays_test
381 382
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num;
```
383

384
```text
385 386 387 388 389 390 391 392 393
┌─s─────┬─arr─────┬─a─┬─num─┬─arrayEnumerate(arr)─┐
│ Hello │ [1,2]   │ 1 │   1 │ [1,2]               │
│ Hello │ [1,2]   │ 2 │   2 │ [1,2]               │
│ World │ [3,4,5] │ 3 │   1 │ [1,2,3]             │
│ World │ [3,4,5] │ 4 │   2 │ [1,2,3]             │
│ World │ [3,4,5] │ 5 │   3 │ [1,2,3]             │
└───────┴─────────┴───┴─────┴─────────────────────┘
```

394
#### ARRAY JOIN With Nested Data Structure
395

396
`ARRAY `JOIN`` also works with [nested data structures](../data_types/nested_data_structures/nested.md). Example:
397

398
```sql
399 400 401 402 403 404
CREATE TABLE nested_test
(
    s String,
    nest Nested(
    x UInt8,
    y UInt32)
405
) ENGINE = Memory;
406

407 408 409
INSERT INTO nested_test
VALUES ('Hello', [1,2], [10,20]), ('World', [3,4,5], [30,40,50]), ('Goodbye', [], []);
```
410

411
```text
412 413 414 415 416
┌─s───────┬─nest.x──┬─nest.y─────┐
│ Hello   │ [1,2]   │ [10,20]    │
│ World   │ [3,4,5] │ [30,40,50] │
│ Goodbye │ []      │ []         │
└─────────┴─────────┴────────────┘
417
```
418

419
```sql
420 421
SELECT s, `nest.x`, `nest.y`
FROM nested_test
422 423
ARRAY JOIN nest;
```
424

425
```text
426 427 428 429 430 431 432 433 434
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │      1 │     10 │
│ Hello │      2 │     20 │
│ World │      3 │     30 │
│ World │      4 │     40 │
│ World │      5 │     50 │
└───────┴────────┴────────┘
```

435
When specifying names of nested data structures in `ARRAY JOIN`, the meaning is the same as `ARRAY JOIN` with all the array elements that it consists of. Examples are listed below:
436

437
```sql
438 439
SELECT s, `nest.x`, `nest.y`
FROM nested_test
440 441
ARRAY JOIN `nest.x`, `nest.y`;
```
442

443
```text
444 445 446 447 448 449 450 451 452 453 454
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │      1 │     10 │
│ Hello │      2 │     20 │
│ World │      3 │     30 │
│ World │      4 │     40 │
│ World │      5 │     50 │
└───────┴────────┴────────┘
```

This variation also makes sense:

455
```sql
456 457
SELECT s, `nest.x`, `nest.y`
FROM nested_test
458 459
ARRAY JOIN `nest.x`;
```
460

461
```text
462 463 464 465 466 467 468 469 470
┌─s─────┬─nest.x─┬─nest.y─────┐
│ Hello │      1 │ [10,20]    │
│ Hello │      2 │ [10,20]    │
│ World │      3 │ [30,40,50] │
│ World │      4 │ [30,40,50] │
│ World │      5 │ [30,40,50] │
└───────┴────────┴────────────┘
```

471
An alias may be used for a nested data structure, in order to select either the `JOIN` result or the source array. Example:
472

473
```sql
474 475
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`
FROM nested_test
476 477
ARRAY JOIN nest AS n;
```
478

479
```text
480 481 482 483 484 485 486 487 488
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┐
│ Hello │   1 │  10 │ [1,2]   │ [10,20]    │
│ Hello │   2 │  20 │ [1,2]   │ [10,20]    │
│ World │   3 │  30 │ [3,4,5] │ [30,40,50] │
│ World │   4 │  40 │ [3,4,5] │ [30,40,50] │
│ World │   5 │  50 │ [3,4,5] │ [30,40,50] │
└───────┴─────┴─────┴─────────┴────────────┘
```

489
Example of using the [arrayEnumerate](functions/array_functions.md#array_functions-arrayenumerate) function:
490

491
```sql
492 493
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`, num
FROM nested_test
494 495
ARRAY JOIN nest AS n, arrayEnumerate(`nest.x`) AS num;
```
496

497
```text
498 499 500 501 502 503 504 505 506
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┬─num─┐
│ Hello │   1 │  10 │ [1,2]   │ [10,20]    │   1 │
│ Hello │   2 │  20 │ [1,2]   │ [10,20]    │   2 │
│ World │   3 │  30 │ [3,4,5] │ [30,40,50] │   1 │
│ World │   4 │  40 │ [3,4,5] │ [30,40,50] │   2 │
│ World │   5 │  50 │ [3,4,5] │ [30,40,50] │   3 │
└───────┴─────┴─────┴─────────┴────────────┴─────┘
```

B
BayoNet 已提交
507
### JOIN Clause {#select-join}
508

509
Joins the data in the normal [SQL JOIN](https://en.wikipedia.org/wiki/Join_(SQL)) sense.
B
BayoNet 已提交
510 511

!!! info "Note"
512
    Not related to [ARRAY JOIN](#select-array-join-clause).
B
BayoNet 已提交
513

514
```sql
B
BayoNet 已提交
515 516
SELECT <expr_list>
FROM <left_subquery>
517
[GLOBAL] [ANY|ALL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER] JOIN <right_subquery>
B
BayoNet 已提交
518
(ON <expr_list>)|(USING <column_list>) ...
519 520
```

I
Ivan Blinkov 已提交
521
The table names can be specified instead of `<left_subquery>` and `<right_subquery>`. This is equivalent to the `SELECT * FROM table` subquery, except in a special case when the table has the [Join](../operations/table_engines/join.md) engine – an array prepared for joining.
522

523
#### Supported Types of `JOIN` {#select-join-types}
524

525 526 527 528 529
- `INNER JOIN` (or `JOIN`)
- `LEFT JOIN` (or `LEFT OUTER JOIN`)
- `RIGHT JOIN` (or `RIGHT OUTER JOIN`)
- `FULL JOIN` (or `FULL OUTER JOIN`)
- `CROSS JOIN` (or `,` )
530

531
See the standard [SQL JOIN](https://en.wikipedia.org/wiki/Join_(SQL)) description.
532

533 534
#### Multiple JOIN

535
Performing queries, ClickHouse rewrites multi-table joins into the sequence of two-table joins. For example, if there are four tables for join ClickHouse joins the first and the second, then joins the result with the third table, and at the last step, it joins the fourth one.
B
BayoNet 已提交
536

537
If a query contains the `WHERE` clause, ClickHouse tries to pushdown filters from this clause through the intermediate join. If it cannot apply the filter to each intermediate join, ClickHouse applies the filters after all joins are completed.
B
BayoNet 已提交
538

539
We recommend the `JOIN ON` or `JOIN USING` syntax for creating queries. For example:
540

541
```sql
542 543
SELECT * FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t1.a = t3.a
```
544

545
You can use comma-separated lists of tables in the `FROM` clause. For example:
546

547
```sql
548 549
SELECT * FROM t1, t2, t3 WHERE t1.a = t2.a AND t1.a = t3.a
```
550 551 552

Don't mix these syntaxes.

553
ClickHouse doesn't directly support syntax with commas, so we don't recommend using them. The algorithm tries to rewrite the query in terms of `CROSS JOIN` and `INNER JOIN` clauses and then proceeds to query processing. When rewriting the query, ClickHouse tries to optimize performance and memory consumption. By default, ClickHouse treats commas as an `INNER JOIN` clause and converts `INNER JOIN` to `CROSS JOIN` when the algorithm cannot guarantee that `INNER JOIN` returns the required data.
554

555
#### Strictness {#select-join-strictness}
556

557
- `ALL` — If the right table has several matching rows, ClickHouse creates a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) from matching rows. This is the standard `JOIN` behavior in SQL.
558
- `ANY` — If the right table has several matching rows, only the first one found is joined. If the right table has only one matching row, the results of queries with `ANY` and `ALL` keywords are the same.
559
- `ASOF` — For joining sequences with a non-exact match. `ASOF JOIN` usage is described below.
560 561 562

**ASOF JOIN Usage**

563 564 565 566
`ASOF JOIN` is useful when you need to join records that have no exact match.

Tables for `ASOF JOIN` must have an ordered sequence column. This column cannot be alone in a table, and should be one of the data types: `UInt32`, `UInt64`, `Float32`, `Float64`, `Date`, and `DateTime`.

567
Syntax `ASOF JOIN ... ON`:
568

569 570 571 572 573 574
```sql
SELECT expressions_list
FROM table_1
ASOF LEFT JOIN table_2
ON equi_cond AND closest_match_cond
```
B
BayoNet 已提交
575

576
You can use any number of equality conditions and exactly one closest match condition. For example, `SELECT count() FROM table_1 ASOF LEFT JOIN table_2 ON table_1.a == table_2.b AND table_2.t <= table_1.t`. 
B
BayoNet 已提交
577

578
Conditions supported for the closest match: `>`, `>=`, `<`, `<=`.
579

580
Syntax `ASOF JOIN ... USING`:
B
BayoNet 已提交
581

582 583 584 585 586 587
```sql
SELECT expressions_list
FROM table_1
ASOF JOIN table_2
USING (equi_column1, ... equi_columnN, asof_column)
```
B
BayoNet 已提交
588

589
`ASOF JOIN` uses `equi_columnX` for joining on equality and `asof_column` for joining on the closest match with the `table_1.asof_column >= table_2.asof_column` condition. The `asof_column` column always the last one in the `USING` clause.
590 591

For example, consider the following tables:
592

593
```text
B
BayoNet 已提交
594 595
     table_1                           table_2

596 597 598 599 600 601 602 603 604
  event   | ev_time | user_id       event   | ev_time | user_id
----------|---------|----------   ----------|---------|----------             
              ...                               ...
event_1_1 |  12:00  |  42         event_2_1 |  11:59  |   42
              ...                 event_2_2 |  12:30  |   42
event_1_2 |  13:00  |  42         event_2_3 |  13:00  |   42
              ...                               ...
```

605
`ASOF JOIN` can take the timestamp of a user event from `table_1` and find an event in `table_2` where the timestamp is closest to the timestamp of the event from `table_1` corresponding to the closest match condition. Equal timestamp values are the closest if available. Here, the `user_id` column can be used for joining on equality and the `ev_time` column can be used for joining on the closest match. In our example, `event_1_1` can be joined with `event_2_1` and `event_1_2` can be joined with `event_2_3`, but `event_2_2` can't be joined.
606 607


608 609
!!! note "Note"
    `ASOF` join is **not** supported in the [Join](../operations/table_engines/join.md) table engine.
610

611
To set the default strictness value, use the session configuration parameter [join_default_strictness](../operations/settings/settings.md#settings-join_default_strictness).
612

613
#### GLOBAL JOIN
614

B
BayoNet 已提交
615
When using a normal `JOIN`, the query is sent to remote servers. Subqueries are run on each of them in order to make the right table, and the join is performed with this table. In other words, the right table is formed on each server separately.
616 617 618

When using `GLOBAL ... JOIN`, first the requestor server runs a subquery to calculate the right table. This temporary table is passed to each remote server, and queries are run on them using the temporary data that was transmitted.

619
Be careful when using `GLOBAL`. For more information, see the section [Distributed subqueries](#select-distributed-subqueries).
620

621
#### Usage Recommendations
622

B
BayoNet 已提交
623
When running a `JOIN`, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in `WHERE` and before aggregation. In order to explicitly set the processing order, we recommend running a `JOIN` subquery with a subquery.
624 625 626

Example:

627
```sql
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650
SELECT
    CounterID,
    hits,
    visits
FROM
(
    SELECT
        CounterID,
        count() AS hits
    FROM test.hits
    GROUP BY CounterID
) ANY LEFT JOIN
(
    SELECT
        CounterID,
        sum(Sign) AS visits
    FROM test.visits
    GROUP BY CounterID
) USING CounterID
ORDER BY hits DESC
LIMIT 10
```

651
```text
652 653 654 655 656 657 658 659 660 661 662 663 664 665 666
┌─CounterID─┬───hits─┬─visits─┐
│   1143050 │ 523264 │  13665 │
│    731962 │ 475698 │ 102716 │
│    722545 │ 337212 │ 108187 │
│    722889 │ 252197 │  10547 │
│   2237260 │ 196036 │   9522 │
│  23057320 │ 147211 │   7689 │
│    722818 │  90109 │  17847 │
│     48221 │  85379 │   4652 │
│  19762435 │  77807 │   7026 │
│    722884 │  77492 │  11056 │
└───────────┴────────┴────────┘
```

Subqueries don't allow you to set names or use them for referencing a column from a specific subquery.
667
The columns specified in `USING` must have the same names in both subqueries, and the other columns must be named differently. You can use aliases to change the names of columns in subqueries (the example uses the aliases `hits` and `visits`).
668

B
BayoNet 已提交
669
The `USING` clause specifies one or more columns to join, which establishes the equality of these columns. The list of columns is set without brackets. More complex join conditions are not supported.
670

B
BayoNet 已提交
671
The right table (the subquery result) resides in RAM. If there isn't enough memory, you can't run a `JOIN`.
672

673
Each time a query is run with the same `JOIN`, the subquery is run again because the result is not cached. To avoid this, use the special [Join](../operations/table_engines/join.md) table engine, which is a prepared array for joining that is always in RAM.
674

B
BayoNet 已提交
675
In some cases, it is more efficient to use `IN` instead of `JOIN`.
676
Among the various types of `JOIN`, the most efficient is `ANY LEFT JOIN`, then `ANY INNER JOIN`. The least efficient are `ALL LEFT JOIN` and `ALL INNER JOIN`.
677

B
BayoNet 已提交
678
If you need a `JOIN` for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a `JOIN` might not be very convenient due to the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of `JOIN`. For more information, see the section [External dictionaries](dicts/external_dicts.md).
679

B
BayoNet 已提交
680
**Memory Limitations**
B
BayoNet 已提交
681

B
BayoNet 已提交
682
ClickHouse uses the [hash join](https://en.wikipedia.org/wiki/Hash_join) algorithm. ClickHouse takes the `<right_subquery>` and creates a hash table for it in RAM. If you need to restrict join operation memory consumption use the following settings:
B
BayoNet 已提交
683 684 685 686

- [max_rows_in_join](../operations/settings/query_complexity.md#settings-max_rows_in_join) — Limits number of rows in the hash table.
- [max_bytes_in_join](../operations/settings/query_complexity.md#settings-max_bytes_in_join) — Limits size of the hash table.

B
BayoNet 已提交
687
When any of these limits is reached, ClickHouse acts as the [join_overflow_mode](../operations/settings/query_complexity.md#settings-join_overflow_mode) setting instructs.
B
BayoNet 已提交
688

689
#### Processing of Empty or NULL Cells
690

S
Sergei Bocharov 已提交
691
While joining tables, the empty cells may appear. The setting [join_use_nulls](../operations/settings/settings.md#join_use_nulls) define how ClickHouse fills these cells.
692

693
If the `JOIN` keys are [Nullable](../data_types/nullable.md) fields, the rows where at least one of the keys has the value [NULL](syntax.md#null-literal) are not joined.
694

695 696
#### Syntax Limitations

697
For multiple `JOIN` clauses in a single `SELECT` query:
698 699 700 701

- Taking all the columns via `*` is available only if tables are joined, not subqueries.
- The `PREWHERE` clause is not available.

702 703 704
For `ON`, `WHERE`, and `GROUP BY` clauses:

- Arbitrary expressions cannot be used in `ON`, `WHERE`, and `GROUP BY` clauses, but you can define an expression in a `SELECT` clause and then use it in these clauses via an alias.
705

706

I
Ivan Blinkov 已提交
707
### WHERE Clause {#select-where}
708

I
Ivan Blinkov 已提交
709 710
If there is a WHERE clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators.
This expression will be used for filtering data before all other transformations.
711

I
Ivan Blinkov 已提交
712
If indexes are supported by the database table engine, the expression is evaluated on the ability to use indexes.
713

714

715
### PREWHERE Clause
716

I
Ivan Blinkov 已提交
717 718
This clause has the same meaning as the WHERE clause. The difference is in which data is read from the table.
When using PREWHERE, first only the columns necessary for executing PREWHERE are read. Then the other columns are read that are needed for running the query, but only those blocks where the PREWHERE expression is true.
719

720
It makes sense to use PREWHERE if there are filtration conditions that are used by a minority of the columns in the query, but that provide strong data filtration. This reduces the volume of data to read.
721

I
Ivan Blinkov 已提交
722
For example, it is useful to write PREWHERE for queries that extract a large number of columns, but that only have filtration for a few columns.
723

I
Ivan Blinkov 已提交
724
PREWHERE is only supported by tables from the `*MergeTree` family.
725

I
Ivan Blinkov 已提交
726
A query may simultaneously specify PREWHERE and WHERE. In this case, PREWHERE precedes WHERE.
727

I
Ivan Blinkov 已提交
728
If the 'optimize_move_to_prewhere' setting is set to 1 and PREWHERE is omitted, the system uses heuristics to automatically move parts of expressions from WHERE to PREWHERE.
729

B
BayoNet 已提交
730
### GROUP BY Clause {#select-group-by-clause}
731 732 733 734 735 736 737 738 739 740

This is one of the most important parts of a column-oriented DBMS.

If there is a GROUP BY clause, it must contain a list of expressions. Each expression will be referred to here as a "key".
All the expressions in the SELECT, HAVING, and ORDER BY clauses must be calculated from keys or from aggregate functions. In other words, each column selected from the table must be used either in keys or inside aggregate functions.

If a query contains only table columns inside aggregate functions, the GROUP BY clause can be omitted, and aggregation by an empty set of keys is assumed.

Example:

741
```sql
742 743 744 745 746 747 748 749 750 751 752 753 754
SELECT
    count(),
    median(FetchTiming > 60 ? 60 : FetchTiming),
    count() - sum(Refresh)
FROM hits
```

However, in contrast to standard SQL, if the table doesn't have any rows (either there aren't any at all, or there aren't any after using WHERE to filter), an empty result is returned, and not the result from one of the rows containing the initial values of aggregate functions.

As opposed to MySQL (and conforming to standard SQL), you can't get some value of some column that is not in a key or aggregate function (except constant expressions). To work around this, you can use the 'any' aggregate function (get the first encountered value) or 'min/max'.

Example:

755
```sql
756 757 758
SELECT
    domainWithoutWWW(URL) AS domain,
    count(),
B
BayoNet 已提交
759
    any(Title) AS title -- getting the first occurred page header for each domain.
760 761 762 763 764 765 766 767 768 769
FROM hits
GROUP BY domain
```

For every different key value encountered, GROUP BY calculates a set of aggregate function values.

GROUP BY is not supported for array columns.

A constant can't be specified as arguments for aggregate functions. Example: sum(1). Instead of this, you can get rid of the constant. Example: `count()`.

770 771
#### NULL processing

I
Ivan Blinkov 已提交
772
For grouping, ClickHouse interprets [NULL](syntax.md) as a value, and `NULL=NULL`.
773

B
BayoNet 已提交
774
Here's an example to show what this means.
775

B
BayoNet 已提交
776
Assume you have this table:
777

778
```text
B
BayoNet 已提交
779 780 781 782 783 784 785 786
┌─x─┬────y─┐
│ 1 │    2 │
│ 2 │ ᴺᵁᴸᴸ │
│ 3 │    2 │
│ 3 │    3 │
│ 3 │ ᴺᵁᴸᴸ │
└───┴──────┘
```
787

B
BayoNet 已提交
788
The query `SELECT sum(x), y FROM t_null_big GROUP BY y` results in:
789

790
```text
B
BayoNet 已提交
791 792 793 794 795 796
┌─sum(x)─┬────y─┐
│      4 │    2 │
│      3 │    3 │
│      5 │ ᴺᵁᴸᴸ │
└────────┴──────┘
```
797

V
Vivien Maisonneuve 已提交
798
You can see that `GROUP BY` for `y = NULL` summed up `x`, as if `NULL` is this value.
799

B
BayoNet 已提交
800
If you pass several keys to `GROUP BY`, the result will give you all the combinations of the selection, as if `NULL` were a specific value.
801

I
Ivan Blinkov 已提交
802
#### WITH TOTALS Modifier
803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826

If the WITH TOTALS modifier is specified, another row will be calculated. This row will have key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the "total" values).

This extra row is output in JSON\*, TabSeparated\*, and Pretty\* formats, separately from the other rows. In the other formats, this row is not output.

In JSON\* formats, this row is output as a separate 'totals' field. In TabSeparated\* formats, the row comes after the main result, preceded by an empty row (after the other data). In Pretty\* formats, the row is output as a separate table after the main result.

`WITH TOTALS` can be run in different ways when HAVING is present. The behavior depends on the 'totals_mode' setting.
By default, `totals_mode = 'before_having'`. In this case, 'totals' is calculated across all rows, including the ones that don't pass through HAVING and 'max_rows_to_group_by'.

The other alternatives include only the rows that pass through HAVING in 'totals', and behave differently with the setting `max_rows_to_group_by` and `group_by_overflow_mode = 'any'`.

`after_having_exclusive` – Don't include rows that didn't pass through `max_rows_to_group_by`. In other words, 'totals' will have less than or the same number of rows as it would if `max_rows_to_group_by` were omitted.

`after_having_inclusive` – Include all the rows that didn't pass through 'max_rows_to_group_by' in 'totals'. In other words, 'totals' will have more than or the same number of rows as it would if `max_rows_to_group_by` were omitted.

`after_having_auto` – Count the number of rows that passed through HAVING. If it is more than a certain amount (by default, 50%), include all the rows that didn't pass through 'max_rows_to_group_by' in 'totals'. Otherwise, do not include them.

`totals_auto_threshold` – By default, 0.5. The coefficient for `after_having_auto`.

If `max_rows_to_group_by` and `group_by_overflow_mode = 'any'` are not used, all variations of `after_having` are the same, and you can use any of them (for example, `after_having_auto`).

You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined).

B
BayoNet 已提交
827
#### GROUP BY in External Memory {#select-group-by-in-external-memory}
828

829 830
You can enable dumping temporary data to the disk to restrict memory usage during `GROUP BY`.
The [max_bytes_before_external_group_by](../operations/settings/settings.md#settings-max_bytes_before_external_group_by) setting determines the threshold RAM consumption for dumping `GROUP BY` temporary data to the file system. If set to 0 (the default), it is disabled.
831

832
When using `max_bytes_before_external_group_by`, we recommend that you set `max_memory_usage` about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasn't dumped, then stage 2 might require up to the same amount of memory as in stage 1.
833

834
For example, if [max_memory_usage](../operations/settings/settings.md#settings_max_memory_usage) was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max_memory_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than `max_bytes_before_external_group_by`.
835

836
With distributed query processing, external aggregation is performed on remote servers. In order for the requester server to use only a small amount of RAM, set `distributed_aggregation_memory_efficient` to 1.
837

838
When merging data flushed to the disk, as well as when merging results from remote servers when the `distributed_aggregation_memory_efficient` setting is enabled, consumes up to `1/256 * the_number_of_threads` from the total amount of RAM.
839

840
When external aggregation is enabled, if there was less than `max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times).
841

842
If you have an `ORDER BY` with a `LIMIT` after `GROUP BY`, then the amount of used RAM depends on the amount of data in `LIMIT`, not in the whole table. But if the `ORDER BY` doesn't have `LIMIT`, don't forget to enable external sorting (`max_bytes_before_external_sort`).
843

844
### LIMIT BY Clause
845

846
A query with the `LIMIT n BY expressions` clause selects the first `n` rows for each distinct value of `expressions`. The key for `LIMIT BY` can contain any number of [expressions](syntax.md#syntax-expressions).
847

848 849
ClickHouse supports the following syntax:

B
BayoNet 已提交
850 851
- `LIMIT [offset_value, ]n BY expressions`
- `LIMIT n OFFSET offset_value BY expressions`
852

853
During query processing, ClickHouse selects data ordered by sorting key. The sorting key is set explicitly using an [ORDER BY](#select-order-by) clause or implicitly as a property of the table engine. Then ClickHouse applies `LIMIT n BY expressions` and returns the first `n` rows for each distinct combination of `expressions`. If `OFFSET` is specified, then for each data block that belongs to a distinct combination of `expressions`, ClickHouse skips `offset_value` number of rows from the beginning of the block and returns a maximum of `n` rows as a result. If `offset_value` is bigger than the number of rows in the data block, ClickHouse returns zero rows from the block.
B
BayoNet 已提交
854

855
`LIMIT BY` is not related to `LIMIT`. They can both be used in the same query.
856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891

**Examples**

Sample table:

```sql
CREATE TABLE limit_by(id Int, val Int) ENGINE = Memory;
INSERT INTO limit_by values(1, 10), (1, 11), (1, 12), (2, 20), (2, 21);
```

Queries:

```sql
SELECT * FROM limit_by ORDER BY id, val LIMIT 2 BY id
```
```text
┌─id─┬─val─┐
│  1 │  10 │
│  1 │  11 │
│  2 │  20 │
│  2 │  21 │
└────┴─────┘
```
```sql
SELECT * FROM limit_by ORDER BY id, val LIMIT 1, 2 BY id
```
```text
┌─id─┬─val─┐
│  1 │  11 │
│  1 │  12 │
│  2 │  21 │
└────┴─────┘
```

The `SELECT * FROM limit_by ORDER BY id, val LIMIT 2 OFFSET 1 BY id` query returns the same result.

892
The following query returns the top 5 referrers for each `domain, device_type` pair with a maximum of 100 rows in total (`LIMIT n BY + LIMIT`).
893

894
```sql
895 896 897 898 899 900 901 902 903 904 905 906
SELECT
    domainWithoutWWW(URL) AS domain,
    domainWithoutWWW(REFERRER_URL) AS referrer,
    device_type,
    count() cnt
FROM hits
GROUP BY domain, referrer, device_type
ORDER BY cnt DESC
LIMIT 5 BY domain, device_type
LIMIT 100
```

I
Ivan Blinkov 已提交
907
### HAVING Clause
908 909 910 911 912 913

Allows filtering the result received after GROUP BY, similar to the WHERE clause.
WHERE and HAVING differ in that WHERE is performed before aggregation (GROUP BY), while HAVING is performed after it.
If aggregation is not performed, HAVING can't be used.


B
BayoNet 已提交
914
### ORDER BY Clause {#select-order-by}
915 916 917 918 919 920 921 922 923 924

The ORDER BY clause contains a list of expressions, which can each be assigned DESC or ASC (the sorting direction). If the direction is not specified, ASC is assumed. ASC is sorted in ascending order, and DESC in descending order. The sorting direction applies to a single expression, not to the entire list. Example: `ORDER BY Visits DESC, SearchPhrase`

For sorting by String values, you can specify collation (comparison). Example: `ORDER BY SearchPhrase COLLATE 'tr'` - for sorting by keyword in ascending order, using the Turkish alphabet, case insensitive, assuming that strings are UTF-8 encoded. COLLATE can be specified or not for each expression in ORDER BY independently. If ASC or DESC is specified, COLLATE is specified after it. When using COLLATE, sorting is always case-insensitive.

We only recommend using COLLATE for final sorting of a small number of rows, since sorting with COLLATE is less efficient than normal sorting by bytes.

Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be nondeterministic (different each time).
If the ORDER BY clause is omitted, the order of the rows is also undefined, and may be nondeterministic as well.

925 926
`NaN` and `NULL` sorting order:

B
BayoNet 已提交
927 928 929 930 931 932 933 934
- With the modifier `NULLS FIRST` — First `NULL`, then `NaN`, then other values.
- With the modifier `NULLS LAST` — First the values, then `NaN`, then `NULL`.
- Default — The same as with the `NULLS LAST` modifier.

Example:

For the table

935
```text
B
BayoNet 已提交
936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │    2 │
│ 1 │  nan │
│ 2 │    2 │
│ 3 │    4 │
│ 5 │    6 │
│ 6 │  nan │
│ 7 │ ᴺᵁᴸᴸ │
│ 6 │    7 │
│ 8 │    9 │
└───┴──────┘
```

Run the query `SELECT * FROM t_null_nan ORDER BY y NULLS FIRST` to get:

952
```text
B
BayoNet 已提交
953 954 955 956 957 958 959 960 961 962 963 964 965
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
│ 1 │  nan │
│ 6 │  nan │
│ 2 │    2 │
│ 2 │    2 │
│ 3 │    4 │
│ 5 │    6 │
│ 6 │    7 │
│ 8 │    9 │
└───┴──────┘
```
966

I
Ivan Blinkov 已提交
967
When floating point numbers are sorted, NaNs are separate from the other values. Regardless of the sorting order, NaNs come at the end. In other words, for ascending sorting they are placed as if they are larger than all the other numbers, while for descending sorting they are placed as if they are smaller than the rest.
968 969 970 971 972 973 974 975 976

Less RAM is used if a small enough LIMIT is specified in addition to ORDER BY. Otherwise, the amount of memory spent is proportional to the volume of data for sorting. For distributed query processing, if GROUP BY is omitted, sorting is partially done on remote servers, and the results are merged on the requestor server. This means that for distributed sorting, the volume of data to sort can be greater than the amount of memory on a single server.

If there is not enough RAM, it is possible to perform sorting in external memory (creating temporary files on a disk). Use the setting `max_bytes_before_external_sort` for this purpose. If it is set to 0 (the default), external sorting is disabled. If it is enabled, when the volume of data to sort reaches the specified number of bytes, the collected data is sorted and dumped into a temporary file. After all data is read, all the sorted files are merged and the results are output. Files are written to the /var/lib/clickhouse/tmp/ directory in the config (by default, but you can use the 'tmp_path' parameter to change this setting).

Running a query may use more memory than 'max_bytes_before_external_sort'. For this reason, this setting must have a value significantly smaller than 'max_memory_usage'. As an example, if your server has 128 GB of RAM and you need to run a single query, set 'max_memory_usage' to 100 GB, and 'max_bytes_before_external_sort' to 80 GB.

External sorting works much less effectively than sorting in RAM.

977 978
### SELECT Clause {#select-select}

979
[Expressions](syntax.md#syntax-expressions) specified in the `SELECT` clause are calculated after all the operations in the clauses described above are finished. These expressions work as if they apply to separate rows in the result. If expressions in the `SELECT` clause contain aggregate functions, then ClickHouse processes aggregate functions and expressions used as their arguments during the [GROUP BY](#select-group-by-clause) aggregation.
980

981
If you want to include all columns in the result, use the asterisk (`*`) symbol. For example, `SELECT * FROM ...`.
982

983
To match some columns in the result with a [re2](https://en.wikipedia.org/wiki/RE2_(software)) regular expression, you can use the `COLUMNS` expression.
984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005

```sql
COLUMNS('regexp')
```

For example, consider the table:

```sql
CREATE TABLE default.col_names (aa Int8, ab Int8, bc Int8) ENGINE = TinyLog
```

The following query selects data from all the columns containing the `a` symbol in their name.

```sql
SELECT COLUMNS('a') FROM col_names
```
```text
┌─aa─┬─ab─┐
│  1 │  1 │
└────┴────┘
```

1006 1007 1008
The selected columns are returned not in the alphabetical order.

You can use multiple `COLUMNS` expressions in a query and apply functions to them.
1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020

For example:

```sql
SELECT COLUMNS('a'), COLUMNS('c'), toTypeName(COLUMNS('c')) FROM col_names
```
```text
┌─aa─┬─ab─┬─bc─┬─toTypeName(bc)─┐
│  1 │  1 │  1 │ Int8           │
└────┴────┴────┴────────────────┘
```

1021
Each column returned by the `COLUMNS` expression is passed to the function as a separate argument. Also you can pass other arguments to the function if it supports them. Be careful when using functions. If a function doesn't support the number of arguments you have passed to it, ClickHouse throws an exception.
1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032

For example:

```sql
SELECT COLUMNS('a') + COLUMNS('c') FROM col_names
```
```text
Received exception from server (version 19.14.1):
Code: 42. DB::Exception: Received from localhost:9000. DB::Exception: Number of arguments for function plus doesn't match: passed 3, should be 2. 
```

1033
In this example, `COLUMNS('a')` returns two columns: `aa` and `ab`. `COLUMNS('c')` returns the `bc` column. The `+` operator can't apply to 3 arguments, so ClickHouse throws an exception with the relevant message.
1034

1035
Columns that matched the `COLUMNS` expression can have different data types. If `COLUMNS` doesn't match any columns and is the only expression in `SELECT`, ClickHouse throws an exception.
1036 1037


B
BayoNet 已提交
1038
### DISTINCT Clause {#select-distinct}
1039

I
Ivan Blinkov 已提交
1040 1041
If DISTINCT is specified, only a single row will remain out of all the sets of fully matching rows in the result.
The result will be the same as if GROUP BY were specified across all the fields specified in SELECT without aggregate functions. But there are several differences from GROUP BY:
1042

I
Ivan Blinkov 已提交
1043 1044
- DISTINCT can be applied together with GROUP BY.
- When ORDER BY is omitted and LIMIT is defined, the query stops running immediately after the required number of different rows has been read.
1045 1046
- Data blocks are output as they are processed, without waiting for the entire query to finish running.

I
Ivan Blinkov 已提交
1047
DISTINCT is not supported if SELECT has at least one array column.
1048

1049
`DISTINCT` works with [NULL](syntax.md) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once.
1050

1051 1052
ClickHouse supports using the `DISTINCT` and `ORDER BY` clauses for different columns in one query. The `DISTINCT` clause is executed before the `ORDER BY` clause.

1053
Example table:
1054 1055 1056 1057 1058 1059 1060 1061 1062 1063

```text
┌─a─┬─b─┐
│ 2 │ 1 │
│ 1 │ 2 │
│ 3 │ 3 │
│ 2 │ 4 │
└───┴───┘
```

1064
When selecting data with the `SELECT DISTINCT a FROM t1 ORDER BY b ASC` query, we get the following result:
1065 1066 1067 1068 1069 1070 1071 1072 1073

```text
┌─a─┐
│ 2 │
│ 1 │
│ 3 │
└───┘
```

1074
If we change the sorting direction `SELECT DISTINCT a FROM t1 ORDER BY b DESC`, we get the following result:
1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085

```text
┌─a─┐
│ 3 │
│ 1 │
│ 2 │
└───┘
```

Row `2, 4` was cut before sorting.

1086
Take this implementation specificity into account when programming queries.
1087

1088
### LIMIT Clause
1089

O
ogorbacheva 已提交
1090
`LIMIT m` allows you to select the first `m` rows from the result.
O
ogorbacheva 已提交
1091 1092

`LIMIT n, m` allows you to select the first `m` rows from the result after skipping the first `n` rows. The `LIMIT m OFFSET n` syntax is also supported.
1093

O
ogorbacheva 已提交
1094
`n` and `m` must be non-negative integers.
1095

O
ogorbacheva 已提交
1096
If there isn't an `ORDER BY` clause that explicitly sorts results, the result may be arbitrary and nondeterministic.
1097

I
Ivan Blinkov 已提交
1098
### UNION ALL Clause
1099

I
Ivan Blinkov 已提交
1100
You can use UNION ALL to combine any number of queries. Example:
1101

1102
```sql
1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114
SELECT CounterID, 1 AS table, toInt64(count()) AS c
    FROM test.hits
    GROUP BY CounterID

UNION ALL

SELECT CounterID, 2 AS table, sum(Sign) AS c
    FROM test.visits
    GROUP BY CounterID
    HAVING c > 0
```

I
Ivan Blinkov 已提交
1115
Only UNION ALL is supported. The regular UNION (UNION DISTINCT) is not supported. If you need UNION DISTINCT, you can write SELECT DISTINCT from a subquery containing UNION ALL.
1116

I
Ivan Blinkov 已提交
1117
Queries that are parts of UNION ALL can be run simultaneously, and their results can be mixed together.
1118

1119
The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query. Type casting is performed for unions. For example, if two queries being combined have the same field with non-`Nullable` and `Nullable` types from a compatible type, the resulting `UNION ALL` has a `Nullable` type field.
1120

I
Ivan Blinkov 已提交
1121
Queries that are parts of UNION ALL can't be enclosed in brackets. ORDER BY and LIMIT are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with UNION ALL in a subquery in the FROM clause.
1122

1123
### INTO OUTFILE Clause {#into-outfile-clause}
1124 1125 1126 1127 1128 1129 1130

Add the `INTO OUTFILE filename` clause (where filename is a string literal) to redirect query output to the specified file.
In contrast to MySQL, the file is created on the client side. The query will fail if a file with the same filename already exists.
This functionality is available in the command-line client and clickhouse-local (a query sent via HTTP interface will fail).

The default output format is TabSeparated (the same as in the command-line client batch mode).

1131
### FORMAT Clause {#format-clause}
1132 1133 1134 1135 1136 1137 1138 1139

Specify 'FORMAT format' to get data in any specified format.
You can use this for convenience, or for creating dumps.
For more information, see the section "Formats".
If the FORMAT clause is omitted, the default format is used, which depends on both the settings and the interface used for accessing the DB. For the HTTP interface and the command-line client in batch mode, the default format is TabSeparated. For the command-line client in interactive mode, the default format is PrettyCompact (it has attractive and compact tables).

When using the command-line client, data is passed to the client in an internal efficient format. The client independently interprets the FORMAT clause of the query and formats the data itself (thus relieving the network and the server from the load).

1140

1141
### IN Operators {#select-in-operators}
1142 1143 1144 1145 1146 1147 1148

The `IN`, `NOT IN`, `GLOBAL IN`, and `GLOBAL NOT IN` operators are covered separately, since their functionality is quite rich.

The left side of the operator is either a single column or a tuple.

Examples:

1149
```sql
1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166
SELECT UserID IN (123, 456) FROM ...
SELECT (CounterID, UserID) IN ((34, 123), (101500, 456)) FROM ...
```

If the left side is a single column that is in the index, and the right side is a set of constants, the system uses the index for processing the query.

Don't list too many values explicitly (i.e. millions). If a data set is large, put it in a temporary table (for example, see the section "External data for query processing"), then use a subquery.

The right side of the operator can be a set of constant expressions, a set of tuples with constant expressions (shown in the examples above), or the name of a database table or SELECT subquery in brackets.

If the right side of the operator is the name of a table (for example, `UserID IN users`), this is equivalent to the subquery `UserID IN (SELECT * FROM users)`. Use this when working with external data that is sent along with the query. For example, the query can be sent together with a set of user IDs loaded to the 'users' temporary table, which should be filtered.

If the right side of the operator is a table name that has the Set engine (a prepared data set that is always in RAM), the data set will not be created over again for each query.

The subquery may specify more than one column for filtering tuples.
Example:

1167
```sql
1168 1169 1170 1171 1172 1173 1174 1175
SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM ...) FROM ...
```

The columns to the left and right of the IN operator should have the same type.

The IN operator and subquery may occur in any part of the query, including in aggregate functions and lambda functions.
Example:

1176
```sql
1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189
SELECT
    EventDate,
    avg(UserID IN
    (
        SELECT UserID
        FROM test.hits
        WHERE EventDate = toDate('2014-03-17')
    )) AS ratio
FROM test.hits
GROUP BY EventDate
ORDER BY EventDate ASC
```

1190
```text
1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204
┌──EventDate─┬────ratio─┐
│ 2014-03-17 │        1 │
│ 2014-03-18 │ 0.807696 │
│ 2014-03-19 │ 0.755406 │
│ 2014-03-20 │ 0.723218 │
│ 2014-03-21 │ 0.697021 │
│ 2014-03-22 │ 0.647851 │
│ 2014-03-23 │ 0.648416 │
└────────────┴──────────┘
```

For each day after March 17th, count the percentage of pageviews made by users who visited the site on March 17th.
A subquery in the IN clause is always run just one time on a single server. There are no dependent subqueries.

1205 1206
#### NULL processing

1207
During request processing, the IN operator assumes that the result of an operation with [NULL](syntax.md) is always equal to `0`, regardless of whether `NULL` is on the right or left side of the operator. `NULL` values are not included in any dataset, do not correspond to each other and cannot be compared.
1208

B
BayoNet 已提交
1209
Here is an example with the `t_null` table:
1210

1211
```text
B
BayoNet 已提交
1212 1213 1214 1215 1216
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │    3 │
└───┴──────┘
```
1217

B
BayoNet 已提交
1218
Running the query `SELECT x FROM t_null WHERE y IN (NULL,3)` gives you the following result:
1219

1220
```text
B
BayoNet 已提交
1221 1222 1223 1224
┌─x─┐
│ 2 │
└───┘
```
1225

B
BayoNet 已提交
1226
You can see that the row in which `y = NULL` is thrown out of the query results. This is because ClickHouse can't decide whether `NULL` is included in the `(NULL,3)` set, returns `0` as the result of the operation, and `SELECT` excludes this row from the final output.
1227

1228
```sql
B
BayoNet 已提交
1229 1230
SELECT y IN (NULL, 3)
FROM t_null
1231 1232
```
```text
B
BayoNet 已提交
1233 1234 1235 1236 1237
┌─in(y, tuple(NULL, 3))─┐
│                     0 │
│                     1 │
└───────────────────────┘
```
1238

1239

1240
#### Distributed Subqueries {#select-distributed-subqueries}
1241

1242
There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / `JOIN` and `GLOBAL IN` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing.
1243

I
Ivan Blinkov 已提交
1244
!!! attention
I
Ivan Blinkov 已提交
1245
    Remember that the algorithms described below may work differently depending on the [settings](../operations/settings/settings.md) `distributed_product_mode` setting.
1246 1247 1248

When using the regular IN, the query is sent to remote servers, and each of them runs the subqueries in the `IN` or `JOIN` clause.

1249
When using `GLOBAL IN` / `GLOBAL JOINs`, first all the subqueries are run for `GLOBAL IN` / `GLOBAL JOINs`, and the results are collected in temporary tables. Then the temporary tables are sent to each remote server, where the queries are run using this temporary data.
1250 1251 1252

For a non-distributed query, use the regular `IN` / `JOIN`.

1253
Be careful when using subqueries in the `IN` / `JOIN` clauses for distributed query processing.
1254 1255 1256 1257 1258 1259 1260

Let's look at some examples. Assume that each server in the cluster has a normal **local_table**. Each server also has a **distributed_table** table with the **Distributed** type, which looks at all the servers in the cluster.

For a query to the **distributed_table**, the query will be sent to all the remote servers and run on them using the **local_table**.

For example, the query

1261
```sql
1262 1263 1264 1265 1266
SELECT uniq(UserID) FROM distributed_table
```

will be sent to all remote servers as

1267
```sql
1268 1269 1270 1271 1272 1273 1274
SELECT uniq(UserID) FROM local_table
```

and run on each of them in parallel, until it reaches the stage where intermediate results can be combined. Then the intermediate results will be returned to the requestor server and merged on it, and the final result will be sent to the client.

Now let's examine a query with IN:

1275
```sql
1276 1277 1278 1279 1280 1281 1282
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```

- Calculation of the intersection of audiences of two sites.

This query will be sent to all remote servers as

1283
```sql
1284 1285 1286 1287 1288 1289 1290 1291 1292
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```

In other words, the data set in the IN clause will be collected on each server independently, only across the data that is stored locally on each of the servers.

This will work correctly and optimally if you are prepared for this case and have spread data across the cluster servers such that the data for a single UserID resides entirely on a single server. In this case, all the necessary data will be available locally on each server. Otherwise, the result will be inaccurate. We refer to this variation of the query as "local IN".

To correct how the query works when data is spread randomly across the cluster servers, you could specify **distributed_table** inside a subquery. The query would look like this:

1293
```sql
1294 1295 1296 1297 1298
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```

This query will be sent to all remote servers as

1299
```sql
1300 1301 1302 1303 1304
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```

The subquery will begin running on each remote server. Since the subquery uses a distributed table, the subquery that is on each remote server will be resent to every remote server as

1305
```sql
1306 1307 1308 1309 1310 1311 1312
SELECT UserID FROM local_table WHERE CounterID = 34
```

For example, if you have a cluster of 100 servers, executing the entire query will require 10,000 elementary requests, which is generally considered unacceptable.

In such cases, you should always use GLOBAL IN instead of IN. Let's look at how it works for the query

1313
```sql
1314 1315 1316 1317 1318
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID GLOBAL IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```

The requestor server will run the subquery

1319
```sql
1320 1321 1322 1323 1324
SELECT UserID FROM distributed_table WHERE CounterID = 34
```

and the result will be put in a temporary table in RAM. Then the request will be sent to each remote server as

1325
```sql
1326 1327 1328
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID GLOBAL IN _data1
```

B
BayoNet 已提交
1329
and the temporary table `_data1` will be sent to every remote server with the query (the name of the temporary table is implementation-defined).
1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340

This is more optimal than using the normal IN. However, keep the following points in mind:

1. When creating a temporary table, data is not made unique. To reduce the volume of data transmitted over the network, specify DISTINCT in the subquery. (You don't need to do this for a normal IN.)
2. The temporary table will be sent to all the remote servers. Transmission does not account for network topology. For example, if 10 remote servers reside in a datacenter that is very remote in relation to the requestor server, the data will be sent 10 times over the channel to the remote datacenter. Try to avoid large data sets when using GLOBAL IN.
3. When transmitting data to remote servers, restrictions on network bandwidth are not configurable. You might overload the network.
4. Try to distribute data across servers so that you don't need to use GLOBAL IN on a regular basis.
5. If you need to use GLOBAL IN often, plan the location of the ClickHouse cluster so that a single group of replicas resides in no more than one data center with a fast network between them, so that a query can be processed entirely within a single data center.

It also makes sense to specify a local table in the `GLOBAL IN` clause, in case this local table is only available on the requestor server and you want to use data from it on remote servers.

1341
### Extreme Values
1342 1343 1344

In addition to results, you can also get minimum and maximum values for the results columns. To do this, set the **extremes** setting to 1. Minimums and maximums are calculated for numeric types, dates, and dates with times. For other columns, the default values are output.

1345
An extra two rows are calculated – the minimums and maximums, respectively. These extra two rows are output in `JSON*`, `TabSeparated*`, and `Pretty*` [formats](../interfaces/formats.md), separate from the other rows. They are not output for other formats.
1346

1347
In `JSON*` formats, the extreme values are output in a separate 'extremes' field. In `TabSeparated*` formats, the row comes after the main result, and after 'totals' if present. It is preceded by an empty row (after the other data). In `Pretty*` formats, the row is output as a separate table after the main result, and after `totals` if present.
1348

B
BayoNet 已提交
1349
Extreme values are calculated for rows before `LIMIT`, but after `LIMIT BY`. However, when using `LIMIT offset, size`, the rows before `offset` are included in `extremes`. In stream requests, the result may also include a small number of rows that passed through `LIMIT`.
1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366

### Notes

The `GROUP BY` and `ORDER BY` clauses do not support positional arguments. This contradicts MySQL, but conforms to standard SQL.
For example, `GROUP BY 1, 2` will be interpreted as grouping by constants (i.e. aggregation of all rows into one).

You can use synonyms (`AS` aliases) in any part of a query.

You can put an asterisk in any part of a query instead of an expression. When the query is analyzed, the asterisk is expanded to a list of all table columns (excluding the `MATERIALIZED` and `ALIAS` columns). There are only a few cases when using an asterisk is justified:

- When creating a table dump.
- For tables containing just a few columns, such as system tables.
- For getting information about what columns are in a table. In this case, set `LIMIT 1`. But it is better to use the `DESC TABLE` query.
- When there is strong filtration on a small number of columns using `PREWHERE`.
- In subqueries (since columns that aren't needed for the external query are excluded from subqueries).

In all other cases, we don't recommend using the asterisk, since it only gives you the drawbacks of a columnar DBMS instead of the advantages. In other words using the asterisk is not recommended.
I
Ivan Blinkov 已提交
1367

I
Ivan Blinkov 已提交
1368
[Original article](https://clickhouse.tech/docs/en/query_language/select/) <!--hide-->