未验证 提交 ce5d9264 编写于 作者: A alexey-milovidov 提交者: GitHub

Merge pull request #2117 from yandex/revert-2110-master

Revert "English translation is updated"
文件模式从 100755 更改为 100644
......@@ -8,4 +8,3 @@ ClickHouse also supports:
- [Parametric aggregate functions](parametric_functions.md#aggregate_functions_parametric), which accept other parameters in addition to columns.
- [Combinators](combinators.md#aggregate_functions_combinators), which change the behavior of aggregate functions.
文件模式从 100755 更改为 100644
......@@ -19,7 +19,7 @@ In some cases, you can rely on the order of execution. This applies to cases whe
When a `SELECT` query has the `GROUP BY` clause or at least one aggregate function, ClickHouse (in contrast to MySQL) requires that all expressions in the `SELECT`, `HAVING`, and `ORDER BY` clauses be calculated from keys or from aggregate functions. In other words, each column selected from the table must be used either in keys or inside aggregate functions. To get behavior like in MySQL, you can put the other columns in the `any` aggregate function.
## anyHeavy
## anyHeavy(x)
Selects a frequently occurring value using the [heavy hitters](http://www.cs.umd.edu/~samir/498/karp.pdf) algorithm. If there is a value that occurs more than in half the cases in each of the query's execution threads, this value is returned. Normally, the result is nondeterministic.
......@@ -39,6 +39,7 @@ Take the [OnTime](../getting_started/example_datasets/ontime.md#example_datasets
SELECT anyHeavy(AirlineID) AS res
FROM ontime
```
```
┌───res─┐
│ 19690 │
......@@ -124,11 +125,11 @@ The result is always Float64.
Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments.
Uses an adaptive sampling algorithm: for the calculation state, it uses a sample of element hash values with a size up to 65536.
This algorithm is also very accurate for data sets with low cardinality (up to 65536) and very efficient on CPU (when computing not too many of these functions, using `uniq` is almost as fast as using other aggregate functions).
This algorithm is also very accurate for data sets with small cardinality (up to 65536) and very efficient on CPU (when computing not too many of these functions, using `uniq` is almost as fast as using other aggregate functions).
The result is determinate (it doesn't depend on the order of query processing).
This function provides excellent accuracy even for data sets with extremely high cardinality (over 10 billion elements). It is recommended for default use.
This function provides excellent accuracy even for data sets with huge cardinality (10B+ elements) and is recommended for use by default.
## uniqCombined(x)
......@@ -138,16 +139,16 @@ A combination of three algorithms is used: array, hash table and [HyperLogLog](h
The result is determinate (it doesn't depend on the order of query processing).
The `uniqCombined` function is a good default choice for calculating the number of different values, but keep in mind that the estimation error will increase for high-cardinality data sets (200M+ elements), and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
The `uniqCombined` function is a good default choice for calculating the number of different values, but the following should be considered: for data sets with large cardinality (200M+) error of estimate will only grow and for data sets with huge cardinality(1B+ elements) it returns result with high inaccuracy.
## uniqHLL12(x)
Uses the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm to approximate the number of different values of the argument.
212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. Result is not very accurate (error up to ~10%) for data sets of small cardinality(<10K elements), but for data sets with large cardinality (10K - 100M) result is quite accurate (error up to ~1.6%) and after that error of estimate will only grow and for data sets with huge cardinality (1B+ elements) it returns result with high inaccuracy.
The result is determinate (it doesn't depend on the order of query processing).
We don't recommend using this function. In most cases, use the `uniq` or `uniqCombined` function.
This function is not recommended for use, and in most cases, use the `uniq` or `uniqCombined` function.
## uniqExact(x)
......@@ -169,7 +170,7 @@ In some cases, you can still rely on the order of execution. This applies to cas
<a name="agg_functions_groupArrayInsertAt"></a>
## groupArrayInsertAt
## groupArrayInsertAt(x)
Inserts a value into the array in the specified position.
......@@ -235,8 +236,8 @@ For its purpose (calculating quantiles of page loading times), using this functi
## quantileTimingWeighted(level)(x, weight)
Differs from the `quantileTiming` function in that it has a second argument, "weights". Weight is a non-negative integer.
The result is calculated as if the `x` value were passed `weight` number of times to the `quantileTiming` function.
Differs from the 'quantileTiming' function in that it has a second argument, "weights". Weight is a non-negative integer.
The result is calculated as if the 'x' value were passed 'weight' number of times to the 'quantileTiming' function.
## quantileExact(level)(x)
......@@ -256,7 +257,7 @@ The performance of the function is lower than for ` quantile`, ` quantileTiming`
The result depends on the order of running the query, and is nondeterministic.
## median
## median(x)
All the quantile functions have corresponding median functions: `median`, `medianDeterministic`, `medianTiming`, `medianTimingWeighted`, `medianExact`, `medianExactWeighted`, `medianTDigest`. They are synonyms and their behavior is identical.
......@@ -274,7 +275,7 @@ Returns `Float64`. When `n <= 1`, returns `+∞`.
## varPop(x)
Calculates the amount `Σ((x - x̅)^2) / (n - 1)`, where `n` is the sample size and `x̅`is the average value of `x`.
Calculates the amount `Σ((x - x̅)^2) / n`, where `n` is the sample size and `x̅`is the average value of `x`.
In other words, dispersion for a set of values. Returns `Float64`.
......@@ -286,33 +287,30 @@ The result is equal to the square root of `varSamp(x)`.
The result is equal to the square root of `varPop(x)`.
## topK
## topK(N)(column)
Returns an array of the most frequent values in the specified column. The resulting array is sorted in descending order of frequency of values (not by the values themselves).
Implements the [ Filtered Space-Saving](http://www.l2f.inesc-id.pt/~fmmb/wiki/uploads/Work/misnis.ref0a.pdf) algorithm for analyzing TopK, based on the reduce-and-combine algorithm from [Parallel Space Saving](https://arxiv.org/pdf/1401.0702.pdf).
```
topK(N)(column)
```
This function doesn't provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.
We recommend using the `N < 10 ` value; performance is reduced with large `N` values. Maximum value of ` N = 65536`.
**Arguments**
- 'N' is the number of values.
- 'N' – The number of values.
- ' x ' – The column.
**Example**
Take the [OnTime](../getting_started/example_datasets/ontime.md#example_datasets-ontime) data set and select the three most frequently occurring values in the `AirlineID` column.
Take the [OnTime](../getting_started/example_datasets/ontime.md#example_datasets-ontime)data set and select the three most frequently occurring values in the `AirlineID` column.
```sql
SELECT topK(3)(AirlineID) AS res
FROM ontime
```
```
┌─res─────────────────┐
│ [19393,19790,19805] │
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
# Date
A date. Stored in two bytes as the number of days since 1970-01-01 (unsigned). Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2106, but the final fully-supported year is 2105).
Date. Stored in two bytes as the number of days since 1970-01-01 (unsigned). Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2038, but it may be expanded to 2106).
The minimum value is output as 0000-00-00.
The date is stored without the time zone.
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -4,8 +4,8 @@
Types are equivalent to types of C:
- `Float32` - `float`
- `Float64` - ` double`
- `Float32` - `float`;
- `Float64` - ` double`.
We recommend that you store data in integer form whenever possible. For example, convert fixed precision numbers to integer values, such as monetary amounts or page load times in milliseconds.
......@@ -24,7 +24,9 @@ SELECT 1 - 0.9
```
- The result of the calculation depends on the calculation method (the processor type and architecture of the computer system).
- Floating-point calculations might result in numbers such as infinity (`Inf`) and "not-a-number" (`NaN`). This should be taken into account when processing the results of calculations.
- When reading floating point numbers from rows, the result might not be the nearest machine-representable number.
## NaN and Inf
......@@ -42,7 +44,6 @@ SELECT 0.5 / 0
│ inf │
└────────────────┘
```
- `-Inf` – Negative infinity.
```sql
......@@ -54,7 +55,6 @@ SELECT -0.5 / 0
│ -inf │
└─────────────────┘
```
- `NaN` – Not a number.
```
......@@ -67,5 +67,5 @@ SELECT 0 / 0
└──────────────┘
```
See the rules for ` NaN` sorting in the section [ORDER BY clause](../query_language/queries.md#query_language-queries-order_by).
See the rules for ` NaN` sorting in the section [ORDER BY clause](../query_language/queries.md#query_language-queries-order_by).
......@@ -2,7 +2,6 @@
# Data types
ClickHouse can store various types of data in table cells.
This section describes the supported data types and special considerations when using and/or implementing them, if any.
ClickHouse table fields can contain data of different types.
The topic contains descriptions of data types supported and specificity of their usage of implementation if exists.
\ No newline at end of file
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -4,7 +4,7 @@
1. The following are recommendations, not requirements.
2. If you are editing code, it makes sense to follow the formatting of the existing code.
3. Code style is needed for consistency. Consistency makes it easier to read the code, and it also makes it easier to search the code.
3. Code style is needed for consistency. Consistency makes it easier to read the code. and it also makes it easier to search the code.
4. Many of the rules do not have logical reasons; they are dictated by established practices.
## Formatting
......@@ -93,25 +93,25 @@
14. In classes and structures, public, private, and protected are written on the same level as the class/struct, but all other internal elements should be deeper.
```cpp
template <typename T>
class MultiVersion
{
public:
/// Version of object for usage. shared_ptr manage lifetime of version.
using Version = std::shared_ptr<const T>;
...
}
template <typename T, typename Ptr = std::shared_ptr<T>>
class MultiVersion
{
public:
/// The specific version of the object to use.
using Version = Ptr;
...
}
```
15. If the same namespace is used for the entire file, and there isn't anything else significant, an offset is not necessary inside namespace.
16. If the block for if, for, while... expressions consists of a single statement, you don't need to use curly brackets. Place the statement on a separate line, instead. The same is true for a nested if, for, while... statement. But if the inner statement contains curly brackets or else, the external block should be written in curly brackets.
```cpp
/// Finish write.
for (auto & stream : streams)
stream.second->finalize();
```
```cpp
/// Finish write.
for (auto & stream : streams)
stream.second->finalize();
```
17. There should be any spaces at the ends of lines.
......@@ -218,11 +218,11 @@ for (auto & stream : streams)
*/
void executeQuery(
ReadBuffer & istr, /// Where to read the query from (and data for INSERT, if applicable)
WriteBuffer & ostr, /// Where to write the result
Context & context, /// DB, tables, data types, engines, functions, aggregate functions...
BlockInputStreamPtr & query_plan, /// A description of query processing can be included here
QueryProcessingStage::Enum stage = QueryProcessingStage::Complete /// The last stage to process the SELECT query to
)
WriteBuffer & ostr, /// Where to write the result
Context & context, /// DB, tables, data types, engines, functions, aggregate functions...
BlockInputStreamPtr & query_plan, /// A description of query processing can be included here
QueryProcessingStage::Enum stage = QueryProcessingStage::Complete /// The last stage to process the SELECT query to
)
```
4. Comments should be written in English only.
......@@ -252,7 +252,7 @@ for (auto & stream : streams)
*/
```
(the example is borrowed from the resource [http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/](http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/)
(Example taken from: [http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/)](http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/)
7. Do not write garbage comments (author, creation date ..) at the beginning of each file.
......@@ -497,15 +497,7 @@ This is not recommended, but it is allowed.
You can create a separate code block inside a single function in order to make certain variables local, so that the destructors are called when exiting the block.
```cpp
Block block = data.in->read();
{
std::lock_guard<std::mutex> lock(mutex);
data.ready = true;
data.block = block;
}
ready_any.set();
Block block = data.in->read();{ std::lock_guard<std::mutex> lock(mutex); data.ready = true; data.block = block;}ready_any.set();
```
7. Multithreading.
......@@ -568,12 +560,13 @@ This is not recommended, but it is allowed.
```cpp
using AggregateFunctionPtr = std::shared_ptr<IAggregateFunction>;
/** Creates an aggregate function by name. */
/** Creates an aggregate function by name.
*/
class AggregateFunctionFactory
{
public:
AggregateFunctionFactory();
AggregateFunctionFactory();
AggregateFunctionPtr get(const String & name, const DataTypes & argument_types) const;
```
......@@ -598,10 +591,10 @@ This is not recommended, but it is allowed.
If later you’ll need to delay initialization, you can add a default constructor that will create an invalid object. Or, for a small number of objects, you can use shared_ptr/unique_ptr.
```cpp
Loader(DB::Connection * connection_, const std::string & query, size_t max_block_size_);
/// For delayed initialization
Loader() {}
Loader(DB::Connection * connection_, const std::string & query, size_t max_block_size_);
/// For delayed initialization
Loader() {}
```
17. Virtual functions.
......
......@@ -21,11 +21,12 @@ The dictionary config file has the following format:
<!--Optional element. File name with substitutions-->
<include_from>/etc/metrika.xml</include_from>
<dictionary>
<!-- Dictionary configuration -->
<!-- Dictionary configuration -->
</dictionary>
...
<dictionary>
......
......@@ -27,8 +27,7 @@ The dictionary configuration has the following structure:
```
- name – The identifier that can be used to access the dictionary. Use the characters `[a-zA-Z0-9_\-]`.
- [source](external_dicts_dict_sources.html/#dicts-external_dicts_dict_sources) — Source of the dictionary .
- [layout](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout) — Dictionary layout in memory.
- [source](external_dicts_dict_sources.html/#dicts-external_dicts_dict_sources) — Structure of the dictionary . A key and attributes that can be retrieved by this key.
- [lifetime](external_dicts_dict_lifetime.md#dicts-external_dicts_dict_lifetime) — Frequency of dictionary updates.
- [source](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources) – Source of the dictionary.
- [layout](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout) – Location of the dictionary in memory.
- [structure](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure) – Structure of the dictionary. A key and attributes that can be retrieved by this key.
- [lifetime](external_dicts_dict_lifetime.md#dicts-external_dicts_dict_lifetime) – How frequently to update dictionaries.
......@@ -2,11 +2,11 @@
# Storing dictionaries in memory
There are a [variety of ways](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout-manner) to store dictionaries in memory.
There are [many different ways](external_dicts_dict_layout#dicts-external_dicts_dict_layout-manner) to store dictionaries in memory.
We recommend [flat](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout-flat), [hashed](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout-hashed)and[complex_key_hashed](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout-complex_key_hashed). which provide optimal processing speed.
We recommend [flat](external_dicts_dict_layout#dicts-external_dicts_dict_layout-flat), [hashed](external_dicts_dict_layout#dicts-external_dicts_dict_layout-hashed), and [complex_key_hashed](external_dicts_dict_layout#dicts-external_dicts_dict_layout-complex_key_hashed). which provide optimal processing speed.
Caching is not recommended because of potentially poor performance and difficulties in selecting optimal parameters. Read more in the section " [cache](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout-cache)".
Caching is not recommended because of potentially poor performance and difficulties in selecting optimal parameters. Read more about this in the "[cache](external_dicts_dict_layout#dicts-external_dicts_dict_layout-cache)" section.
There are several ways to improve dictionary performance:
......@@ -46,6 +46,7 @@ The configuration looks like this:
- [range_hashed](#dicts-external_dicts_dict_layout-range_hashed)
- [complex_key_hashed](#dicts-external_dicts_dict_layout-complex_key_hashed)
- [complex_key_cache](#dicts-external_dicts_dict_layout-complex_key_cache)
- [ip_trie](#dicts-external_dicts_dict_layout-ip_trie)
<a name="dicts-external_dicts_dict_layout-flat"></a>
......@@ -87,7 +88,7 @@ Configuration example:
### complex_key_hashed
This type is for use with composite [keys](external_dicts_dict_structure.md/#dicts-external_dicts_dict_structure). Similar to `hashed`.
This type of storage is designed for use with compound [keys](external_dicts_dict_structure#dicts-external_dicts_dict_structure). It is similar to hashed.
Configuration example:
......@@ -108,18 +109,18 @@ This storage method works the same way as hashed and allows using date/time rang
Example: The table contains discounts for each advertiser in the format:
```
+---------------+---------------------+-------------------+--------+
| advertiser id | discount start date | discount end date | amount |
+===============+=====================+===================+========+
| 123 | 2015-01-01 | 2015-01-15 | 0.15 |
+---------------+---------------------+-------------------+--------+
| 123 | 2015-01-16 | 2015-01-31 | 0.25 |
+---------------+---------------------+-------------------+--------+
| 456 | 2015-01-01 | 2015-01-15 | 0.05 |
+---------------+---------------------+-------------------+--------+
+---------------+---------------------+-------------------+--------+
| advertiser id | discount start date | discount end date | amount |
+===============+=====================+===================+========+
| 123 | 2015-01-01 | 2015-01-15 | 0.15 |
+---------------+---------------------+-------------------+--------+
| 123 | 2015-01-16 | 2015-01-31 | 0.25 |
+---------------+---------------------+-------------------+--------+
| 456 | 2015-01-01 | 2015-01-15 | 0.05 |
+---------------+---------------------+-------------------+--------+
```
To use a sample for date ranges, define the `range_min` and `range_max` elements in the [structure](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure).
To use a sample for date ranges, define `range_min` and `range_max` in [structure](external_dicts_dict_structure#dicts-external_dicts_dict_structure).
Example:
......@@ -196,15 +197,15 @@ This is the least effective of all the ways to store dictionaries. The speed of
To improve cache performance, use a subquery with ` LIMIT`, and call the function with the dictionary externally.
Supported [sources](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources): MySQL, ClickHouse, executable, HTTP.
Supported [sources](external_dicts_dict_sources#dicts-external_dicts_dict_sources): MySQL, ClickHouse, executable, HTTP.
Example of settings:
```xml
<layout>
<cache>
<!-- The size of the cache, in number of cells. Rounded up to a power of two. -->
<size_in_cells>1000000000</size_in_cells>
<!-- The size of the cache, in number of cells. Rounded up to a power of two. -->
<size_in_cells>1000000000</size_in_cells>
</cache>
</layout>
```
......@@ -226,4 +227,66 @@ Do not use ClickHouse as a source, because it is slow to process queries with ra
### complex_key_cache
This type of storage is for use with composite [keys](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure). Similar to `cache`.
This type of storage is designed for use with compound [keys](external_dicts_dict_structure#dicts-external_dicts_dict_structure). Similar to `cache`.
<a name="dicts-external_dicts_dict_layout-ip_trie"></a>
### ip_trie
The table stores IP prefixes for each key (IP address), which makes it possible to map IP addresses to metadata such as ASN or threat score.
Example: in the table there are prefixes matches to AS number and country:
```
+-----------------+-------+--------+
| prefix | asn | cca2 |
+=================+=======+========+
| 202.79.32.0/20 | 17501 | NP |
+-----------------+-------+--------+
| 2620:0:870::/48 | 3856 | US |
+-----------------+-------+--------+
| 2a02:6b8:1::/48 | 13238 | RU |
+-----------------+-------+--------+
| 2001:db8::/32 | 65536 | ZZ |
+-----------------+-------+--------+
```
When using such a layout, the structure should have the "key" element.
Example:
```xml
<structure>
<key>
<attribute>
<name>prefix</name>
<type>String</type>
</attribute>
</key>
<attribute>
<name>asn</name>
<type>UInt32</type>
<null_value />
</attribute>
<attribute>
<name>cca2</name>
<type>String</type>
<null_value>??</null_value>
</attribute>
...
```
These key must have only one attribute of type String, containing a valid IP prefix. Other types are not yet supported.
For querying, same functions (dictGetT with tuple) as for complex key dictionaries have to be used:
dictGetT('dict_name', 'attr_name', tuple(ip))
The function accepts either UInt32 for IPv4 address or FixedString(16) for IPv6 address in wire format:
dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))
No other type is supported. The function returns attribute for a prefix matching the given IP address. If there are overlapping prefixes, the most specific one is returned.
The data is stored currently in a bitwise trie, it has to fit in memory.
......@@ -36,13 +36,13 @@ Example of settings:
When upgrading the dictionaries, the ClickHouse server applies different logic depending on the type of [ source](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources):
> - For a text file, it checks the time of modification. If the time differs from the previously recorded time, the dictionary is updated.
- For MyISAM tables, the time of modification is checked using a `SHOW TABLE STATUS` query.
- Dictionaries from other sources are updated every time by default.
> - For MyISAM tables, the time of modification is checked using a `SHOW TABLE STATUS` query.
> - Dictionaries from other sources are updated every time by default.
For MySQL (InnoDB) and ODBC sources, you can set up a query that will update the dictionaries only if they really changed, rather than each time. To do this, follow these steps:
> - The dictionary table must have a field that always changes when the source data is updated.
- The settings of the source must specify a query that retrieves the changing field. The ClickHouse server interprets the query result as a row, and if this row has changed relative to its previous state, the dictionary is updated. Specify the query in the `<invalidate_query>` field in the settings for the [source](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources).
> - The settings of the source must specify a query that retrieves the changing field. The ClickHouse server interprets the query result as a row, and if this row has changed relative to its previous state, the dictionary is updated. The query must be specified in the `<invalidate_query>` field in the [ source](external_dicts_dict_sources.md#dicts-external_dicts_dict_sources) settings.
Example of settings:
......
......@@ -80,7 +80,7 @@ Setting fields:
## HTTP(s)
Working with an HTTP(s) server depends on [how the dictionary is stored in memory](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout). If the dictionary is stored using `cache` and `complex_key_cache`, ClickHouse requests the necessary keys by sending a request via the `POST` method.
Working with executable files depends on [how the dictionary is stored in memory](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout). If the dictionary is stored using `cache` and `complex_key_cache`, ClickHouse requests the necessary keys by sending a request via the `POST` method.
Example of settings:
......@@ -135,9 +135,9 @@ Installing unixODBC and the ODBC driver for PostgreSQL:
Configuring `/etc/odbc.ini` (or `~/.odbc.ini`):
```
[DEFAULT]
[DEFAULT]
Driver = myconnection
[myconnection]
Description = PostgreSQL connection to my_db
Driver = PostgreSQL Unicode
......
......@@ -25,8 +25,8 @@ Overall structure:
Columns are described in the structure:
- `<id>` - [key column](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure-key).
- `<attribute>` - [data column](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure-attributes). There can be a large number of columns.
- `<id>` [Key column](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure-key).
- `<attribute>` [Data column](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure-attributes). There can be a large number of columns.
<a name="dicts-external_dicts_dict_structure-key"></a>
......@@ -63,10 +63,12 @@ Configuration fields:
### Composite key
The key can be a `tuple` from any types of fields. The [layout](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout) in this case must be `complex_key_hashed` or `complex_key_cache`.
The key can be a `tuple` from any types of fields. The [ layout](external_dicts_dict_layout.md#dicts-external_dicts_dict_layout) in this case must be `complex_key_hashed` or `complex_key_cache`.
<div class="admonition tip">
A composite key can consist of a single element. This makes it possible to use a string as the key, for instance.
A composite key can also consist of a single element, which makes it possible to use a string as the key, for instance.
</div>
The key structure is set in the element `<key>`. Key fields are specified in the same format as the dictionary [attributes](external_dicts_dict_structure.md#dicts-external_dicts_dict_structure-attributes). Example:
......@@ -117,6 +119,6 @@ Configuration fields:
- `null_value` – The default value for a non-existing element. In the example, it is an empty string.
- `expression` – The attribute can be an expression. The tag is not required.
- `hierarchical` – Hierarchical support. Mirrored to the parent identifier. By default, ` false`.
- `injective` Whether the `id -> attribute` image is injective. If ` true`, then you can optimize the ` GROUP BY` clause. By default, `false`.
- `is_object_id` – Whether the query is executed for a MongoDB document by `ObjectID`.
- `injective` Whether the `id -> attribute` image is injective. If ` true`, then you can optimize the ` GROUP BY` clause. By default, `false`.
- `is_object_id` - Used for query mongo documents by ObjectId
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
<a name="format_capnproto"></a>
# CapnProto
Cap'n Proto is a binary message format similar to Protocol Buffers and Thrift, but not like JSON or MessagePack.
Cap'n Proto messages are strictly typed and not self-describing, meaning they need an external schema description. The schema is applied on the fly and cached for each query.
```sql
SELECT SearchPhrase, count() AS c FROM test.hits
GROUP BY SearchPhrase FORMAT CapnProto SETTINGS schema = 'schema:Message'
```
Where `schema.capnp` looks like this:
```
struct Message {
SearchPhrase @0 :Text;
c @1 :Uint64;
}
```
Schema files are in the file that is located in the directory specified in [ format_schema_path](../operations/server_settings/settings.md#server_settings-format_schema_path) in the server configuration.
Deserialization is effective and usually doesn't increase the system load.
<a name="format_capnproto"></a>
# CapnProto
Cap'n Proto is a binary message format similar to Protocol Buffers and Thrift, but not like JSON or MessagePack.
Cap'n Proto messages are strictly typed and not self-describing, meaning they need an external schema description. The schema is applied on the fly and cached for each query.
```sql
SELECT SearchPhrase, count() AS c FROM test.hits
GROUP BY SearchPhrase FORMAT CapnProto SETTINGS schema = 'schema:Message'
```
Where `schema.capnp` looks like this:
```
struct Message {
SearchPhrase @0 :Text;
c @1 :Uint64;
}
```
Schema files are in the file that is located in the directory specified in [ format_schema_path](../operations/server_settings/settings.md#server_settings-format_schema_path) in the server configuration.
Deserialization is effective and usually doesn't increase the system load.
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -3,4 +3,3 @@
# Formats
The format determines how data is returned to you after SELECTs (how it is written and formatted by the server), and how it is accepted for INSERTs (how it is read and parsed by the server).
文件模式从 100755 更改为 100644
......@@ -24,7 +24,7 @@ Example:
["bathroom interior design", "2166"],
["yandex", "1655"],
["spring 2014 fashion", "1549"],
["freeform photo", "1480"]
["freeform photos", "1480"]
],
"totals": ["","8873898"],
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -9,5 +9,5 @@ Date is represented as a UInt16 object that contains the number of days since 19
String is represented as a varint length (unsigned [LEB128](https://en.wikipedia.org/wiki/LEB128)), followed by the bytes of the string.
FixedString is represented simply as a sequence of bytes.
Array is represented as a varint length (unsigned [LEB128](https://en.wikipedia.org/wiki/LEB128)), followed by successive elements of the array.
Arrays are represented as a varint length (unsigned [LEB128](https://en.wikipedia.org/wiki/LEB128)), followed by the array elements in order.
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -4,5 +4,5 @@ Prints every row in brackets. Rows are separated by commas. There is no comma af
The minimum set of characters that you need to escape when passing data in Values ​​format: single quotes and backslashes.
This is the format that is used in `INSERT INTO t VALUES ...`, but you can also use it for formatting query results.
This is the format that is used in `INSERT INTO t VALUES ...` but you can also use it for query result.
文件模式从 100755 更改为 100644
......@@ -35,7 +35,7 @@ XML format is suitable only for output, not for parsing. Example:
<field>1549</field>
</row>
<row>
<SearchPhrase>freeform photo</SearchPhrase>
<SearchPhrase>freeform photos</SearchPhrase>
<field>1480</field>
</row>
<row>
......
文件模式从 100755 更改为 100644
......@@ -39,7 +39,7 @@ Accepts an empty array and returns a one-element array that is equal to the defa
Returns an array of numbers from 0 to N-1.
Just in case, an exception is thrown if arrays with a total length of more than 100,000,000 elements are created in a data block.
## array(x1, ...), оператор \[x1, ...\]
## array(x1, ...), operator \[x1, ...\]
Creates an array from the function arguments.
The arguments must be constants and have types that have the smallest common type. At least one argument must be passed, because otherwise it isn't clear which type of array to create. That is, you can't use this function to create an empty array (to do that, use the 'emptyArray\*' function described above).
......@@ -62,6 +62,7 @@ arrayConcat(arrays)
```sql
SELECT arrayConcat([1, 2], [3, 4], [5, 6]) AS res
```
```
┌─res───────────┐
│ [1,2,3,4,5,6] │
......@@ -202,6 +203,7 @@ arrayPopBack(array)
```sql
SELECT arrayPopBack([1, 2, 3]) AS res
```
```
┌─res───┐
│ [1,2] │
......@@ -243,13 +245,14 @@ arrayPushBack(array, single_value)
**Arguments**
- `array` – Array.
- `single_value` – A single value. Only numbers can be added to an array with numbers, and only strings can be added to an array of strings. When adding numbers, ClickHouse automatically sets the `single_value` type for the data type of the array. For more information about ClickHouse data types, read the section "[Data types](../data_types/index.md#data_types)".
- `single_value` – A single value. Only numbers can be added to an array with numbers, and only strings can be added to an array of strings. When adding numbers, ClickHouse automatically sets the `single_value` type for the data type of the array. For more information about the types of data in ClickHouse, see "[Data types](../data_types/index.md#data_types)".
**Example**
```sql
SELECT arrayPushBack(['a'], 'b') AS res
```
```
┌─res───────┐
│ ['a','b'] │
......@@ -267,7 +270,7 @@ arrayPushFront(array, single_value)
**Arguments**
- `array` – Array.
- `single_value` – A single value. Only numbers can be added to an array with numbers, and only strings can be added to an array of strings. When adding numbers, ClickHouse automatically sets the `single_value` type for the data type of the array. For more information about ClickHouse data types, read the section "[Data types](../data_types/index.md#data_types)".
- `single_value` – A single value. Only numbers can be added to an array with numbers, and only strings can be added to an array of strings. When adding numbers, ClickHouse automatically sets the `single_value` type for the data type of the array. For more information about the types of data in ClickHouse, see "[Data types](../data_types/index.md#data_types)".
**Example**
......@@ -292,7 +295,7 @@ arraySlice(array, offset[, length])
**Arguments**
- `array` – Array of data.
- `offset`Indent from the edge of the array. A positive value indicates an offset on the left, and a negative value is an indent on the right. Numbering of the array items begins with 1.
- `offset`Offset from the edge of the array. A positive value indicates an offset on the left, and a negative value is an indent on the right. Numbering of the array items begins with 1.
- `length` - The length of the required slice. If you specify a negative value, the function returns an open slice `[offset, array_length - length)`. If you omit the value, the function returns the slice `[offset, the_end_of_array]`.
**Example**
......@@ -300,6 +303,7 @@ arraySlice(array, offset[, length])
```sql
SELECT arraySlice([1, 2, 3, 4, 5], 2, 3) AS res
```
```
┌─res─────┐
│ [2,3,4] │
......
......@@ -28,3 +28,4 @@ SELECT arrayJoin([1, 2, 3] AS src) AS dst, 'Hello', src
│ 3 │ Hello │ [1,2,3] │
└─────┴───────────┴─────────┘
```
......@@ -15,3 +15,4 @@ The result type is an integer with bits equal to the maximum bits of its argumen
## bitShiftLeft(a, b)
## bitShiftRight(a, b)
......@@ -15,7 +15,7 @@ For example, you can't compare a date with a string. You have to use a function
Strings are compared by bytes. A shorter string is smaller than all strings that start with it and that contain at least one more character.
Note. Up until version 1.1.54134, signed and unsigned numbers were compared the same way as in C++. In other words, you could get an incorrect result in cases like SELECT 9223372036854775807 &gt; -1. This behavior changed in version 1.1.54134 and is now mathematically correct.
Note: Up until version 1.1.54134, signed and unsigned numbers were compared the same way as in C++. In other words, you could get an incorrect result in cases like SELECT 9223372036854775807 &gt; -1. This behavior changed in version 1.1.54134 and is now mathematically correct.
## equals, a = b and a == b operator
......
文件模式从 100755 更改为 100644
......@@ -79,10 +79,6 @@ Rounds down a date with time to the start of the minute.
Rounds down a date with time to the start of the hour.
## toStartOfFifteenMinutes
Rounds down the date with time to the start of the fifteen-minute interval.
Note: If you need to round a date with time to any other number of seconds, minutes, or hours, you can convert it into a number by using the toUInt32 function, then round the number using intDiv and multiplication, and convert it back using the toDateTime function.
## toStartOfHour
......
文件模式从 100755 更改为 100644
......@@ -18,20 +18,18 @@ For information on connecting and configuring external dictionaries, see "[Exter
`dictGetT('dict_name', 'attr_name', id)`
- Get the value of the attr_name attribute from the dict_name dictionary using the 'id' key.
`dict_name` and `attr_name` are constant strings.
`id`must be UInt64.
- Get the value of the attr_name attribute from the dict_name dictionary using the 'id' key.`dict_name` and `attr_name` are constant strings.`id`must be UInt64.
If there is no `id` key in the dictionary, it returns the default value specified in the dictionary description.
## dictGetTOrDefault
`dictGetT('dict_name', 'attr_name', id, default)`
The same as the `dictGetT` functions, but the default value is taken from the function's last argument.
Similar to the functions dictGetT, but the default value is taken from the last argument of the function.
## dictIsIn
`dictIsIn('dict_name', child_id, ancestor_id)`
`dictIsIn ('dict_name', child_id, ancestor_id)`
- For the 'dict_name' hierarchical dictionary, finds out whether the 'child_id' key is located inside 'ancestor_id' (or matches 'ancestor_id'). Returns UInt8.
......
文件模式从 100755 更改为 100644
......@@ -73,7 +73,7 @@ Returns the index of the first element in the 'arr1' array for which 'func' retu
### arrayCumSum(\[func,\] arr1, ...)
Returns an array of partial sums of elements in the source array (a running sum). If the `func` function is specified, then the values of the array elements are converted by this function before summing.
Returns the cumulative sum of the array obtained from the original application of the 'func' function to each element in the 'arr' array.
Example:
......@@ -86,3 +86,4 @@ SELECT arrayCumSum([1, 1, 1, 1]) AS res
│ [1, 2, 3, 4] │
└──────────────┘
```
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
# Functions for working with JSON
# Functions for working with JSON.
In Yandex.Metrica, JSON is transmitted by users as session parameters. There are some special functions for working with this JSON. (Although in most of the cases, the JSONs are additionally pre-processed, and the resulting values are put in separate columns in their processed format.) All these functions are based on strong assumptions about what the JSON can be, but they try to do as little as possible to get the job done.
The following assumptions are made:
1. The field name (function argument) must be a constant.
2. The field name is somehow canonically encoded in JSON. For example: `visitParamHas('{"abc":"def"}', 'abc') = 1`, но `visitParamHas('{"\\u0061\\u0062\\u0063":"def"}', 'abc') = 0`
2. The field name is somehow canonically encoded in JSON. For example: `visitParamHas('{"abc":"def"}', 'abc') = 1`, but `visitParamHas('{"\\u0061\\u0062\\u0063":"def"}', 'abc') = 0`
3. Fields are searched for on any nesting level, indiscriminately. If there are multiple matching fields, the first occurrence is used.
4. The JSON doesn't have space characters outside of string literals.
......@@ -47,10 +47,7 @@ Parses the string in double quotes. The value is unescaped. If unescaping failed
Examples:
```text
visitParamExtractString('{"abc":"\\n\\u0000"}', 'abc') = '\n\0'
visitParamExtractString('{"abc":"\\u263a"}', 'abc') = '☺'
visitParamExtractString('{"abc":"\\u263"}', 'abc') = ''
visitParamExtractString('{"abc":"hello}', 'abc') = ''
visitParamExtractString('{"abc":"\\n\\u0000"}', 'abc') = '\n\0'visitParamExtractString('{"abc":"\\u263a"}', 'abc') = '☺'visitParamExtractString('{"abc":"\\u263"}', 'abc') = ''visitParamExtractString('{"abc":"hello}', 'abc') = ''
```
There is currently no support for code points in the format `\uXXXX\uYYYY` that are not from the basic multilingual plane (they are converted to CESU-8 instead of UTF-8).
......
......@@ -11,3 +11,4 @@ Zero as an argument is considered "false," while any non-zero value is considere
## not, NOT operator
## xor
......@@ -97,3 +97,4 @@ The arc tangent.
## pow(x, y)
xy.
......@@ -59,8 +59,7 @@ For elements in a nested data structure, the function checks for the existence o
Allows building a unicode-art diagram.
`bar (x, min, max, width)` – Draws a band with a width proportional to (x - min) and equal to 'width' characters when x == max.
`min, max` – Integer constants. The value must fit in Int64.`width` – Constant, positive number, may be a fraction.
`bar (x, min, max, width)` – Draws a band with a width proportional to (x - min) and equal to 'width' characters when x == max.`min, max` – Integer constants. The value must fit in Int64.`width` – Constant, positive number, may be a fraction.
The band is drawn with accuracy to one eighth of a symbol.
......@@ -138,7 +137,7 @@ Example:
```sql
SELECT
transform(SearchEngineID, [2, 3], ['Yandex', 'Google'], 'Other') AS title,
transform(SearchEngineID, [2, 3], ['Yandex', 'Google'], 'Other' AS title,
count() AS c
FROM test.hits
WHERE SearchEngineID != 0
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -76,3 +76,4 @@ SELECT replaceRegexpAll('Hello, World!', '^', 'here: ') AS res
│ here: Hello, World! │
└─────────────────────┘
```
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -21,9 +21,7 @@ All functions for working with regions have an optional argument at the end –
Example:
```text
regionToCountry(RegionID) – Uses the default dictionary: /opt/geo/regions_hierarchy.txt
regionToCountry(RegionID, '') – Uses the default dictionary: /opt/geo/regions_hierarchy.txt
regionToCountry(RegionID, 'ua') – Uses the dictionary for the 'ua' key: /opt/geo/regions_hierarchy_ua.txt
regionToCountry(RegionID) – Uses the default dictionary: /opt/geo/regions_hierarchy.txtregionToCountry(RegionID, '') – Uses the default dictionary: /opt/geo/regions_hierarchy.txtregionToCountry(RegionID, 'ua') – Uses the dictionary for the 'ua' key: /opt/geo/regions_hierarchy_ua.txt
```
### regionToCity(id[, geobase])
......@@ -35,9 +33,7 @@ Accepts a UInt32 number – the region ID from the Yandex geobase. If this regio
Converts a region to an area (type 5 in the geobase). In every other way, this function is the same as 'regionToCity'.
```sql
SELECT DISTINCT regionToName(regionToArea(toUInt32(number), 'ua'))
FROM system.numbers
LIMIT 15
SELECT DISTINCT regionToName(regionToArea(toUInt32(number), 'ua'))FROM system.numbersLIMIT 15
```
```text
......@@ -65,9 +61,7 @@ LIMIT 15
Converts a region to a federal district (type 4 in the geobase). In every other way, this function is the same as 'regionToCity'.
```sql
SELECT DISTINCT regionToName(regionToDistrict(toUInt32(number), 'ua'))
FROM system.numbers
LIMIT 15
SELECT DISTINCT regionToName(regionToDistrict(toUInt32(number), 'ua'))FROM system.numbersLIMIT 15
```
```text
......
文件模式从 100755 更改为 100644
......@@ -66,8 +66,6 @@ CREATE TABLE criteo
Transform data from the raw log and put it in the second table:
```sql
INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;
DROP TABLE criteo_log;
INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;DROP TABLE criteo_log;
```
# New York Taxi data
# Data about New York taxis
## How to import the raw data
## How to import raw data
See <https://github.com/toddwschneider/nyc-taxi-data> and <http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html> for the description of the dataset and instructions for downloading.
See <https://github.com/toddwschneider/nyc-taxi-data> and <http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html> for description of the dataset and loading instructions.
Downloading will result in about 227 GB of uncompressed data in CSV files. The download takes about an hour over a 1 Gbit connection (parallel downloading from s3.amazonaws.com recovers at least half of a 1 Gbit channel).
Some of the files might not download fully. Check the file sizes and re-download any that seem doubtful.
......@@ -301,14 +301,19 @@ SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetr
Q4:
```sql
SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)FROM trips_mergetreeGROUP BY passenger_count, year, distanceORDER BY year, count(*) DESC
SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)
FROM trips_mergetree
GROUP BY passenger_count, year, distance
ORDER BY year, count(*) DESC
```
3.593 seconds.
The following server was used:
Two Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 16 physical kernels total,128 GiB RAM,8x6 TB HD on hardware RAID-5
Two Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 16 physical kernels total,
128 GiB RAM,
8x6 TB HD on hardware RAID-5
Execution time is the best of three runsBut starting from the second run, queries read data from the file system cache. No further caching occurs: the data is read out and processed in each run.
......
文件模式从 100755 更改为 100644
# Star Schema Benchmark
# Star scheme
Compiling dbgen: <https://github.com/vadimtk/ssb-dbgen>
......@@ -82,4 +82,3 @@ Downloading data (change 'customer' to 'customerd' in the distributed version):
cat customer.tbl | sed 's/$/2000-01-01/' | clickhouse-client --query "INSERT INTO customer FORMAT CSV"
cat lineorder.tbl | clickhouse-client --query "INSERT INTO lineorder FORMAT CSV"
```
......@@ -20,7 +20,7 @@ CREATE TABLE wikistat
Loading data:
```bash
for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sSL "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt
for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sS "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt
cat links.txt | while read link; do wget http://dumps.wikimedia.org/other/pagecounts-raw/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1/')/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1-\2/')/$link; done
ls -1 /opt/wikistat/ | grep gz | while read i; do echo $i; gzip -cd /opt/wikistat/$i | ./wikistat-loader --time="$(echo -n $i | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})([0-9]{2})-([0-9]{2})([0-9]{2})([0-9]{2})\.gz/\1-\2-\3 \4-00-00/')" | clickhouse-client --query="INSERT INTO wikistat FORMAT TabSeparated"; done
```
......
......@@ -16,15 +16,14 @@ The terminal must use UTF-8 encoding (the default in Ubuntu).
For testing and development, the system can be installed on a single server or on a desktop computer.
### Installing from packages
### Installing from packages Debian/Ubuntu
In `/etc/apt/sources.list` (or in a separate `/etc/apt/sources.list.d/clickhouse.list` file), add the repository:
```text
deb http://repo.yandex.ru/clickhouse/trusty stable main
deb http://repo.yandex.ru/clickhouse/deb/stable/ main/
```
On other versions of Ubuntu, replace `trusty` with `xenial` or `precise`.
If you want to use the most recent test version, replace 'stable' with 'testing'.
Then run:
......@@ -35,10 +34,7 @@ sudo apt-get update
sudo apt-get install clickhouse-client clickhouse-server-common
```
You can also download and install packages manually from here:
<http://repo.yandex.ru/clickhouse/trusty/pool/main/c/clickhouse/>
<http://repo.yandex.ru/clickhouse/xenial/pool/main/c/clickhouse/>
<http://repo.yandex.ru/clickhouse/precise/pool/main/c/clickhouse/>
You can also download and install packages manually from here: <https://repo.yandex.ru/clickhouse/deb/stable/main/>
ClickHouse contains access restriction settings. They are located in the 'users.xml' file (next to 'config.xml').
By default, access is allowed from anywhere for the 'default' user, without a password. See 'user/default/networks'.
......@@ -104,8 +100,7 @@ clickhouse-client
```
The default parameters indicate connecting with localhost:9000 on behalf of the user 'default' without a password.
The client can be used for connecting to a remote server.
Example:
The client can be used for connecting to a remote server. Example:
```bash
clickhouse-client --host=example.com
......@@ -137,4 +132,3 @@ SELECT 1
**Congratulations, the system works!**
To continue experimenting, you can try to download from the test data sets.
......@@ -39,7 +39,7 @@ We'll say that the following is true for the OLAP (online analytical processing)
- Data is updated in fairly large batches (> 1000 rows), not by single rows; or it is not updated at all.
- Data is added to the DB but is not modified.
- For reads, quite a large number of rows are extracted from the DB, but only a small subset of columns.
- Tables are "wide," meaning they contain a large number of columns.
- Tables are "wide", meaning they contain a large number of columns.
- Queries are relatively rare (usually hundreds of queries per server or less per second).
- For simple queries, latencies around 50 ms are allowed.
- Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).
......
......@@ -6,9 +6,7 @@ To work from the command line, you can use ` clickhouse-client`:
$ clickhouse-client
ClickHouse client version 0.0.26176.
Connecting to localhost:9000.
Connected to ClickHouse server version 0.0.26176.
:)
Connected to ClickHouse server version 0.0.26176.:)
```
The client supports command-line options and configuration files. For more information, see "[Configuring](#interfaces_cli_configuration)".
......@@ -31,6 +29,7 @@ _EOF
cat file.csv | clickhouse-client --database=test --query="INSERT INTO test FORMAT CSV";
```
In batch mode, the default data format is TabSeparated. You can set the format in the FORMAT clause of the query.
By default, you can only process a single query in batch mode. To make multiple queries from a "script," use the --multiquery parameter. This works for all queries except INSERT. Query results are output consecutively without additional separators.
Similarly, to process a large number of queries, you can run 'clickhouse-client' for each query. Note that it may take tens of milliseconds to launch the 'clickhouse-client' program.
......@@ -65,7 +64,7 @@ The command-line client allows passing external data (external temporary tables)
<a name="interfaces_cli_configuration"></a>
## Configuring
## Configure
You can pass parameters to `clickhouse-client` (all parameters have a default value) using:
......
......@@ -37,8 +37,7 @@ Date: Fri, 16 Nov 2012 19:21:50 GMT
1
```
As you can see, curl is somewhat inconvenient in that spaces must be URL escaped.
Although wget escapes everything itself, we don't recommend using it because it doesn't work well over HTTP 1.1 when using keep-alive and Transfer-Encoding: chunked.
As you can see, curl is somewhat inconvenient in that spaces must be URL escaped.Although wget escapes everything itself, we don't recommend using it because it doesn't work well over HTTP 1.1 when using keep-alive and Transfer-Encoding: chunked.
```bash
$ echo 'SELECT 1' | curl 'http://localhost:8123/' --data-binary @-
......@@ -131,11 +130,15 @@ POST 'http://localhost:8123/?query=DROP TABLE t'
For successful requests that don't return a data table, an empty response body is returned.
You can use compression when transmitting data. The compressed data has a non-standard format, and you will need to use the special compressor program to work with it (sudo apt-get install compressor-metrika-yandex).
You can use compression when transmitting data.
For using ClickHouse internal compression format, and you will need to use the special clickhouse-compressor program to work with it (installed as a part of clickhouse-client package).
If you specified 'compress=1' in the URL, the server will compress the data it sends you.
If you specified 'decompress=1' in the URL, the server will decompress the same data that you pass in the POST method.
Also standard gzip-based HTTP compression can be used. To send gzip compressed POST data just add `Content-Encoding: gzip` to request headers, and gzip POST body.
To get response compressed, you need to add `Accept-Encoding: gzip` to request headers, and turn on ClickHouse setting called `enable_http_compression`.
You can use this to reduce network traffic when transmitting a large amount of data, or for creating dumps that are immediately compressed.
You can use the 'database' URL parameter to specify the default database.
......@@ -191,7 +194,11 @@ $ echo 'SELECT number FROM system.numbers LIMIT 10' | curl 'http://localhost:812
For information about other parameters, see the section "SET".
In contrast to the native interface, the HTTP interface does not support the concept of sessions or session settings, does not allow aborting a query (to be exact, it allows this in only a few cases), and does not show the progress of query processing. Parsing and data formatting are performed on the server side, and using the network might be ineffective.
You can use ClickHouse sessions in the HTTP protocol. To do this, you need to specify the `session_id` GET parameter in HTTP request. You can use any alphanumeric string as a session_id. By default session will be timed out after 60 seconds of inactivity. You can change that by setting `default_session_timeout` in server config file, or by adding GET parameter `session_timeout`. You can also check the status of the session by using GET parameter `session_check=1`. When using sessions you can't run 2 queries with the same session_id simultaneously.
You can get the progress of query execution in X-ClickHouse-Progress headers, by enabling setting send_progress_in_http_headers.
Running query are not aborted automatically after closing HTTP connection. Parsing and data formatting are performed on the server side, and using the network might be ineffective.
The optional 'query_id' parameter can be passed as the query ID (any string). For more information, see the section "Settings, replace_running_query".
The optional 'quota_key' parameter can be passed as the quota key (any string). For more information, see the section "Quotas".
......@@ -213,4 +220,3 @@ curl -sS 'http://localhost:8123/?max_result_bytes=4000000&buffer_size=3000000&wa
```
Use buffering to avoid situations where a query processing error occurred after the response code and HTTP headers were sent to the client. In this situation, an error message is written at the end of the response body, and on the client side, the error can only be detected at the parsing stage.
......@@ -2,5 +2,4 @@
# Interfaces
To explore the system's capabilities, download data to tables, or make manual queries, use the clickhouse-client program.
To explore the system's capabilities, download data to tables, or make manual queries, use the clickhouse-client program.
\ No newline at end of file
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -2,7 +2,7 @@
There are libraries for working with ClickHouse for:
- Python
- Python:
- [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm)
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse)
- [clickhouse-driver](https://github.com/mymarilyn/clickhouse-driver)
......
文件模式从 100755 更改为 100644
# Distinctive features of ClickHouse
## True column-oriented DBMS
## True column-oriented DBMS.
In a true column-oriented DBMS, there isn't any "garbage" stored with the values. Among other things, this means that constant-length values must be supported, to avoid storing their length "number" next to the values. As an example, a billion UInt8-type values should actually consume around 1 GB uncompressed, or this will strongly affect the CPU use. It is very important to store data compactly (without any "garbage") even when uncompressed, since the speed of decompression (CPU usage) depends mainly on the volume of uncompressed data.
In a true column-oriented DBMS, there isn't any "garbage" stored with the values. For example, constant-length values must be supported, to avoid storing their length "number" next to the values. As an example, a billion UInt8-type values should actually consume around 1 GB uncompressed, or this will strongly affect the CPU use. It is very important to store data compactly (without any "garbage") even when uncompressed, since the speed of decompression (CPU usage) depends mainly on the volume of uncompressed data.
This is worth noting because there are systems that can store values of separate columns separately, but that can't effectively process analytical queries due to their optimization for other scenarios. Examples are HBase, BigTable, Cassandra, and HyperTable. In these systems, you will get throughput around a hundred thousand rows per second, but not hundreds of millions of rows per second.
This is worth noting because there are systems that can store values of separate columns separately, but that can't effectively process analytical queries due to their optimization for other scenarios. Example are HBase, BigTable, Cassandra, and HyperTable. In these systems, you will get throughput around a hundred thousand rows per second, but not hundreds of millions of rows per second.
Also note that ClickHouse is a DBMS, not a single database. ClickHouse allows creating tables and databases in runtime, loading data, and running queries without reconfiguring and restarting the server.
......@@ -12,15 +12,15 @@ Also note that ClickHouse is a DBMS, not a single database. ClickHouse allows cr
Some column-oriented DBMSs (InfiniDB CE and MonetDB) do not use data compression. However, data compression really improves performance.
## Disk storage of data
## Disk storage of data.
Many column-oriented DBMSs (such as SAP HANA and Google PowerDrill) can only work in RAM. But even on thousands of servers, the RAM is too small for storing all the pageviews and sessions in Yandex.Metrica.
## Parallel processing on multiple cores
## Parallel processing on multiple cores.
Large queries are parallelized in a natural way.
## Distributed processing on multiple servers
## Distributed processing on multiple servers.
Almost none of the columnar DBMSs listed above have support for distributed processing.
In ClickHouse, data can reside on different shards. Each shard can be a group of replicas that are used for fault tolerance. The query is processed on all the shards in parallel. This is transparent for the user.
......@@ -30,12 +30,12 @@ In ClickHouse, data can reside on different shards. Each shard can be a group of
If you are familiar with standard SQL, we can't really talk about SQL support.
All the functions have different names.
However, this is a declarative query language based on SQL that can't be differentiated from SQL in many instances.
JOINs are supported. Subqueries are supported in FROM, IN, and JOIN clauses, as well as scalar subqueries.
Support for JOINs. Subqueries are supported in FROM, IN, and JOIN clauses, as well as scalar subqueries.
Dependent subqueries are not supported.
## Vector engine
Data is not only stored by columns, but is processed by vectors (parts of columns). This allows us to achieve high CPU performance.
Data is not only stored by columns, but is processed by vectors – parts of columns. This allows us to achieve high CPU performance.
## Real-time data updates
......@@ -43,13 +43,13 @@ ClickHouse supports primary key tables. In order to quickly perform queries on t
## Indexes
Having a primary key makes it possible to extract data for specific clients (for instance, Yandex.Metrica tracking tags) for a specific time range, with low latency less than several dozen milliseconds.
Having a primary key allows, for example, extracting data for specific clients (Metrica counters) for a specific time range, with low latency less than several dozen milliseconds.
## Suitable for online queries
This lets us use the system as the back-end for a web interface. Low latency means queries can be processed without delay, while the Yandex.Metrica interface page is loading. In other words, in online mode.
## Support for approximated calculations
## Support for approximated calculations.
1. The system contains aggregate functions for approximated calculation of the number of various values, medians, and quantiles.
2. Supports running a query based on a part (sample) of data and getting an approximated result. In this case, proportionally less data is retrieved from the disk.
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
# Questions you were afraid to ask
# Everything you were afraid to ask
## Why not use something like MapReduce?
We can refer to systems like map-reduce as distributed computing systems in which the reduce operation is based on distributed sorting. In this sense, they include Hadoop, and YT (YT is developed at Yandex for internal use).
We can refer to systems like map-reduce as distributed computing systems in which the reduce operation is based on distributed sorting. In this sense, they include Hadoop and YT (Yandex proprietary technology).
These systems aren't appropriate for online queries due to their high latency. In other words, they can't be used as the back-end for a web interface.
These types of systems aren't useful for real-time data updates.
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册