external_dicts.rst 17.7 KB
Newer Older
A
Andrey Dudin 已提交
1
External dictionaries
2
=====================
A
Andrey Dudin 已提交
3

A
Andrey Dudin 已提交
4 5
It is possible to add your own dictionaries from various data sources. The data source for a dictionary can be a file in the local file system, the ClickHouse server, or a MySQL server.
A dictionary can be stored completely in RAM and updated regularly, or it can be partially cached in RAM and dynamically load missing values.
A
Andrey Dudin 已提交
6

7
The configuration of external dictionaries is in a separate file or files specified in the ``dictionaries_config`` configuration parameter.
A
Andrey Dudin 已提交
8
This parameter contains the absolute or relative path to the file with the dictionary configuration. A relative path is relative to the directory with the server config file. The path can contain wildcards * and ?, in which case all matching files are found. Example: dictionaries/*.xml.
A
Andrey Dudin 已提交
9

A
Andrey Dudin 已提交
10
The dictionary configuration, as well as the set of files with the configuration, can be updated without restarting the server. The server checks updates every 5 seconds. This means that dictionaries can be enabled dynamically.
A
Andrey Dudin 已提交
11

12
Dictionaries can be created when starting the server, or at first use. This is defined by the ``dictionaries_lazy_load`` parameter in the main server config file. This parameter is optional, 'true' by default. If set to 'true', each dictionary is created at first use. If dictionary creation failed, the function that was using the dictionary throws an exception. If 'false', all dictionaries are created when the server starts, and if there is an error, the server shuts down.
A
Andrey Dudin 已提交
13

A
Andrey Dudin 已提交
14
The dictionary config file has the following format:
A
Andrey Dudin 已提交
15 16 17 18

.. code-block:: xml

  <dictionaries>
A
Andrey Dudin 已提交
19 20 21
      <comment>Optional element with any content; completely ignored.</comment>

      <!--You can set any number of different dictionaries. -->
A
Andrey Dudin 已提交
22
      <dictionary>
A
Andrey Dudin 已提交
23
          <!-- Dictionary name. The dictionary will be accessed for use by this name. -->
A
Andrey Dudin 已提交
24 25
          <name>os</name>
  
A
Andrey Dudin 已提交
26
          <!-- Data source. -->
A
Andrey Dudin 已提交
27
          <source>
A
Andrey Dudin 已提交
28
              <!-- Source is a file in the local file system. -->
A
Andrey Dudin 已提交
29
              <file>
A
Andrey Dudin 已提交
30
                  <!-- Path on the local file system. -->
A
Andrey Dudin 已提交
31
                  <path>/opt/dictionaries/os.tsv</path>
A
Andrey Dudin 已提交
32
                  <!-- Which format to use for reading the file. -->
A
Andrey Dudin 已提交
33 34
                  <format>TabSeparated</format>
              </file>
A
Andrey Dudin 已提交
35 36

              <!-- or the source is a table on a MySQL server.
A
Andrey Dudin 已提交
37
              <mysql>
A
Andrey Dudin 已提交
38
                  <!- - These parameters can be specified outside (common for all replicas) or inside a specific replica - ->
A
Andrey Dudin 已提交
39 40 41
                  <port>3306</port>
                  <user>clickhouse</user>
                  <password>qwerty</password>
A
Andrey Dudin 已提交
42
                  <!- - Specify from one to any number of replicas for fault tolerance. - ->
A
Andrey Dudin 已提交
43 44
                  <replica>
                      <host>example01-1</host>
A
Andrey Dudin 已提交
45
                      <priority>1</priority> <!- - The lower the value, the higher the priority. - ->
A
Andrey Dudin 已提交
46 47 48 49 50 51 52 53 54
                  </replica>
                  <replica>
                      <host>example01-2</host>
                      <priority>1</priority>
                  </replica>
                  <db>conv_main</db>
                  <table>counters</table>
              </mysql>
              -->
A
Andrey Dudin 已提交
55 56

              <!-- or the source is a table on the ClickHouse server.
A
Andrey Dudin 已提交
57 58 59 60 61 62 63 64
              <clickhouse>
                  <host>example01-01-1</host>
                  <port>9000</port>
                  <user>default</user>
                  <password></password>
                  <db>default</db>
                  <table>counters</table>
              </clickhouse>
A
Andrey Dudin 已提交
65
              <!- - If the address is similar to localhost, the request is made without network interaction. For fault tolerance, you can create a Distributed table on localhost and enter it. - ->
A
Andrey Dudin 已提交
66
              -->
A
Andrey Dudin 已提交
67 68

              <!-- or the source is a executable. If layout.complex_key_cache - list of needed keys will be written in STDIN of program -->
A
Andrey Dudin 已提交
69
              <executable>
A
Andrey Dudin 已提交
70
                  <!-- Path on the local file system or name located in one of env PATH dirs. -->
A
Andrey Dudin 已提交
71
                  <command>cat /opt/dictionaries/os.tsv</command>
A
Andrey Dudin 已提交
72
                  <!-- Which format to use for reading/writing stream. -->
A
Andrey Dudin 已提交
73 74
                  <format>TabSeparated</format>
              </executable>
A
Andrey Dudin 已提交
75 76

              <!-- or the source is a http server. If layout.complex_key_cache - list of needed keys will be sent as POST  -->
A
Andrey Dudin 已提交
77
              <http>
A
Andrey Dudin 已提交
78
                  <!-- Host. -->
A
Andrey Dudin 已提交
79
                  <url>http://[::1]/os.tsv</url>
A
Andrey Dudin 已提交
80
                  <!-- Which format to use for reading answer and making POST. -->
A
Andrey Dudin 已提交
81 82
                  <format>TabSeparated</format>
              </http>
A
Andrey Dudin 已提交
83

A
Andrey Dudin 已提交
84
          </source>
A
Andrey Dudin 已提交
85 86

          <!-- Update interval for fully loaded dictionaries. 0 - never update. -->
A
Andrey Dudin 已提交
87 88 89
          <lifetime>
              <min>300</min>
              <max>360</max>
A
Andrey Dudin 已提交
90
              <!-- The update interval is selected uniformly randomly between min and max, in order to spread out the load when updating dictionaries on a large number of servers. -->
A
Andrey Dudin 已提交
91
          </lifetime>
A
Andrey Dudin 已提交
92 93

          <!-- or <!- - The update interval for fully loaded dictionaries or invalidation time for cached dictionaries. 0 - never update. - ->
A
Andrey Dudin 已提交
94 95
          <lifetime>300</lifetime>
          -->
A
Andrey Dudin 已提交
96 97

          <layout> <!-- Method for storing in memory. -->
A
Andrey Dudin 已提交
98
              <flat />
A
Andrey Dudin 已提交
99 100
              <!-- or <hashed />
              or
A
Andrey Dudin 已提交
101
              <cache>
A
Andrey Dudin 已提交
102
                  <!- - Cache size in number of cells; rounded up to a degree of two. - ->
A
Andrey Dudin 已提交
103
                  <size_in_cells>1000000000</size_in_cells>
104 105 106 107
              </cache>
              or
              <ip_trie />
              -->
A
Andrey Dudin 已提交
108
          </layout>
A
Andrey Dudin 已提交
109 110

          <!-- Structure. -->
A
Andrey Dudin 已提交
111
          <structure>
A
Andrey Dudin 已提交
112
              <!-- Description of the column that serves as the dictionary identifier (key). -->
A
Andrey Dudin 已提交
113
              <id>
A
Andrey Dudin 已提交
114
                  <!-- Column name with ID. -->
A
Andrey Dudin 已提交
115 116
                  <name>Id</name>
              </id>
A
Andrey Dudin 已提交
117 118 119

              <attribute>
                  <!-- Column name. -->
A
Andrey Dudin 已提交
120
                  <name>Name</name>
A
Andrey Dudin 已提交
121 122 123
                  <!-- Column type. (How the column is understood when loading. For MySQL, a table can have TEXT, VARCHAR, and BLOB, but these are all loaded as String) -->
                   <type>String</type>
                  <!-- Value to use for a non-existing element. In the example, an empty string. -->
A
Andrey Dudin 已提交
124 125
                  <null_value></null_value>
              </attribute>
A
Andrey Dudin 已提交
126
              <!-- Any number of attributes can be specified. -->
A
Andrey Dudin 已提交
127 128 129 130
              <attribute>
                  <name>ParentID</name>
                  <type>UInt64</type>
                  <null_value>0</null_value>
A
Andrey Dudin 已提交
131
                  <!-- Whether it defines a hierarchy - mapping to the parent ID (by default, false). -->
A
Andrey Dudin 已提交
132
                  <hierarchical>true</hierarchical>
A
Andrey Dudin 已提交
133
                  <!-- The mapping id -> attribute can be considered injective, in order to optimize GROUP BY. (by default, false) -->
A
Andrey Dudin 已提交
134 135 136 137 138 139
                  <injective>true</injective>
              </attribute>
          </structure>
      </dictionary>
  </dictionaries>

A
Andrey Dudin 已提交
140
The dictionary identifier (key attribute) should be a number that fits into UInt64. Also, you can use arbitrary tuples as keys (see section "Dictionaries with complex keys"). Note: you can use complex keys consisting of just one element. This allows using e.g. Strings as dictionary keys.
A
Andrey Dudin 已提交
141

A
Andrey Dudin 已提交
142
There are six ways to store dictionaries in memory.
A
Andrey Dudin 已提交
143 144

flat
145
----
A
Andrey Dudin 已提交
146
This is the most effective method. It works if all keys are smaller than ``500,000``.  If a larger key is discovered when creating the dictionary, an exception is thrown and the dictionary is not created. The dictionary is loaded to RAM in its entirety. The dictionary uses the amount of memory proportional to maximum key value. With the limit of 500,000, memory consumption is not likely to be high. All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety.
A
Andrey Dudin 已提交
147 148

hashed
149
------
A
Andrey Dudin 已提交
150 151
This method is slightly less effective than the first one. The dictionary is also loaded to RAM in its entirety, and can contain any number of items with any identifiers. In practice, it makes sense to use up to tens of millions of items, while there is enough RAM.
All types of sources are supported. When updating, data (from a file or from a table) is read in its entirety.
A
Andrey Dudin 已提交
152 153

cache
154
-----
A
Andrey Dudin 已提交
155 156
This is the least effective method. It is appropriate if the dictionary doesn't fit in RAM. It is a cache of a fixed number of cells, where frequently-used data can be located. MySQL, ClickHouse, executable, http sources are supported, but file sources are not supported. 
When searching a dictionary, the cache is searched first. For each data block, all keys not found in the cache (or expired keys) are collected in a package, which is sent to the source with the query ``SELECT attrs... FROM db.table WHERE id IN (k1, k2, ...)``. The received data is then written to the cache.
A
Andrey Dudin 已提交
157 158

range_hashed
159
------------
A
Andrey Dudin 已提交
160
The table lists some data for date ranges, for each key. To give the possibility to extract this data for a given key, for a given date.
A
Andrey Dudin 已提交
161

A
Andrey Dudin 已提交
162
Example: in the table there are discounts for each advertiser in the form:
163 164 165

.. code-block:: text

A
Andrey Dudin 已提交
166
  advertiser id    discount start date    end date    value
A
Andrey Dudin 已提交
167 168 169 170
  123                 2015-01-01                     2015-01-15    0.15
  123                 2015-01-16                     2015-01-31    0.25
  456                 2015-01-01                     2015-01-15    0.05

A
Andrey Dudin 已提交
171 172
Adding layout = range_hashed.
When using such a layout, the structure should have the elements range_min, range_max.
A
Andrey Dudin 已提交
173

A
Andrey Dudin 已提交
174
Example:
A
Andrey Dudin 已提交
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

.. code-block:: xml

  <structure>
      <id>
          <name>Id</name>
      </id>
      <range_min>
          <name>first</name>
      </range_min>
      <range_max>
          <name>last</name>
      </range_max>
      ...
      
A
Andrey Dudin 已提交
190 191
These columns must be of type Date. Other types are not yet supported.
The columns indicate a closed date range.
A
Andrey Dudin 已提交
192

A
Andrey Dudin 已提交
193
To work with such dictionaries, dictGetT functions must take one more argument - the date:
A
Andrey Dudin 已提交
194 195 196

``dictGetT('dict_name', 'attr_name', id, date)``

A
Andrey Dudin 已提交
197
The function takes out the value for this id and for the date range, which includes the transmitted date. If no id is found or the range found is not found for the found id, the default value for the dictionary is returned.
A
Andrey Dudin 已提交
198

A
Andrey Dudin 已提交
199
If there are overlapping ranges, then any suitable one can be used.
A
Andrey Dudin 已提交
200

A
Andrey Dudin 已提交
201
If the range boundary is NULL or is an incorrect date (1900-01-01, 2039-01-01), then the range should be considered open. The range can be open on both sides.
A
Andrey Dudin 已提交
202

A
Andrey Dudin 已提交
203
In the RAM, the data is presented as a hash table with a value in the form of an ordered array of ranges and their corresponding values.
A
Andrey Dudin 已提交
204

A
Andrey Dudin 已提交
205
Example of a dictionary by ranges:
A
Andrey Dudin 已提交
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

.. code-block:: xml

  <dictionaries>
          <dictionary>
                  <name>xxx</name>
                  <source>
                          <mysql>
                                  <password>xxx</password>
                                  <port>3306</port>
                                  <user>xxx</user>
                                  <replica>
                                          <host>xxx</host>
                                          <priority>1</priority>
                                  </replica>
                                  <db>dicts</db>
                                  <table>xxx</table>
                          </mysql>
                  </source>
                  <lifetime>
                          <min>300</min>
                          <max>360</max>
                  </lifetime>
                  <layout>
                          <range_hashed />
                  </layout>
                  <structure>
                          <id>
                                  <name>Abcdef</name>
                          </id>
                          <range_min>
                                  <name>StartDate</name>
                          </range_min>
                          <range_max>
                                  <name>EndDate</name>
                          </range_max>
                          <attribute>
                                  <name>XXXType</name>
                                  <type>String</type>
                                  <null_value />
                          </attribute>
                  </structure>
          </dictionary>
  </dictionaries>

251 252 253 254 255
ip_trie
-------
The table stores IP prefixes for each key (IP address), which makes it possible to map IP addresses to metadata such as ASN or threat score.

Example: in the table there are prefixes matches to AS number and country:
256

257 258
.. code-block:: text

259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
  prefix            asn       cca2
  202.79.32.0/20    17501     NP
  2620:0:870::/48   3856      US
  2a02:6b8:1::/48   13238     RU
  2001:db8::/32     65536     ZZ


When using such a layout, the structure should have the "key" element.

Example:

.. code-block:: xml

  <structure>
      <key>
          <attribute>
              <name>prefix</name>
              <type>String</type>
          </attribute>
      </key>
      <attribute>
              <name>asn</name>
              <type>UInt32</type>
              <null_value />
      </attribute>
      <attribute>
              <name>cca2</name>
              <type>String</type>
              <null_value>??</null_value>
      </attribute>
      ...
      
These key must have only one attribute of type String, containing a valid IP prefix. Other types are not yet supported.

For querying, same functions (dictGetT with tuple) as for complex key dictionaries have to be used:

``dictGetT('dict_name', 'attr_name', tuple(ip))``

The function accepts either UInt32 for IPv4 address or FixedString(16) for IPv6 address in wire format:

``dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))``

No other type is supported. The function returns attribute for a prefix matching the given IP address. If there are overlapping prefixes, the most specific one is returned.

The data is stored currently in a bitwise trie, it has to fit in memory.

A
Andrey Dudin 已提交
305
complex_key_hashed
306
------------------
A
Andrey Dudin 已提交
307

A
Andrey Dudin 已提交
308
The same as ``hashed``, but for complex keys.
A
Andrey Dudin 已提交
309 310

complex_key_cache
311
-----------------
A
Andrey Dudin 已提交
312

A
Andrey Dudin 已提交
313
The same as ``cache``, but for complex keys.
A
Andrey Dudin 已提交
314

A
Andrey Dudin 已提交
315
Notes
316
-----
A
Andrey Dudin 已提交
317

A
Andrey Dudin 已提交
318
We recommend using the ``flat`` method when possible, or ``hashed``. The speed of the dictionaries is impeccable with this type of memory storage.
A
Andrey Dudin 已提交
319

A
Andrey Dudin 已提交
320
Use the cache method only in cases when it is unavoidable. The speed of the cache depends strongly on correct settings and the usage scenario. A cache type dictionary only works normally for high enough hit rates (recommended 99% and higher). You can view the average hit rate in the system.dictionaries table. Set a large enough cache size. You will need to experiment to find the right number of cells - select a value, use a query to get the cache completely full, look at the memory consumption (this information is in the system.dictionaries table), then proportionally increase the number of cells so that a reasonable amount of memory is consumed. We recommend MySQL as the source for the cache, because ClickHouse doesn't handle requests with random reads very well.
A
Andrey Dudin 已提交
321

A
Andrey Dudin 已提交
322
In all cases, performance is better if you call the function for working with a dictionary after ``GROUP BY``, and if the attribute being fetched is marked as injective. For a dictionary cache, performance improves if you call the function after LIMIT. To do this, you can use a subquery with LIMIT, and call the function with the dictionary from the outside.
A
Andrey Dudin 已提交
323

A
Andrey Dudin 已提交
324
An attribute is called injective if different attribute values correspond to different keys. So when ``GROUP BY`` uses a function that fetches an attribute value by the key, this function is automatically taken out of ``GROUP BY``.
A
Andrey Dudin 已提交
325

A
Andrey Dudin 已提交
326 327
When updating dictionaries from a file, first the file modification time is checked, and it is loaded only if the file has changed.
When updating from MySQL, for flat and hashed dictionaries, first a ``SHOW TABLE STATUS`` query is made, and the table update time is checked. If it is not NULL, it is compared to the stored time. This works for MyISAM tables, but for InnoDB tables the update time is unknown, so loading from InnoDB is performed on each update.
A
Andrey Dudin 已提交
328

A
Andrey Dudin 已提交
329
For cache dictionaries, the expiration (lifetime) of data in the cache can be set. If more time than 'lifetime' has passed since loading the data in a cell, the cell's value is not used, and it is re-requested the next time it needs to be used.
A
Andrey Dudin 已提交
330

A
Andrey Dudin 已提交
331 332 333
If a dictionary couldn't be loaded even once, an attempt to use it throws an exception.
If an error occurred during a request to a cached source, an exception is thrown.
Dictionary updates (other than loading for first use) do not block queries. During updates, the old version of a dictionary is used. If an error occurs during an update, the error is written to the server log, and queries continue using the old version of dictionaries.
A
Andrey Dudin 已提交
334

A
Andrey Dudin 已提交
335
You can view the list of external dictionaries and their status in the system.dictionaries table.
A
Andrey Dudin 已提交
336

A
Andrey Dudin 已提交
337
To use external dictionaries, see the section "Functions for working with external dictionaries".
A
Andrey Dudin 已提交
338

A
Andrey Dudin 已提交
339
Note that you can convert values for a small dictionary by specifying all the contents of the dictionary directly in a ``SELECT`` query (see the section "transform function"). This functionality is not related to external dictionaries.
A
Andrey Dudin 已提交
340

A
Andrey Dudin 已提交
341
Dictionaries with complex keys
342
------------------------------
A
Andrey Dudin 已提交
343

A
Andrey Dudin 已提交
344
You can use tuples consisting of fields of arbitrary types as keys. Configure your dictionary with ``complex_key_hashed`` or ``complex_key_cache`` layout in this case.
A
Andrey Dudin 已提交
345

A
Andrey Dudin 已提交
346
Key structure is configured not in the ``<id>`` element but in the ``<key>`` element. Fields of the key tuple are configured analogously to dictionary attributes. Example:
A
Andrey Dudin 已提交
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364

.. code-block:: xml

  <structure>
      <key>
          <attribute>
              <name>field1</name>
              <type>String</type>
          </attribute>
          <attribute>
              <name>field2</name>
              <type>UInt32</type>
          </attribute>
          ...
      </key>
  ...


A
Andrey Dudin 已提交
365
When using such dictionary, use a Tuple of field values as a key in dictGet* functions. Example: ``dictGetString('dict_name', 'attr_name', tuple('field1_value', 123))``.