未验证 提交 8011f5c3 编写于 作者: G gyuton 提交者: GitHub

DOCSUP-5910: Documented SimHash, MinHash, bitHammingDistance and...

DOCSUP-5910: Documented SimHash, MinHash, bitHammingDistance and tupleHammingDistance functions (#22131)
Co-authored-by: Nolgarev <56617294+olgarev@users.noreply.github.com>
Co-authored-by: NGeorge <gyuton@yandex-team.ru>
Co-authored-by: NVladimir <vdimir@yandex-team.ru>
上级 6fb70cfd
......@@ -250,3 +250,53 @@ Result:
└───────────────┘
```
## bitHammingDistance {#bithammingdistance}
Returns the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) between the bit representations of two integer values. Can be used with [SimHash](../../sql-reference/functions/hash-functions.md#ngramsimhash) functions for detection of semi-duplicate strings. The smaller is the distance, the more likely those strings are the same.
**Syntax**
``` sql
bitHammingDistance(int1, int2)
```
**Arguments**
- `int1` — First integer value. [Int64](../../sql-reference/data-types/int-uint.md).
- `int2` — Second integer value. [Int64](../../sql-reference/data-types/int-uint.md).
**Returned value**
- The Hamming distance.
Type: [UInt8](../../sql-reference/data-types/int-uint.md).
**Examples**
Query:
``` sql
SELECT bitHammingDistance(111, 121);
```
Result:
``` text
┌─bitHammingDistance(111, 121)─┐
│ 3 │
└──────────────────────────────┘
```
With [SimHash](../../sql-reference/functions/hash-functions.md#ngramsimhash):
``` sql
SELECT bitHammingDistance(ngramSimHash('cat ate rat'), ngramSimHash('rat ate cat'));
```
Result:
``` text
┌─bitHammingDistance(ngramSimHash('cat ate rat'), ngramSimHash('rat ate cat'))─┐
│ 5 │
└──────────────────────────────────────────────────────────────────────────────┘
```
......@@ -111,4 +111,55 @@ Result:
- [Tuple](../../sql-reference/data-types/tuple.md)
[Original article](https://clickhouse.tech/docs/en/sql-reference/functions/tuple-functions/) <!--hide-->
## tupleHammingDistance {#tuplehammingdistance}
Returns the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) between two tuples of the same size.
**Syntax**
``` sql
tupleHammingDistance(tuple1, tuple2)
```
**Arguments**
- `tuple1` — First tuple. [Tuple](../../sql-reference/data-types/tuple.md).
- `tuple2` — Second tuple. [Tuple](../../sql-reference/data-types/tuple.md).
Tuples should have the same type of the elements.
**Returned value**
- The Hamming distance.
Type: [UInt8](../../sql-reference/data-types/int-uint.md).
**Examples**
Query:
``` sql
SELECT tupleHammingDistance((1, 2, 3), (3, 2, 1)) AS HammingDistance;
```
Result:
``` text
┌─HammingDistance─┐
│ 2 │
└─────────────────┘
```
Can be used with [MinHash](../../sql-reference/functions/hash-functions.md#ngramminhash) functions for detection of semi-duplicate strings:
``` sql
SELECT tupleHammingDistance(wordShingleMinHash(string), wordShingleMinHashCaseInsensitive(string)) as HammingDistance FROM (SELECT 'Clickhouse is a column-oriented database management system for online analytical processing of queries.' AS string);
```
Result:
``` text
┌─HammingDistance─┐
│ 2 │
└─────────────────┘
```
......@@ -240,3 +240,53 @@ SELECT bitCount(333);
└───────────────┘
```
## bitHammingDistance {#bithammingdistance}
Возвращает [расстояние Хэмминга](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%B8%D0%B5_%D0%A5%D1%8D%D0%BC%D0%BC%D0%B8%D0%BD%D0%B3%D0%B0) между битовыми представлениями двух целых чисел. Может быть использовано с функциями [SimHash](../../sql-reference/functions/hash-functions.md#ngramsimhash) для проверки двух строк на схожесть. Чем меньше расстояние, тем больше вероятность, что строки совпадают.
**Синтаксис**
``` sql
bitHammingDistance(int1, int2)
```
**Аргументы**
- `int1` — первое целое число. [Int64](../../sql-reference/data-types/int-uint.md).
- `int2` — второе целое число. [Int64](../../sql-reference/data-types/int-uint.md).
**Возвращаемое значение**
- Расстояние Хэмминга.
Тип: [UInt8](../../sql-reference/data-types/int-uint.md).
**Примеры**
Запрос:
``` sql
SELECT bitHammingDistance(111, 121);
```
Результат:
``` text
┌─bitHammingDistance(111, 121)─┐
│ 3 │
└──────────────────────────────┘
```
Используя [SimHash](../../sql-reference/functions/hash-functions.md#ngramsimhash):
``` sql
SELECT bitHammingDistance(ngramSimHash('cat ate rat'), ngramSimHash('rat ate cat'));
```
Результат:
``` text
┌─bitHammingDistance(ngramSimHash('cat ate rat'), ngramSimHash('rat ate cat'))─┐
│ 5 │
└──────────────────────────────────────────────────────────────────────────────┘
```
......@@ -111,3 +111,55 @@ SELECT untuple((* EXCEPT (v2, v3),)) FROM kv;
- [Tuple](../../sql-reference/data-types/tuple.md)
## tupleHammingDistance {#tuplehammingdistance}
Возвращает [расстояние Хэмминга](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%B8%D0%B5_%D0%A5%D1%8D%D0%BC%D0%BC%D0%B8%D0%BD%D0%B3%D0%B0) между двумя кортежами одинакового размера.
**Синтаксис**
``` sql
tupleHammingDistance(tuple1, tuple2)
```
**Аргументы**
- `tuple1` — первый кортеж. [Tuple](../../sql-reference/data-types/tuple.md).
- `tuple2` — второй кортеж. [Tuple](../../sql-reference/data-types/tuple.md).
Кортежи должны иметь одинаковый размер и тип элементов.
**Возвращаемое значение**
- Расстояние Хэмминга.
Тип: [UInt8](../../sql-reference/data-types/int-uint.md).
**Примеры**
Запрос:
``` sql
SELECT tupleHammingDistance((1, 2, 3), (3, 2, 1)) AS HammingDistance;
```
Результат:
``` text
┌─HammingDistance─┐
│ 2 │
└─────────────────┘
```
Может быть использовано с функциями [MinHash](../../sql-reference/functions/hash-functions.md#ngramminhash) для проверки строк на совпадение:
``` sql
SELECT tupleHammingDistance(wordShingleMinHash(string), wordShingleMinHashCaseInsensitive(string)) as HammingDistance FROM (SELECT 'Clickhouse is a column-oriented database management system for online analytical processing of queries.' AS string);
```
Результат:
``` text
┌─HammingDistance─┐
│ 2 │
└─────────────────┘
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册