# 12.3.控制文本搜索

12.3.1. 解析文档

12.3.2. 解析查询

12.3.3. 对搜索结果进行排名

12.3.4. 突出显示结果

要实现全文搜索,必须有一个函数来创建tsvector从文件和tsquery来自用户查询。此外,我们需要以有用的顺序返回结果,因此我们需要一个函数来比较文档与查询的相关性。能够很好地显示结果也很重要。PostgreSQL支持所有这些功能。

# 12.3.1.解析文档

PostgreSQL提供了这个函数到_tsvector用于将文档转换为tsvector数据类型。

to_tsvector([ config regconfig, ] document text) returns tsvector

到_tsvector将文本文档解析为标记,将标记减少为词素,并返回tsvector它列出了词汇及其在文档中的位置。文档将根据指定的或默认的文本搜索配置进行处理。下面是一个简单的例子:

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
### 12.3.2. Parsing Queries

PostgreSQL provides the functions `to_tsquery`, `plainto_tsquery`, `phraseto_tsquery` and `websearch_to_tsquery` for converting a query to the `tsquery` data type. `to_tsquery` offers access to more features than either `plainto_tsquery` or `phraseto_tsquery`, but it is less forgiving about its input. `websearch_to_tsquery` is a simplified version of `to_tsquery` with an alternative syntax, similar to the one used by web search engines.

[]()

质疑([config regconfig,,]querytext(文本)返回tsquery

`to_tsquery` creates a `tsquery` value from *`querytext`*, which must consist of single tokens separated by the `tsquery` operators `&` (AND), `|` (OR), `!` (NOT), and `<->` (FOLLOWED BY), possibly grouped using parentheses. In other words, the input to `to_tsquery` must already follow the general rules for `tsquery` input, as described in [Section 8.11.2](datatype-textsearch.html#DATATYPE-TSQUERY). The difference is that while basic `tsquery` input takes the tokens at face value, `to_tsquery` normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration. For example:

选择to_tsquery('english'、'The&Fat&Rats');质疑

# 12.3.3.对搜索结果进行排名

排名试图衡量文档与特定查询的相关性,以便在有许多匹配项时,可以首先显示最相关的文档。PostgreSQL提供了两个预定义的排名函数,它们考虑了词汇、接近度和结构信息;也就是说,他们考虑查询词在文档中出现的频率,文档中术语的紧密程度,以及文档发生的部分的重要性。然而,相关性的概念是模糊的,并且非常具体。不同的应用程序可能需要额外的排名信息,例如文档修改时间。内置的排名函数只是示例。您可以编写自己的排名函数和/或将其结果与其他因素结合起来,以满足您的特定需求。

目前可用的两个排名功能是:

T_秩([*权重*浮动4[], ] *向量*向量, *询问*查询[, *正常化*整数]) 返回浮动4``

根据匹配词位的频率对向量进行排名。

ts_rank_cd([ *权重*浮动4[], ] *向量*向量, *询问*tsquery[, *规范化*整数])返回浮动4``

此函数用于计算覆盖密度给定文档向量和查询的排名,如Clarke、Cormack和Tudhope在《信息处理与管理》杂志1999年版中的“一到三项查询的相关性排名”所述。覆盖密度与T_秩除了考虑匹配词素彼此之间的接近性之外,排名是不合理的。

此函数需要词素位置信息来执行其计算。因此,它忽略了文本中的任何“剥离”词素tsvector.如果输入中没有未压缩的词素,结果将为零。(见第12.4.1节有关计算机中的功能和位置信息tsvectors、 )

对于这两种功能,可选的*砝码*argument提供了一种能力,可以根据单词实例的标记方式,对它们进行或多或少的权衡。权重数组指定了每类单词的权重,顺序如下:

{D-weight, C-weight, B-weight, A-weight}

如果没有*砝码*则使用以下默认值:

{0.1, 0.2, 0.4, 1.0}

通常,权重用于标记文档中特定区域的单词,如标题或初始摘要,因此它们可以比文档正文中的单词更重要或更不重要。

由于较长的文档包含查询词的可能性较大,因此考虑文档大小是合理的,例如,一个包含五个搜索词实例的百字文档可能比一个包含五个实例的千字文档更相关。两个排名函数都采用整数*规范化*选项,指定文档的长度是否以及如何影响其排名。整数选项控制多个行为,因此它是一个位掩码:您可以使用|(例如,2|4).

  • 0(默认值)忽略文档长度

  • 1将秩除以1+文档长度的对数

  • 2将排名除以文档长度

  • 4将秩除以区段之间的平均谐波距离(这仅通过T_rank_cd)

  • 8将排名除以文档中唯一的单词数

  • 16将秩除以1+文档中唯一字数的对数

  • 32除以秩本身+1

    如果指定了多个标志位,则将按列出的顺序应用转换。

    需要注意的是,排名函数不使用任何全局信息,因此不可能像有时所需的那样公平地将其标准化为1%或100%。标准化选项32(排名/(排名+1))可以应用于将所有级别缩放到0到1的范围,但这当然只是一个表面上的改变;这不会影响搜索结果的顺序。

    下面是一个仅选择排名最高的十个匹配项的示例:

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
### 12.3.4. Highlighting Results

 To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function `ts_headline` that implements this functionality.

[]()

标题([config regconfig,,]文档文本,查询tsquery[,选项文本,])返回文本

`ts_headline` accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. The configuration to be used to parse the document can be specified by *`config`*; if *`config`* is omitted, the `default_text_search_config` configuration is used.

 If an *`options`* string is specified it must consist of a comma-separated list of one or more *`option`*`=`*`value`* pairs. The available options are:

* `MaxWords`, `MinWords` (integers): these numbers determine the longest and shortest headlines to output. The default values are 35 and 15.

* `ShortWord` (integer): words of this length or less will be dropped at the start and end of a headline, unless they are query terms. The default value of three eliminates common English articles.

* `HighlightAll` (boolean): if `true` the whole document will be used as the headline, ignoring the preceding three parameters. The default is `false`.

* `MaxFragments` (integer): maximum number of text fragments to display. The default value of zero selects a non-fragment-based headline generation method. A value greater than zero selects fragment-based headline generation (see below).

* `StartSel`, `StopSel` (strings): the strings with which to delimit query words appearing in the document, to distinguish them from other excerpted words. The default values are “`<b>`” and “`</b>`”, which can be suitable for HTML output.

* `FragmentDelimiter` (string): When more than one fragment is displayed, the fragments will be separated by this string. The default is “` ... `”.

 These option names are recognized case-insensitively. You must double-quote string values if they contain spaces or commas.

 In non-fragment-based headline generation, `ts_headline` locates matches for the given *`query`* and chooses a single one to display, preferring matches that have more query words within the allowed headline length. In fragment-based headline generation, `ts_headline` locates the query matches and splits each match into “fragments” of no more than `MaxWords` words each, preferring fragments with more query words, and when possible “stretching” fragments to include surrounding words. The fragment-based mode is thus more useful when the query matches span large sections of the document, or when it's desirable to display multiple matches. In either mode, if no query matches can be identified, then a single fragment of the first `MinWords` words in the document will be displayed.

 For example:

选择Tsu headline('english','最常见的搜索类型是查找包含给定查询词的所有文档,并按照它们与查询的相似性的顺序返回它们','到Tsu query('english','query&similarity');标题