提交 6ffa7140 编写于 作者: H Heikki Linnakangas

Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.

This is a backport of upstream commit 9556aa01, and Tom Lane's follow
up commit 6119060d. Cherry-picked it now, to avoid the 256 MB limit on
strings. We used to have an old workaround for that issue in GPDB, but lost
it as part of the 9.1 merge.

Upstream commit:

commit 9556aa01
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Fri Jan 25 16:25:05 2019 +0200

    Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.

    The old implementation first converted the input strings to arrays of
    wchars, and performed the conversion on those. However, the conversion is
    expensive, and for a large input string, consumes a lot of memory.
    Allocating the large arrays also meant that these functions could not be
    used on strings larger 1 GB / pg_encoding_max_length() (256 MB for UTF-8).

    Avoid the conversion, and instead use the single-byte algorithm even with
    multibyte encodings. That can get fooled, if there is a matching byte
    sequence in the middle of a multi-byte character, so to eliminate false
    positives like that, we verify any matches by walking the string character
    by character with pg_mblen(). Also, if the caller needs the position of
    the match, as a character-offset, we also need to walk the string to count
    the characters.

    Performance testing shows that walking the whole string with pg_mblen() is
    somewhat slower than converting the whole string to wchars. It's still
    often a win, though, because we don't need to do it if there is no match,
    and even when there is, we only need to walk up to the point where the
    match is, not the whole string. Even in the worst case, there would be
    room for optimization: Much of the CPU time in the current loop with
    pg_mblen() is function call overhead, and could be improved by inlining
    pg_mblen() and/or the encoding-specific mblen() functions. But I didn't
    attempt to do that as part of this patch.

    Most of the callers of text_position_setup/next functions were actually
    not interested in the position of the match, counted in characters. To
    cater for them, refactor the text_position_next() interface into two
    parts: searching for the next match (text_position_next()), and returning
    the current match's position as a pointer (text_position_get_match_ptr())
    or as a character offset (text_position_get_match_pos()). Getting the
    pointer to the match is a more convenient API for many callers, and with
    UTF-8, it allows skipping the character-walking step altogether, because
    UTF-8 can't have false matches even when treated like raw byte strings.

    Reviewed-by: John Naylor
    Discussion: https://www.postgresql.org/message-id/3173d989-bc1c-fc8a-3b69-f24246f73876%40iki.fi
上级 1265f9c4
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册