speed up text_position() for utf-8
Hi,
Commit 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even
with multibyte encodings), was a speed improvement for the majority of
cases, but when the match was toward the end of the string, the
slowdown in text_position_get_match_pos() was noticeable. It was found
that there was a lot of overhead in pg_mblen(). [1]/messages/by-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4@iki.fi
The attached exploratory PoC improves this for utf-8. It applies on
top v25 of my utf-8 verification patch in [2]/messages/by-id/CAFBsxsHG=g6W8Mie+_NO8dV6O0pO2stxrnS=me5ZmGqk--fd5g@mail.gmail.com, since one approach
relies on the DFA from it. The other three approaches are:
- a version of pg_utf_mblen() that uses a lookup table [3]https://github.com/skeeto/branchless-utf8/blob/master/utf8.h
- an inlined copy of pg_utf_mblen()
- an ascii fast path with a fallback to the inlined copy of pg_utf_mblen()
The test is attached and the test function is part of the patch. It's
based on the test used in the commit above. The test searches for a
string that's at the end of a ~1 million byte string. This is on gcc
11 with 2-3 runs to ensure repeatability, but I didn't bother with
statistics because the differences are pretty big:
patch | no match | ascii | mulitbyte
-----------------------------------------+----------+-------+-----------
PG11 | 1120 | 1100 | 900
master | 381 | 2350 | 1900
DFA | 386 | 1640 | 1640
branchless utf mblen | 387 | 4100 | 2600
inline pg_utf_mblen() | 380 | 1080 | 920
inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918
Neither of the branchless approaches worked well. The DFA can't work
as well here as in verification because it must do additional work.
Inlining pg_utf_mblen() restores worst-case performance to PG11
levels. The ascii fast path is a nice improvement on top of that. A
similar approach could work for pg_mbstrlen() as well, but I haven't
looked into that yet. There are other callers of pg_mblen(), but I
haven't looked into whether they are performance-sensitive. A more
general application would be preferable to a targeted one.
[1]: /messages/by-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4@iki.fi
[2]: /messages/by-id/CAFBsxsHG=g6W8Mie+_NO8dV6O0pO2stxrnS=me5ZmGqk--fd5g@mail.gmail.com
[3]: https://github.com/skeeto/branchless-utf8/blob/master/utf8.h
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
prototype-textpos-utf8.patchapplication/x-patch; name=prototype-textpos-utf8.patchDownload
src/backend/utils/adt/varlena.c | 112 ++++++++++++++++++++++++++++++++++++++++
src/common/wchar.c | 90 ++------------------------------
src/include/mb/pg_wchar.h | 53 -------------------
src/test/regress/regress.c | 18 +++++++
4 files changed, 133 insertions(+), 140 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index bd3091bbfb..cc93818007 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -26,6 +26,7 @@
#include "common/unicode_norm.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
+#include "mb/pg_utf8.h"
#include "miscadmin.h"
#include "nodes/execnodes.h"
#include "parser/scansup.h"
@@ -1468,6 +1469,116 @@ text_position_get_match_pos(TextPositionState *state)
{
if (!state->is_multibyte)
return state->last_match - state->str1 + 1;
+ else if (GetDatabaseEncoding() == PG_UTF8)
+ {
+#if 0 /* dfa */
+
+ int utf8_state_count[UTF8_LAST_STATE + 1] = {0};
+ uint32 dfa_state = BGN;
+
+ /*
+ * See pg_utf8.h for an explanation of the state machine. Unlike in
+ * pg_wchar.c, we must calculate the exact state so we can use it as
+ * an array index.
+ */
+ while (state->refpoint < state->last_match)
+ {
+ const unsigned char *s = (const unsigned char *) state->refpoint++;
+
+ dfa_state = (Utf8Transition[*s] >> dfa_state) & 31;
+ utf8_state_count[dfa_state]++;
+ }
+ Assert(state->refpoint == state->last_match);
+ state->refpos += utf8_state_count[END];
+ return state->refpos + 1;
+
+#endif
+#if 0 /* like pg_utf_mblen(), but different algorithm: */
+
+ static const char lengths[] = {
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+ 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 1
+ };
+
+ while (state->refpoint < state->last_match)
+ {
+ const unsigned char *s = (const unsigned char *) state->refpoint;
+
+ state->refpoint += lengths[s[0] >> 3];
+ state->refpos++;
+ }
+
+ Assert(state->refpoint == state->last_match);
+ return state->refpos + 1;
+#endif
+#if 0 /* inline pg_utf_mblen()*/
+
+ int len = 0;
+
+ while (state->refpoint < state->last_match)
+ {
+ const unsigned char *s = (const unsigned char *) state->refpoint;
+
+ if ((*s & 0x80) == 0)
+ len = 1;
+ else if ((*s & 0xe0) == 0xc0)
+ len = 2;
+ else if ((*s & 0xf0) == 0xe0)
+ len = 3;
+ else if ((*s & 0xf8) == 0xf0)
+ len = 4;
+ else
+ len = 1;
+
+ state->refpoint += len;
+ state->refpos++;
+ }
+
+ Assert(state->refpoint == state->last_match);
+ return state->refpos + 1;
+#endif
+#if 1 /* try ASCII fast path, then fall back to inlined pg_utf_mblen() */
+
+ int len = 0;
+
+#define STRIDE_LENGTH 16
+
+ /* fast path for ascii text. */
+ while (state->refpoint + STRIDE_LENGTH <= state->last_match)
+ {
+ if (is_valid_ascii((const unsigned char *) state->refpoint, STRIDE_LENGTH))
+ {
+ state->refpoint += STRIDE_LENGTH;
+ state->refpos += STRIDE_LENGTH;
+ }
+ else
+ break;
+ }
+
+ while (state->refpoint < state->last_match)
+ {
+ const unsigned char *s = (const unsigned char *) state->refpoint;
+
+ if ((*s & 0x80) == 0)
+ len = 1;
+ else if ((*s & 0xe0) == 0xc0)
+ len = 2;
+ else if ((*s & 0xf0) == 0xe0)
+ len = 3;
+ else if ((*s & 0xf8) == 0xf0)
+ len = 4;
+ else
+ len = 1;
+
+ state->refpoint += len;
+ state->refpos++;
+ }
+
+ Assert(state->refpoint == state->last_match);
+ return state->refpos + 1;
+#endif
+ }
+
else
{
/* Convert the byte position to char position. */
@@ -1481,6 +1592,7 @@ text_position_get_match_pos(TextPositionState *state)
}
}
+
/*
* Reset search state to the initial state installed by text_position_setup.
*
diff --git a/src/common/wchar.c b/src/common/wchar.c
index be931c5e92..2c51c3b6f9 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -12,6 +12,7 @@
*/
#include "c.h"
+#include "mb/pg_utf8.h"
#include "mb/pg_wchar.h"
@@ -1750,94 +1751,9 @@ pg_utf8_verifychar(const unsigned char *s, int len)
return l;
}
-/*
- * The fast path of the UTF-8 verifier uses a deterministic finite automaton
- * (DFA) for multibyte characters. In a traditional table-driven DFA, the
- * input byte and current state are used to compute an index into an array of
- * state transitions. Since the address of the next transition is dependent
- * on this computation, there is latency in executing the load instruction,
- * and the CPU is not kept busy.
- *
- * Instead, we use a "shift-based" DFA as described by Per Vognsen:
- *
- * https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725
- *
- * In a shift-based DFA, the input byte is an index into array of integers
- * whose bit pattern encodes the state transitions. To compute the next
- * state, we simply right-shift the integer by the current state and apply a
- * mask. In this scheme, the address of the transition only depends on the
- * input byte, so there is better pipelining.
- *
- * The naming convention for states and transitions was adopted from a UTF-8
- * to UTF-16/32 transcoder, whose table is reproduced below:
- *
- * https://github.com/BobSteagall/utf_utils/blob/6b7a465265de2f5fa6133d653df0c9bdd73bbcf8/src/utf_utils.cpp
- *
- * ILL ASC CR1 CR2 CR3 L2A L3A L3B L3C L4A L4B L4C CLASS / STATE
- * ==========================================================================
- * err, END, err, err, err, CS1, P3A, CS2, P3B, P4A, CS3, P4B, | BGN/END
- * err, err, err, err, err, err, err, err, err, err, err, err, | ERR
- * |
- * err, err, END, END, END, err, err, err, err, err, err, err, | CS1
- * err, err, CS1, CS1, CS1, err, err, err, err, err, err, err, | CS2
- * err, err, CS2, CS2, CS2, err, err, err, err, err, err, err, | CS3
- * |
- * err, err, err, err, CS1, err, err, err, err, err, err, err, | P3A
- * err, err, CS1, CS1, err, err, err, err, err, err, err, err, | P3B
- * |
- * err, err, err, CS2, CS2, err, err, err, err, err, err, err, | P4A
- * err, err, CS2, err, err, err, err, err, err, err, err, err, | P4B
- *
- * In the most straightforward implementation, a shift-based DFA for UTF-8
- * requires 64-bit integers to encode the transitions, but with an SMT solver
- * it's possible to find state numbers such that the transitions fit within
- * 32-bit integers, as Dougall Johnson demonstrated:
- *
- * https://gist.github.com/dougallj/166e326de6ad4cf2c94be97a204c025f
- *
- * This packed representation is the reason for the seemingly odd choice of
- * state values below.
- */
-/* Error */
-#define ERR 0
-/* Begin */
-#define BGN 11
-/* Continuation states, expect 1/2/3 continuation bytes */
-#define CS1 16
-#define CS2 1
-#define CS3 5
-/* Leading byte was E0/ED, expect 1 more continuation byte */
-#define P3A 6
-#define P3B 20
-/* Leading byte was F0/F4, expect 2 more continuation bytes */
-#define P4A 25
-#define P4B 30
-/* Begin and End are the same state */
-#define END BGN
-
-/* the encoded state transitions for the lookup table */
-
-/* ASCII */
-#define ASC (END << BGN)
-/* 2-byte lead */
-#define L2A (CS1 << BGN)
-/* 3-byte lead */
-#define L3A (P3A << BGN)
-#define L3B (CS2 << BGN)
-#define L3C (P3B << BGN)
-/* 4-byte lead */
-#define L4A (P4A << BGN)
-#define L4B (CS3 << BGN)
-#define L4C (P4B << BGN)
-/* continuation byte */
-#define CR1 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3B) | (CS2 << P4B)
-#define CR2 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3B) | (CS2 << P4A)
-#define CR3 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3A) | (CS2 << P4A)
-/* invalid byte */
-#define ILL ERR
-
-static const uint32 Utf8Transition[256] =
+/* transition table for the UTF-8 DFA -- see pg_utf8.h for details */
+const uint32 Utf8Transition[256] =
{
/* ASCII */
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 6bd996b3d0..d93ccac263 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -699,57 +699,4 @@ extern int mic2latin_with_table(const unsigned char *mic, unsigned char *p,
extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len);
#endif
-
-/*
- * Verify a chunk of bytes for valid ASCII.
- *
- * Returns false if the input contains any zero bytes or bytes with the
- * high-bit set. Input len must be a multiple of 8.
- */
-static inline bool
-is_valid_ascii(const unsigned char *s, int len)
-{
- uint64 chunk,
- highbit_cum = UINT64CONST(0),
- zero_cum = UINT64CONST(0x8080808080808080);
-
- Assert(len % sizeof(chunk) == 0);
-
- while (len > 0)
- {
- memcpy(&chunk, s, sizeof(chunk));
-
- /*
- * Capture any zero bytes in this chunk.
- *
- * First, add 0x7f to each byte. This sets the high bit in each byte,
- * unless it was a zero. If any resulting high bits are zero, the
- * corresponding high bits in the zero accumulator will be cleared.
- *
- * If none of the bytes in the chunk had the high bit set, the max
- * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
- * and we don't need to worry about carrying over to the next byte. If
- * any input bytes did have the high bit set, it doesn't matter
- * because we check for those separately.
- */
- zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
-
- /* Capture any set bits in this chunk. */
- highbit_cum |= chunk;
-
- s += sizeof(chunk);
- len -= sizeof(chunk);
- }
-
- /* Check if any high bits in the high bit accumulator got set. */
- if (highbit_cum & UINT64CONST(0x8080808080808080))
- return false;
-
- /* Check if any high bits in the zero accumulator got cleared. */
- if (zero_cum != UINT64CONST(0x8080808080808080))
- return false;
-
- return true;
-}
-
#endif /* PG_WCHAR_H */
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 351d79e1f0..7f23cffa21 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -1206,3 +1206,21 @@ binary_coercible(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(IsBinaryCoercible(srctype, targettype));
}
+
+
+PG_FUNCTION_INFO_V1(benchmark_textpos);
+Datum
+benchmark_textpos(PG_FUNCTION_ARGS)
+{
+ text *t1 = PG_GETARG_TEXT_PP(0);
+ text *t2 = PG_GETARG_TEXT_PP(1);
+ int loops = PG_GETARG_INT32(2);
+ int i;
+ Datum result = Int32GetDatum(0);
+
+ for (i = 0; i < loops; i++)
+ {
+ result = DirectFunctionCall2Coll(textpos, PG_GET_COLLATION(), PointerGetDatum(t1), PointerGetDatum(t2));
+ }
+ return result;
+}
I wrote:
The test is attached and the test function is part of the patch. It's
based on the test used in the commit above. The test searches for a
string that's at the end of a ~1 million byte string. This is on gcc
11 with 2-3 runs to ensure repeatability, but I didn't bother with
statistics because the differences are pretty big:patch | no match | ascii | mulitbyte
-----------------------------------------+----------+-------+-----------
PG11 | 1120 | 1100 | 900
master | 381 | 2350 | 1900
DFA | 386 | 1640 | 1640
branchless utf mblen | 387 | 4100 | 2600
inline pg_utf_mblen() | 380 | 1080 | 920
inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918
I failed to mention that the above numbers are milliseconds, so
smaller is better.
--
John Naylor
EDB: http://www.enterprisedb.com
Attached is a short patch series to develop some ideas of inlining
pg_utf_mblen().
0001 puts the main implementation of pg_utf_mblen() into an inline
function and uses this in pg_mblen(). This is somewhat faster in the
strpos tests, so that gives some measure of the speedup expected for
other callers. Text search seems to call this a lot, so this might
have noticeable benefit.
0002 refactors text_position_get_match_pos() to use
pg_mbstrlen_with_len(). This itself is significantly faster when
combined with 0001, likely because the latter can inline the call to
pg_mblen(). The intention is to speed up more than just text_position.
0003 explicitly specializes for the inline version of pg_utf_mblen()
into pg_mbstrlen_with_len(), but turns out to be almost as slow as
master for ascii. It doesn't help if I undo the previous change in
pg_mblen(), and I haven't investigated why yet.
0002 looks good now, but the experience with 0003 makes me hesitant to
propose this seriously until I can figure out what's going on there.
The test is as earlier, a worst-case substring search, times in milliseconds.
patch | no match | ascii | multibyte
--------+----------+-------+-----------
PG11 | 1220 | 1220 | 1150
master | 385 | 2420 | 1980
0001 | 390 | 2180 | 1670
0002 | 389 | 1330 | 1100
0003 | 391 | 2100 | 1360
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patchapplication/octet-stream; name=v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patchDownload
From 575e7e488aa8e168d7cf99c7bcc069753d6199ca Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 17 Dec 2021 12:27:21 -0400
Subject: [PATCH v2 2/3] Refactor text_position_get_match_pos() to use
pg_mbstrlen_with_len()
---
src/backend/utils/adt/varlena.c | 28 +++++-----------------------
1 file changed, 5 insertions(+), 23 deletions(-)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index bd3091bbfb..449fcfc4bf 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -51,7 +51,6 @@ typedef struct varlena VarString;
*/
typedef struct
{
- bool is_multibyte; /* T if multibyte encoding */
bool is_multibyte_char_in_char; /* need to check char boundaries? */
char *str1; /* haystack string */
@@ -1221,20 +1220,11 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
* and continue the search if it was a false match.
*/
if (pg_database_encoding_max_length() == 1)
- {
- state->is_multibyte = false;
state->is_multibyte_char_in_char = false;
- }
else if (GetDatabaseEncoding() == PG_UTF8)
- {
- state->is_multibyte = true;
state->is_multibyte_char_in_char = false;
- }
else
- {
- state->is_multibyte = true;
state->is_multibyte_char_in_char = true;
- }
state->str1 = VARDATA_ANY(t1);
state->str2 = VARDATA_ANY(t2);
@@ -1466,19 +1456,11 @@ text_position_get_match_ptr(TextPositionState *state)
static int
text_position_get_match_pos(TextPositionState *state)
{
- if (!state->is_multibyte)
- return state->last_match - state->str1 + 1;
- else
- {
- /* Convert the byte position to char position. */
- while (state->refpoint < state->last_match)
- {
- state->refpoint += pg_mblen(state->refpoint);
- state->refpos++;
- }
- Assert(state->refpoint == state->last_match);
- return state->refpos + 1;
- }
+ /* Convert the byte position to char position. */
+ state->refpos += pg_mbstrlen_with_len(state->refpoint,
+ state->last_match - state->refpoint);
+ state->refpoint = state->last_match;
+ return state->refpos + 1;
}
/*
--
2.31.1
v2-0003-Specialize-pg_mbstrlen_with_len-for-UTF-8.patchapplication/octet-stream; name=v2-0003-Specialize-pg_mbstrlen_with_len-for-UTF-8.patchDownload
From 6eb47127225ba2e97f3b5f2e9befef5c35e4f065 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 17 Dec 2021 12:40:01 -0400
Subject: [PATCH v2 3/3] Specialize pg_mbstrlen_with_len() for UTF-8
---
src/backend/utils/mb/mbutils.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 91eea625b9..c03eadaf61 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -1011,7 +1011,13 @@ pg_mbstrlen_with_len(const char *mbstr, int limit)
while (limit > 0 && *mbstr)
{
- int l = pg_mblen(mbstr);
+ int l;
+
+ /* avoid the overhead of a function call for UTF-8 */
+ if (GetDatabaseEncoding() == PG_UTF8)
+ l = pg_utf_mblen_fast((const unsigned char *) mbstr);
+ else
+ l = pg_mblen(mbstr);
limit -= l;
mbstr += l;
--
2.31.1
v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patchapplication/octet-stream; name=v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patchDownload
From 039ddd985d24a2efccbf9fc34bc5ef36a3cfd9bb Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 17 Dec 2021 12:55:34 -0400
Subject: [PATCH v2 1/3] Move the implementation of pg_utf_mblen() to an inline
function
Use that to specialize pg_mblen() for UTF-8. This provides a modest
speedup for code that calls pg_mblen() in a loop.
This has a side effect of removing the unnecessary check for zero bytes
in pg_utf8_verifychar().
WIP: Maybe "fast" in the name is misleading -- the point is to be inlinable.
---
src/backend/utils/mb/mbutils.c | 6 ++++-
src/common/wchar.c | 45 ++--------------------------------
src/include/mb/pg_wchar.h | 33 +++++++++++++++++++++++++
3 files changed, 40 insertions(+), 44 deletions(-)
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index a13c398f4a..91eea625b9 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -965,7 +965,11 @@ pg_encoding_wchar2mb_with_len(int encoding,
int
pg_mblen(const char *mbstr)
{
- return pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
+ /* avoid the overhead of a function call for UTF-8 */
+ if (GetDatabaseEncoding() == PG_UTF8)
+ return pg_utf_mblen_fast((const unsigned char *) mbstr);
+ else
+ return pg_wchar_table[DatabaseEncoding->encoding].mblen((const unsigned char *) mbstr);
}
/* returns the display length of a multibyte character */
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a6bffd0642..5fd682829c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -536,37 +536,11 @@ pg_wchar2utf_with_len(const pg_wchar *from, unsigned char *to, int len)
/*
* Return the byte length of a UTF8 character pointed to by s
- *
- * Note: in the current implementation we do not support UTF8 sequences
- * of more than 4 bytes; hence do NOT return a value larger than 4.
- * We return "1" for any leading byte that is either flat-out illegal or
- * indicates a length larger than we support.
- *
- * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps
- * other places would need to be fixed to change this.
*/
int
pg_utf_mblen(const unsigned char *s)
{
- int len;
-
- if ((*s & 0x80) == 0)
- len = 1;
- else if ((*s & 0xe0) == 0xc0)
- len = 2;
- else if ((*s & 0xf0) == 0xe0)
- len = 3;
- else if ((*s & 0xf8) == 0xf0)
- len = 4;
-#ifdef NOT_USED
- else if ((*s & 0xfc) == 0xf8)
- len = 5;
- else if ((*s & 0xfe) == 0xfc)
- len = 6;
-#endif
- else
- len = 1;
- return len;
+ return pg_utf_mblen_fast(s);
}
/*
@@ -1724,22 +1698,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
static int
pg_utf8_verifychar(const unsigned char *s, int len)
{
- int l;
-
- if ((*s & 0x80) == 0)
- {
- if (*s == '\0')
- return -1;
- return 1;
- }
- else if ((*s & 0xe0) == 0xc0)
- l = 2;
- else if ((*s & 0xf0) == 0xe0)
- l = 3;
- else if ((*s & 0xf8) == 0xf0)
- l = 4;
- else
- l = 1;
+ int l = pg_utf_mblen_fast(s);
if (l > len)
return -1;
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d93ccac263..a8d67c1214 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -590,6 +590,39 @@ extern bool pg_utf8_islegal(const unsigned char *source, int length);
extern int pg_utf_mblen(const unsigned char *s);
extern int pg_mule_mblen(const unsigned char *s);
+/*
+ * Return the byte length of a UTF8 character pointed to by s
+ * Workhorse for pg_utf_mblen().
+ *
+ * Declared as inline for callers of pg_mblen() that are performance critical
+ * enough to justify specializing for UTF-8.
+ *
+ * Note: in the current implementation we do not support UTF8 sequences
+ * of more than 4 bytes; hence do NOT return a value larger than 4.
+ * We return "1" for any leading byte that is either flat-out illegal or
+ * indicates a length larger than we support.
+ *
+ * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps
+ * other places would need to be fixed to change this.
+ */
+static inline int
+pg_utf_mblen_fast(const unsigned char *s)
+{
+ int len;
+
+ if ((*s & 0x80) == 0)
+ len = 1;
+ else if ((*s & 0xe0) == 0xc0)
+ len = 2;
+ else if ((*s & 0xf0) == 0xe0)
+ len = 3;
+ else if ((*s & 0xf8) == 0xf0)
+ len = 4;
+ else
+ len = 1;
+ return len;
+}
+
/*
* The remaining functions are backend-only.
*/
--
2.31.1
I wrote:
0001 puts the main implementation of pg_utf_mblen() into an inline
function and uses this in pg_mblen(). This is somewhat faster in the
strpos tests, so that gives some measure of the speedup expected for
other callers. Text search seems to call this a lot, so this might
have noticeable benefit.0002 refactors text_position_get_match_pos() to use
pg_mbstrlen_with_len(). This itself is significantly faster when
combined with 0001, likely because the latter can inline the call to
pg_mblen(). The intention is to speed up more than just text_position.0003 explicitly specializes for the inline version of pg_utf_mblen()
into pg_mbstrlen_with_len(), but turns out to be almost as slow as
master for ascii. It doesn't help if I undo the previous change in
pg_mblen(), and I haven't investigated why yet.0002 looks good now, but the experience with 0003 makes me hesitant to
propose this seriously until I can figure out what's going on there.The test is as earlier, a worst-case substring search, times in milliseconds.
patch | no match | ascii | multibyte
--------+----------+-------+-----------
PG11 | 1220 | 1220 | 1150
master | 385 | 2420 | 1980
0001 | 390 | 2180 | 1670
0002 | 389 | 1330 | 1100
0003 | 391 | 2100 | 1360
I tried this test on a newer CPU, and 0003 had no regression. Both
systems used gcc 11.2. Rather than try to figure out why an experiment
had unexpected behavior, I plan to test 0001 and 0002 on a couple
different compilers/architectures and call it a day. It's also worth
noting that 0002 by itself seemed to be decently faster on the newer
machine, but not as fast as 0001 and 0002 together.
Looking at the assembly, pg_mblen is inlined into
pg_mbstrlen_[with_len] and pg_mbcliplen, so the specialization for
utf-8 in 0001 would be inlined in the other 3 as well. That's only a
few bytes, so I think it's fine.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
I tried this test on a newer CPU, and 0003 had no regression. Both
systems used gcc 11.2. Rather than try to figure out why an experiment
had unexpected behavior, I plan to test 0001 and 0002 on a couple
different compilers/architectures and call it a day. It's also worth
noting that 0002 by itself seemed to be decently faster on the newer
machine, but not as fast as 0001 and 0002 together.
I tested four machines with various combinations of patches, and it
seems the only thing they all agree on is that 0002 is a decent
improvement (full results attached). The others can be faster or
slower. 0002 also simplifies things, so it has that going for it. I
plan to commit that this week unless there are objections.
--
John Naylor
EDB: http://www.enterprisedb.com