Speeding up text_position_next with multibyte encodings

Started by Heikki Linnakangasabout 7 years ago14 messages

hlinnaka@iki.fi

about 7 years ago

1 attachment(s)

Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm,
with a skip table. Currently, with a multi-byte encoding, we first
convert both input strings to arrays of wchars, and run the algorithm on
those arrays.

This patch removes the mb->wchar conversion, and instead runs the
single-byte version of the algorithm directly on the inputs, even with
multi-byte encodings. That avoids the mb->wchar conversion altogether,
when there is no match. Even when there is a match, we don't need to
convert the whole input string to wchars. It's enough to count the
characters up to the match, using pg_mblen(). And when
text_position_setup/next() are used as part of split_part() or replace()
functions, we're not actually even interested in the character position,
so we can skip that too.

Avoiding the large allocation is nice, too. That was actually why I
stated to look into this: a customer complained that text_position fails
with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false
positives, though. To make that work, after finding a match, we verify
that the match doesn't fall in the middle of a multi-byte character.
However, as an important special case, that cannot happen with UTF-8,
because it has the property that the byte sequence of one character
never appears as part of another character. I think some other encodings
might have the same property, but I wasn't sure.

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.
Counting the characters up to that point is slower than the mb->wchar
conversion. I think we could avoid that regression too, if we had a more
optimized function to count characters. Currently, the code calls
pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar
loop through all the characters is a tight loop within the function.

Thoughts?

- Heikki

Attachments:

0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit.patchtext/x-patch; name=0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit.patchDownload

From 1efb5ace7cf9da63f300942f15f9da2fddfb4de5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 19 Oct 2018 15:12:39 +0300
Subject: [PATCH 1/1] Use single-byte Boyer-Moore-Horspool search even with
 multibyte encodings.

The old implementation first converted the input strings to arrays of
wchars, and performed the on those. However, the conversion is expensive,
and for a large input string, consumes a lot of memory.

Avoid the conversion, and instead use the single-byte algorithm even with
multibyte encodings. That has a couple of problems, though. Firstly, we
might get fooled if the needle string's byte sequence appears embedded
inside a different string. We might incorrectly return a match in the
middle of a multi-byte character. The second problem is that the
text_position function needs to return the position of the match, counted
in characters, rather than bytes. We can work around both of those problems
by an extra step after finding a match. Walk the string one character at a
time, starting from the beginning, until we reach the position of the match.
We can then check if the match was found at a valid character boundary,
which solves the first problem, and we can count the characters, so that we
can return the character offset. Walking the characters is faster than the
wchar conversion, especially in the case that there are no matches and we
can avoid it altogether. Also, avoiding the large allocation allows these
functions to work with inputs larger than 256 MB, even with multibyte
encodings.

For UTF-8, we can even skip the step to verify that the match lands on a
character boundary, because in UTF-8, the byte sequence of one character
cannot appear in the middle of a different character. I think many other
encodings have the same property, but I wasn't sure, so I only enabled
that optimization for UTF-8.

Most of the callers of the text_position_setup/next functions were actually
not interested in the position of the match, counted in characters. To
cater for them, refactor the text_position_next() interface into two parts:
searching for the next match (text_position_next()), and returning the
current match's position as a pointer (text_position_get_match_ptr()) or
as a character offset (text_position_get_match_pos()). Most callers of
text_position_setup/next are not interested in the character offsets, so
getting the pointer to the match within the original string is a more
convenient API for them. It also allows skipping the character counting
with UTF-8 encoding altogether.
---
 src/backend/utils/adt/varlena.c | 475 +++++++++++++++++++++-------------------
 1 file changed, 253 insertions(+), 222 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index a5e812d026..0c8c8b9a4f 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -43,18 +43,33 @@ int			bytea_output = BYTEA_OUTPUT_HEX;
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
+/*
+ * State for text_position_* functions.
+ */
 typedef struct
 {
-	bool		use_wchar;		/* T if multibyte encoding */
-	char	   *str1;			/* use these if not use_wchar */
-	char	   *str2;			/* note: these point to original texts */
-	pg_wchar   *wstr1;			/* use these if use_wchar */
-	pg_wchar   *wstr2;			/* note: these are palloc'd */
-	int			len1;			/* string lengths in logical characters */
+	bool		is_multibyte;	/* T if multibyte encoding */
+	bool		is_multibyte_char_in_char;
+
+	char	   *str1;			/* haystack string */
+	char	   *str2;			/* needle string */
+	int			len1;			/* string lengths in bytes */
 	int			len2;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
+
+	char	   *last_match;		/* pointer to last match in 'str1' */
+
+	/*
+	 * Sometimes we need to convert byte positions to character positions, or
+	 * the other way round. These store the last position that was converted,
+	 * so that on the next call, we can continue from that point, rather than
+	 * count characters from the very beginning.
+	 */
+	char	   *refpoint;		/* pointer within original haystack string */
+	int			refpos;			/* 0-based character offset of the same point */
 } TextPositionState;
 
 typedef struct
@@ -106,7 +121,10 @@ static text *text_substring(Datum str,
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2);
 static void text_position_setup(text *t1, text *t2, TextPositionState *state);
-static int	text_position_next(int start_pos, TextPositionState *state);
+static bool text_position_next(TextPositionState *state);
+static char *text_position_next_internal(char *startptr, TextPositionState *state);
+static char *text_position_get_match_ptr(TextPositionState *state);
+static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
 static int	text_cmp(text *arg1, text *arg2, Oid collid);
 static bytea *bytea_catenate(bytea *t1, bytea *t2);
@@ -1097,7 +1115,10 @@ text_position(text *t1, text *t2)
 	int			result;
 
 	text_position_setup(t1, t2, &state);
-	result = text_position_next(1, &state);
+	if (!text_position_next(&state))
+		result = 0;
+	else
+		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
 	return result;
 }
@@ -1109,9 +1130,14 @@ text_position(text *t1, text *t2)
  *
  * These are broken out so that a string can be efficiently searched for
  * multiple occurrences of the same pattern.  text_position_next may be
- * called multiple times with increasing values of start_pos, which is
- * the 1-based character position to start the search from.  The "state"
- * variable is normally just a local variable in the caller.
+ * called multiple times, and it advances to the next match on each call.
+ * text_position_get_match_ptr() and text_position_get_match_pos() return
+ * a pointer or 1-based character position of the last match, respectively.
+ *
+ * The "state" variable is normally just a local variable in the caller.
+ *
+ * NOTE: text_position_next skips over the matched portion, for example,
+ * searching for "xx" in "xxx" returns only one match, not two.
  */
 
 static void
@@ -1120,33 +1146,46 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 	int			len1 = VARSIZE_ANY_EXHDR(t1);
 	int			len2 = VARSIZE_ANY_EXHDR(t2);
 
+	Assert(len1 > 0);
+	Assert(len2 > 0);
+
+	/*
+	 * Even with a multi-byte encoding, we perform the search by searching for
+	 * the byte sequence, ignoring multibyte characters.  For UTF-8, that
+	 * works fine, because in UTF-8 the byte sequence of one character never
+	 * contains another character.  For other multi-byte encodings, we do the
+	 * search initially as a simple byte search, ignoring multibyte issues,
+	 * but we verify afterwards that the match we found is at a character
+	 * boundary, and continue the search if a match is found in the middle of
+	 * a multi-byte character.
+	 *
+	 * XXX: I think many other multi-byte encodings than UTF-8 have the same
+	 * property.  Maybe even all server encodings, but I wasn't sure.
+	 */
 	if (pg_database_encoding_max_length() == 1)
 	{
-		/* simple case - single byte encoding */
-		state->use_wchar = false;
-		state->str1 = VARDATA_ANY(t1);
-		state->str2 = VARDATA_ANY(t2);
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = false;
+		state->is_multibyte_char_in_char = false;
+	}
+	else if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = false;
 	}
 	else
 	{
-		/* not as simple - multibyte encoding */
-		pg_wchar   *p1,
-				   *p2;
-
-		p1 = (pg_wchar *) palloc((len1 + 1) * sizeof(pg_wchar));
-		len1 = pg_mb2wchar_with_len(VARDATA_ANY(t1), p1, len1);
-		p2 = (pg_wchar *) palloc((len2 + 1) * sizeof(pg_wchar));
-		len2 = pg_mb2wchar_with_len(VARDATA_ANY(t2), p2, len2);
-
-		state->use_wchar = true;
-		state->wstr1 = p1;
-		state->wstr2 = p2;
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = true;
 	}
 
+	state->str1 = VARDATA_ANY(t1);
+	state->str2 = VARDATA_ANY(t2);
+	state->len1 = len1;
+	state->len2 = len2;
+	state->last_match = NULL;
+	state->refpoint = state->str1;
+	state->refpos = 0;
+
 	/*
 	 * Prepare the skip table for Boyer-Moore-Horspool searching.  In these
 	 * notes we use the terminology that the "haystack" is the string to be
@@ -1163,6 +1202,7 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		int			skiptablemask;
 		int			last;
 		int			i;
+		const char *str2 = state->str2;
 
 		/*
 		 * First we must determine how much of the skip table to use.  The
@@ -1209,165 +1249,182 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		 */
 		last = len2 - 1;
 
-		if (!state->use_wchar)
-		{
-			const char *str2 = state->str2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
-		}
-		else
-		{
-			const pg_wchar *wstr2 = state->wstr2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[wstr2[i] & skiptablemask] = last - i;
-		}
+		for (i = 0; i < last; i++)
+			state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
 	}
 }
 
-static int
-text_position_next(int start_pos, TextPositionState *state)
+/*
+ * Advance to the next match, starting from the end of the previous match
+ * (or the beginning of the string, on first call).  Returns true if a match
+ * is found.
+ */
+static bool
+text_position_next(TextPositionState *state)
 {
-	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
-	int			skiptablemask = state->skiptablemask;
-
-	Assert(start_pos > 0);		/* else caller error */
+	char	   *startptr;
+	char	   *matchptr;
 
 	if (needle_len <= 0)
-		return start_pos;		/* result for empty pattern */
+		return false;			/* result for empty pattern */
+
+	/* Start from the point right after the previous match. */
+	if (state->last_match)
+		startptr = state->last_match + needle_len;
+	else
+		startptr = state->str1;
 
-	start_pos--;				/* adjust for zero based arrays */
+retry:
+	matchptr = text_position_next_internal(startptr, state);
 
-	/* Done if the needle can't possibly fit */
-	if (haystack_len < start_pos + needle_len)
-		return 0;
+	if (!matchptr)
+		return false;
 
-	if (!state->use_wchar)
+	/*
+	 * Found a match for the byte sequence.  If this is a multibyte encoding,
+	 * where one character's byte sequence can appear inside a longer
+	 * multi-byte character, we need to verify that the match was at a
+	 * character boundary, not in the middle of a multi-byte character.
+	 */
+	if (state->is_multibyte_char_in_char)
 	{
-		/* simple case - single byte encoding */
-		const char *haystack = state->str1;
-		const char *needle = state->str2;
-		const char *haystack_end = &haystack[haystack_len];
-		const char *hptr;
+		/* Walk one character at a time, until we reach the match. */
+
+		/* the search should never move backwards. */
+		Assert(state->refpoint <= matchptr);
 
-		if (needle_len == 1)
+		while (state->refpoint < matchptr)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			char		nchar = *needle;
+			/* step to next character. */
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			/*
+			 * If we stepped over the match, then that match was a false
+			 * positive, where the byte sequence appeared in the middle of a
+			 * multi-byte character.  Skip it, and continue the search at the
+			 * next character boundary.
+			 */
+			if (state->refpoint > matchptr)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				startptr = state->refpoint;
+				goto retry;
 			}
 		}
-		else
-		{
-			const char *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const char *nptr;
-				const char *p;
+	state->last_match = matchptr;
+	return true;
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Subroutine of text_position_next().  This searches for the raw byte
+ * sequence, ignoring any multi-byte encoding issues.  Returns the first
+ * match starting at 'startptr', or NULL if no match is found.
+ */
+static char *
+text_position_next_internal(char *startptr, TextPositionState *state)
+{
+	int			haystack_len = state->len1;
+	int			needle_len = state->len2;
+	int			skiptablemask = state->skiptablemask;
+	const char *haystack = state->str1;
+	const char *needle = state->str2;
+	const char *haystack_end = &haystack[haystack_len];
+	const char *hptr;
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
-			}
+	Assert(start_ptr >= haystack && start_ptr <= haystack_end);
+
+	if (needle_len == 1)
+	{
+		/* No point in using B-M-H for a one-character needle */
+		char		nchar = *needle;
+
+		hptr = startptr;
+		while (hptr < haystack_end)
+		{
+			if (*hptr == nchar)
+				return (char *) hptr;
+			hptr++;
 		}
 	}
 	else
 	{
-		/* The multibyte char version. This works exactly the same way. */
-		const pg_wchar *haystack = state->wstr1;
-		const pg_wchar *needle = state->wstr2;
-		const pg_wchar *haystack_end = &haystack[haystack_len];
-		const pg_wchar *hptr;
+		const char *needle_last = &needle[needle_len - 1];
 
-		if (needle_len == 1)
+		/* Start at startpos plus the length of the needle */
+		hptr = startptr + needle_len - 1;
+		while (hptr < haystack_end)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			pg_wchar	nchar = *needle;
+			/* Match the needle scanning *backward* */
+			const char *nptr;
+			const char *p;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			nptr = needle_last;
+			p = hptr;
+			while (*nptr == *p)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				/* Matched it all?	If so, return 1-based position */
+				if (nptr == needle)
+					return (char *) p;
+				nptr--, p--;
 			}
+
+			/*
+			 * No match, so use the haystack char at hptr to decide how far to
+			 * advance.  If the needle had any occurrence of that character
+			 * (or more precisely, one sharing the same skiptable entry)
+			 * before its last character, then we advance far enough to align
+			 * the last such needle character with that haystack position.
+			 * Otherwise we can advance by the whole needle length.
+			 */
+			hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
 		}
-		else
-		{
-			const pg_wchar *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const pg_wchar *nptr;
-				const pg_wchar *p;
+	return 0;					/* not found */
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Return a pointer to the current match.
+ *
+ * The returned pointer points into correct position in the original
+ * the haystack string.
+ */
+static char *
+text_position_get_match_ptr(TextPositionState *state)
+{
+	return state->last_match;
+}
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[*hptr & skiptablemask];
-			}
+/*
+ * Return the offset of the current match.
+ *
+ * The offset is in characters, 1-based.
+ */
+static int
+text_position_get_match_pos(TextPositionState *state)
+{
+	if (!state->is_multibyte)
+		return state->last_match - state->str1 + 1;
+	else
+	{
+		/* Convert the byte position to char position. */
+		while (state->refpoint < state->last_match)
+		{
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 		}
+		Assert(state->refpoint == state->last_match);
+		return state->refpos + 1;
 	}
-
-	return 0;					/* not found */
 }
 
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	if (state->use_wchar)
-	{
-		pfree(state->wstr1);
-		pfree(state->wstr2);
-	}
+	/* no cleanup needed */
 }
 
 /* varstr_cmp()
@@ -3835,20 +3892,16 @@ replace_text(PG_FUNCTION_ARGS)
 	int			from_sub_text_len;
 	TextPositionState state;
 	text	   *ret_text;
-	int			start_posn;
-	int			curr_posn;
+	char	   *curr_ptr;
 	int			chunk_len;
 	char	   *start_ptr;
 	StringInfoData str;
+	bool		found;
 
 	text_position_setup(src_text, from_sub_text, &state);
 
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	src_text_len = state.len1;
-	from_sub_text_len = state.len2;
+	src_text_len = VARSIZE_ANY_EXHDR(src_text);
+	from_sub_text_len = VARSIZE_ANY_EXHDR(from_sub_text);
 
 	/* Return unmodified source string if empty source or pattern */
 	if (src_text_len < 1 || from_sub_text_len < 1)
@@ -3857,17 +3910,17 @@ replace_text(PG_FUNCTION_ARGS)
 		PG_RETURN_TEXT_P(src_text);
 	}
 
-	start_posn = 1;
-	curr_posn = text_position_next(1, &state);
+	found = text_position_next(&state);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
-	if (curr_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(src_text);
 	}
+	curr_ptr = text_position_get_match_ptr(&state);
 
-	/* start_ptr points to the start_posn'th character of src_text */
+	/* start_ptr points to the start_posn'th character of src_text XXX */
 	start_ptr = VARDATA_ANY(src_text);
 
 	initStringInfo(&str);
@@ -3876,20 +3929,19 @@ replace_text(PG_FUNCTION_ARGS)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/* copy the data skipped over by last text_position_next() */
-		chunk_len = charlen_to_bytelen(start_ptr, curr_posn - start_posn);
+		/* copy the data skipped over by last text_position_nextptr() */
+		chunk_len = curr_ptr - start_ptr;
 		appendBinaryStringInfo(&str, start_ptr, chunk_len);
 
 		appendStringInfoText(&str, to_sub_text);
 
-		start_posn = curr_posn;
-		start_ptr += chunk_len;
-		start_posn += from_sub_text_len;
-		start_ptr += charlen_to_bytelen(start_ptr, from_sub_text_len);
+		start_ptr = curr_ptr + from_sub_text_len;
 
-		curr_posn = text_position_next(start_posn, &state);
+		found = text_position_next(&state);
+		if (found)
+			curr_ptr = text_position_get_match_ptr(&state);
 	}
-	while (curr_posn > 0);
+	while (found);
 
 	/* copy trailing data */
 	chunk_len = ((char *) src_text + VARSIZE_ANY(src_text)) - start_ptr;
@@ -4190,9 +4242,10 @@ split_text(PG_FUNCTION_ARGS)
 	int			inputstring_len;
 	int			fldsep_len;
 	TextPositionState state;
-	int			start_posn;
-	int			end_posn;
+	char	   *start_ptr;
+	char	   *end_ptr;
 	text	   *result_text;
+	bool		found;
 
 	/* field number is 1 based */
 	if (fldnum < 1)
@@ -4200,21 +4253,12 @@ split_text(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("field position must be greater than zero")));
 
-	text_position_setup(inputstring, fldsep, &state);
-
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	inputstring_len = state.len1;
-	fldsep_len = state.len2;
+	inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+	fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 	/* return empty string for empty input string */
 	if (inputstring_len < 1)
-	{
-		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(cstring_to_text(""));
-	}
 
 	/* empty field separator */
 	if (fldsep_len < 1)
@@ -4227,12 +4271,14 @@ split_text(PG_FUNCTION_ARGS)
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
 
+	text_position_setup(inputstring, fldsep, &state);
+
 	/* identify bounds of first field */
-	start_posn = 1;
-	end_posn = text_position_next(1, &state);
+	start_ptr = VARDATA_ANY(inputstring);
+	found = text_position_next(&state);
 
 	/* special case if fldsep not found at all */
-	if (end_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		/* if field 1 requested, return input string, else empty string */
@@ -4241,12 +4287,15 @@ split_text(PG_FUNCTION_ARGS)
 		else
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
+	end_ptr = text_position_get_match_ptr(&state);
 
-	while (end_posn > 0 && --fldnum > 0)
+	while (found && --fldnum > 0)
 	{
 		/* identify bounds of next field */
-		start_posn = end_posn + fldsep_len;
-		end_posn = text_position_next(start_posn, &state);
+		start_ptr = end_ptr + fldsep_len;
+		found = text_position_next(&state);
+		if (found)
+			end_ptr = text_position_get_match_ptr(&state);
 	}
 
 	text_position_cleanup(&state);
@@ -4256,20 +4305,14 @@ split_text(PG_FUNCTION_ARGS)
 		/* N'th field separator not found */
 		/* if last field requested, return it, else empty string */
 		if (fldnum == 1)
-			result_text = text_substring(PointerGetDatum(inputstring),
-										 start_posn,
-										 -1,
-										 true);
+			result_text = cstring_to_text_with_len(start_ptr, inputstring_len - (start_ptr - VARDATA_ANY(inputstring)));
 		else
 			result_text = cstring_to_text("");
 	}
 	else
 	{
 		/* non-last field requested */
-		result_text = text_substring(PointerGetDatum(inputstring),
-									 start_posn,
-									 end_posn - start_posn,
-									 false);
+		result_text = cstring_to_text_with_len(start_ptr, end_ptr - start_ptr);
 	}
 
 	PG_RETURN_TEXT_P(result_text);
@@ -4355,26 +4398,14 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 		 */
 		TextPositionState state;
 		int			fldnum;
-		int			start_posn;
-		int			end_posn;
 		int			chunk_len;
 
-		text_position_setup(inputstring, fldsep, &state);
-
-		/*
-		 * Note: we check the converted string length, not the original,
-		 * because they could be different if the input contained invalid
-		 * encoding.
-		 */
-		inputstring_len = state.len1;
-		fldsep_len = state.len2;
+		inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+		fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 		/* return empty array for empty input string */
 		if (inputstring_len < 1)
-		{
-			text_position_cleanup(&state);
 			PG_RETURN_ARRAYTYPE_P(construct_empty_array(TEXTOID));
-		}
 
 		/*
 		 * empty field separator: return the input string as a one-element
@@ -4387,7 +4418,6 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			int			dims[1];
 			int			lbs[1];
 
-			text_position_cleanup(&state);
 			/* single element can be a NULL too */
 			is_null = null_string ? text_isequal(inputstring, null_string) : false;
 
@@ -4401,17 +4431,20 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 													 TEXTOID, -1, false, 'i'));
 		}
 
-		start_posn = 1;
+		text_position_setup(inputstring, fldsep, &state);
+
 		/* start_ptr points to the start_posn'th character of inputstring */
 		start_ptr = VARDATA_ANY(inputstring);
 
 		for (fldnum = 1;; fldnum++) /* field number is 1 based */
 		{
-			CHECK_FOR_INTERRUPTS();
+			bool		found;
+			char	   *end_ptr;
 
-			end_posn = text_position_next(start_posn, &state);
+			CHECK_FOR_INTERRUPTS();
 
-			if (end_posn == 0)
+			found = text_position_next(&state);
+			if (!found)
 			{
 				/* fetch last field */
 				chunk_len = ((char *) inputstring + VARSIZE_ANY(inputstring)) - start_ptr;
@@ -4419,7 +4452,8 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			else
 			{
 				/* fetch non-last field */
-				chunk_len = charlen_to_bytelen(start_ptr, end_posn - start_posn);
+				end_ptr = text_position_get_match_ptr(&state);
+				chunk_len = end_ptr - start_ptr;
 			}
 
 			/* must build a temp text datum to pass to accumArrayResult */
@@ -4435,13 +4469,10 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			pfree(result_text);
 
-			if (end_posn == 0)
+			if (!found)
 				break;
 
-			start_posn = end_posn;
-			start_ptr += chunk_len;
-			start_posn += fldsep_len;
-			start_ptr += charlen_to_bytelen(start_ptr, fldsep_len);
+			start_ptr = end_ptr + fldsep_len;
 		}
 
 		text_position_cleanup(&state);
-- 
2.11.0

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Heikki Linnakangas (#1)

Re: Speeding up text_position_next with multibyte encodings

Hello.

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi>

Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm,
with a skip table. Currently, with a multi-byte encoding, we first
convert both input strings to arrays of wchars, and run the algorithm
on those arrays.

This patch removes the mb->wchar conversion, and instead runs the
single-byte version of the algorithm directly on the inputs, even with
multi-byte encodings. That avoids the mb->wchar conversion altogether,
when there is no match. Even when there is a match, we don't need to
convert the whole input string to wchars. It's enough to count the
characters up to the match, using pg_mblen(). And when
text_position_setup/next() are used as part of split_part() or
replace() functions, we're not actually even interested in the
character position, so we can skip that too.

Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.

Avoiding the large allocation is nice, too. That was actually why I
stated to look into this: a customer complained that text_position
fails with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false
positives, though. To make that work, after finding a match, we verify
that the match doesn't fall in the middle of a multi-byte
character. However, as an important special case, that cannot happen
with UTF-8, because it has the property that the byte sequence of one
character never appears as part of another character. I think some
other encodings might have the same property, but I wasn't sure.

Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
happens.)

This is a win in most cases. One case is slower: calling position()
with a large haystack input, where the match is near the end of the
string. Counting the characters up to that point is slower than the
mb->wchar conversion. I think we could avoid that regression too, if
we had a more optimized function to count characters. Currently, the
code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
the similar loop through all the characters is a tight loop within the
function.

Thoughts?

Coudn't we count characters while searching a match?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Speeding up text_position_next with multibyte encodings

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote
Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

Hi,

Unfortunately, patch doesn't compile anymore due:

varlena.c: In function ‘text_position_next_internal’:
varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this function)
Assert(start_ptr >= haystack && start_ptr <= haystack_end);

Could you please send an updated version? For now I'm moving it to the next CF.

John Naylor

jcnaylor@gmail.com

about 7 years ago

In reply to: Heikki Linnakangas (#1)

2 attachment(s)

Re: Speeding up text_position_next with multibyte encodings

On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, patch doesn't compile anymore due:

varlena.c: In function ‘text_position_next_internal’:
varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
function)
Assert(start_ptr >= haystack && start_ptr <= haystack_end);

Could you please send an updated version? For now I'm moving it to the next
CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short: 1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long: 250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length master patch
short 2.26s 2.51s
med 2.28s 2.54s
long 3.07s 3.11s

EUC-KR
length master patch
short 2.29s 2.37s
med 2.29s 2.36s
long 1.75s 1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachments:

v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patchDownload

From ca7a228174acb8b85b87642bb6df182139587c05 Mon Sep 17 00:00:00 2001
From: John Naylor <jcnaylor@gmail.com>
Date: Sun, 18 Nov 2018 17:40:09 +0700
Subject: [PATCH v2] Use single-byte Boyer-Moore-Horspool search even with
 multibyte encodings.

The old implementation first converted the input strings to arrays of
wchars, and performed the on those. However, the conversion is expensive,
and for a large input string, consumes a lot of memory.

Avoid the conversion, and instead use the single-byte algorithm even with
multibyte encodings. That has a couple of problems, though. Firstly, we
might get fooled if the needle string's byte sequence appears embedded
inside a different string. We might incorrectly return a match in the
middle of a multi-byte character. The second problem is that the
text_position function needs to return the position of the match, counted
in characters, rather than bytes. We can work around both of those problems
by an extra step after finding a match. Walk the string one character at a
time, starting from the beginning, until we reach the position of the match.
We can then check if the match was found at a valid character boundary,
which solves the first problem, and we can count the characters, so that we
can return the character offset. Walking the characters is faster than the
wchar conversion, especially in the case that there are no matches and we
can avoid it altogether. Also, avoiding the large allocation allows these
functions to work with inputs larger than 256 MB, even with multibyte
encodings.

For UTF-8, we can even skip the step to verify that the match lands on a
character boundary, because in UTF-8, the byte sequence of one character
cannot appear in the middle of a different character. I think many other
encodings have the same property, but I wasn't sure, so I only enabled
that optimization for UTF-8.

Most of the callers of the text_position_setup/next functions were actually
not interested in the position of the match, counted in characters. To
cater for them, refactor the text_position_next() interface into two parts:
searching for the next match (text_position_next()), and returning the
current match's position as a pointer (text_position_get_match_ptr()) or
as a character offset (text_position_get_match_pos()). Most callers of
text_position_setup/next are not interested in the character offsets, so
getting the pointer to the match within the original string is a more
convenient API for them. It also allows skipping the character counting
with UTF-8 encoding altogether.
---
 src/backend/utils/adt/varlena.c | 475 +++++++++++++++++---------------
 1 file changed, 253 insertions(+), 222 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0fd3b15748..90f5a37f08 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -43,18 +43,33 @@ int			bytea_output = BYTEA_OUTPUT_HEX;
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
+/*
+ * State for text_position_* functions.
+ */
 typedef struct
 {
-	bool		use_wchar;		/* T if multibyte encoding */
-	char	   *str1;			/* use these if not use_wchar */
-	char	   *str2;			/* note: these point to original texts */
-	pg_wchar   *wstr1;			/* use these if use_wchar */
-	pg_wchar   *wstr2;			/* note: these are palloc'd */
-	int			len1;			/* string lengths in logical characters */
+	bool		is_multibyte;	/* T if multibyte encoding */
+	bool		is_multibyte_char_in_char;
+
+	char	   *str1;			/* haystack string */
+	char	   *str2;			/* needle string */
+	int			len1;			/* string lengths in bytes */
 	int			len2;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
+
+	char	   *last_match;		/* pointer to last match in 'str1' */
+
+	/*
+	 * Sometimes we need to convert byte positions to character positions, or
+	 * the other way round. These store the last position that was converted,
+	 * so that on the next call, we can continue from that point, rather than
+	 * count characters from the very beginning.
+	 */
+	char	   *refpoint;		/* pointer within original haystack string */
+	int			refpos;			/* 0-based character offset of the same point */
 } TextPositionState;
 
 typedef struct
@@ -106,7 +121,10 @@ static text *text_substring(Datum str,
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2);
 static void text_position_setup(text *t1, text *t2, TextPositionState *state);
-static int	text_position_next(int start_pos, TextPositionState *state);
+static bool text_position_next(TextPositionState *state);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static char *text_position_get_match_ptr(TextPositionState *state);
+static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
 static int	text_cmp(text *arg1, text *arg2, Oid collid);
 static bytea *bytea_catenate(bytea *t1, bytea *t2);
@@ -1097,7 +1115,10 @@ text_position(text *t1, text *t2)
 	int			result;
 
 	text_position_setup(t1, t2, &state);
-	result = text_position_next(1, &state);
+	if (!text_position_next(&state))
+		result = 0;
+	else
+		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
 	return result;
 }
@@ -1109,9 +1130,14 @@ text_position(text *t1, text *t2)
  *
  * These are broken out so that a string can be efficiently searched for
  * multiple occurrences of the same pattern.  text_position_next may be
- * called multiple times with increasing values of start_pos, which is
- * the 1-based character position to start the search from.  The "state"
- * variable is normally just a local variable in the caller.
+ * called multiple times, and it advances to the next match on each call.
+ * text_position_get_match_ptr() and text_position_get_match_pos() return
+ * a pointer or 1-based character position of the last match, respectively.
+ *
+ * The "state" variable is normally just a local variable in the caller.
+ *
+ * NOTE: text_position_next skips over the matched portion, for example,
+ * searching for "xx" in "xxx" returns only one match, not two.
  */
 
 static void
@@ -1120,33 +1146,46 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 	int			len1 = VARSIZE_ANY_EXHDR(t1);
 	int			len2 = VARSIZE_ANY_EXHDR(t2);
 
+	Assert(len1 > 0);
+	Assert(len2 > 0);
+
+	/*
+	 * Even with a multi-byte encoding, we perform the search by searching for
+	 * the byte sequence, ignoring multibyte characters.  For UTF-8, that
+	 * works fine, because in UTF-8 the byte sequence of one character never
+	 * contains another character.  For other multi-byte encodings, we do the
+	 * search initially as a simple byte search, ignoring multibyte issues,
+	 * but we verify afterwards that the match we found is at a character
+	 * boundary, and continue the search if a match is found in the middle of
+	 * a multi-byte character.
+	 *
+	 * XXX: I think many other multi-byte encodings than UTF-8 have the same
+	 * property.  Maybe even all server encodings, but I wasn't sure.
+	 */
 	if (pg_database_encoding_max_length() == 1)
 	{
-		/* simple case - single byte encoding */
-		state->use_wchar = false;
-		state->str1 = VARDATA_ANY(t1);
-		state->str2 = VARDATA_ANY(t2);
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = false;
+		state->is_multibyte_char_in_char = false;
+	}
+	else if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = false;
 	}
 	else
 	{
-		/* not as simple - multibyte encoding */
-		pg_wchar   *p1,
-				   *p2;
-
-		p1 = (pg_wchar *) palloc((len1 + 1) * sizeof(pg_wchar));
-		len1 = pg_mb2wchar_with_len(VARDATA_ANY(t1), p1, len1);
-		p2 = (pg_wchar *) palloc((len2 + 1) * sizeof(pg_wchar));
-		len2 = pg_mb2wchar_with_len(VARDATA_ANY(t2), p2, len2);
-
-		state->use_wchar = true;
-		state->wstr1 = p1;
-		state->wstr2 = p2;
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = true;
 	}
 
+	state->str1 = VARDATA_ANY(t1);
+	state->str2 = VARDATA_ANY(t2);
+	state->len1 = len1;
+	state->len2 = len2;
+	state->last_match = NULL;
+	state->refpoint = state->str1;
+	state->refpos = 0;
+
 	/*
 	 * Prepare the skip table for Boyer-Moore-Horspool searching.  In these
 	 * notes we use the terminology that the "haystack" is the string to be
@@ -1163,6 +1202,7 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		int			skiptablemask;
 		int			last;
 		int			i;
+		const char *str2 = state->str2;
 
 		/*
 		 * First we must determine how much of the skip table to use.  The
@@ -1209,165 +1249,182 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		 */
 		last = len2 - 1;
 
-		if (!state->use_wchar)
-		{
-			const char *str2 = state->str2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
-		}
-		else
-		{
-			const pg_wchar *wstr2 = state->wstr2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[wstr2[i] & skiptablemask] = last - i;
-		}
+		for (i = 0; i < last; i++)
+			state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
 	}
 }
 
-static int
-text_position_next(int start_pos, TextPositionState *state)
+/*
+ * Advance to the next match, starting from the end of the previous match
+ * (or the beginning of the string, on first call).  Returns true if a match
+ * is found.
+ */
+static bool
+text_position_next(TextPositionState *state)
 {
-	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
-	int			skiptablemask = state->skiptablemask;
-
-	Assert(start_pos > 0);		/* else caller error */
+	char	   *start_ptr;
+	char	   *matchptr;
 
 	if (needle_len <= 0)
-		return start_pos;		/* result for empty pattern */
+		return false;			/* result for empty pattern */
 
-	start_pos--;				/* adjust for zero based arrays */
+	/* Start from the point right after the previous match. */
+	if (state->last_match)
+		start_ptr = state->last_match + needle_len;
+	else
+		start_ptr = state->str1;
 
-	/* Done if the needle can't possibly fit */
-	if (haystack_len < start_pos + needle_len)
-		return 0;
+retry:
+	matchptr = text_position_next_internal(start_ptr, state);
+
+	if (!matchptr)
+		return false;
 
-	if (!state->use_wchar)
+	/*
+	 * Found a match for the byte sequence.  If this is a multibyte encoding,
+	 * where one character's byte sequence can appear inside a longer
+	 * multi-byte character, we need to verify that the match was at a
+	 * character boundary, not in the middle of a multi-byte character.
+	 */
+	if (state->is_multibyte_char_in_char)
 	{
-		/* simple case - single byte encoding */
-		const char *haystack = state->str1;
-		const char *needle = state->str2;
-		const char *haystack_end = &haystack[haystack_len];
-		const char *hptr;
+		/* Walk one character at a time, until we reach the match. */
 
-		if (needle_len == 1)
+		/* the search should never move backwards. */
+		Assert(state->refpoint <= matchptr);
+
+		while (state->refpoint < matchptr)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			char		nchar = *needle;
+			/* step to next character. */
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			/*
+			 * If we stepped over the match, then that match was a false
+			 * positive, where the byte sequence appeared in the middle of a
+			 * multi-byte character.  Skip it, and continue the search at the
+			 * next character boundary.
+			 */
+			if (state->refpoint > matchptr)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				start_ptr = state->refpoint;
+				goto retry;
 			}
 		}
-		else
-		{
-			const char *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const char *nptr;
-				const char *p;
+	state->last_match = matchptr;
+	return true;
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Subroutine of text_position_next().  This searches for the raw byte
+ * sequence, ignoring any multi-byte encoding issues.  Returns the first
+ * match starting at 'start_ptr', or NULL if no match is found.
+ */
+static char *
+text_position_next_internal(char *start_ptr, TextPositionState *state)
+{
+	int			haystack_len = state->len1;
+	int			needle_len = state->len2;
+	int			skiptablemask = state->skiptablemask;
+	const char *haystack = state->str1;
+	const char *needle = state->str2;
+	const char *haystack_end = &haystack[haystack_len];
+	const char *hptr;
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
-			}
+	Assert(start_ptr >= haystack && start_ptr <= haystack_end);
+
+	if (needle_len == 1)
+	{
+		/* No point in using B-M-H for a one-character needle */
+		char		nchar = *needle;
+
+		hptr = start_ptr;
+		while (hptr < haystack_end)
+		{
+			if (*hptr == nchar)
+				return (char *) hptr;
+			hptr++;
 		}
 	}
 	else
 	{
-		/* The multibyte char version. This works exactly the same way. */
-		const pg_wchar *haystack = state->wstr1;
-		const pg_wchar *needle = state->wstr2;
-		const pg_wchar *haystack_end = &haystack[haystack_len];
-		const pg_wchar *hptr;
+		const char *needle_last = &needle[needle_len - 1];
 
-		if (needle_len == 1)
+		/* Start at startpos plus the length of the needle */
+		hptr = start_ptr + needle_len - 1;
+		while (hptr < haystack_end)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			pg_wchar	nchar = *needle;
+			/* Match the needle scanning *backward* */
+			const char *nptr;
+			const char *p;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			nptr = needle_last;
+			p = hptr;
+			while (*nptr == *p)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				/* Matched it all?	If so, return 1-based position */
+				if (nptr == needle)
+					return (char *) p;
+				nptr--, p--;
 			}
+
+			/*
+			 * No match, so use the haystack char at hptr to decide how far to
+			 * advance.  If the needle had any occurrence of that character
+			 * (or more precisely, one sharing the same skiptable entry)
+			 * before its last character, then we advance far enough to align
+			 * the last such needle character with that haystack position.
+			 * Otherwise we can advance by the whole needle length.
+			 */
+			hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
 		}
-		else
-		{
-			const pg_wchar *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const pg_wchar *nptr;
-				const pg_wchar *p;
+	return 0;					/* not found */
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Return a pointer to the current match.
+ *
+ * The returned pointer points into correct position in the original
+ * the haystack string.
+ */
+static char *
+text_position_get_match_ptr(TextPositionState *state)
+{
+	return state->last_match;
+}
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[*hptr & skiptablemask];
-			}
+/*
+ * Return the offset of the current match.
+ *
+ * The offset is in characters, 1-based.
+ */
+static int
+text_position_get_match_pos(TextPositionState *state)
+{
+	if (!state->is_multibyte)
+		return state->last_match - state->str1 + 1;
+	else
+	{
+		/* Convert the byte position to char position. */
+		while (state->refpoint < state->last_match)
+		{
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 		}
+		Assert(state->refpoint == state->last_match);
+		return state->refpos + 1;
 	}
-
-	return 0;					/* not found */
 }
 
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	if (state->use_wchar)
-	{
-		pfree(state->wstr1);
-		pfree(state->wstr2);
-	}
+	/* no cleanup needed */
 }
 
 /* varstr_cmp()
@@ -3835,20 +3892,16 @@ replace_text(PG_FUNCTION_ARGS)
 	int			from_sub_text_len;
 	TextPositionState state;
 	text	   *ret_text;
-	int			start_posn;
-	int			curr_posn;
+	char	   *curr_ptr;
 	int			chunk_len;
 	char	   *start_ptr;
 	StringInfoData str;
+	bool		found;
 
 	text_position_setup(src_text, from_sub_text, &state);
 
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	src_text_len = state.len1;
-	from_sub_text_len = state.len2;
+	src_text_len = VARSIZE_ANY_EXHDR(src_text);
+	from_sub_text_len = VARSIZE_ANY_EXHDR(from_sub_text);
 
 	/* Return unmodified source string if empty source or pattern */
 	if (src_text_len < 1 || from_sub_text_len < 1)
@@ -3857,17 +3910,17 @@ replace_text(PG_FUNCTION_ARGS)
 		PG_RETURN_TEXT_P(src_text);
 	}
 
-	start_posn = 1;
-	curr_posn = text_position_next(1, &state);
+	found = text_position_next(&state);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
-	if (curr_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(src_text);
 	}
+	curr_ptr = text_position_get_match_ptr(&state);
 
-	/* start_ptr points to the start_posn'th character of src_text */
+	/* start_ptr points to the start_posn'th character of src_text XXX */
 	start_ptr = VARDATA_ANY(src_text);
 
 	initStringInfo(&str);
@@ -3876,20 +3929,19 @@ replace_text(PG_FUNCTION_ARGS)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/* copy the data skipped over by last text_position_next() */
-		chunk_len = charlen_to_bytelen(start_ptr, curr_posn - start_posn);
+		/* copy the data skipped over by last text_position_nextptr() */
+		chunk_len = curr_ptr - start_ptr;
 		appendBinaryStringInfo(&str, start_ptr, chunk_len);
 
 		appendStringInfoText(&str, to_sub_text);
 
-		start_posn = curr_posn;
-		start_ptr += chunk_len;
-		start_posn += from_sub_text_len;
-		start_ptr += charlen_to_bytelen(start_ptr, from_sub_text_len);
+		start_ptr = curr_ptr + from_sub_text_len;
 
-		curr_posn = text_position_next(start_posn, &state);
+		found = text_position_next(&state);
+		if (found)
+			curr_ptr = text_position_get_match_ptr(&state);
 	}
-	while (curr_posn > 0);
+	while (found);
 
 	/* copy trailing data */
 	chunk_len = ((char *) src_text + VARSIZE_ANY(src_text)) - start_ptr;
@@ -4190,9 +4242,10 @@ split_text(PG_FUNCTION_ARGS)
 	int			inputstring_len;
 	int			fldsep_len;
 	TextPositionState state;
-	int			start_posn;
-	int			end_posn;
+	char	   *start_ptr;
+	char	   *end_ptr;
 	text	   *result_text;
+	bool		found;
 
 	/* field number is 1 based */
 	if (fldnum < 1)
@@ -4200,21 +4253,12 @@ split_text(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("field position must be greater than zero")));
 
-	text_position_setup(inputstring, fldsep, &state);
-
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	inputstring_len = state.len1;
-	fldsep_len = state.len2;
+	inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+	fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 	/* return empty string for empty input string */
 	if (inputstring_len < 1)
-	{
-		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(cstring_to_text(""));
-	}
 
 	/* empty field separator */
 	if (fldsep_len < 1)
@@ -4227,12 +4271,14 @@ split_text(PG_FUNCTION_ARGS)
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
 
+	text_position_setup(inputstring, fldsep, &state);
+
 	/* identify bounds of first field */
-	start_posn = 1;
-	end_posn = text_position_next(1, &state);
+	start_ptr = VARDATA_ANY(inputstring);
+	found = text_position_next(&state);
 
 	/* special case if fldsep not found at all */
-	if (end_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		/* if field 1 requested, return input string, else empty string */
@@ -4241,12 +4287,15 @@ split_text(PG_FUNCTION_ARGS)
 		else
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
+	end_ptr = text_position_get_match_ptr(&state);
 
-	while (end_posn > 0 && --fldnum > 0)
+	while (found && --fldnum > 0)
 	{
 		/* identify bounds of next field */
-		start_posn = end_posn + fldsep_len;
-		end_posn = text_position_next(start_posn, &state);
+		start_ptr = end_ptr + fldsep_len;
+		found = text_position_next(&state);
+		if (found)
+			end_ptr = text_position_get_match_ptr(&state);
 	}
 
 	text_position_cleanup(&state);
@@ -4256,20 +4305,14 @@ split_text(PG_FUNCTION_ARGS)
 		/* N'th field separator not found */
 		/* if last field requested, return it, else empty string */
 		if (fldnum == 1)
-			result_text = text_substring(PointerGetDatum(inputstring),
-										 start_posn,
-										 -1,
-										 true);
+			result_text = cstring_to_text_with_len(start_ptr, inputstring_len - (start_ptr - VARDATA_ANY(inputstring)));
 		else
 			result_text = cstring_to_text("");
 	}
 	else
 	{
 		/* non-last field requested */
-		result_text = text_substring(PointerGetDatum(inputstring),
-									 start_posn,
-									 end_posn - start_posn,
-									 false);
+		result_text = cstring_to_text_with_len(start_ptr, end_ptr - start_ptr);
 	}
 
 	PG_RETURN_TEXT_P(result_text);
@@ -4355,26 +4398,14 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 		 */
 		TextPositionState state;
 		int			fldnum;
-		int			start_posn;
-		int			end_posn;
 		int			chunk_len;
 
-		text_position_setup(inputstring, fldsep, &state);
-
-		/*
-		 * Note: we check the converted string length, not the original,
-		 * because they could be different if the input contained invalid
-		 * encoding.
-		 */
-		inputstring_len = state.len1;
-		fldsep_len = state.len2;
+		inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+		fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 		/* return empty array for empty input string */
 		if (inputstring_len < 1)
-		{
-			text_position_cleanup(&state);
 			PG_RETURN_ARRAYTYPE_P(construct_empty_array(TEXTOID));
-		}
 
 		/*
 		 * empty field separator: return the input string as a one-element
@@ -4387,7 +4418,6 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			int			dims[1];
 			int			lbs[1];
 
-			text_position_cleanup(&state);
 			/* single element can be a NULL too */
 			is_null = null_string ? text_isequal(inputstring, null_string) : false;
 
@@ -4401,17 +4431,20 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 													 TEXTOID, -1, false, 'i'));
 		}
 
-		start_posn = 1;
+		text_position_setup(inputstring, fldsep, &state);
+
 		/* start_ptr points to the start_posn'th character of inputstring */
 		start_ptr = VARDATA_ANY(inputstring);
 
 		for (fldnum = 1;; fldnum++) /* field number is 1 based */
 		{
-			CHECK_FOR_INTERRUPTS();
+			bool		found;
+			char	   *end_ptr;
 
-			end_posn = text_position_next(start_posn, &state);
+			CHECK_FOR_INTERRUPTS();
 
-			if (end_posn == 0)
+			found = text_position_next(&state);
+			if (!found)
 			{
 				/* fetch last field */
 				chunk_len = ((char *) inputstring + VARSIZE_ANY(inputstring)) - start_ptr;
@@ -4419,7 +4452,8 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			else
 			{
 				/* fetch non-last field */
-				chunk_len = charlen_to_bytelen(start_ptr, end_posn - start_posn);
+				end_ptr = text_position_get_match_ptr(&state);
+				chunk_len = end_ptr - start_ptr;
 			}
 
 			/* must build a temp text datum to pass to accumArrayResult */
@@ -4435,13 +4469,10 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			pfree(result_text);
 
-			if (end_posn == 0)
+			if (!found)
 				break;
 
-			start_posn = end_posn;
-			start_ptr += chunk_len;
-			start_posn += fldsep_len;
-			start_ptr += charlen_to_bytelen(start_ptr, fldsep_len);
+			start_ptr = end_ptr + fldsep_len;
 		}
 
 		text_position_cleanup(&state);
-- 
2.17.1

single-byte-bmh.sqlapplication/sql; name=single-byte-bmh.sqlDownload

John Naylor

jcnaylor@gmail.com

about 7 years ago

In reply to: John Naylor (#4)

Re: Speeding up text_position_next with multibyte encodings

On 12/14/18, John Naylor <jcnaylor@gmail.com> wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

-John Naylor

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: John Naylor (#4)

2 attachment(s)

Re: Speeding up text_position_next with multibyte encodings

On 14/12/2018 20:20, John Naylor wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

Thanks for fixing that!

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
loop counts so that the runtimes are reasonable, now that it actually
does some work.

UTF-8

length master patch
short 2.7s 10.9s
med 2.6s 11.4s
long 3.9s 9.9s

EUC_KR

length master patch
short 2.2s 7.5s
med 2.2s 3.5s
long 3.4s 3.6s

This doesn't look very flattering for the patch. Although, this is
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.
But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

- Heikki

Attachments:

single-byte-bmh-benchmark.patchtext/x-patch; name=single-byte-bmh-benchmark.patchDownload

commit baebdb21cf6ebbe710bde189dd3d300610fb40c8
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Sun Dec 23 00:11:52 2018 +0200

    benchmarking code for textpos.

diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index a2e57768d40..59957db586f 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -863,3 +863,21 @@ test_fdw_handler(PG_FUNCTION_ARGS)
 	elog(ERROR, "test_fdw_handler is not implemented");
 	PG_RETURN_NULL();
 }
+
+
+PG_FUNCTION_INFO_V1(benchmark_textpos);
+Datum
+benchmark_textpos(PG_FUNCTION_ARGS)
+{
+	text	   *t1 = PG_GETARG_TEXT_PP(0);
+	text	   *t2 = PG_GETARG_TEXT_PP(1);
+	int			loops = PG_GETARG_INT32(2);
+	int			i;
+	Datum		result = Int32GetDatum(0);
+
+	for (i = 0; i < loops; i++)
+	{
+		result = DirectFunctionCall2Coll(textpos, PG_GET_COLLATION(), PointerGetDatum(t1), PointerGetDatum(t2));
+	}
+	return result;
+}

single-byte-bmh-benchmark.sqlapplication/sql; name=single-byte-bmh-benchmark.sqlDownload

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: John Naylor (#5)

Re: Speeding up text_position_next with multibyte encodings

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

- Heikki

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: Heikki Linnakangas (#7)

Re: Speeding up text_position_next with multibyte encodings

On 23/12/2018 02:28, Heikki Linnakangas wrote:

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll
investigate!

- Heikki

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: Heikki Linnakangas (#8)

1 attachment(s)

Re: Speeding up text_position_next with multibyte encodings

On 23/12/2018 02:32, Heikki Linnakangas wrote:

On 23/12/2018 02:28, Heikki Linnakangas wrote:

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll
investigate!

The bug was in handling empty inputs. text_position_setup assumed and
asserted that neither the needle nor haystack are empty, expecting the
callers to have handled those special cases already, but not all callers
did. Here is a fixed version.

- Heikki

Attachments:

0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit-2.patchtext/x-patch; name=0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit-2.patchDownload

From 051069c56176e9fa5c4a0f25709bed4d59da4317 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 23 Dec 2018 02:40:48 +0200
Subject: [PATCH] Use single-byte Boyer-Moore-Horspool search even with
 multibyte encodings.

The old implementation first converted the input strings to arrays of
wchars, and performed the on those. However, the conversion is expensive,
and for a large input string, consumes a lot of memory.

Avoid the conversion, and instead use the single-byte algorithm even with
multibyte encodings. That has a couple of problems, though. Firstly, we
might get fooled if the needle string's byte sequence appears embedded
inside a different string. We might incorrectly return a match in the
middle of a multi-byte character. The second problem is that the
text_position function needs to return the position of the match, counted
in characters, rather than bytes. We can work around both of those problems
by an extra step after finding a match. Walk the string one character at a
time, starting from the beginning, until we reach the position of the match.
We can then check if the match was found at a valid character boundary,
which solves the first problem, and we can count the characters, so that we
can return the character offset. Walking the characters is faster than the
wchar conversion, especially in the case that there are no matches and we
can avoid it altogether. Also, avoiding the large allocation allows these
functions to work with inputs larger than 256 MB, even with multibyte
encodings.

For UTF-8, we can even skip the step to verify that the match lands on a
character boundary, because in UTF-8, the byte sequence of one character
cannot appear in the middle of a different character. I think many other
encodings have the same property, but I wasn't sure, so I only enabled
that optimization for UTF-8.

Most of the callers of the text_position_setup/next functions were actually
not interested in the position of the match, counted in characters. To
cater for them, refactor the text_position_next() interface into two parts:
searching for the next match (text_position_next()), and returning the
current match's position as a pointer (text_position_get_match_ptr()) or
as a character offset (text_position_get_match_pos()). Most callers of
text_position_setup/next are not interested in the character offsets, so
getting the pointer to the match within the original string is a more
convenient API for them. It also allows skipping the character counting
with UTF-8 encoding altogether.

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 50a157df9fb..8b7c7f039ee 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -43,18 +43,33 @@ int			bytea_output = BYTEA_OUTPUT_HEX;
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
+/*
+ * State for text_position_* functions.
+ */
 typedef struct
 {
-	bool		use_wchar;		/* T if multibyte encoding */
-	char	   *str1;			/* use these if not use_wchar */
-	char	   *str2;			/* note: these point to original texts */
-	pg_wchar   *wstr1;			/* use these if use_wchar */
-	pg_wchar   *wstr2;			/* note: these are palloc'd */
-	int			len1;			/* string lengths in logical characters */
+	bool		is_multibyte;	/* T if multibyte encoding */
+	bool		is_multibyte_char_in_char;
+
+	char	   *str1;			/* haystack string */
+	char	   *str2;			/* needle string */
+	int			len1;			/* string lengths in bytes */
 	int			len2;
+
 	/* Skip table for Boyer-Moore-Horspool search algorithm: */
 	int			skiptablemask;	/* mask for ANDing with skiptable subscripts */
 	int			skiptable[256]; /* skip distance for given mismatched char */
+
+	char	   *last_match;		/* pointer to last match in 'str1' */
+
+	/*
+	 * Sometimes we need to convert byte positions to character positions, or
+	 * the other way round. These store the last position that was converted,
+	 * so that on the next call, we can continue from that point, rather than
+	 * count characters from the very beginning.
+	 */
+	char	   *refpoint;		/* pointer within original haystack string */
+	int			refpos;			/* 0-based character offset of the same point */
 } TextPositionState;
 
 typedef struct
@@ -109,7 +124,10 @@ static text *text_substring(Datum str,
 static text *text_overlay(text *t1, text *t2, int sp, int sl);
 static int	text_position(text *t1, text *t2);
 static void text_position_setup(text *t1, text *t2, TextPositionState *state);
-static int	text_position_next(int start_pos, TextPositionState *state);
+static bool text_position_next(TextPositionState *state);
+static char *text_position_next_internal(char *start_ptr, TextPositionState *state);
+static char *text_position_get_match_ptr(TextPositionState *state);
+static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
 static int	text_cmp(text *arg1, text *arg2, Oid collid);
 static bytea *bytea_catenate(bytea *t1, bytea *t2);
@@ -1099,8 +1117,14 @@ text_position(text *t1, text *t2)
 	TextPositionState state;
 	int			result;
 
+	if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1)
+		return 0;
+
 	text_position_setup(t1, t2, &state);
-	result = text_position_next(1, &state);
+	if (!text_position_next(&state))
+		result = 0;
+	else
+		result = text_position_get_match_pos(&state);
 	text_position_cleanup(&state);
 	return result;
 }
@@ -1112,9 +1136,14 @@ text_position(text *t1, text *t2)
  *
  * These are broken out so that a string can be efficiently searched for
  * multiple occurrences of the same pattern.  text_position_next may be
- * called multiple times with increasing values of start_pos, which is
- * the 1-based character position to start the search from.  The "state"
- * variable is normally just a local variable in the caller.
+ * called multiple times, and it advances to the next match on each call.
+ * text_position_get_match_ptr() and text_position_get_match_pos() return
+ * a pointer or 1-based character position of the last match, respectively.
+ *
+ * The "state" variable is normally just a local variable in the caller.
+ *
+ * NOTE: text_position_next skips over the matched portion, for example,
+ * searching for "xx" in "xxx" returns only one match, not two.
  */
 
 static void
@@ -1123,33 +1152,46 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 	int			len1 = VARSIZE_ANY_EXHDR(t1);
 	int			len2 = VARSIZE_ANY_EXHDR(t2);
 
+	Assert(len1 > 0);
+	Assert(len2 > 0);
+
+	/*
+	 * Even with a multi-byte encoding, we perform the search by searching for
+	 * the byte sequence, ignoring multibyte characters.  For UTF-8, that
+	 * works fine, because in UTF-8 the byte sequence of one character never
+	 * contains another character.  For other multi-byte encodings, we do the
+	 * search initially as a simple byte search, ignoring multibyte issues,
+	 * but we verify afterwards that the match we found is at a character
+	 * boundary, and continue the search if a match is found in the middle of
+	 * a multi-byte character.
+	 *
+	 * XXX: I think many other multi-byte encodings than UTF-8 have the same
+	 * property.  Maybe even all server encodings, but I wasn't sure.
+	 */
 	if (pg_database_encoding_max_length() == 1)
 	{
-		/* simple case - single byte encoding */
-		state->use_wchar = false;
-		state->str1 = VARDATA_ANY(t1);
-		state->str2 = VARDATA_ANY(t2);
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = false;
+		state->is_multibyte_char_in_char = false;
+	}
+	else if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = false;
 	}
 	else
 	{
-		/* not as simple - multibyte encoding */
-		pg_wchar   *p1,
-				   *p2;
-
-		p1 = (pg_wchar *) palloc((len1 + 1) * sizeof(pg_wchar));
-		len1 = pg_mb2wchar_with_len(VARDATA_ANY(t1), p1, len1);
-		p2 = (pg_wchar *) palloc((len2 + 1) * sizeof(pg_wchar));
-		len2 = pg_mb2wchar_with_len(VARDATA_ANY(t2), p2, len2);
-
-		state->use_wchar = true;
-		state->wstr1 = p1;
-		state->wstr2 = p2;
-		state->len1 = len1;
-		state->len2 = len2;
+		state->is_multibyte = true;
+		state->is_multibyte_char_in_char = true;
 	}
 
+	state->str1 = VARDATA_ANY(t1);
+	state->str2 = VARDATA_ANY(t2);
+	state->len1 = len1;
+	state->len2 = len2;
+	state->last_match = NULL;
+	state->refpoint = state->str1;
+	state->refpos = 0;
+
 	/*
 	 * Prepare the skip table for Boyer-Moore-Horspool searching.  In these
 	 * notes we use the terminology that the "haystack" is the string to be
@@ -1166,6 +1208,7 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		int			skiptablemask;
 		int			last;
 		int			i;
+		const char *str2 = state->str2;
 
 		/*
 		 * First we must determine how much of the skip table to use.  The
@@ -1212,165 +1255,182 @@ text_position_setup(text *t1, text *t2, TextPositionState *state)
 		 */
 		last = len2 - 1;
 
-		if (!state->use_wchar)
-		{
-			const char *str2 = state->str2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
-		}
-		else
-		{
-			const pg_wchar *wstr2 = state->wstr2;
-
-			for (i = 0; i < last; i++)
-				state->skiptable[wstr2[i] & skiptablemask] = last - i;
-		}
+		for (i = 0; i < last; i++)
+			state->skiptable[(unsigned char) str2[i] & skiptablemask] = last - i;
 	}
 }
 
-static int
-text_position_next(int start_pos, TextPositionState *state)
+/*
+ * Advance to the next match, starting from the end of the previous match
+ * (or the beginning of the string, on first call).  Returns true if a match
+ * is found.
+ */
+static bool
+text_position_next(TextPositionState *state)
 {
-	int			haystack_len = state->len1;
 	int			needle_len = state->len2;
-	int			skiptablemask = state->skiptablemask;
-
-	Assert(start_pos > 0);		/* else caller error */
+	char	   *start_ptr;
+	char	   *matchptr;
 
 	if (needle_len <= 0)
-		return start_pos;		/* result for empty pattern */
+		return false;			/* result for empty pattern */
+
+	/* Start from the point right after the previous match. */
+	if (state->last_match)
+		start_ptr = state->last_match + needle_len;
+	else
+		start_ptr = state->str1;
 
-	start_pos--;				/* adjust for zero based arrays */
+retry:
+	matchptr = text_position_next_internal(start_ptr, state);
 
-	/* Done if the needle can't possibly fit */
-	if (haystack_len < start_pos + needle_len)
-		return 0;
+	if (!matchptr)
+		return false;
 
-	if (!state->use_wchar)
+	/*
+	 * Found a match for the byte sequence.  If this is a multibyte encoding,
+	 * where one character's byte sequence can appear inside a longer
+	 * multi-byte character, we need to verify that the match was at a
+	 * character boundary, not in the middle of a multi-byte character.
+	 */
+	if (state->is_multibyte_char_in_char)
 	{
-		/* simple case - single byte encoding */
-		const char *haystack = state->str1;
-		const char *needle = state->str2;
-		const char *haystack_end = &haystack[haystack_len];
-		const char *hptr;
+		/* Walk one character at a time, until we reach the match. */
+
+		/* the search should never move backwards. */
+		Assert(state->refpoint <= matchptr);
 
-		if (needle_len == 1)
+		while (state->refpoint < matchptr)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			char		nchar = *needle;
+			/* step to next character. */
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			/*
+			 * If we stepped over the match, then that match was a false
+			 * positive, where the byte sequence appeared in the middle of a
+			 * multi-byte character.  Skip it, and continue the search at the
+			 * next character boundary.
+			 */
+			if (state->refpoint > matchptr)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				start_ptr = state->refpoint;
+				goto retry;
 			}
 		}
-		else
-		{
-			const char *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const char *nptr;
-				const char *p;
+	state->last_match = matchptr;
+	return true;
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Subroutine of text_position_next().  This searches for the raw byte
+ * sequence, ignoring any multi-byte encoding issues.  Returns the first
+ * match starting at 'start_ptr', or NULL if no match is found.
+ */
+static char *
+text_position_next_internal(char *start_ptr, TextPositionState *state)
+{
+	int			haystack_len = state->len1;
+	int			needle_len = state->len2;
+	int			skiptablemask = state->skiptablemask;
+	const char *haystack = state->str1;
+	const char *needle = state->str2;
+	const char *haystack_end = &haystack[haystack_len];
+	const char *hptr;
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
-			}
+	Assert(start_ptr >= haystack && start_ptr <= haystack_end);
+
+	if (needle_len == 1)
+	{
+		/* No point in using B-M-H for a one-character needle */
+		char		nchar = *needle;
+
+		hptr = start_ptr;
+		while (hptr < haystack_end)
+		{
+			if (*hptr == nchar)
+				return (char *) hptr;
+			hptr++;
 		}
 	}
 	else
 	{
-		/* The multibyte char version. This works exactly the same way. */
-		const pg_wchar *haystack = state->wstr1;
-		const pg_wchar *needle = state->wstr2;
-		const pg_wchar *haystack_end = &haystack[haystack_len];
-		const pg_wchar *hptr;
+		const char *needle_last = &needle[needle_len - 1];
 
-		if (needle_len == 1)
+		/* Start at startpos plus the length of the needle */
+		hptr = start_ptr + needle_len - 1;
+		while (hptr < haystack_end)
 		{
-			/* No point in using B-M-H for a one-character needle */
-			pg_wchar	nchar = *needle;
+			/* Match the needle scanning *backward* */
+			const char *nptr;
+			const char *p;
 
-			hptr = &haystack[start_pos];
-			while (hptr < haystack_end)
+			nptr = needle_last;
+			p = hptr;
+			while (*nptr == *p)
 			{
-				if (*hptr == nchar)
-					return hptr - haystack + 1;
-				hptr++;
+				/* Matched it all?	If so, return 1-based position */
+				if (nptr == needle)
+					return (char *) p;
+				nptr--, p--;
 			}
+
+			/*
+			 * No match, so use the haystack char at hptr to decide how far to
+			 * advance.  If the needle had any occurrence of that character
+			 * (or more precisely, one sharing the same skiptable entry)
+			 * before its last character, then we advance far enough to align
+			 * the last such needle character with that haystack position.
+			 * Otherwise we can advance by the whole needle length.
+			 */
+			hptr += state->skiptable[(unsigned char) *hptr & skiptablemask];
 		}
-		else
-		{
-			const pg_wchar *needle_last = &needle[needle_len - 1];
+	}
 
-			/* Start at startpos plus the length of the needle */
-			hptr = &haystack[start_pos + needle_len - 1];
-			while (hptr < haystack_end)
-			{
-				/* Match the needle scanning *backward* */
-				const pg_wchar *nptr;
-				const pg_wchar *p;
+	return 0;					/* not found */
+}
 
-				nptr = needle_last;
-				p = hptr;
-				while (*nptr == *p)
-				{
-					/* Matched it all?	If so, return 1-based position */
-					if (nptr == needle)
-						return p - haystack + 1;
-					nptr--, p--;
-				}
+/*
+ * Return a pointer to the current match.
+ *
+ * The returned pointer points into correct position in the original
+ * the haystack string.
+ */
+static char *
+text_position_get_match_ptr(TextPositionState *state)
+{
+	return state->last_match;
+}
 
-				/*
-				 * No match, so use the haystack char at hptr to decide how
-				 * far to advance.  If the needle had any occurrence of that
-				 * character (or more precisely, one sharing the same
-				 * skiptable entry) before its last character, then we advance
-				 * far enough to align the last such needle character with
-				 * that haystack position.  Otherwise we can advance by the
-				 * whole needle length.
-				 */
-				hptr += state->skiptable[*hptr & skiptablemask];
-			}
+/*
+ * Return the offset of the current match.
+ *
+ * The offset is in characters, 1-based.
+ */
+static int
+text_position_get_match_pos(TextPositionState *state)
+{
+	if (!state->is_multibyte)
+		return state->last_match - state->str1 + 1;
+	else
+	{
+		/* Convert the byte position to char position. */
+		while (state->refpoint < state->last_match)
+		{
+			state->refpoint += pg_mblen(state->refpoint);
+			state->refpos++;
 		}
+		Assert(state->refpoint == state->last_match);
+		return state->refpos + 1;
 	}
-
-	return 0;					/* not found */
 }
 
 static void
 text_position_cleanup(TextPositionState *state)
 {
-	if (state->use_wchar)
-	{
-		pfree(state->wstr1);
-		pfree(state->wstr2);
-	}
+	/* no cleanup needed */
 }
 
 /* varstr_cmp()
@@ -4050,39 +4110,34 @@ replace_text(PG_FUNCTION_ARGS)
 	int			from_sub_text_len;
 	TextPositionState state;
 	text	   *ret_text;
-	int			start_posn;
-	int			curr_posn;
+	char	   *curr_ptr;
 	int			chunk_len;
 	char	   *start_ptr;
 	StringInfoData str;
+	bool		found;
 
-	text_position_setup(src_text, from_sub_text, &state);
-
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	src_text_len = state.len1;
-	from_sub_text_len = state.len2;
+	src_text_len = VARSIZE_ANY_EXHDR(src_text);
+	from_sub_text_len = VARSIZE_ANY_EXHDR(from_sub_text);
 
 	/* Return unmodified source string if empty source or pattern */
 	if (src_text_len < 1 || from_sub_text_len < 1)
 	{
-		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(src_text);
 	}
 
-	start_posn = 1;
-	curr_posn = text_position_next(1, &state);
+	text_position_setup(src_text, from_sub_text, &state);
+
+	found = text_position_next(&state);
 
 	/* When the from_sub_text is not found, there is nothing to do. */
-	if (curr_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(src_text);
 	}
+	curr_ptr = text_position_get_match_ptr(&state);
 
-	/* start_ptr points to the start_posn'th character of src_text */
+	/* start_ptr points to the start_posn'th character of src_text XXX */
 	start_ptr = VARDATA_ANY(src_text);
 
 	initStringInfo(&str);
@@ -4091,20 +4146,19 @@ replace_text(PG_FUNCTION_ARGS)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/* copy the data skipped over by last text_position_next() */
-		chunk_len = charlen_to_bytelen(start_ptr, curr_posn - start_posn);
+		/* copy the data skipped over by last text_position_nextptr() */
+		chunk_len = curr_ptr - start_ptr;
 		appendBinaryStringInfo(&str, start_ptr, chunk_len);
 
 		appendStringInfoText(&str, to_sub_text);
 
-		start_posn = curr_posn;
-		start_ptr += chunk_len;
-		start_posn += from_sub_text_len;
-		start_ptr += charlen_to_bytelen(start_ptr, from_sub_text_len);
+		start_ptr = curr_ptr + from_sub_text_len;
 
-		curr_posn = text_position_next(start_posn, &state);
+		found = text_position_next(&state);
+		if (found)
+			curr_ptr = text_position_get_match_ptr(&state);
 	}
-	while (curr_posn > 0);
+	while (found);
 
 	/* copy trailing data */
 	chunk_len = ((char *) src_text + VARSIZE_ANY(src_text)) - start_ptr;
@@ -4405,9 +4459,10 @@ split_text(PG_FUNCTION_ARGS)
 	int			inputstring_len;
 	int			fldsep_len;
 	TextPositionState state;
-	int			start_posn;
-	int			end_posn;
+	char	   *start_ptr;
+	char	   *end_ptr;
 	text	   *result_text;
+	bool		found;
 
 	/* field number is 1 based */
 	if (fldnum < 1)
@@ -4415,21 +4470,12 @@ split_text(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("field position must be greater than zero")));
 
-	text_position_setup(inputstring, fldsep, &state);
-
-	/*
-	 * Note: we check the converted string length, not the original, because
-	 * they could be different if the input contained invalid encoding.
-	 */
-	inputstring_len = state.len1;
-	fldsep_len = state.len2;
+	inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+	fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 	/* return empty string for empty input string */
 	if (inputstring_len < 1)
-	{
-		text_position_cleanup(&state);
 		PG_RETURN_TEXT_P(cstring_to_text(""));
-	}
 
 	/* empty field separator */
 	if (fldsep_len < 1)
@@ -4442,12 +4488,14 @@ split_text(PG_FUNCTION_ARGS)
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
 
+	text_position_setup(inputstring, fldsep, &state);
+
 	/* identify bounds of first field */
-	start_posn = 1;
-	end_posn = text_position_next(1, &state);
+	start_ptr = VARDATA_ANY(inputstring);
+	found = text_position_next(&state);
 
 	/* special case if fldsep not found at all */
-	if (end_posn == 0)
+	if (!found)
 	{
 		text_position_cleanup(&state);
 		/* if field 1 requested, return input string, else empty string */
@@ -4456,12 +4504,15 @@ split_text(PG_FUNCTION_ARGS)
 		else
 			PG_RETURN_TEXT_P(cstring_to_text(""));
 	}
+	end_ptr = text_position_get_match_ptr(&state);
 
-	while (end_posn > 0 && --fldnum > 0)
+	while (found && --fldnum > 0)
 	{
 		/* identify bounds of next field */
-		start_posn = end_posn + fldsep_len;
-		end_posn = text_position_next(start_posn, &state);
+		start_ptr = end_ptr + fldsep_len;
+		found = text_position_next(&state);
+		if (found)
+			end_ptr = text_position_get_match_ptr(&state);
 	}
 
 	text_position_cleanup(&state);
@@ -4471,20 +4522,14 @@ split_text(PG_FUNCTION_ARGS)
 		/* N'th field separator not found */
 		/* if last field requested, return it, else empty string */
 		if (fldnum == 1)
-			result_text = text_substring(PointerGetDatum(inputstring),
-										 start_posn,
-										 -1,
-										 true);
+			result_text = cstring_to_text_with_len(start_ptr, inputstring_len - (start_ptr - VARDATA_ANY(inputstring)));
 		else
 			result_text = cstring_to_text("");
 	}
 	else
 	{
 		/* non-last field requested */
-		result_text = text_substring(PointerGetDatum(inputstring),
-									 start_posn,
-									 end_posn - start_posn,
-									 false);
+		result_text = cstring_to_text_with_len(start_ptr, end_ptr - start_ptr);
 	}
 
 	PG_RETURN_TEXT_P(result_text);
@@ -4570,26 +4615,14 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 		 */
 		TextPositionState state;
 		int			fldnum;
-		int			start_posn;
-		int			end_posn;
 		int			chunk_len;
 
-		text_position_setup(inputstring, fldsep, &state);
-
-		/*
-		 * Note: we check the converted string length, not the original,
-		 * because they could be different if the input contained invalid
-		 * encoding.
-		 */
-		inputstring_len = state.len1;
-		fldsep_len = state.len2;
+		inputstring_len = VARSIZE_ANY_EXHDR(inputstring);
+		fldsep_len = VARSIZE_ANY_EXHDR(fldsep);
 
 		/* return empty array for empty input string */
 		if (inputstring_len < 1)
-		{
-			text_position_cleanup(&state);
 			PG_RETURN_ARRAYTYPE_P(construct_empty_array(TEXTOID));
-		}
 
 		/*
 		 * empty field separator: return the input string as a one-element
@@ -4602,7 +4635,6 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			int			dims[1];
 			int			lbs[1];
 
-			text_position_cleanup(&state);
 			/* single element can be a NULL too */
 			is_null = null_string ? text_isequal(inputstring, null_string) : false;
 
@@ -4616,17 +4648,20 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 													 TEXTOID, -1, false, 'i'));
 		}
 
-		start_posn = 1;
+		text_position_setup(inputstring, fldsep, &state);
+
 		/* start_ptr points to the start_posn'th character of inputstring */
 		start_ptr = VARDATA_ANY(inputstring);
 
 		for (fldnum = 1;; fldnum++) /* field number is 1 based */
 		{
-			CHECK_FOR_INTERRUPTS();
+			bool		found;
+			char	   *end_ptr;
 
-			end_posn = text_position_next(start_posn, &state);
+			CHECK_FOR_INTERRUPTS();
 
-			if (end_posn == 0)
+			found = text_position_next(&state);
+			if (!found)
 			{
 				/* fetch last field */
 				chunk_len = ((char *) inputstring + VARSIZE_ANY(inputstring)) - start_ptr;
@@ -4634,7 +4669,8 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 			else
 			{
 				/* fetch non-last field */
-				chunk_len = charlen_to_bytelen(start_ptr, end_posn - start_posn);
+				end_ptr = text_position_get_match_ptr(&state);
+				chunk_len = end_ptr - start_ptr;
 			}
 
 			/* must build a temp text datum to pass to accumArrayResult */
@@ -4650,13 +4686,10 @@ text_to_array_internal(PG_FUNCTION_ARGS)
 
 			pfree(result_text);
 
-			if (end_posn == 0)
+			if (!found)
 				break;
 
-			start_posn = end_posn;
-			start_ptr += chunk_len;
-			start_posn += fldsep_len;
-			start_ptr += charlen_to_bytelen(start_ptr, fldsep_len);
+			start_ptr = end_ptr + fldsep_len;
 		}
 
 		text_position_cleanup(&state);
-- 
2.19.2

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Heikki Linnakangas (#6)

Re: Speeding up text_position_next with multibyte encodings

On 12/23/18 1:26 AM, Heikki Linnakangas wrote:

On 14/12/2018 20:20, John Naylor wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

Thanks for fixing that!

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
loop counts so that the runtimes are reasonable, now that it actually
does some work.

UTF-8

length        master        patch
short        2.7s        10.9s
med        2.6s        11.4s
long        3.9s        9.9s

EUC_KR

length        master        patch
short        2.2s        7.5s
med        2.2s        3.5s
long        3.4s        3.6s

This doesn't look very flattering for the patch. Although, this is
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.
But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

John Naylor

jcnaylor@gmail.com

about 7 years ago

In reply to: Heikki Linnakangas (#6)

Re: Speeding up text_position_next with multibyte encodings

On 12/22/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 14/12/2018 20:20, John Naylor wrote:
I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

That makes perfect sense now. I should have been more skeptical about
the small and medium sizes having similar times. :/

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead.

Thanks for that, I'll probably have occasion to do something like this
for other tests.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.

Me neither, that was unintentional.

But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

Okay.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

On 12/23/18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

I'll investigate some "better" cases.

-John Naylor

#12

John Naylor

jcnaylor@gmail.com

almost 7 years ago

In reply to: Tomas Vondra (#10)

Re: Speeding up text_position_next with multibyte encodings

On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

Not sure about a realistic mix, but I investigated the tradeoffs.
First, using the test function Heikki posted upthread [1]/messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --, I tried a 2
character needle needle with different haystack sizes, and looked for
the position where master and patch were roughly equal (average of 3,
within 1%). Beyond this point, the patch regresses. To keep it simple
I used UTF-8 only.

haystack position
<=23 (patch always faster)
30 23
100 58
1000 ~550
1000000 ~550000

For very short haystacks, the patch is always faster or regresses only
when the needle is close to the end. For longer haystacks, the patch
will be faster if the position searched for is less than ~55% of the
way to the end, slower if after. Next, I tested finding the position
of a 2 character needle in a couple positions towards the front of a
1000 character haystack, plus not present and at the end for
comparison. As seen earlier [1]/messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --, the worst-case regression for this
haystack length was 4.4x slower for UTF-8, but that had a skip table
collision, and this time I used characters with no bytes in common.
Average of 3, with a loop count of 1,000,000:

UTF-8
pos. master patch diff
* 3880ms 144ms -96%
100 4440ms 1410ms -68%
333 5700ms 4280ms -25%
999 9370ms 12500ms 34%

EUC-KR
pos. master patch diff
* 3900ms 112ms -97%
100 4410ms 1289ms -71%
333 5680ms 3970ms -30%
999 9370ms 11600ms 24%

The patch is much faster than master when the needle is near the front
of a large haystack or not there at all.

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

[1]: /messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --
--

-John Naylor

#13

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: John Naylor (#12)

Re: Speeding up text_position_next with multibyte encodings

On 15/01/2019 02:52, John Naylor wrote:

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

I read through the patch one more time, tweaked the comments a little
bit, and committed. Thanks for the review!

I did a little profiling of the worst case, where this is slower than
the old approach. There's a lot of function call overhead coming from
walking the string with pg_mblen(). That could be improved. If we
inlined pg_mblen() into loop, it becomes much faster, and I think this
code would be faster even in the worst case. (Except for the very worst
cases, where hash table with the new code happens to have a collision at
a different point than before, but that doesn't seem like a fair
comparison.)

I think this is good enough as it is, but if I have the time, I'm going
to try optimizing the pg_mblen() loop, as well as similar loops e.g. in
pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

- Heikki

#14

Bruce Momjian

bruce@momjian.us

almost 7 years ago

In reply to: Heikki Linnakangas (#13)

Re: Speeding up text_position_next with multibyte encodings

On Fri, Jan 25, 2019 at 04:33:54PM +0200, Heikki Linnakangas wrote:

On 15/01/2019 02:52, John Naylor wrote:

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

I read through the patch one more time, tweaked the comments a little bit,
and committed. Thanks for the review!

I did a little profiling of the worst case, where this is slower than the
old approach. There's a lot of function call overhead coming from walking
the string with pg_mblen(). That could be improved. If we inlined pg_mblen()
into loop, it becomes much faster, and I think this code would be faster
even in the worst case. (Except for the very worst cases, where hash table
with the new code happens to have a collision at a different point than
before, but that doesn't seem like a fair comparison.)

I think this is good enough as it is, but if I have the time, I'm going to
try optimizing the pg_mblen() loop, as well as similar loops e.g. in
pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

It might be valuable to just inline the UTF8 case.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +