Speeding up text_position_next with multibyte encodings

Started by Heikki Linnakangasover 7 years ago14 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm,
with a skip table. Currently, with a multi-byte encoding, we first
convert both input strings to arrays of wchars, and run the algorithm on
those arrays.

This patch removes the mb->wchar conversion, and instead runs the
single-byte version of the algorithm directly on the inputs, even with
multi-byte encodings. That avoids the mb->wchar conversion altogether,
when there is no match. Even when there is a match, we don't need to
convert the whole input string to wchars. It's enough to count the
characters up to the match, using pg_mblen(). And when
text_position_setup/next() are used as part of split_part() or replace()
functions, we're not actually even interested in the character position,
so we can skip that too.

Avoiding the large allocation is nice, too. That was actually why I
stated to look into this: a customer complained that text_position fails
with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false
positives, though. To make that work, after finding a match, we verify
that the match doesn't fall in the middle of a multi-byte character.
However, as an important special case, that cannot happen with UTF-8,
because it has the property that the byte sequence of one character
never appears as part of another character. I think some other encodings
might have the same property, but I wasn't sure.

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.
Counting the characters up to that point is slower than the mb->wchar
conversion. I think we could avoid that regression too, if we had a more
optimized function to count characters. Currently, the code calls
pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar
loop through all the characters is a tight loop within the function.

Thoughts?

- Heikki

Attachments:

0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit.patchtext/x-patch; name=0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit.patchDownload+253-223
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Speeding up text_position_next with multibyte encodings

Hello.

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi>

Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm,
with a skip table. Currently, with a multi-byte encoding, we first
convert both input strings to arrays of wchars, and run the algorithm
on those arrays.

This patch removes the mb->wchar conversion, and instead runs the
single-byte version of the algorithm directly on the inputs, even with
multi-byte encodings. That avoids the mb->wchar conversion altogether,
when there is no match. Even when there is a match, we don't need to
convert the whole input string to wchars. It's enough to count the
characters up to the match, using pg_mblen(). And when
text_position_setup/next() are used as part of split_part() or
replace() functions, we're not actually even interested in the
character position, so we can skip that too.

Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.

Avoiding the large allocation is nice, too. That was actually why I
stated to look into this: a customer complained that text_position
fails with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false
positives, though. To make that work, after finding a match, we verify
that the match doesn't fall in the middle of a multi-byte
character. However, as an important special case, that cannot happen
with UTF-8, because it has the property that the byte sequence of one
character never appears as part of another character. I think some
other encodings might have the same property, but I wasn't sure.

Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
happens.)

This is a win in most cases. One case is slower: calling position()
with a large haystack input, where the match is near the end of the
string. Counting the characters up to that point is slower than the
mb->wchar conversion. I think we could avoid that regression too, if
we had a more optimized function to count characters. Currently, the
code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
the similar loop through all the characters is a tight loop within the
function.

Thoughts?

Coudn't we count characters while searching a match?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Speeding up text_position_next with multibyte encodings

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote
Attached is a patch to speed up text_position_setup/next(), in some
common cases with multibyte encodings.

Hi,

Unfortunately, patch doesn't compile anymore due:

varlena.c: In function ‘text_position_next_internal’:
varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this function)
Assert(start_ptr >= haystack && start_ptr <= haystack_end);

Could you please send an updated version? For now I'm moving it to the next CF.

#4John Naylor
john.naylor@enterprisedb.com
In reply to: Heikki Linnakangas (#1)
Re: Speeding up text_position_next with multibyte encodings

On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, patch doesn't compile anymore due:

varlena.c: In function ‘text_position_next_internal’:
varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
function)
Assert(start_ptr >= haystack && start_ptr <= haystack_end);

Could you please send an updated version? For now I'm moving it to the next
CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short: 1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long: 250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length master patch
short 2.26s 2.51s
med 2.28s 2.54s
long 3.07s 3.11s

EUC-KR
length master patch
short 2.29s 2.37s
med 2.29s 2.36s
long 1.75s 1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachments:

v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Use-single-byte-Boyer-Moore-Horspool-search-even-.patchDownload+253-223
single-byte-bmh.sqlapplication/sql; name=single-byte-bmh.sqlDownload
#5John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#4)
Re: Speeding up text_position_next with multibyte encodings

On 12/14/18, John Naylor <jcnaylor@gmail.com> wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

-John Naylor

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: John Naylor (#4)
Re: Speeding up text_position_next with multibyte encodings

On 14/12/2018 20:20, John Naylor wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

Thanks for fixing that!

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
loop counts so that the runtimes are reasonable, now that it actually
does some work.

UTF-8

length master patch
short 2.7s 10.9s
med 2.6s 11.4s
long 3.9s 9.9s

EUC_KR

length master patch
short 2.2s 7.5s
med 2.2s 3.5s
long 3.4s 3.6s

This doesn't look very flattering for the patch. Although, this is
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.
But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

- Heikki

Attachments:

single-byte-bmh-benchmark.patchtext/x-patch; name=single-byte-bmh-benchmark.patchDownload+18-0
single-byte-bmh-benchmark.sqlapplication/sql; name=single-byte-bmh-benchmark.sqlDownload
#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: John Naylor (#5)
Re: Speeding up text_position_next with multibyte encodings

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

- Heikki

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#7)
Re: Speeding up text_position_next with multibyte encodings

On 23/12/2018 02:28, Heikki Linnakangas wrote:

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll
investigate!

- Heikki

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#8)
Re: Speeding up text_position_next with multibyte encodings

On 23/12/2018 02:32, Heikki Linnakangas wrote:

On 23/12/2018 02:28, Heikki Linnakangas wrote:

On 14/12/2018 23:40, John Naylor wrote:

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

Hmm, it works for me. What failure did you see?

Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll
investigate!

The bug was in handling empty inputs. text_position_setup assumed and
asserted that neither the needle nor haystack are empty, expecting the
callers to have handled those special cases already, but not all callers
did. Here is a fixed version.

- Heikki

Attachments:

0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit-2.patchtext/x-patch; name=0001-Use-single-byte-Boyer-Moore-Horspool-search-even-wit-2.patchDownload+258-226
#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#6)
Re: Speeding up text_position_next with multibyte encodings

On 12/23/18 1:26 AM, Heikki Linnakangas wrote:

On 14/12/2018 20:20, John Naylor wrote:

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

Thanks for fixing that!

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is a win in most cases. One case is slower: calling position() with
a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
loop counts so that the runtimes are reasonable, now that it actually
does some work.

UTF-8

length        master        patch
short        2.7s        10.9s
med        2.6s        11.4s
long        3.9s        9.9s

EUC_KR

length        master        patch
short        2.2s        7.5s
med        2.2s        3.5s
long        3.4s        3.6s

This doesn't look very flattering for the patch. Although, this is
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.
But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11John Naylor
john.naylor@enterprisedb.com
In reply to: Heikki Linnakangas (#6)
Re: Speeding up text_position_next with multibyte encodings

On 12/22/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 14/12/2018 20:20, John Naylor wrote:
I'm afraid that script doesn't work as a performance test. The
position() function is immutable, so the result gets cached in the plan
cache. All you're measuring is the speed to get the constant from the
plan cache :-(.

That makes perfect sense now. I should have been more skeptical about
the small and medium sizes having similar times. :/

I rewrote the same tests with a little C helper function, attached, to
fix that, and to eliminate any PL/pgSQL overhead.

Thanks for that, I'll probably have occasion to do something like this
for other tests.

You chose interesting characters for the UTF-8 test. The haystack is a
repeating pattern of byte sequence EC 99 84, and the needle is a
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
skip table are '2', '1' and '250'. But the search bounces between the
'2' and '1' cases, and never hits the byte that would allow it to skip
250 bytes. Interesting case, I had not realized that that can happen.

Me neither, that was unintentional.

But I don't think we need to put much weight on that, you could come up
with other scenarios where the current code has skip table collisions, too.

Okay.

So overall, I think it's still a worthwhile tradeoff, given that that is
a worst case scenario. If you choose less pathological UTF-8 codepoints,
or there is no match or the match is closer to the beginning of the
string, the patch wins.

On 12/23/18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

I'll investigate some "better" cases.

-John Naylor

#12John Naylor
john.naylor@enterprisedb.com
In reply to: Tomas Vondra (#10)
Re: Speeding up text_position_next with multibyte encodings

On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

Not sure about a realistic mix, but I investigated the tradeoffs.
First, using the test function Heikki posted upthread [1]/messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --, I tried a 2
character needle needle with different haystack sizes, and looked for
the position where master and patch were roughly equal (average of 3,
within 1%). Beyond this point, the patch regresses. To keep it simple
I used UTF-8 only.

haystack position
<=23 (patch always faster)
30 23
100 58
1000 ~550
1000000 ~550000

For very short haystacks, the patch is always faster or regresses only
when the needle is close to the end. For longer haystacks, the patch
will be faster if the position searched for is less than ~55% of the
way to the end, slower if after. Next, I tested finding the position
of a 2 character needle in a couple positions towards the front of a
1000 character haystack, plus not present and at the end for
comparison. As seen earlier [1]/messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --, the worst-case regression for this
haystack length was 4.4x slower for UTF-8, but that had a skip table
collision, and this time I used characters with no bytes in common.
Average of 3, with a loop count of 1,000,000:

UTF-8
pos. master patch diff
* 3880ms 144ms -96%
100 4440ms 1410ms -68%
333 5700ms 4280ms -25%
999 9370ms 12500ms 34%

EUC-KR
pos. master patch diff
* 3900ms 112ms -97%
100 4410ms 1289ms -71%
333 5680ms 3970ms -30%
999 9370ms 11600ms 24%

The patch is much faster than master when the needle is near the front
of a large haystack or not there at all.

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

[1]: /messages/by-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi --
--

-John Naylor

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: John Naylor (#12)
Re: Speeding up text_position_next with multibyte encodings

On 15/01/2019 02:52, John Naylor wrote:

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

I read through the patch one more time, tweaked the comments a little
bit, and committed. Thanks for the review!

I did a little profiling of the worst case, where this is slower than
the old approach. There's a lot of function call overhead coming from
walking the string with pg_mblen(). That could be improved. If we
inlined pg_mblen() into loop, it becomes much faster, and I think this
code would be faster even in the worst case. (Except for the very worst
cases, where hash table with the new code happens to have a collision at
a different point than before, but that doesn't seem like a fair
comparison.)

I think this is good enough as it is, but if I have the time, I'm going
to try optimizing the pg_mblen() loop, as well as similar loops e.g. in
pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

- Heikki

#14Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#13)
Re: Speeding up text_position_next with multibyte encodings

On Fri, Jan 25, 2019 at 04:33:54PM +0200, Heikki Linnakangas wrote:

On 15/01/2019 02:52, John Naylor wrote:

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

I read through the patch one more time, tweaked the comments a little bit,
and committed. Thanks for the review!

I did a little profiling of the worst case, where this is slower than the
old approach. There's a lot of function call overhead coming from walking
the string with pg_mblen(). That could be improved. If we inlined pg_mblen()
into loop, it becomes much faster, and I think this code would be faster
even in the worst case. (Except for the very worst cases, where hash table
with the new code happens to have a collision at a different point than
before, but that doesn't seem like a fair comparison.)

I think this is good enough as it is, but if I have the time, I'm going to
try optimizing the pg_mblen() loop, as well as similar loops e.g. in
pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

It might be valuable to just inline the UTF8 case.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +