Speeding up GIST index creation for tsvectors

Started by Amit Khandekarover 5 years ago15 messageshackers

amitdkhan.pg@gmail.com

over 5 years ago

Hi,

In hemdistsign() of tsgistidx.c, if we process the bits in 64-bit
chunks rather than byte-by-byte, we get an overall speed up in Gist
index creation for tsvector types. With default siglen (124), the
speed up is 12-20%. With siglen=700, it is 30-50%. So with longer
signature lengths, we get higher percentage speed-up. The attached
patch 0001 has the required changes.

In the patch 0001, rather than using xor operator on char values, xor
is operated on 64-bit chunks. And since the chunks are 64-bit,
popcount64() is used on each of the chunks. I have checked that the
two bitvector pointer arguments of hemdistsign() are not always 64-bit
aligned. So process the leading mis-aligned bits and the trailing
remainder bits char-by-char, leaving the middle 64-bit chunks for
popcount64() usage.

We might extend this to the hemdistsign() definitions at other places
in the code. But for now, we can start with gist. I haven't tried
other places.

-------------

While working on this, I observed that on platforms other than x86_64,
we still declare pg_popcount64() as a function pointer, even though we
don't use the runtime selection of right function using__get_cpuid()
as is done on x86.
The other patch i.e. 0002 is a general optimization that avoids this
function pointer for pg_popcount32/64() call. The patch arranges for
direct function call so as to get rid of function pointer
dereferencing each time pg_popcount32/64 is called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined. And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
directly the __builtin_popcount() gets called. For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour.

Tested this 0002 patch on ARM64, with patch 0001 already applied, and the
gist index creation for tsvectors *further* speeds up by 6% for
default siglen (=124), and by 12% with siglen=700.

-------------

Schema :

CREATE TABLE test_tsvector(t text, a tsvector);
-- Attached tsearch.data (a bigger version of
-- src/test/regress/data/tsearch.data)
\COPY test_tsvector FROM 'tsearch.data';

Test case that shows improvement :
CREATE INDEX wowidx6 ON test_tsvector USING gist (a);

Time taken by the above create-index command, in seconds, along with %
speed-up w.r.t. HEAD :

A) siglen=124 (Default)
head 0001.patch 0001+0002.patch
x86 .827 .737 (12%) .....
arm 1.098 .912 (20%) .861 (28%)

B) siglen=700 (... USING gist (a tsvector_ops(siglen=700))
head 0001.patch 0001+0002.patch
x86 1.121 .847 (32%) .....
arm 1.751 1.191 (47%) 1.062 (65%)

--
Thanks,
-Amit Khandekar
Huawei Technologies

Import Notes

Reply to msg id not found: CAJ3gD9c-TM4Gg4tLkkJro-gnnyig+097FJ+Z90TSkrBQUeO9KA@mail.gmail.comReference msg id not found: CAJ3gD9c-TM4Gg4tLkkJro-gnnyig+097FJ+Z90TSkrBQUeO9KA@mail.gmail.com

Pavel Borisov

pashkin.elfe@gmail.com

over 5 years ago

In reply to: Amit Khandekar (#1)

Re: Speeding up GIST index creation for tsvectors

Hi, Amit!
It's really cool to hear about another GiST improvement proposal. I'd like
to connect recently committed GiST ordered build discussion here [1]https://commitfest.postgresql.org/29/2276/ and
further improvement proposed [2]https://commitfest.postgresql.org/31/2824/

I've tested feature [1]https://commitfest.postgresql.org/29/2276/ and got 2.5-3 times speed improvement which is much
better I believe. There is an ongoing activity [2]https://commitfest.postgresql.org/31/2824/ to build support for
different data types for GiST. Maybe you will consider it interesting to
join.

BTW you may have heard about Gin and Rum [3]https://github.com/postgrespro/rum indexes which suit text search
much, much better (and faster) than GiST. The idea to process data in
bigger chunks is good. Still optimize index structure, minimizing disc
pages access, etc. seems better in many cases.

Thank you for your proposal!

[1]: https://commitfest.postgresql.org/29/2276/
[2]: https://commitfest.postgresql.org/31/2824/
[3]: https://github.com/postgrespro/rum

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Pavel Borisov (#2)

Re: Speeding up GIST index creation for tsvectors

On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Amit!
It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered build discussion here [1] and further improvement proposed [2]

I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster. But then I thought that my small patch would
still be worthwhile because for tsvector types the non-sort method for
index build would continue to be used by users, and in general we can
extend this small optimization for other gist types also.

There is an ongoing activity [2] to build support for different data types for GiST. Maybe you will consider it interesting to join.

BTW you may have heard about Gin and Rum [3] indexes which suit text search much, much better (and faster) than GiST. The idea to process data in bigger chunks is good. Still optimize index structure, minimizing disc pages access, etc. seems better in many cases.

Sure. Thanks for the pointers.

--
Thanks,
-Amit Khandekar
Huawei Technologies

Andrey Borodin

amborodin@acm.org

over 5 years ago

In reply to: Amit Khandekar (#3)

Re: Speeding up GIST index creation for tsvectors

13 дек. 2020 г., в 17:46, Amit Khandekar <amitdkhan.pg@gmail.com> написал(а):

On Thu, 10 Dec 2020 at 20:43, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Amit!
It's really cool to hear about another GiST improvement proposal. I'd like to connect recently committed GiST ordered build discussion here [1] and further improvement proposed [2]

I've tested feature [1] and got 2.5-3 times speed improvement which is much better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster.

+1
This will make all INSERTs and UPDATES for tsvector's GiSTs.
Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible.

Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor them somehow...

Thanks!

Best regards, Andrey Borodin.

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Andrey Borodin (#4)

Re: Speeding up GIST index creation for tsvectors

On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

+1
This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

Also I really like idea of taking advantage of hardware capabilities like __builtin_* etc wherever possible.

Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64
: "cnt v0.8b, v0.8b"), hence there's all the more reason to use it.
Over and above that, I had thought that if we can auto-vectorize the
byte-by-byte xor operation and the popcount() call using compiler
optimizations, we would benefit out of this, but didn't see any more
improvement. I hoped for the benefit because that would have allowed
us to process in 128-bit chunks or 256-bit chunks, since the vector
registers are at least that long. Maybe gcc is not that smart to
translate __builtin_popcount() to 128/256 bit vectorized instruction.
But for XOR operator, it does translate to 128bit vectorized
instructions (on arm64 : "eor v2.16b, v2.16b, v18.16b")

Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor them somehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Amit Khandekar (#5)

Re: Speeding up GIST index creation for tsvectors

On Tue, 15 Dec 2020 at 20:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

+1
This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

Meanwhile there are at least 4 incarnation of hemdistsign() functions that are quite similar. I'd propose to refactor them somehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Amit Khandekar (#6)

Re: Speeding up GIST index creation for tsvectors

I have added this one in the March commitfest.
https://commitfest.postgresql.org/32/3023/

John Naylor

john.naylor@enterprisedb.com

over 5 years ago

In reply to: Amit Khandekar (#6)

Re: Speeding up GIST index creation for tsvectors

On Wed, Jan 27, 2021 at 11:07 AM Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Hi Amit,

Your performance numbers look like this is a fruitful area to improve. I
have not yet tested performance, but I will do so at a later date. I did
some microbenchmarking of our popcount implementation, since I wasn't quite
sure it's optimal, and indeed, there is room for improvement there [1]/messages/by-id/CAFBsxsFCWys_yfPe4PoF3=2_oxU5fFR2H+mtM6njUA8nBiCYug@mail.gmail.com. I'd
be curious to hear your thoughts on those findings. I think it'd be worth
it to test a version of this patch using those idioms here as well, so at
some point I plan to post something.

Now for the patches:

0001:

+ /*
+ * We can process 64-bit chunks only if both are mis-aligned by the same
+ * number of bytes.
+ */
+ if (b_aligned - b == a_aligned - a)

The obvious question here is: how often are they identically misaligned?
You don't indicate that your measurements differ in a bimodal fashion, so
does that mean they happen to be always (mis)aligned? Is that an accident
of the gist coding and could change unexpectedly? And how hard would it be
to allocate those values upstream so that the pointers are always aligned
on 8-byte boundaries? (I imagine pretty hard, since there are multiple
callers, and such tight coupling is not good style)

+ /* For smaller lengths, do simple byte-by-byte traversal */
+ if (bytes <= 32)

You noted upthread:

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

I wonder if that can be addressed directly, while cleaning up the loops and
alignment checks in pg_xorcount_long() a little. For example, have a look
at pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Also, pardon my ignorance, but where can I find the default siglen for
various types?

+/* Count the number of 1-bits in the result of xor operation */
+extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char
*b,
+   int bytes);
+static inline uint64 pg_xorcount(const unsigned char *a, const unsigned
char *b,
+ int bytes)

I don't think it makes sense to have a static inline function call a global
function.

-static int
+static inline int
 hemdistsign(BITVECP a, BITVECP b, int siglen)

Not sure what the reason is for inlining all these callers. Come to think
of it, I don't see a reason to have hemdistsign() at all anymore. All it
does is call pg_xorcount(). I suspect that's what Andrey Borodin meant when
he said upthread:

Meanwhile there are at least 4 incarnation of hemdistsign() functions

that are quite similar. I'd propose to refactor them somehow...

0002:

I'm not really happy with this patch. I like the idea of keeping indirect
calls out of non-x86 platforms, but I believe it could be done more simply.
For one, I don't see a need to invent a third category of retail function.
Second, there's no reason to put "nonasm" in the header as a static inline,
and then call from there to the new now-global function "slow". Especially
since the supposed static inline is still needed as a possible value as a
function pointer on x86, so the compiler is not going to inline it on x86
anyway. That just confuses things. (I did make sure to remove indirect
calls from the retail functions in [1]/messages/by-id/CAFBsxsFCWys_yfPe4PoF3=2_oxU5fFR2H+mtM6njUA8nBiCYug@mail.gmail.com, in case we want to go that route).

[1]: /messages/by-id/CAFBsxsFCWys_yfPe4PoF3=2_oxU5fFR2H+mtM6njUA8nBiCYug@mail.gmail.com
/messages/by-id/CAFBsxsFCWys_yfPe4PoF3=2_oxU5fFR2H+mtM6njUA8nBiCYug@mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: John Naylor (#8)

Re: Speeding up GIST index creation for tsvectors

On Wed, 3 Mar 2021 at 23:32, John Naylor <john.naylor@enterprisedb.com> wrote:

Your performance numbers look like this is a fruitful area to improve. I have not yet tested performance, but I will do so at a later date.

Thanks for reviewing the patch !

I did some
microbenchmarking of our popcount implementation, since I wasn't quite sure
it's optimal, and indeed, there is room for improvement there [1]. I'd be
curious to hear your thoughts on those findings. I think it'd be worth it to
test a version of this patch using those idioms here as well, so at some
point I plan to post something.

I am not yet clear about the implications of that work on this patch
set here, but I am still going over it, and will reply to that.

Now for the patches:

0001:
+ /*
+ * We can process 64-bit chunks only if both are mis-aligned by the same
+ * number of bytes.
+ */
+ if (b_aligned - b == a_aligned - a)
The obvious question here is: how often are they identically misaligned? You
don't indicate that your measurements differ in a bimodal fashion, so does
that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned.
3. Rest of the cases, both were aligned.

With other types, and with different sets of data, I believe I can get
totally different proportions.

Is that an accident of the gist coding and could change unexpectedly?
And how hard would it be to
allocate those values upstream so that the pointers are always aligned on
8-byte boundaries? (I imagine pretty hard, since there are multiple callers,
and such tight coupling is not good style)

That I am not sure though; haven't clearly understood the structure of
gist indexes yet. I believe it would depend on individual gist
implementation, and we can't assume about that ?

+ /* For smaller lengths, do simple byte-by-byte traversal */
+ if (bytes <= 32)
You noted upthread:

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

I wonder if that can be addressed directly, while cleaning up the loops and
alignment checks in pg_xorcount_long() a little. For example, have a look at
pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

Also, pardon my ignorance, but where can I find the default siglen for various types?

Check SIGLEN_DEFAULT.

+/* Count the number of 1-bits in the result of xor operation */
+extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char *b,
+   int bytes);
+static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char *b,
+ int bytes)

I don't think it makes sense to have a static inline function call a global function.

As you may note, the global function will be called only in a subset
of cases where siglen <= 32. Yeah, we can put the bytes <= 32
condition inside pg_xorcount_long(). I avoided that to not hamper the
<= 32 scenarios. If I do this, it will add a function call for these
small siglen scenarios. The idea was: use the function call only for
cases where we know that the function call overhead will be masked by
the popcount() optimization.

-static int
+static inline int
hemdistsign(BITVECP a, BITVECP b, int siglen)
Not sure what the reason is for inlining all these callers.
Come to think of it, I don't see a reason to have hemdistsign()
at all anymore. All it does is call pg_xorcount(). I suspect that's
what Andrey Borodin meant when he said upthread:

Meanwhile there are at least 4 incarnation of hemdistsign()
functions that are quite similar. I'd propose to refactor them somehow...

I had something in mind when I finally decided to not remove
hemdistsign(). Now I think you are right, we can remove hemdistsign()
altogether. Let me check again.

--------------------

0002:

I'm not really happy with this patch. I like the idea of keeping indirect
calls out of non-x86 platforms, but I believe it could be done more simply.

I am open for other approaches that would make this patch simpler.

For one, I don't see a need to invent a third category of retail function.

So currently we have pg_popcount64_choose, pg_popcount64_slow and
pg_popcount64_asm.
With the patch, we have those three, plus pg_popcount64_nonasm.

I had earlier considered #defining pg_popcount64 as pg_popcount64_slow
in the .h (inside USE_POPCNT_ASM of course) and leave
pg_popcount64_slow() untouched. But this will still involve an extra
level of function call for each call to pg_popcount64() since
pg_popcount64_slow() needs to be an exported function meant to be used
in multiple place outside pg_bitutils.c; and our purpose is to avoid
indirect calls for this function because it is used so repeatedly.

So then I thought why not move the current pg_popcount64_slow()
definition to pg_bitutils.h and make it inline. We can do that way.
This way that function would look similar to how the other existing
functions like pg_leftmost_one_pos64() are defined. But I didn't like
it for two reasons:
1) I anyway found the function name confusing. The only slow part of
that function is the last part where it does byte-by-byte traversal.
That's the reason I renamed pg_popcount64_slow() to
pg_popcount64_nonasm() and kept the slow logic in
pg_popcount64_slow(). It's ok to keep the slow part in a non-inline
function because that part is anyway slow, and is a fallback code for
non-supporting platforms.
2) This also keeps the inline pg_popcount64_nonasm() code smaller.

The way I look at the final functions is :
pg_popcount64_choose() chooses between an asm and non-asm function.
pg_popcount64_asm() is the one for running direct assembly code.
pg_popcount64_nonasm() is used for platforms where we don't want to
call assembly code. So it either calls hardware intrinsics, or calls
the slow version if the intrinsics are not available.

If I look at these functions this way, it sounds simpler to me. But I
understand if it may still sound confusing. That's why I mentioned
that I am open to simplifying the patch. Also, the current popcount
platform-specific stuff is already confusing; but I feel what the
patch is trying to do looks worth it because I am getting an extra
7-8% improvement on my ARM machine.

Second, there's no reason to put "nonasm" in the header as a static inline,
and then call from there to the new now-global function "slow".

Explained above, the reason why I shifted the nonasm code in the
header and made it inline.

Especially since the supposed static inline is still needed as a possible
value as a function pointer on x86, so the compiler is not going to inline
it on x86 anyway. That just confuses things.

Yeah, the inline is anyway just a request to the compiler, right ? On
x86, the pg_bitutils.c will have it as non-inline function, and all
the other files would have it as an inline function which will never
be used.
Like I mentioned, it is important to define it as inline, in order to
avoid function call when one calls pg_popcount64(). pg_popcount64()
should be translated to the built-in intrinsic.

(I did make sure to remove indirect calls from the retail functions
in [1], in case we want to go that route).

Yeah, I quickly had a look at that. I am still going over that thread.
Thanks for the exhaustive analysis there.

#10

John Naylor

john.naylor@enterprisedb.com

over 5 years ago

In reply to: Amit Khandekar (#9)

Re: Speeding up GIST index creation for tsvectors

On Mon, Mar 8, 2021 at 8:43 AM Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On Wed, 3 Mar 2021 at 23:32, John Naylor <john.naylor@enterprisedb.com>

wrote:

0001:

+ /*
+ * We can process 64-bit chunks only if both are mis-aligned by the

same

+ * number of bytes.
+ */
+ if (b_aligned - b == a_aligned - a)
The obvious question here is: how often are they identically

misaligned? You

don't indicate that your measurements differ in a bimodal fashion, so

does

that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically

misaligned.

3. Rest of the cases, both were aligned.

That's somewhat strange. I would have assumed always aligned, or usually
not. It sounds like we could require them both to be aligned and not bother
with the byte-by-byte prologue. I also wonder if it's worth it to memcpy to
a new buffer if the passed pointer is not aligned.

+ /* For smaller lengths, do simple byte-by-byte traversal */
+ if (bytes <= 32)
You noted upthread:

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

I wonder if that can be addressed directly, while cleaning up the loops

and

alignment checks in pg_xorcount_long() a little. For example, have a

look at

pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

Also, pardon my ignorance, but where can I find the default siglen for

various types?

Check SIGLEN_DEFAULT.

Okay, so it's hard-coded in various functions in contrib modules. In that
case, let's just keep the existing coding for those. In fact, the comments
that got removed by your patch specifically say:

/* Using the popcount functions here isn't likely to win */

...which your testing confirmed. The new function should be used only for
Gist and possibly Intarray, whose default is 124. That means we can't get
rid of hemdistsign(), but that's fine. Alternatively, we could say that a
small regression is justifiable reason to refactor all call sites, but I'm
not proposing that.

(I did make sure to remove indirect calls from the retail functions
in [1], in case we want to go that route).

Yeah, I quickly had a look at that. I am still going over that thread.
Thanks for the exhaustive analysis there.

I'll post a patch soon that builds on that, so you can see what I mean.

--
John Naylor
EDB: http://www.enterprisedb.com

#11

John Naylor

john.naylor@enterprisedb.com

over 5 years ago

In reply to: John Naylor (#10)

Re: Speeding up GIST index creation for tsvectors

I wrote:

I'll post a patch soon that builds on that, so you can see what I mean.

I've attached where I was imagining this heading, as a text file to avoid
distracting the cfbot. Here are the numbers I got with your test on the
attached, as well as your 0001, on x86-64 Clang 10, default siglen:

master:
739ms

v3-0001
692ms

attached POC
665ms

The small additional speed up is not worth it, given the code churn and
complexity, so I don't want to go this route after all. I think the way to
go is a simplified version of your 0001 (not 0002), with only a single
function, for gist and intarray only, and a style that better matches the
surrounding code. If you look at my xor functions in the attached text
file, you'll get an idea of what it should look like. Note that it got the
above performance without ever trying to massage the pointer alignment. I'm
a bit uncomfortable with the fact that we can't rely on alignment, but
maybe there's a simple fix somewhere in the gist code.

--
John Naylor
EDB: http://www.enterprisedb.com

#12

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: John Naylor (#11)

Re: Speeding up GIST index creation for tsvectors

On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com> wrote:

Okay, so it's hard-coded in various functions in contrib modules. In that
case, let's just keep the existing coding for those. In fact, the comments
that got removed by your patch specifically say: /* Using the popcount
functions here isn't likely to win */ ...which your testing confirmed. The
new function should be used only for Gist and possibly Intarray, whose
default is 124. That means we can't get rid of hemdistsign(), but that's
fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules. I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *. I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

I think the way to go is a simplified version of your 0001 (not 0002), with
only a single function, for gist and intarray only, and a style that better
matches the surrounding code. If you look at my xor functions in the attached
text file, you'll get an idea of what it should look like. Note that it got
the above performance without ever trying to massage the pointer alignment.
I'm a bit uncomfortable with the fact that we can't rely on alignment, but
maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

-Amit Khandekar

#13

John Naylor

john.naylor@enterprisedb.com

over 5 years ago

In reply to: Amit Khandekar (#12)

Re: Speeding up GIST index creation for tsvectors

On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On Thu, 11 Mar 2021 at 04:25, John Naylor <john.naylor@enterprisedb.com>

wrote:

Okay, so it's hard-coded in various functions in contrib modules. In

that

case, let's just keep the existing coding for those. In fact, the

comments

that got removed by your patch specifically say: /* Using the popcount
functions here isn't likely to win */ ...which your testing confirmed.

The

new function should be used only for Gist and possibly Intarray, whose
default is 124. That means we can't get rid of hemdistsign(), but that's
fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules. I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

The 1-7% degradation measured was from an earlier version, when
pg_xorcount_long had a lot of finicky branching and computation. Is it
still true in v3? We should answer that first. I'm interested in what
happens if you now use pg_xorcount_long in the call sites, at least in the
worst case 7% test.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *. I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

That matches the style of that file, so +1.

I think the way to go is a simplified version of your 0001 (not 0002),

with

only a single function, for gist and intarray only, and a style that

better

matches the surrounding code. If you look at my xor functions in the

attached

text file, you'll get an idea of what it should look like. Note that it

got

the above performance without ever trying to massage the pointer

alignment.

I'm a bit uncomfortable with the fact that we can't rely on alignment,

but

maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I find it stranger that something equivalent to char* is not randomly
misaligned, but rather only seems to land on 4-byte boundaries.

[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.

FWIW, I anticipate some push back from the community because of the fact
that the optimization relies on statistical phenomena.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

For the amount of code uglification involved, we should be getting full asm
popcount support on Arm, not an attempt to kluge around the current
implementation. I'd be happy to review such an effort for PG15, by the way.

Readability counts, and as it stands I don't feel comfortable asking a
committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com

#14

Amit Khandekar

amitdkhan.pg@gmail.com

almost 5 years ago

In reply to: John Naylor (#13)

Re: Speeding up GIST index creation for tsvectors

On Sat, 20 Mar 2021 at 02:19, John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.

[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.

FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
......
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));

for (j = 0; j < maxoff + 2; j++)
cache[j].sign = &cache_sign[siglen * j];
....
}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.

#15

John Naylor

john.naylor@enterprisedb.com

almost 5 years ago

In reply to: Amit Khandekar (#14)

Re: Speeding up GIST index creation for tsvectors

On Sun, Aug 1, 2021 at 11:41 PM Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

FWIW, I anticipate some push back from the community because of the

fact that the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the

pointers to

be misaligned:

[...]

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Sounds good.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.

Thinking some more, my discomfort with inline functions that call a global
function doesn't make logical sense, so feel free to do it that way if you
like.

--
John Naylor
EDB: http://www.enterprisedb.com

Speeding up GIST index creation for tsvectors

Attachments:

Attachments:

Attachments:

Attachments: