Reduce build times of pg_trgm GIN indexes
Hi hackers,
Attached is a series of patches which gradually reduce the time it takes
to create GIN indexes. Most of the gains come from optimizing the
trigram extraction code in pg_trgm. A few small optimizations apply to
any GIN index operator class.
The changes are motivated by the time it takes to create GIN indexes on
large production tables, especially, on columns with long strings. Even
with multiple parallel maintenance workers I've seen this taking hours.
For testing purposes I've used two different data sets:
1. The l_comment column of the TPC-H SF 10 lineitem table. l_comment
contains relatively short strings with a minimum of 10, a maximum of 43
and an average of ~27 characters.
2. The plots from a collection of movies from Wikipedia. The plots are
much longer than l_comment, with a minimum of 15, a maximum of 36,773
and an average of ~2,165 characters. The CSV file can be downloaded here
[1]: https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv
Testing both cases is important because a big part of the trigram
extraction is spent on removing duplicates. The longer the string, the
more duplicates are usually encountered.
The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:
Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59x
The attached patches do the following:
- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().
- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.
- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.
v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.
With above changes, the majority of the runtime is now spent on
inserting the trigrams into the GIN index via ginInsertBAEntry(). The
code in master uses a red-black for further deduplication and sorting.
Traversing the red-black tree and updating it is pretty slow. I haven't
looked through all the code yet, but it seems to me that we would be
better off replacing the red-black tree with a sort and/or a hash map.
But I'll leave this as future work for now.
[1]: https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv
https://github.com/kiq005/movie-recommendation/raw/refs/heads/master/src/dataset/wiki_movie_plots_deduped.csv
--
David Geier
Attachments:
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patchtext/x-patch; charset=UTF-8; name=v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patchDownload+65-60
v1-0007-Faster-qunique-comparator.patchtext/x-patch; charset=UTF-8; name=v1-0007-Faster-qunique-comparator.patchDownload+12-13
v1-0006-Use-radix-sort.patchtext/x-patch; charset=UTF-8; name=v1-0006-Use-radix-sort.patchDownload+54-11
v1-0005-Make-btint4cmp-branchless.patchtext/x-patch; charset=UTF-8; name=v1-0005-Make-btint4cmp-branchless.patchDownload+2-7
v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patchtext/x-patch; charset=UTF-8; name=v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patchDownload+9-13
v1-0003-Use-sort_template.h.patchtext/x-patch; charset=UTF-8; name=v1-0003-Use-sort_template.h.patchDownload+38-12
v1-0002-Optimized-comparison-functions.patchtext/x-patch; charset=UTF-8; name=v1-0002-Optimized-comparison-functions.patchDownload+27-12
v1-0001-Inline-ginCompareAttEntries.patchtext/x-patch; charset=UTF-8; name=v1-0001-Inline-ginCompareAttEntries.patchDownload+38-45
On 05/01/2026 17:01, David Geier wrote:
The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59x
Impressive speedup!
The attached patches do the following:
- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.
Looks good.
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().
You lose the check for NULL result with this. That's probably still
worth checking.
- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.
ok
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.
Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and deduped.
- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.
Seems reasonable.
v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.
Ok.
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.
Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:
/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.
This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.
- Heikki
On Tue, 6 Jan 2026 at 22:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/01/2026 17:01, David Geier wrote:
The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59xImpressive speedup!
The attached patches do the following:
- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.Looks good.
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().You lose the check for NULL result with this. That's probably still
worth checking.- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.ok
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and deduped.- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.Seems reasonable.
v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.Ok.
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..Perhaps we should introduce a new qunique_eq() function with a different
callback signature:/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.- Heikki
Hi!
patches 0003, 0004 & 0008 applies with whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am v1-0003-Use-sort_template.h.patch
Applying: Use sort_template.h
.git/rebase-apply/patch:66: trailing whitespace.
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
Applying: Avoid dedup and sort in ginExtractEntries
.git/rebase-apply/patch:30: trailing whitespace.
{
warning: 1 line adds whitespace errors.
reshke@yezzey-cbdb-bench:~/pgpure$ git am
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch
Applying: Add ASCII fastpath to generate_trgm_only()
.git/rebase-apply/patch:101: trailing whitespace.
else
warning: 1 line adds whitespace errors.
I did run benchmarks of my VM using your data. v1-0001 solely improves
runtime by 4-5%. v2-0002 does not show runtime improvement for me.
With parallel GIN build, performance gains are 2-3 % for 2 workers and
not noticeable after that.
I will try to run some more benchmarks on this.
--
Best regards,
Kirill Reshke
Hi Heikki!
Thanks for looking at the patch set.
On 06.01.2026 18:00, Heikki Linnakangas wrote:
On 05/01/2026 17:01, David Geier wrote:
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().You lose the check for NULL result with this. That's probably still
worth checking.
It seems like existing code where all args are not null, has that safety
check. Added it for consistency.
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries
argument.Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and
deduped.
As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.
Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..
I thought the same.
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))
I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.
Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III' COLLATE "tr_TR");
show_trgm
-------------------------
{" i"," ii","ii ",iii}
(1 row)
But when using tr_TR as default locale of the database the following
happens:
postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ııı
postgres=# select show_trgm('III');sü
show_trgm
---------------------------------------
{0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.
Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.
Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.
Thoughts?
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.
Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was taken.
Code | movies |delta | lineitem | delta
------------------------------------|--------|-------|------------------
master | 10,311 | 0 | 256,986 | 0
v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208
v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684
v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904
v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282
v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085
v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700
v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943
v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275
Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.
--
David Geier
Attachments:
v2-0008-Add-ASCII-fastpath-to-generate_trgm_only.patchtext/x-patch; charset=UTF-8; name=v2-0008-Add-ASCII-fastpath-to-generate_trgm_only.patchDownload+65-60
v2-0007-Faster-qunique-comparator.patchtext/x-patch; charset=UTF-8; name=v2-0007-Faster-qunique-comparator.patchDownload+12-13
v2-0006-Use-radix-sort.patchtext/x-patch; charset=UTF-8; name=v2-0006-Use-radix-sort.patchDownload+54-11
v2-0005-Make-btint4cmp-branchless.patchtext/x-patch; charset=UTF-8; name=v2-0005-Make-btint4cmp-branchless.patchDownload+2-7
v2-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patchtext/x-patch; charset=UTF-8; name=v2-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patchDownload+27-15
v2-0003-Use-sort_template.h.patchtext/x-patch; charset=UTF-8; name=v2-0003-Use-sort_template.h.patchDownload+37-11
v2-0002-Optimized-comparison-functions.patchtext/x-patch; charset=UTF-8; name=v2-0002-Optimized-comparison-functions.patchDownload+31-12
v2-0001-Inline-ginCompareAttEntries.patchtext/x-patch; charset=UTF-8; name=v2-0001-Inline-ginCompareAttEntries.patchDownload+38-45
Hi!
Hi!
patches 0003, 0004 & 0008 applies with whitespace errors.
I've fixed the white space errors. See v2 of the patch set in [1]/messages/by-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d@gmail.com.
I did run benchmarks of my VM using your data. v1-0001 solely improves
runtime by 4-5%. v2-0002 does not show runtime improvement for me.
With parallel GIN build, performance gains are 2-3 % for 2 workers and
not noticeable after that.I will try to run some more benchmarks on this.
Thanks. I've also included the delta for each patch in [1]/messages/by-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d@gmail.com. I would be
curious what you measure, especially for patch 0004, where I curiously
measured a regression.
[1]: /messages/by-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d@gmail.com
/messages/by-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d@gmail.com
--
David Geier
On 09/01/2026 14:06, David Geier wrote:
On 06.01.2026 18:00, Heikki Linnakangas wrote:
On 05/01/2026 17:01, David Geier wrote:
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().You lose the check for NULL result with this. That's probably still
worth checking.It seems like existing code where all args are not null, has that safety
check. Added it for consistency.- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries
argument.Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and
deduped.As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?
Looking at 0001 and 0004 patches and ginExtractEntries(), the way
ginExtractEntries() handles NULLs looks a little inefficient. It treats
NULLs as proper entries, passing them through qsort() for deduplication
and comparing them with cmpEntries(). But the end result is always that
if the opclass's extractValueFn() function returned any NULLs, then
there's exctly one GIN_CAT_NULL_KEY as the last entry of the result
array. Surely we could be smarter about how we accomplish that. Your
0004 patch eliminates the deduplication overhead altogether, which is
great of course, but the point remains for when we still need the
deduplication.
Attached is an attempt at that. It avoids the null-checks in
cmpEntries(), saving a few cycles. That's drowned in noise with your
test cases, but with the attached test case with int arrays, I'm seeing
a 1-2 % gain. That's not much, but I think it's still worth doing
because it makes the code a little simpler too, IMHO. (I didn't test it
together with the rest of your patches.)
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.
Works for me.
At quick glance, most if not all of the qunique() calls call qsort()
just before qunique(). I wonder if we should have a single "sort and
deduplicate" function, instead. It could perhaps do some deduplication
"on the go", or other optimizations.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was taken.Code | movies |delta | lineitem | delta
------------------------------------|--------|-------|------------------
master | 10,311 | 0 | 256,986 | 0
v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208
v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684
v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904
v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282
v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085
v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700
v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943
v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.
Thanks, I pushed patch 0001 now, that's a simple and clear win.
- Heikki
On 09.01.2026 19:36, Heikki Linnakangas wrote:
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from
the
extract value function. In case of pg_trgm, that is completely
redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries
argument.Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and
deduped.As said, that was just for demonstration purposes of the possible gains.
I've changed the code now such that the extractValue function of the GIN
index can indicate via the third argument uniqueAndSorted, if the
returned keys are already unique and sorted.Unfortunately, it seems like this patch regresses performance. See
measurements below. I haven't had the time to investigate why that is.
It's pretty counter intuitive, given that this patch effectively only
removes code. Maybe you could re-test patch 0004 and share your runtimes?Looking at 0001 and 0004 patches and ginExtractEntries(), the way
ginExtractEntries() handles NULLs looks a little inefficient. It treats
NULLs as proper entries, passing them through qsort() for deduplication
and comparing them with cmpEntries(). But the end result is always that
if the opclass's extractValueFn() function returned any NULLs, then
there's exctly one GIN_CAT_NULL_KEY as the last entry of the result
array. Surely we could be smarter about how we accomplish that. Your
0004 patch eliminates the deduplication overhead altogether, which is
great of course, but the point remains for when we still need the
deduplication.
Good observation. I like this idea. I've focused on pg_trgm but making
this code faster is certainly useful.
Given that doing the sort on pre-sorted input apparently doesn't add
measurable overhead, according to my benchmark results, we can apply
your patch and leave mine out for the moment being.
That's btw. also the reason for why 0002 doesn't show much gain: when
the data is pre-sorted, cmpEntries() is not called as much.
Attached is an attempt at that. It avoids the null-checks in
cmpEntries(), saving a few cycles. That's drowned in noise with your
test cases, but with the attached test case with int arrays, I'm seeing
a 1-2 % gain. That's not much, but I think it's still worth doing
because it makes the code a little simpler too, IMHO. (I didn't test it
together with the rest of your patches.)
Performance is death by a thousand cuts and that's definitely the case
for the GIN index code. I'm all for putting in these small improvements
because they'll add up and what's now 2% can be 10% once other
optimization are in.
I took a look at your patch. Overall looks good to me. Just a few comments:
1) You should be able to create the categories array without the need
for the subsequent for loop as follows:
StaticAssertStmt(GIN_CAT_NORM_KEY == 0, "Assuming GIN_CAT_NORM_KEY=0");
*categories = palloc0_array(GinNullCategory, (nentries + (hasNull ? 1 : 0));
2) Your test case seems sub-optimal to show the gains. The arrays don't
contain any NULL values and are also already sorted. Or I'm missing
something.
3) Could you try your patch in conjunction with 0002 on data that is not
pre-sorted and then check if 0002 has more impact? That way cmpEntries()
should be called much more often.
4) Have you checked if there are regression tests that exercise this
code? If not, how about adding some?
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.Works for me.
At quick glance, most if not all of the qunique() calls call qsort()
just before qunique(). I wonder if we should have a single "sort and
deduplicate" function, instead. It could perhaps do some deduplication
"on the go", or other optimizations.
If it's just for deduplication purposes and the data doesn't have to end
up sorted, something based on a hash map should be even faster. How
about we start with changing the qunique() comparator signature and as a
2nd step take a closer look at how to go about providing a function that
does it in one go?
If you agree, I'll share a patch here next week.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.Here is the impact of each patch. I ran again CREATE INDEX three times
and took the fastest run. The run of each patch includes all previous
patches as well. For example, the timings for patch 0003 were measured
with a binary that also had patch 0002 and 0001 applied. To get the
impact of each patch in isolation, the delta to the previous run was
taken.Code | movies |delta | lineitem | delta
------------------------------------|--------|-------|------------------
master | 10,311 | 0 | 256,986 | 0
v1-0001-Inline-ginCompareAttEntries | 9,694 | 617 | 239,778 | 17,208
v1-0002-Optimized-comparison-func | 9,510 | 184 | 238,094 | 1,684
v1-0003-Use-sort_template.h | 8,661 | 849 | 231,190 | 6,904
v1-0004-Avoid-dedup-and-sort-in | 9,305 | -644 | 232,472 | -1,282
v1-0005-Make-btint4cmp-branchless | 8,240 | 1,065 | 228,387 | 4,085
v1-0006-Use-radix-sort | 6,976 | 1,264 | 207,687 | 20,700
v1-0007-Faster-qunique-comparator | 5,911 | 1,065 | 203,744 | 3,943
v1-0008-Add-ASCII-fastpath | 3,409 | 2,502 | 161,469 | 42,275Attached is v2 of the patch set with the aforementioned changes. I've
also fixed the white space errors in 0003, 0004 and 0008, as reported by
Kirill.Thanks, I pushed patch 0001 now, that's a simple and clear win.
Great. Thanks.
What about the other patches? 0003 and 0007 are also pretty simple and
IMHO uncontroversial while giving decent savings.
For 0006 I would make the code also work for char being unsigned. That's
still missing.
Any thoughts about 0008 and my findings regarding the to lower-case
conversion for ASCII? Adding to pg_locale_struct if it's save to use
ASCII-style tolower should be straight forward and then the code should
be correct.
--
David Geier
On 09/01/2026 14:06, David Geier wrote:
On 06.01.2026 18:00, Heikki Linnakangas wrote:
On 05/01/2026 17:01, David Geier wrote:
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ıııpostgres=# select show_trgm('III' COLLATE "tr_TR");
show_trgm
-------------------------
{" i"," ii","ii ",iii}
(1 row)But when using tr_TR as default locale of the database the following
happens:postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ıııpostgres=# select show_trgm('III');sü
show_trgm
---------------------------------------
{0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.
Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.
Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.
I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.
Although, we probably should be using case-folding rather than
lower-casing with ICU locales anyway. Case-folding is designed for
string matching. It'd be a backwards-compatibility breaking change, though.
- Heikki
Hi Heikki!
On 09.01.2026 22:02, David Geier wrote:
Given that doing the sort on pre-sorted input apparently doesn't add
measurable overhead, according to my benchmark results, we can apply
your patch and leave mine out for the moment being.
I've removed my patch from the patch set in favor of your patch. I've
adapted your patch slightly as follows:
- I replaced the custom for-loop by qunique()
- I switched to sort_template.h which gives a nice speedup as it can do
some things more efficiently now where the entries are simple Datums.
- I use palloc0_array() to initialize the array to GIN_CAT_NORM_KEY in
one go.
Your patch together with my changes gives a 20% speedup on a table with
arrays of 1000 elements and 10% NULLs. See attached test script.
That's btw. also the reason for why 0002 doesn't show much gain: when
the data is pre-sorted, cmpEntries() is not called as much.
This turned out to be not the case. I tested 0002 with the attached
script but that neither showed any significant improvements. It's still
curious to me because cmpEntries() is called hundreds of millions of
times and the disassembly shows that the optimized function indeed
directly calls the operator rather than having the indirection via
FunctionCall2Coll().
Anyways, I've removed the patch from the patch set for the moment being.
I would prefer to change qunique() instead. That would enforce using an
adequate comparison function from the get go. There are only ~15 calls
to qunique(). So refactoring this should also be a fairly small patch. I
can do that if there's agreement for that approach.Works for me.
At quick glance, most if not all of the qunique() calls call qsort()
just before qunique(). I wonder if we should have a single "sort and
deduplicate" function, instead. It could perhaps do some deduplication
"on the go", or other optimizations.If it's just for deduplication purposes and the data doesn't have to end
up sorted, something based on a hash map should be even faster. How
about we start with changing the qunique() comparator signature and as a
2nd step take a closer look at how to go about providing a function that
does it in one go?
Thinking about this some more: ideally we have two functions: something
like deduplicateArray() and sortAndDeduplicateArray(). We could
initially implement deduplicateArray() on top of
sortAndDeduplicateArray(). If we ever find a case that needs
optimization, and doesn't require the data to actually end up sorted, we
can implement deduplicateArray() e.g. on top of simplehash.h.
I'll draft a patch and submit it in a separate thread.
What about the other patches? 0003 and 0007 are also pretty simple and
IMHO uncontroversial while giving decent savings.
I've reordered the patches such that the ones that I think are
uncontroversial, small and ready to be committed are at the beginning
(patches 0001 - 0004). The radix sort and ASCII fast-path patches come
last (0005 and 0006). I would like to first concentrate on getting 0001
- 0004 in and then get back to 0005 and 0006.
I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:
Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19x
--
David Geier
Attachments:
v3-0004-Faster-qunique-comparator-in-generate_trgm.patchtext/x-patch; charset=UTF-8; name=v3-0004-Faster-qunique-comparator-in-generate_trgm.patchDownload+12-13
v3-0006-Add-ASCII-fastpath-to-generate_trgm_only.patchtext/x-patch; charset=UTF-8; name=v3-0006-Add-ASCII-fastpath-to-generate_trgm_only.patchDownload+65-60
v3-0005-Optimize-generate_trgm-with-radix-sort.patchtext/x-patch; charset=UTF-8; name=v3-0005-Optimize-generate_trgm-with-radix-sort.patchDownload+54-11
v3-0003-Make-btint4cmp-branchless.patchtext/x-patch; charset=UTF-8; name=v3-0003-Make-btint4cmp-branchless.patchDownload+2-7
v3-0002-Optimize-generate_trgm-with-sort_template.h.patchtext/x-patch; charset=UTF-8; name=v3-0002-Optimize-generate_trgm-with-sort_template.h.patchDownload+37-11
v3-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchtext/x-patch; charset=UTF-8; name=v3-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchDownload+66-89
Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ıııpostgres=# select show_trgm('III' COLLATE "tr_TR");
show_trgm
-------------------------
{" i"," ii","ii ",iii}
(1 row)But when using tr_TR as default locale of the database the following
happens:postgres=# select lower('III' COLLATE "tr_TR");
lower
-------
ıııpostgres=# select show_trgm('III');sü
show_trgm
---------------------------------------
{0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.
I've written a patch to fix that. See [1]/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com.
Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.Although, we probably should be using case-folding rather than lower-
casing with ICU locales anyway. Case-folding is designed for string
matching. It'd be a backwards-compatibility breaking change, though.
Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.
Good point with the casefolding. I'll look into that.
How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?
[1]: /messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com
/messages/by-id/db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com
--
David Geier
On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
How do we usually go about such backwards-compatibility breaking
changes?
When it concerns a bug, we mention the change in the release notes
with a warning to reindex affected indexes to be sure no known
corruption remains. See e.g. the final entry in the PG18 release
notes' migration section here:
https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.
Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?
No. We'd handle this like any other collation/opclass fixes; we ask
users to reindex their indexes in their own time after they've
upgraded their cluster. Note that in this case it concerns an issue
with just one GIN opclass, not all GIN indexes; so even if we were to
address this in pg_upgrade it wouldn't be a correct choice to reindex
every GIN index, as only a subset of those would be affected by this
issue.
Generally speaking, pg_upgrade doesn't concern itself with the
validity of the data structures that are described by the catalogs
that it upgrades, it only concerns itself with that it correctly
transcribes the catalogs from one version to another, and that the
data files of the old cluster are transfered correctly without
changes.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
Hi Matthias,
On 21.01.2026 21:50, Matthias van de Meent wrote:
On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
How do we usually go about such backwards-compatibility breaking
changes?When it concerns a bug, we mention the change in the release notes
with a warning to reindex affected indexes to be sure no known
corruption remains. See e.g. the final entry in the PG18 release
notes' migration section here:
https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?No. We'd handle this like any other collation/opclass fixes; we ask
users to reindex their indexes in their own time after they've
upgraded their cluster. Note that in this case it concerns an issue
with just one GIN opclass, not all GIN indexes; so even if we were to
address this in pg_upgrade it wouldn't be a correct choice to reindex
every GIN index, as only a subset of those would be affected by this
issue.Generally speaking, pg_upgrade doesn't concern itself with the
validity of the data structures that are described by the catalogs
that it upgrades, it only concerns itself with that it correctly
transcribes the catalogs from one version to another, and that the
data files of the old cluster are transfered correctly without
changes.
Thanks for the clarifications and the link to the release notes. That's
very helpful. Then I know how to move on and will update the patch
accordingly.
--
David Geier
On 23.01.2026 11:18, David Geier wrote:
Hi Matthias,
On 21.01.2026 21:50, Matthias van de Meent wrote:
On Wed, 21 Jan 2026 at 16:45, David Geier <geidav.pg@gmail.com> wrote:
How do we usually go about such backwards-compatibility breaking
changes?When it concerns a bug, we mention the change in the release notes
with a warning to reindex affected indexes to be sure no known
corruption remains. See e.g. the final entry in the PG18 release
notes' migration section here:
https://www.postgresql.org/docs/18/release-18.html#RELEASE-18-MIGRATION.Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?No. We'd handle this like any other collation/opclass fixes; we ask
users to reindex their indexes in their own time after they've
upgraded their cluster. Note that in this case it concerns an issue
with just one GIN opclass, not all GIN indexes; so even if we were to
address this in pg_upgrade it wouldn't be a correct choice to reindex
every GIN index, as only a subset of those would be affected by this
issue.Generally speaking, pg_upgrade doesn't concern itself with the
validity of the data structures that are described by the catalogs
that it upgrades, it only concerns itself with that it correctly
transcribes the catalogs from one version to another, and that the
data files of the old cluster are transfered correctly without
changes.Thanks for the clarifications and the link to the release notes. That's
very helpful. Then I know how to move on and will update the patch
accordingly.
Attached are the patches rebased on latest master.
I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.
I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.
I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:
Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19x
I've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.
--
David Geier
Attachments:
v4-0005-Optimize-generate_trgm-with-radix-sort.patchtext/x-patch; charset=UTF-8; name=v4-0005-Optimize-generate_trgm-with-radix-sort.patchDownload+54-11
v4-0004-Faster-qunique-comparator-in-generate_trgm.patchtext/x-patch; charset=UTF-8; name=v4-0004-Faster-qunique-comparator-in-generate_trgm.patchDownload+12-13
v4-0003-Make-btint4cmp-branchless.patchtext/x-patch; charset=UTF-8; name=v4-0003-Make-btint4cmp-branchless.patchDownload+2-7
v4-0002-Optimize-generate_trgm-with-sort_template.h.patchtext/x-patch; charset=UTF-8; name=v4-0002-Optimize-generate_trgm-with-sort_template.h.patchDownload+37-11
v4-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchtext/x-patch; charset=UTF-8; name=v4-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchDownload+66-89
Attached are the patches rebased on latest master.
I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19xI've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.
Attached is v5 that removes an incorrect assertion from the radix sort code.
--
David Geier
Attachments:
v5-0005-Optimize-generate_trgm-with-radix-sort.patchtext/x-patch; charset=UTF-8; name=v5-0005-Optimize-generate_trgm-with-radix-sort.patchDownload+53-11
v5-0004-Faster-qunique-comparator-in-generate_trgm.patchtext/x-patch; charset=UTF-8; name=v5-0004-Faster-qunique-comparator-in-generate_trgm.patchDownload+12-13
v5-0003-Make-btint4cmp-branchless.patchtext/x-patch; charset=UTF-8; name=v5-0003-Make-btint4cmp-branchless.patchDownload+2-7
v5-0002-Optimize-generate_trgm-with-sort_template.h.patchtext/x-patch; charset=UTF-8; name=v5-0002-Optimize-generate_trgm-with-sort_template.h.patchDownload+37-11
v5-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchtext/x-patch; charset=UTF-8; name=v5-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patchDownload+66-89
On 03/03/2026 19:31, David Geier wrote:
Attached are the patches rebased on latest master.
I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19xI've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.Attached is v5 that removes an incorrect assertion from the radix sort code.
v5-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patch
v5-0002-Optimize-generate_trgm-with-sort_template.h.patch
v5-0003-Make-btint4cmp-branchless.patch
v5-0004-Faster-qunique-comparator-in-generate_trgm.patch
v5-0005-Optimize-generate_trgm-with-radix-sort.patch
Pushed 0001 as commit 6f5ad00ab7.
I squashed 0002 and 0004 into one commit, and did some more refactoring:
I created a trigram_qsort() helper function that calls the signed or
unsigned variant, so that that logic doesn't need to be duplicated in
the callers. For symmetry, I also added a trigram_qunique() helper
function which just calls qunique() with the new, faster CMPTRGM_EQ
comparator. Pushed these as commit 9f3755ea07.
Patch 0003 gives me pause. It's a tiny patch:
@@ -203,12 +204,7 @@ btint4cmp(PG_FUNCTION_ARGS)
int32 a = PG_GETARG_INT32(0);
int32 b = PG_GETARG_INT32(1);- if (a > b) - PG_RETURN_INT32(A_GREATER_THAN_B); - else if (a == b) - PG_RETURN_INT32(0); - else - PG_RETURN_INT32(A_LESS_THAN_B); + PG_RETURN_INT32(pg_cmp_s32(a, b)); }
But the comments on the pg_cmp functions say:
* NB: If the comparator function is inlined, some compilers may produce
* worse code with these helper functions than with code with the
* following form:
*
* if (a < b)
* return -1;
* if (a > b)
* return 1;
* return 0;
*
So, uh, is that really a universal improvement? Is that comment about
producing worse code outdated?
- Heikki
On Tue, Apr 7, 2026 at 6:27 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
But the comments on the pg_cmp functions say:
* NB: If the comparator function is inlined, some compilers may produce
* worse code with these helper functions than with code with the
* following form:
*
* if (a < b)
* return -1;
* if (a > b)
* return 1;
* return 0;
*So, uh, is that really a universal improvement? Is that comment about
producing worse code outdated?
No, it's quite recent:
/messages/by-id/20240212230423.GA3519@nathanxps13
--
John Naylor
Amazon Web Services
Hi,
On Tue, Apr 07, 2026 at 02:27:40PM +0300, Heikki Linnakangas wrote:
On 03/03/2026 19:31, David Geier wrote:
Attached are the patches rebased on latest master.
I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19xI've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.Attached is v5 that removes an incorrect assertion from the radix sort code.
v5-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patch
v5-0002-Optimize-generate_trgm-with-sort_template.h.patch
v5-0003-Make-btint4cmp-branchless.patch
v5-0004-Faster-qunique-comparator-in-generate_trgm.patch
v5-0005-Optimize-generate_trgm-with-radix-sort.patchPushed 0001 as commit 6f5ad00ab7.
This commit makes use of StaticAssertStmt() that has been deprecated in
d50c86e74375. The attached, fixes it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-gin-change-remaining-StaticAssertStmt-to-StaticAs.patchtext/x-diff; charset=us-asciiDownload+2-2
Heikki Linnakangas <hlinnaka@iki.fi> writes:
Pushed 0001 as commit 6f5ad00ab7.
This commit has caused Coverity to start complaining that
most of ginExtractEntries() is unreachable:
*** CID 1691468: Control flow issues (DEADCODE)
/srv/coverity/git/pgsql-git/postgresql/src/backend/access/gin/ginutil.c: 495 in ginExtractEntries()
489 /*
490 * Scan the items for any NULLs. All NULLs are considered equal, so we
491 * just need to check and remember if there are any. We remove them from
492 * the array here, and after deduplication, put back one NULL entry to
493 * represent them all.
494 */
CID 1691468: Control flow issues (DEADCODE)
Execution cannot reach this statement: "hasNull = false;".
495 hasNull = false;
496 if (nullFlags)
497 {
498 int32 numNonNulls = 0;
499
500 for (int32 i = 0; i < nentries; i++)
Evidently, it does not realize that the extractValueFn() can change
nentries from its initial value of zero. I wouldn't be too surprised
if that's related to our casting of the pointer to uintptr_t --- that
may cause it to not see the passed pointer as a potential reference
mechanism.
I would just write that off as Coverity not being smart enough, except
that I'm worried that some compiler might make a similar deduction and
break the function completely. Was the switch to a local variable
for nentries really a useful win performance-wise?
regards, tom lane
On 09.04.26 13:28, Bertrand Drouvot wrote:
Hi,
On Tue, Apr 07, 2026 at 02:27:40PM +0300, Heikki Linnakangas wrote:
On 03/03/2026 19:31, David Geier wrote:
Attached are the patches rebased on latest master.
I've removed the ASCII fast-path patch 0006 as it turned out to be more
complicated to make work than expected.I kept the radix sort patch because it gives a decent speedup but I
would like to focus for now on getting patches 0001 - 0004 merged.
They're all simple and, the way I see it, uncontroversial.I remeasured the savings of 0001 - 0004, which comes on top of the
already committed patch that inlined the comparison function, which gave
another ~5%:Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 8,058 | 10,311 | 1.27x
lineitem(l_comment) | 223,233 | 256,986 | 1.19xI've also registered the change at the commit fest, see
https://commitfest.postgresql.org/patch/6418/.Attached is v5 that removes an incorrect assertion from the radix sort code.
v5-0001-Optimize-sort-and-deduplication-in-ginExtractEntr.patch
v5-0002-Optimize-generate_trgm-with-sort_template.h.patch
v5-0003-Make-btint4cmp-branchless.patch
v5-0004-Faster-qunique-comparator-in-generate_trgm.patch
v5-0005-Optimize-generate_trgm-with-radix-sort.patchPushed 0001 as commit 6f5ad00ab7.
This commit makes use of StaticAssertStmt() that has been deprecated in
d50c86e74375. The attached, fixes it.
I think the position of the static assertion is correct, because it
refers to the palloc0_array() that follows. Maybe the comment could be
a bit clearer, like "using palloc0_array requires GIN_CAT_NORM_KEY==0"?
Hi,
On Mon, Apr 13, 2026 at 11:41:02AM +0200, Peter Eisentraut wrote:
On 09.04.26 13:28, Bertrand Drouvot wrote:
This commit makes use of StaticAssertStmt() that has been deprecated in
d50c86e74375. The attached, fixes it.I think the position of the static assertion is correct, because it refers
to the palloc0_array() that follows. Maybe the comment could be a bit
clearer, like "using palloc0_array requires GIN_CAT_NORM_KEY==0"?
Yeah that looks better to not lose the connection with palloc0_array() here.
Done that way in the attached and adding new braces to avoid warning from
-Wdeclaration-after-statement.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com