[PATCH] ltree hash functions

Started by Tommy Pavlicekalmost 3 years ago14 messageshackers
Jump to latest
#1Tommy Pavlicek
tommypav122@gmail.com

Hi All,

I've written a patch to add hash functions for the ltree extension. It adds
support for hash indexes and hash aggregation. I've reused the existing
logic that's used to hash arrays and added tests that mirror elsewhere
(i.e. hstore and hash_func regression tests).

The patch doesn't currently support hash joins as the ltree = operator was
created without support for it. The ALTER OPERATOR command doesn't support
changing the hash join support, so I'm not sure what the best strategy to
change it is. Is it ok to update the operator's row in the pg_operator
system catalog or is there a better way to change this that someone could
recommend?

Any comments on the overall approach or other feedback would be appreciated.

Thanks,
Tommy

Attachments:

ltree-hash-v1.patchapplication/octet-stream; name=ltree-hash-v1.patchDownload+166-2
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tommy Pavlicek (#1)
Re: [PATCH] ltree hash functions

Hi,

I've created a CF entry for the patch:

https://commitfest.postgresql.org/43/4375/

I only briefly skimmed the code, so a couple comments.

On 6/17/23 17:45, Tommy Pavlicek wrote:

Hi All,

I've written a patch to add hash functions for the ltree extension. It
adds support for hash indexes and hash aggregation. I've reused the
existing logic that's used to hash arrays and added tests that mirror
elsewhere (i.e. hstore and hash_func regression tests).

Reusing code/logic is the right approach, IMHO.

The patch doesn't currently support hash joins as the ltree = operator
was created without support for it. The ALTER OPERATOR command doesn't
support changing the hash join support, so I'm not sure what the best
strategy to change it is. Is it ok to update the operator's row in the
pg_operator system catalog or is there a better way to change this that
someone could recommend?

I guess the "correct" solution would be to extend ALTER OPERATOR. I
wonder why it's not supported - it's clearly an intentional decision
(per comment in AlterOperator). So what might break if this changes for
an existing operator?

FWIW the CREATE OPERATOR documentation only talks about hash joins for
HASHES, maybe it should be updated to also mention hash aggregates?

Any comments on the overall approach or other feedback would be appreciated.

I wonder what's the use case for this. I wonder how often people join on
ltree, for example. Did you just notice ltree can't hash and decided to
fix that, or do you have a practical use case / need for this feature?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#2)
Re: [PATCH] ltree hash functions

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I guess the "correct" solution would be to extend ALTER OPERATOR. I
wonder why it's not supported - it's clearly an intentional decision
(per comment in AlterOperator). So what might break if this changes for
an existing operator?

This code was added by commit 321eed5f0. The thread leading up to
that commit is here:

/messages/by-id/3348985.V7xMLFDaJO@dinodell

There are some nontrivial concerns in there about breaking the
semantics of existing exclusion constraints, for instance. I think
we mostly rejected the concern about invalidation of cached plans
as already-covered, but that wasn't the only problem.

However, I think we could largely ignore the issues if we restricted
ALTER OPERATOR to only add commutator, negator, hashes, or merges
properties to operators that lacked them before --- which'd be the
primary if not only use-case anyway. That direction can't break
anything.

regards, tom lane

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: [PATCH] ltree hash functions

On 6/17/23 20:19, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I guess the "correct" solution would be to extend ALTER OPERATOR. I
wonder why it's not supported - it's clearly an intentional decision
(per comment in AlterOperator). So what might break if this changes for
an existing operator?

This code was added by commit 321eed5f0. The thread leading up to
that commit is here:

/messages/by-id/3348985.V7xMLFDaJO@dinodell

There are some nontrivial concerns in there about breaking the
semantics of existing exclusion constraints, for instance. I think
we mostly rejected the concern about invalidation of cached plans
as already-covered, but that wasn't the only problem.

However, I think we could largely ignore the issues if we restricted
ALTER OPERATOR to only add commutator, negator, hashes, or merges
properties to operators that lacked them before --- which'd be the
primary if not only use-case anyway. That direction can't break
anything.

Sound reasonable.

Tommy, are you interested in extending ALTER OPERATOR to allow this,
which would also allow fixing the ltree operator?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tommy Pavlicek
tommypav122@gmail.com
In reply to: Tomas Vondra (#4)
Re: [PATCH] ltree hash functions

FWIW the CREATE OPERATOR documentation only talks about hash joins for

HASHES, maybe it should be updated to also mention hash aggregates?

I think I might have been a bit unclear here, the hash aggregate does work
without altering the operator so it's just the join that's blocked. Sorry
about the confusion.

I wonder what's the use case for this. I wonder how often people join on

ltree, for example. Did you just notice ltree can't hash and decided to
fix that, or do you have a practical use case / need for this feature?

I mostly want to add hash indexes. Beyond selecting specific values, you
can use them to get ancestors (trim the path and do an exact select) and
descendents (using a functional index calculating the parent path for each
row). For example, I've found it can be faster to calculate the path of
every ancestor and use select ltree path = ANY([ancestor paths]) compared
to using a gist index. It's not ideal, but unfortunately I've found that
with enough rows, gist indexes get very large and slow. Btree indexes are
better, but for ltree they can still be up to around 10x bigger than a hash
index. I've also seen ltree hash indexes outperform btree indexes in very
large tables, but I suspect in most cases they'll be similar.

Tommy, are you interested in extending ALTER OPERATOR to allow this,

which would also allow fixing the ltree operator?

Yes, I can do that. I took a look over the code and email thread and it
seems like it should be relatively straight forward. I'll put a patch
together for that and then update this patch to alter the operator.

On Sat, Jun 17, 2023 at 9:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

On 6/17/23 20:19, Tom Lane wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I guess the "correct" solution would be to extend ALTER OPERATOR. I
wonder why it's not supported - it's clearly an intentional decision
(per comment in AlterOperator). So what might break if this changes for
an existing operator?

This code was added by commit 321eed5f0. The thread leading up to
that commit is here:

/messages/by-id/3348985.V7xMLFDaJO@dinodell

There are some nontrivial concerns in there about breaking the
semantics of existing exclusion constraints, for instance. I think
we mostly rejected the concern about invalidation of cached plans
as already-covered, but that wasn't the only problem.

However, I think we could largely ignore the issues if we restricted
ALTER OPERATOR to only add commutator, negator, hashes, or merges
properties to operators that lacked them before --- which'd be the
primary if not only use-case anyway. That direction can't break
anything.

Sound reasonable.

Tommy, are you interested in extending ALTER OPERATOR to allow this,
which would also allow fixing the ltree operator?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Daniel Gustafsson
daniel@yesql.se
In reply to: Tommy Pavlicek (#5)
Re: [PATCH] ltree hash functions

On 19 Jun 2023, at 11:18, Tommy Pavlicek <tommypav122@gmail.com> wrote:

Tommy, are you interested in extending ALTER OPERATOR to allow this,
which would also allow fixing the ltree operator?

Yes, I can do that. I took a look over the code and email thread and it seems like it should be relatively straight forward. I'll put a patch together for that and then update this patch to alter the operator.

Did you have a chance to look at this for an updated patch for this commitfest?

--
Daniel Gustafsson

#7Tommy Pavlicek
tommypav122@gmail.com
In reply to: Daniel Gustafsson (#6)
Re: [PATCH] ltree hash functions

On Thu, Jul 6, 2023 at 2:18 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 19 Jun 2023, at 11:18, Tommy Pavlicek <tommypav122@gmail.com> wrote:

Tommy, are you interested in extending ALTER OPERATOR to allow this,
which would also allow fixing the ltree operator?

Yes, I can do that. I took a look over the code and email thread and it seems like it should be relatively straight forward. I'll put a patch together for that and then update this patch to alter the operator.

Did you have a chance to look at this for an updated patch for this commitfest?

I finally had a chance to look at this and I've updated the patch to
alter the = operator to enable hash joins.

This is ready to be looked at now.

Is there anything I need to do to move this forward?

Cheers,
Tommy

Attachments:

0001-ltree-hash-v2.patchapplication/octet-stream; name=0001-ltree-hash-v2.patchDownload+168-2
#8jian he
jian.universality@gmail.com
In reply to: Tommy Pavlicek (#7)
Re: [PATCH] ltree hash functions

On Wed, Nov 29, 2023 at 6:09 AM Tommy Pavlicek <tommypav122@gmail.com> wrote:

On Thu, Jul 6, 2023 at 2:18 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 19 Jun 2023, at 11:18, Tommy Pavlicek <tommypav122@gmail.com> wrote:

Tommy, are you interested in extending ALTER OPERATOR to allow this,
which would also allow fixing the ltree operator?

Yes, I can do that. I took a look over the code and email thread and it seems like it should be relatively straight forward. I'll put a patch together for that and then update this patch to alter the operator.

Did you have a chance to look at this for an updated patch for this commitfest?

I finally had a chance to look at this and I've updated the patch to
alter the = operator to enable hash joins.

This is ready to be looked at now.

Is there anything I need to do to move this forward?

you only change Makefile, you also need to change contrib/ltree/meson.build?

+drop index tstidx;
+create index tstidx on ltreetest using hash (t);
+set enable_seqscan=off;
+
+SELECT * FROM ltreetest WHERE t =  '12.3' order by t asc;

Do you need to use EXPLAIN to demo the index usage?

#9Tommy Pavlicek
tommypav122@gmail.com
In reply to: jian he (#8)
Re: [PATCH] ltree hash functions

On Tue, Nov 28, 2023 at 7:38 PM jian he <jian.universality@gmail.com> wrote:

you only change Makefile, you also need to change contrib/ltree/meson.build?
Do you need to use EXPLAIN to demo the index usage?

Thanks! Yes, I missed the Meson build file. I added additional
commands with EXPLAIN (COSTS OFF) as I found in other places.

Patch updated for those comments (and a touch of cleanup in the tests) attached.

Attachments:

0001-ltree-hash-v3.patchapplication/octet-stream; name=0001-ltree-hash-v3.patchDownload+211-2
#10jian he
jian.universality@gmail.com
In reply to: Tommy Pavlicek (#9)
Re: [PATCH] ltree hash functions

On Fri, Dec 1, 2023 at 8:44 AM Tommy Pavlicek <tommypav122@gmail.com> wrote:

Patch updated for those comments (and a touch of cleanup in the tests) attached.

it would be a better name as hash_ltree than ltree_hash, similar logic
applies to ltree_hash_extended.
that would be the convention. see: https://stackoverflow.com/a/69650940/15603477

Other than that, it looks good.

#11Tommy Pavlicek
tommypav122@gmail.com
In reply to: jian he (#10)
Re: [PATCH] ltree hash functions

Thanks.

I've attached the latest version that updates the naming in line with
the convention.

Show quoted text

On Mon, Dec 4, 2023 at 12:46 AM jian he <jian.universality@gmail.com> wrote:

On Fri, Dec 1, 2023 at 8:44 AM Tommy Pavlicek <tommypav122@gmail.com> wrote:

Patch updated for those comments (and a touch of cleanup in the tests) attached.

it would be a better name as hash_ltree than ltree_hash, similar logic
applies to ltree_hash_extended.
that would be the convention. see: https://stackoverflow.com/a/69650940/15603477

Other than that, it looks good.

Attachments:

0001-ltree-hash-v4.patchapplication/octet-stream; name=0001-ltree-hash-v4.patchDownload+211-2
#12vignesh C
vignesh21@gmail.com
In reply to: Tommy Pavlicek (#11)
Re: [PATCH] ltree hash functions

On Wed, 6 Dec 2023 at 04:08, Tommy Pavlicek <tommypav122@gmail.com> wrote:

Thanks.

I've attached the latest version that updates the naming in line with
the convention.

On Mon, Dec 4, 2023 at 12:46 AM jian he <jian.universality@gmail.com> wrote:

On Fri, Dec 1, 2023 at 8:44 AM Tommy Pavlicek <tommypav122@gmail.com> wrote:

Patch updated for those comments (and a touch of cleanup in the tests) attached.

it would be a better name as hash_ltree than ltree_hash, similar logic
applies to ltree_hash_extended.
that would be the convention. see: https://stackoverflow.com/a/69650940/15603477

Other than that, it looks good.

CFBot shows one of the test is failing as in [1]:
diff -U3 /tmp/cirrus-ci-build/contrib/ltree/expected/ltree.out
/tmp/cirrus-ci-build/build-32/testrun/ltree/regress/results/ltree.out
--- /tmp/cirrus-ci-build/contrib/ltree/expected/ltree.out 2024-01-31
15:18:42.893039599 +0000
+++ /tmp/cirrus-ci-build/build-32/testrun/ltree/regress/results/ltree.out
2024-01-31 15:23:25.309028749 +0000
@@ -1442,9 +1442,14 @@
        ('0.1.2'::ltree), ('0'::ltree), ('0_asd.1_ASD'::ltree)) x(v)
 WHERE  hash_ltree(v)::bit(32) != hash_ltree_extended(v, 0)::bit(32)
        OR hash_ltree(v)::bit(32) = hash_ltree_extended(v, 1)::bit(32);
- value | standard | extended0 | extended1
--------+----------+-----------+-----------
-(0 rows)
+    value    |             standard             |
extended0             |            extended1
+-------------+----------------------------------+----------------------------------+----------------------------------
+ 0           | 10001010100010010000000000001011 |
01011001111001000100011001011011 | 01011001111001000100011010011111
+ 0.1         | 10100000111110001010110001001110 |
00111100100010001100110111010101 | 00111100100010001101100011010101
+ 0.1.2       | 01111000011100000101111101110100 |
10101110011101011000000011010111 | 10101110011101110010001111000011
+ 0           | 10001010100010010000000000001011 |
01011001111001000100011001011011 | 01011001111001000100011010011111
+ 0_asd.1_ASD | 01000010001010000000101001001101 |
00111100100010001100110111010101 | 00111100100010001101100011010101
+(5 rows)

Please post an updated version for the same.

[1]: https://api.cirrus-ci.com/v1/artifact/task/5572544858685440/testrun/build-32/testrun/ltree/regress/regression.diffs

Regards,
Vignesh

#13jian he
jian.universality@gmail.com
In reply to: vignesh C (#12)
Re: [PATCH] ltree hash functions

On Thu, Feb 1, 2024 at 11:11 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 6 Dec 2023 at 04:08, Tommy Pavlicek <tommypav122@gmail.com> wrote:

Thanks.

I've attached the latest version that updates the naming in line with
the convention.

On Mon, Dec 4, 2023 at 12:46 AM jian he <jian.universality@gmail.com> wrote:

On Fri, Dec 1, 2023 at 8:44 AM Tommy Pavlicek <tommypav122@gmail.com> wrote:

Patch updated for those comments (and a touch of cleanup in the tests) attached.

it would be a better name as hash_ltree than ltree_hash, similar logic
applies to ltree_hash_extended.
that would be the convention. see: https://stackoverflow.com/a/69650940/15603477

Other than that, it looks good.

CFBot shows one of the test is failing as in [1]:
diff -U3 /tmp/cirrus-ci-build/contrib/ltree/expected/ltree.out
/tmp/cirrus-ci-build/build-32/testrun/ltree/regress/results/ltree.out
--- /tmp/cirrus-ci-build/contrib/ltree/expected/ltree.out 2024-01-31
15:18:42.893039599 +0000
+++ /tmp/cirrus-ci-build/build-32/testrun/ltree/regress/results/ltree.out
2024-01-31 15:23:25.309028749 +0000
@@ -1442,9 +1442,14 @@
('0.1.2'::ltree), ('0'::ltree), ('0_asd.1_ASD'::ltree)) x(v)
WHERE  hash_ltree(v)::bit(32) != hash_ltree_extended(v, 0)::bit(32)
OR hash_ltree(v)::bit(32) = hash_ltree_extended(v, 1)::bit(32);
- value | standard | extended0 | extended1
--------+----------+-----------+-----------
-(0 rows)
+    value    |             standard             |
extended0             |            extended1
+-------------+----------------------------------+----------------------------------+----------------------------------
+ 0           | 10001010100010010000000000001011 |
01011001111001000100011001011011 | 01011001111001000100011010011111
+ 0.1         | 10100000111110001010110001001110 |
00111100100010001100110111010101 | 00111100100010001101100011010101
+ 0.1.2       | 01111000011100000101111101110100 |
10101110011101011000000011010111 | 10101110011101110010001111000011
+ 0           | 10001010100010010000000000001011 |
01011001111001000100011001011011 | 01011001111001000100011010011111
+ 0_asd.1_ASD | 01000010001010000000101001001101 |
00111100100010001100110111010101 | 00111100100010001101100011010101
+(5 rows)

Please post an updated version for the same.

[1] - https://api.cirrus-ci.com/v1/artifact/task/5572544858685440/testrun/build-32/testrun/ltree/regress/regression.diffs

It only fails on Linux - Debian Bullseye - Meson.
I fixed the white space, named it v5.
I also made the following changes:
from

uint64 levelHash = hash_any_extended((unsigned char *) al->name, al->len, seed);
uint32 levelHash = hash_any((unsigned char *) al->name, al->len);

to
uint64 levelHash = DatumGetUInt64(hash_any_extended((unsigned char *)
al->name, al->len, seed));
uint32 levelHash = DatumGetUInt32(hash_any((unsigned char *) al->name,
al->len));

(these two line live in different functions)

I have some problems testing it locally, so I post the patch.

Attachments:

v5-0001-add-hash-functions-for-the-ltree-extension.patchapplication/x-patch; name=v5-0001-add-hash-functions-for-the-ltree-extension.patchDownload+211-3
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: jian he (#13)
Re: [PATCH] ltree hash functions

jian he <jian.universality@gmail.com> writes:

I also made the following changes:
from

uint64 levelHash = hash_any_extended((unsigned char *) al->name, al->len, seed);
uint32 levelHash = hash_any((unsigned char *) al->name, al->len);

to
uint64 levelHash = DatumGetUInt64(hash_any_extended((unsigned char *)
al->name, al->len, seed));
uint32 levelHash = DatumGetUInt32(hash_any((unsigned char *) al->name,
al->len));

Yeah, that'd fail on 32-bit machines.

Pushed v5 with some minor cosmetic tweaking.

regards, tom lane