Do we want a hashset type?
Hi,
I've been working with a social network start-up that uses PostgreSQL as their
only database. Recently, they became interested in graph databases, largely
because of an article [1]https://neo4j.com/news/how-much-faster-is-a-graph-database-really/ suggesting that a SQL database "just chokes" when it
encounters a depth-five friends-of-friends query (for a million users having
50 friends each).
The article didn't provide the complete SQL queries, so I had to buy the
referenced book to get the full picture. It turns out, the query was a simple
self-join, which, of course, isn't very efficient. When we rewrote the query
using a modern RECURSIVE CTE, it worked but still took quite some time.
Of course, there will always be a need for specific databases, and some queries
will run faster on them. But, if PostgreSQL could handle graph queries with a
Big-O runtime similar to graph databases, scalability wouldn't be such a big
worry.
Just like the addition of the JSON type to PostgreSQL helped reduce the hype
around NoSQL, maybe there's something simple that's missing in PostgreSQL that
could help us achieve the same Big-O class performance as graph databases for
some of these type of graph queries?
Looking into the key differences between PostgreSQL and graph databases,
it seems that one is how they store adjacent nodes. In SQL, a graph can be
represented as one table for the Nodes and another table for the Edges.
For a friends-of-friends query, we would need to query Edges to find adjacent
nodes, recursively.
Graph databases, on the other hand, keep adjacent nodes immediately accessible
by storing them with the node itself. This looks like a major difference in
terms of how the data is stored.
Could a hashset type help bridge this gap?
The idea would be to store adjacent nodes as a hashset column in a Nodes table.
Apache AGE is an option for users who really need a new graph query language.
But I believe there are more users who occasionally run into a graph problem and
would be glad if there was an efficient way to solve it in SQL without having
to bring in a new database.
/Joel
[1]: https://neo4j.com/news/how-much-faster-is-a-graph-database-really/
On 5/31/23 16:09, Joel Jacobson wrote:
Hi,
I've been working with a social network start-up that uses PostgreSQL as
their
only database. Recently, they became interested in graph databases, largely
because of an article [1] suggesting that a SQL database "just chokes"
when it
encounters a depth-five friends-of-friends query (for a million users having
50 friends each).The article didn't provide the complete SQL queries, so I had to buy the
referenced book to get the full picture. It turns out, the query was a
simple
self-join, which, of course, isn't very efficient. When we rewrote the query
using a modern RECURSIVE CTE, it worked but still took quite some time.Of course, there will always be a need for specific databases, and some
queries
will run faster on them. But, if PostgreSQL could handle graph queries
with a
Big-O runtime similar to graph databases, scalability wouldn't be such a big
worry.Just like the addition of the JSON type to PostgreSQL helped reduce the hype
around NoSQL, maybe there's something simple that's missing in
PostgreSQL that
could help us achieve the same Big-O class performance as graph
databases for
some of these type of graph queries?Looking into the key differences between PostgreSQL and graph databases,
it seems that one is how they store adjacent nodes. In SQL, a graph can be
represented as one table for the Nodes and another table for the Edges.
For a friends-of-friends query, we would need to query Edges to find
adjacent
nodes, recursively.Graph databases, on the other hand, keep adjacent nodes immediately
accessible
by storing them with the node itself. This looks like a major difference in
terms of how the data is stored.Could a hashset type help bridge this gap?
The idea would be to store adjacent nodes as a hashset column in a Nodes
table.
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?
Presumably it'd store whole adjacent nodes, not just some sort of node
ID. So what if a node is adjacent to many other nodes? What if a node is
added/deleted/modified?
AFAICS the main problem is the lookups of adjacent nodes, generating
lot of random I/O etc. Presumably it's not that hard to keep the
"relational" schema with table for vertices/edges, and then an auxiliary
table with adjacent nodes grouped by node, possibly maintained by a
couple triggers. A bit like an "aggregated index" except the queries
would have to use it explicitly.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?
In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.
This data structure would store identifiers (IDs) of the nodes, not the complete
nodes themselves.
Presumably it'd store whole adjacent nodes, not just some sort of node
ID. So what if a node is adjacent to many other nodes? What if a node is
added/deleted/modified?
That would require updating the hashset, which should be close to O(1) in
practical applications.
AFAICS the main problem is the lookups of adjacent nodes, generating
lot of random I/O etc. Presumably it's not that hard to keep the
"relational" schema with table for vertices/edges, and then an auxiliary
table with adjacent nodes grouped by node, possibly maintained by a
couple triggers. A bit like an "aggregated index" except the queries
would have to use it explicitly.
Yes, auxiliary table would be good, since we don't want to duplicate all
node-related data, and only store the IDs in the adjacent nodes hashset.
/Joel
On 5/31/23 17:40, Joel Jacobson wrote:
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.
Why would you want hash-based lookups? It should be fairly trivial to
implement as a user-defined data type, but in what cases are you going
to ask "does the hashset contain X"?
This data structure would store identifiers (IDs) of the nodes, not the complete
nodes themselves.
How does storing just the IDs solves anything? Isn't the main challenge
the random I/O when fetching the adjacent nodes? This does not really
improve that, no?
Presumably it'd store whole adjacent nodes, not just some sort of node
ID. So what if a node is adjacent to many other nodes? What if a node is
added/deleted/modified?That would require updating the hashset, which should be close to O(1) in
practical applications.
But you need to modify hashsets for all the adjacent nodes. Also, O(1)
doesn't say it's cheap. I wonder how expensive it'd be in practice.
AFAICS the main problem is the lookups of adjacent nodes, generating
lot of random I/O etc. Presumably it's not that hard to keep the
"relational" schema with table for vertices/edges, and then an auxiliary
table with adjacent nodes grouped by node, possibly maintained by a
couple triggers. A bit like an "aggregated index" except the queries
would have to use it explicitly.Yes, auxiliary table would be good, since we don't want to duplicate all
node-related data, and only store the IDs in the adjacent nodes hashset.
I may be missing something, but as mentioned, I don't quite see how this
would help. What exactly would this save us? If you create an index on
the edge node IDs, you should get the adjacent nodes pretty cheap from
an IOS. Granted, a pre-built hashset is going to be faster, but if the
next step is fetching the node info, that's going to do a lot of random
I/O, dwarfing all of this.
It's entirely possible I'm missing some important aspect. It'd be very
useful to have a couple example queries illustrating the issue, that we
could use to actually test different approaches.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 31, 2023, at 18:59, Tomas Vondra wrote:
How does storing just the IDs solves anything? Isn't the main challenge
the random I/O when fetching the adjacent nodes? This does not really
improve that, no?
I'm thinking of a recursive query where a lot of time is just spent following
all friends-of-friends, where the graph traversal is the heavy part,
where the final set of nodes are only fetched at the end.
It's entirely possible I'm missing some important aspect. It'd be very
useful to have a couple example queries illustrating the issue, that we
could use to actually test different approaches.
Here is an example using a real anonymised social network.
wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
gunzip soc-pokec-relationships.txt.gz
CREATE TABLE edges (from_node INT, to_node INT);
\copy edges from soc-pokec-relationships.txt;
ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);
SET work_mem TO '1GB';
CREATE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
ARRAY[5867::bigint] AS found,
0 AS depth
UNION ALL
SELECT
new_current,
found || new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;
;
SELECT COUNT(*) FROM edges;
count
----------
30622564
(1 row)
SELECT COUNT(DISTINCT from_node) FROM edges;
count
---------
1432693
(1 row)
-- Most connected user (worst-case) is user 5867 with 8763 friends:
SELECT from_node, COUNT(*) FROM edges GROUP BY from_node ORDER BY COUNT(*) DESC LIMIT 1;
from_node | count
-----------+-------
5867 | 8763
(1 row)
-- Friend-of-friends query exactly at depth three:
EXPLAIN ANALYZE
SELECT * FROM friends_of_friends;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=6017516.90..6017517.60 rows=1 width=8) (actual time=2585.881..2586.334 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..6017516.90 rows=31 width=68) (actual time=0.005..2581.664 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.002..0.002 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=200583.71..601751.66 rows=3 width=68) (actual time=645.036..645.157 rows=1 loops=4)
-> Nested Loop (cost=200583.71..601751.54 rows=3 width=68) (actual time=641.880..641.972 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actual time=0.001..0.002 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=200583.71..200583.72 rows=1 width=32) (actual time=850.997..850.998 rows=1 loops=3)
-> Bitmap Heap Scan on edges (cost=27656.38..196840.88 rows=1497133 width=4) (actual time=203.239..423.534 rows=3486910 loops=3)
Recheck Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Blocks: exact=117876
-> Bitmap Index Scan on edges_pkey (cost=0.00..27282.10 rows=1497133 width=0) (actual time=198.047..198.047 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Planning Time: 1.414 ms
Execution Time: 2588.288 ms
(19 rows)
I tested on PostgreSQL 15.2. For some reason I got a different slower query on HEAD:
SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=6576.67..6577.37 rows=1 width=8) (actual time=6412.693..6413.335 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..6576.67 rows=31 width=68) (actual time=0.008..6407.134 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.005..0.005 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=219.05..657.64 rows=3 width=68) (actual time=1600.747..1600.934 rows=1 loops=4)
-> Nested Loop (cost=219.05..657.52 rows=3 width=68) (actual time=1594.906..1595.035 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actual time=0.001..0.002 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=219.05..219.06 rows=1 width=32) (actual time=2118.105..2118.105 rows=1 loops=3)
-> Sort (cost=207.94..213.49 rows=2221 width=4) (actual time=1780.770..1925.853 rows=3486910 loops=3)
Sort Key: edges.to_node
Sort Method: quicksort Memory: 393217kB
-> Index Only Scan using edges_pkey on edges (cost=0.56..84.48 rows=2221 width=4) (actual time=0.077..762.408 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 8.229 ms
Execution Time: 6446.421 ms
(20 rows)
On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote:
Here is an example using a real anonymised social network.
I realised the "found" column is not necessary in this particular example,
since we only care about the friends at the exact depth level. Simplified query:
CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;
;
-- PostgreSQL 15.2:
EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=2687.88..2688.58 rows=1 width=8) (actual time=2076.362..2076.454 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..2687.88 rows=31 width=36) (actual time=0.008..2075.073 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=89.44..268.75 rows=3 width=36) (actual time=518.613..518.622 rows=1 loops=4)
-> Nested Loop (cost=89.44..268.64 rows=3 width=36) (actual time=515.523..515.523 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.001..0.001 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=89.44..89.45 rows=1 width=32) (actual time=687.356..687.356 rows=1 loops=3)
-> Index Only Scan using edges_pkey on edges (cost=0.56..83.96 rows=2191 width=4) (actual time=0.139..290.996 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 0.557 ms
Execution Time: 2076.990 ms
(17 rows)
On 2023-05-31 We 11:40, Joel Jacobson wrote:
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.
Yeah, a fast lookup set type has long been on my "blue sky" wish list.
So +1 for pursuing the idea.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Wed, 31 May 2023 at 18:40, Joel Jacobson <joel@compiler.org> wrote:
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.This data structure would store identifiers (IDs) of the nodes, not the complete
nodes themselves.
Have you looked at roaring bitmaps? There is a pg_roaringbitmap
extension [1]https://github.com/ChenHuajun/pg_roaringbitmap already available that offers very fast unions,
intersections and membership tests over integer sets. I used it to get
some pretty impressive performance results for faceting search on
large document sets. [2]https://github.com/cybertec-postgresql/pgfaceting -- Ants Aasma www.cybertec-postgresql.com
Depending on the graph fan-outs and operations it might make sense in
the graph use case. For small sets it's probably not too different
from the intarray extension in contrib. But for finding intersections
over large sets (i.e. a join) it's very-very fast. If the workload is
traversal heavy it might make sense to even cache materialized
transitive closures up to some depth (a friend-of-a-friend list).
Roaring bitmaps only support int4 right now, but that is easily
fixable. And they need a relatively dense ID space to get the
performance boost, which seems essential to the approach. The latter
issue means that it can't be easily dropped into GIN or B-tree indexes
for ctid storage.
[1]: https://github.com/ChenHuajun/pg_roaringbitmap
[2]: https://github.com/cybertec-postgresql/pgfaceting -- Ants Aasma www.cybertec-postgresql.com
--
Ants Aasma
www.cybertec-postgresql.com
On 6/1/23 09:14, Joel Jacobson wrote:
On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote:
Here is an example using a real anonymised social network.
I realised the "found" column is not necessary in this particular example,
since we only care about the friends at the exact depth level. Simplified query:CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;
;-- PostgreSQL 15.2:
EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=2687.88..2688.58 rows=1 width=8) (actual time=2076.362..2076.454 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..2687.88 rows=31 width=36) (actual time=0.008..2075.073 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=89.44..268.75 rows=3 width=36) (actual time=518.613..518.622 rows=1 loops=4)
-> Nested Loop (cost=89.44..268.64 rows=3 width=36) (actual time=515.523..515.523 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.001..0.001 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=89.44..89.45 rows=1 width=32) (actual time=687.356..687.356 rows=1 loops=3)
-> Index Only Scan using edges_pkey on edges (cost=0.56..83.96 rows=2191 width=4) (actual time=0.139..290.996 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 0.557 ms
Execution Time: 2076.990 ms
(17 rows)
I've been thinking about this a bit on the way back from pgcon. Per CPU
profile it seems most of the job is actually spent on qsort, calculating
the array_agg(distinct) bit. I don't think we build
We could replace this part by a hash-based set aggregate, which would be
faster, but I doubt that may yield a massive improvement that'd change
the fundamental cost.
I forgot I did something like that a couple years back, implementing a
count_distinct() aggregate that was meant as a faster alternative to
count(distinct). And then I mostly abandoned it because the built-in
sort-based approach improved significantly - it was still slower in
cases, but the gap got small enough.
Anyway, I hacked together a trivial set backed by an open addressing
hash table:
https://github.com/tvondra/hashset
It's super-basic, providing just some bare minimum of functions, but
hopefully good enough for experiments.
- hashset data type - hash table in varlena
- hashset_init - create new hashset
- hashset_add / hashset_contains - add value / check membership
- hashset_merge - merge two hashsets
- hashset_count - count elements
- hashset_to_array - return
- hashset(int) aggregate
This allows rewriting the recursive query like this:
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
hashset_to_array(hashset(edges.to_node)) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;
On my laptop cuts the timing roughly in half - which is nice, but as I
said I don't think it's not a fundamental speedup. The aggregate can be
also parallelized, but I don't think that changes much.
Furthermore, this has a number of drawbacks too - e.g. it can't spill
data to disk, which might be an issue on more complex queries / larger
data sets.
FWIW I wonder how representative this query is, considering it returns
~1M node IDs, i.e. about 10% of the whole set of node IDs. Seems a bit
on the high side.
I've also experimented with doing stuff from plpgsql procedure (that's
what the non-aggregate functions are about). I saw this mostly as a way
to do stuff that'd be hard to do in the recursive CTE, but it has a lot
of additional execution overhead due to plpgsql. Maybe it we had some
smart trick to calculate adjacent nodes we could have a SRF written in C
to get rid of the overhead.
Anyway, this leads me to the question what graph databases are doing for
these queries, if they're faster in answering such queries (by how
much?). I'm not very familiar with that stuff, but surely they do
something smart - precalculating the data, some special indexing,
duplicating the data in multiple places?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 2, 2023, at 10:01, Ants Aasma wrote:
Have you looked at roaring bitmaps? There is a pg_roaringbitmap
extension [1] already available that offers very fast unions,
intersections and membership tests over integer sets. I used it to get
some pretty impressive performance results for faceting search on
large document sets. [2]
Many thanks for the tip!
New benchmark:
We already had since before:
wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
gunzip soc-pokec-relationships.txt.gz
CREATE TABLE edges (from_node INT, to_node INT);
\copy edges from soc-pokec-relationships.txt;
ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);
I've created a new `users` table from the `edges` table,
with a new `friends` roaringbitmap column:
CREATE TABLE users AS
SELECT from_node AS id, rb_build_agg(to_node) AS friends FROM edges GROUP BY 1;
ALTER TABLE users ADD PRIMARY KEY (id);
Old query from before:
CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
coalesce(array_length(current, 1), 0) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3;
;
New roaringbitmap-based query using users instead:
CREATE OR REPLACE VIEW friends_of_friends_roaringbitmap AS
WITH RECURSIVE friends_of_friends AS
(
SELECT
friends,
1 AS depth
FROM users WHERE id = 5867
UNION ALL
SELECT
new_friends,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
rb_or_agg(users.friends) AS new_friends
FROM
users
WHERE
users.id = ANY(rb_to_array(friends_of_friends.friends))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
rb_cardinality(friends) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3
;
Note, depth is 1 at first level since we already have user 5867's friends in the users column.
Maybe there is a better way to make use of the btree index on users.id,
than to convert the roaringbitmap to an array in order to use = ANY(...),
that is, this part: `users.id = ANY(rb_to_array(friends_of_friends.friends))`?
Or maybe there is some entirely different but equivalent way of writing the query
in a more efficient way?
SELECT * FROM friends_of_friends;
count_friends_at_depth_3
--------------------------
1035293
(1 row)
SELECT * FROM friends_of_friends_roaringbitmap;
count_friends_at_depth_3
--------------------------
1035293
(1 row)
EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=5722.03..5722.73 rows=1 width=4) (actual time=2232.896..2233.289 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..5722.03 rows=31 width=36) (actual time=0.003..2228.707 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.001..0.001 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=190.59..572.17 rows=3 width=36) (actual time=556.806..556.837 rows=1 loops=4)
-> Nested Loop (cost=190.59..572.06 rows=3 width=36) (actual time=553.748..553.748 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.000..0.001 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=190.59..190.60 rows=1 width=32) (actual time=737.427..737.427 rows=1 loops=3)
-> Sort (cost=179.45..185.02 rows=2227 width=4) (actual time=577.192..649.812 rows=3486910 loops=3)
Sort Key: edges.to_node
Sort Method: quicksort Memory: 393217kB
-> Index Only Scan using edges_pkey on edges (cost=0.56..55.62 rows=2227 width=4) (actual time=0.027..225.609 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 0.294 ms
Execution Time: 2240.446 ms
EXPLAIN ANALYZE SELECT * FROM friends_of_friends_roaringbitmap;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=799.30..800.00 rows=1 width=8) (actual time=492.925..492.930 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 2
CTE friends_of_friends
-> Recursive Union (cost=0.43..799.30 rows=31 width=118) (actual time=0.061..492.842 rows=3 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..2.65 rows=1 width=118) (actual time=0.060..0.062 rows=1 loops=1)
Index Cond: (id = 5867)
-> Nested Loop (cost=26.45..79.63 rows=3 width=36) (actual time=164.244..164.244 rows=1 loops=3)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36) (actual time=0.000..0.001 rows=1 loops=3)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=26.45..26.46 rows=1 width=32) (actual time=246.359..246.359 rows=1 loops=2)
-> Index Scan using users_pkey on users users_1 (cost=0.43..26.42 rows=10 width=114) (actual time=0.074..132.318 rows=116336 loops=2)
Index Cond: (id = ANY (rb_to_array(friends_of_friends_1.friends)))
Planning Time: 0.257 ms
Execution Time: 493.134 ms
On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
Anyway, I hacked together a trivial set backed by an open addressing
hash table:https://github.com/tvondra/hashset
It's super-basic, providing just some bare minimum of functions, but
hopefully good enough for experiments.- hashset data type - hash table in varlena
- hashset_init - create new hashset
- hashset_add / hashset_contains - add value / check membership
- hashset_merge - merge two hashsets
- hashset_count - count elements
- hashset_to_array - return
- hashset(int) aggregateThis allows rewriting the recursive query like this:
WITH RECURSIVE friends_of_friends AS (
...
Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.
The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's rb_or_agg().
With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the need to scan the edges table for graph traversal:
We could then benchmark roaringbitmap against hashset querying the same table:
CREATE TABLE users AS
SELECT
from_node AS id,
rb_build_agg(to_node) AS friends_roaringbitmap,
hashset(to_node) AS friends_hashset
FROM edges
GROUP BY 1;
/Joel
On 6/5/23 21:52, Joel Jacobson wrote:
On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
Anyway, I hacked together a trivial set backed by an open addressing
hash table:https://github.com/tvondra/hashset
It's super-basic, providing just some bare minimum of functions, but
hopefully good enough for experiments.- hashset data type - hash table in varlena
- hashset_init - create new hashset
- hashset_add / hashset_contains - add value / check membership
- hashset_merge - merge two hashsets
- hashset_count - count elements
- hashset_to_array - return
- hashset(int) aggregateThis allows rewriting the recursive query like this:
WITH RECURSIVE friends_of_friends AS (
...
Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.
The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's rb_or_agg().
With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the need to scan the edges table for graph traversal:
I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:
WITH RECURSIVE friends_of_friends AS (
SELECT
(select hashset(v) from values (5867) as s(v)) AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
hashset(edges.to_node) AS new_current
FROM
edges
WHERE
from_node =
ANY(hashset_to_array(friends_of_friends.current))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
hashset_count(current)
FROM
friends_of_friends
WHERE
depth = 3;
it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.
But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?
Your hashset is now almost exactly as fast as the corresponding roaringbitmap query, +/- 1 ms on my machine.
I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
However, I've probably misunderstood something, maybe I need to add some index or something.
Even so, it's interesting it's apparently not fast "by default".
The query I tested:
MATCH (user:User {id: '5867'})-[:FRIENDS_WITH*3..3]->(fof)
RETURN COUNT(DISTINCT fof)
Here is how I loaded the data into it:
% pwd
/Users/joel/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-3837aa22-c830-4dcf-8668-ef8e302263c7
% head import/*
==> import/friendships.csv <==
1,13,FRIENDS_WITH
1,11,FRIENDS_WITH
1,6,FRIENDS_WITH
1,3,FRIENDS_WITH
1,4,FRIENDS_WITH
1,5,FRIENDS_WITH
1,15,FRIENDS_WITH
1,14,FRIENDS_WITH
1,7,FRIENDS_WITH
1,8,FRIENDS_WITH
==> import/friendships_header.csv <==
:START_ID(User),:END_ID(User),:TYPE
==> import/users.csv <==
1,User
2,User
3,User
4,User
5,User
6,User
7,User
8,User
9,User
10,User
==> import/users_header.csv <==
id:ID(User),:LABEL
% ./bin/neo4j-admin database import full --overwrite-destination --nodes=User=import/users_header.csv,import/users.csv --relationships=FRIENDS_WIDTH=import/friendships_header.csv,import/friendships.csv neo4j
/Joel
On 6/7/23 16:21, Joel Jacobson wrote:
On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?Your hashset is now almost exactly as fast as the corresponding roaringbitmap query, +/- 1 ms on my machine.
Interesting, considering how dumb the the hash table implementation is.
I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
However, I've probably misunderstood something, maybe I need to add some index or something.
Even so, it's interesting it's apparently not fast "by default".
No idea how to fix that, but it's rather suspicious.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
Interesting, considering how dumb the the hash table implementation is.
That's promising.
I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
However, I've probably misunderstood something, maybe I need to add some index or something.
Even so, it's interesting it's apparently not fast "by default".No idea how to fix that, but it's rather suspicious.
I've had a graph-db expert review my benchmark, and he suggested adding an index:
CREATE INDEX FOR (n:User) ON (n.id)
This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a win.
Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
so I had to test with a less connected user.
The graph expert also said that other more realistic graph use-cases might be "multi-relational",
and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
No idea how such multi-relational datasets would affect the benchmarks.
I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
performance perspective, do just fine processing basic graph queries, even with large datasets.
Then there will always be the case when users primarily write very different graph queries all day long,
who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.
/Joel
On 6/8/23 11:41, Joel Jacobson wrote:
On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
Interesting, considering how dumb the the hash table implementation is.
That's promising.
Yeah, not bad for sleep-deprived on-plane hacking ...
There's a bunch of stuff that needs to be improved to make this properly
usable, like:
1) better hash table implementation
2) input/output functions
3) support for other types (now it only works with int32)
4) I wonder if this might be done as an array-like polymorphic type.
5) more efficient storage format, with versioning etc.
6) regression tests
Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.
I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
However, I've probably misunderstood something, maybe I need to add some index or something.
Even so, it's interesting it's apparently not fast "by default".No idea how to fix that, but it's rather suspicious.
I've had a graph-db expert review my benchmark, and he suggested adding an index:
CREATE INDEX FOR (n:User) ON (n.id)
This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a win.
Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
so I had to test with a less connected user.
Interesting. I'd have expected the graph db to be much faster.
The graph expert also said that other more realistic graph use-cases might be "multi-relational",
and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
No idea how such multi-relational datasets would affect the benchmarks.
Not sure either, but I don't have ambition to improve everything at
once. If the hashset improves one practical use case, fine with me.
I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
performance perspective, do just fine processing basic graph queries, even with large datasets.Then there will always be the case when users primarily write very different graph queries all day long,
who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.
Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.
Sure, I'm willing to help!
I've attached a patch that works on some of the items on your list,
including some additions to the README.md.
There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + (set->maxelements + 7) / 8);
Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
to the PostgreSQL source code in general, since it's easy to make this mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.
There's a bunch of stuff that needs to be improved to make this properly
usable, like:1) better hash table implementation
TODO
2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the most natural format,
example:
SELECT '{1,2,3}'::hashset;
3) support for other types (now it only works with int32)
TODO
4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.
Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?
5) more efficient storage format, with versioning etc.
TODO
6) regression tests
I've added some regression tests.
Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.
Good point, I fully agree.
/Joel
Attachments:
hashset-1.0.0-joel-0001.patchapplication/octet-stream; name=hashset-1.0.0-joel-0001.patchDownload
diff --git a/Makefile b/Makefile
index 6d1ccab..53d363e 100644
--- a/Makefile
+++ b/Makefile
@@ -7,7 +7,8 @@ MODULES = hashset
CFLAGS=`pg_config --includedir-server`
-REGRESS = basic cast conversions incremental parallel_query value_count_api trimmed_aggregates
+REGRESS = prelude basic random table
+
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/README.md b/README.md
index 5805613..ad7f12b 100644
--- a/README.md
+++ b/README.md
@@ -6,22 +6,92 @@ providing a collection of integer items with fast lookup.
## Usage
-FIXME
+After installing the extension, you can use the `hashset` data type and
+associated functions within your PostgreSQL queries.
+
+To demonstrate the usage, let's consider a hypothetical table `users` which has
+a `user_id` and a `user_likes` of type `hashset`.
+
+Firstly, let's create the table:
+
+```sql
+CREATE TABLE users(
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+```
+In the above statement, the `hashset_init(2)` initializes a hashset with initial
+capacity for 2 elements. The hashset will automatically resize itself when more
+elements are added beyond this initial capacity.
+
+Now, we can perform operations on this table. Here are some examples:
+
+```sql
+-- Insert a new user with id 1. The user_likes will automatically be initialized
+-- as an empty hashset
+INSERT INTO users (user_id) VALUES (1);
+
+-- Add elements (likes) for a user
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+
+-- Check if a user likes a particular item
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
+
+-- Count the number of likes a user has
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
+```
+
+You can also use the aggregate functions to perform operations on multiple rows.
+For instance, you can add an integer to a `hashset`.
## Data types
-FIXME
+- **hashset**: This data type represents a set of integers. Internally, it uses
+a combination of a bitmap and a value array to store the elements in a set. It's
+a variable-length type.
-## Operators
+## Functions
-FIXME
+The extension provides the following functions:
+### hashset_add(hashset, int) -> hashset
+Adds an integer to a `hashset`.
-#### Casts
+### hashset_contains(hashset, int) -> boolean
+Checks if an integer is contained in a `hashset`.
-FIXME
+### hashset_count(hashset) -> bigint
+Returns the number of elements in a `hashset`.
+
+### hashset_merge(hashset, hashset) -> hashset
+Merges two `hashset`s into a single `hashset`.
+
+### hashset_to_array(hashset) -> integer[]
+Converts a `hashset` to an integer array.
+
+### hashset_init(int) -> hashset
+Initializes an empty `hashset` with a specified initial capacity for maximum
+elements. The argument determines the maximum number of elements the `hashset`
+can hold before it needs to resize.
+
+## Aggregate Functions
+
+### hashset(integer) -> hashset
+Generates a `hashset` from a series of integers, keeping only the unique ones.
+
+### hashset(hashset) -> hashset
+Merges multiple `hashset`s into a single `hashset`, preserving unique elements.
+
+
+## Installation
+
+To install the extension, run `make install` in the project root. Then, in your
+PostgreSQL connection, execute `CREATE EXTENSION hashset;`.
+
+This extension requires PostgreSQL version ?.? or later.
## License
diff --git a/hashset.c b/hashset.c
index 33b2133..d60ccc8 100644
--- a/hashset.c
+++ b/hashset.c
@@ -73,7 +73,6 @@ Datum hashset_agg_combine(PG_FUNCTION_ARGS);
Datum hashset_to_array(PG_FUNCTION_ARGS);
-/* allocate hashset with enough space for a requested number of centroids */
static hashset_t *
hashset_allocate(int maxelements)
{
@@ -85,7 +84,6 @@ hashset_allocate(int maxelements)
len += (maxelements + 7) / 8;
len += maxelements * sizeof(int32);
- /* we pre-allocate the array for all centroids and also the buffer for incoming data */
ptr = palloc0(len);
SET_VARSIZE(ptr, len);
@@ -95,7 +93,6 @@ hashset_allocate(int maxelements)
set->maxelements = maxelements;
set->nelements = 0;
- /* new tdigest are automatically storing mean */
set->flags |= 0;
return set;
@@ -104,30 +101,109 @@ hashset_allocate(int maxelements)
Datum
hashset_in(PG_FUNCTION_ARGS)
{
-// int i, r;
-// char *str = PG_GETARG_CSTRING(0);
-// hashset_t *set = NULL;
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ hashset_t *set;
- PG_RETURN_NULL();
+ /* Check the opening and closing braces */
+ if (str[0] != '{' || str[len - 1] != '}')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\" and end with \"}\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = hashset_allocate(len/2);
+
+ while (true)
+ {
+ int32 value = strtol(str, &endptr, 10);
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->maxelements)
+ {
+ set = hashset_resize(set);
+ }
+ set = hashset_add_element(set, value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+ else if (*endptr == ',')
+ {
+ str = endptr + 1; /* Move to the next number */
+ }
+ else if (*endptr == '}')
+ {
+ break; /* End of the hashset */
+ }
+ }
+
+ PG_RETURN_POINTER(set);
}
Datum
hashset_out(PG_FUNCTION_ARGS)
{
- //int i;
- //tdigest_t *digest = (tdigest_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
- StringInfoData str;
+ hashset_t *set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
+ char *bitmap;
+ int32 *values;
+ int i;
+ StringInfoData str;
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + (set->maxelements + 7) / 8);
+
+ /* Initialize the StringInfo buffer */
initStringInfo(&str);
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->maxelements; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
PG_RETURN_CSTRING(str.data);
}
Datum
hashset_recv(PG_FUNCTION_ARGS)
{
- //StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
- hashset_t *set= NULL;
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ hashset_t *set;
+
+ set = (hashset_t *) palloc0(sizeof(hashset_t));
+ set->flags = pq_getmsgint(buf, 4);
+ set->maxelements = pq_getmsgint64(buf);
+ set->nelements = pq_getmsgint(buf, 4);
PG_RETURN_POINTER(set);
}
@@ -148,22 +224,6 @@ hashset_send(PG_FUNCTION_ARGS)
}
-/*
- * tdigest_to_array
- * Transform the tdigest into an array of double values.
- *
- * The whole digest is stored in a single "double precision" array, which
- * may be a bit confusing and perhaps fragile if more fields need to be
- * added in the future. The initial elements are flags, count (number of
- * items added to the digest), compression (determines the limit on number
- * of centroids) and current number of centroids. Follows stream of values
- * encoding the centroids in pairs of (mean, count).
- *
- * We make sure to always print mean, even for tdigests in the older format
- * storing sum for centroids. Otherwise the "mean" key would be confusing.
- * But we don't call tdigest_update_format, and instead we simply update the
- * flags and convert the sum/mean values.
- */
Datum
hashset_to_array(PG_FUNCTION_ARGS)
{
@@ -182,7 +242,7 @@ hashset_to_array(PG_FUNCTION_ARGS)
set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
sbitmap = set->data;
- svalues = (int32 *) (set->data + set->maxelements / 8);
+ svalues = (int32 *) (set->data + (set->maxelements + 7) / 8);
/* number of values to store in the array */
nvalues = set->nelements;
@@ -231,13 +291,14 @@ int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
static hashset_t *
hashset_resize(hashset_t * set)
{
- int i;
- hashset_t *new = hashset_allocate(set->maxelements * 2);
- char *bitmap;
- int32 *values;
+ int i;
+ hashset_t *new = hashset_allocate(set->maxelements * 2);
+ char *bitmap;
+ int32 *values;
+ /* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + (set->maxelements + 7) / 8);
for (i = 0; i < set->maxelements; i++)
{
@@ -266,7 +327,7 @@ hashset_add_element(hashset_t *set, int32 value)
hash = ((uint32) value * 7691 + 4201) % set->maxelements;
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + (set->maxelements + 7) / 8);
while (true)
{
@@ -308,7 +369,7 @@ hashset_contains_element(hashset_t *set, int32 value)
hash = ((uint32) value * 7691 + 4201) % set->maxelements;
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + (set->maxelements + 7) / 8);
while (true)
{
@@ -347,7 +408,10 @@ hashset_add(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
set = hashset_allocate(64);
else
- set = (hashset_t *) PG_GETARG_POINTER(0);
+ {
+ /* make sure we are working with a non-toasted and non-shared copy of the input */
+ set = (hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(0));
+ }
set = hashset_add_element(set, PG_GETARG_INT32(1));
@@ -377,7 +441,7 @@ hashset_merge(PG_FUNCTION_ARGS)
setb = PG_GETARG_HASHSET(1);
bitmap = setb->data;
- values = (int32 *) (setb->data + setb->maxelements / 8);
+ values = (int32 *) (setb->data + (setb->maxelements + 7) / 8);
for (i = 0; i < setb->maxelements; i++)
{
@@ -439,7 +503,7 @@ hashset_agg_add(PG_FUNCTION_ARGS)
/*
* We want to skip NULL values altogether - we return either the existing
- * t-digest (if it already exists) or NULL.
+ * hashset (if it already exists) or NULL.
*/
if (PG_ARGISNULL(1))
{
@@ -450,7 +514,7 @@ hashset_agg_add(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
}
- /* if there's no digest allocated, create it now */
+ /* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
@@ -481,7 +545,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
/*
* We want to skip NULL values altogether - we return either the existing
- * t-digest (if it already exists) or NULL.
+ * hashset (if it already exists) or NULL.
*/
if (PG_ARGISNULL(1))
{
@@ -492,7 +556,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
}
- /* if there's no digest allocated, create it now */
+ /* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
@@ -513,7 +577,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
value = PG_GETARG_HASHSET(1);
bitmap = value->data;
- values = (int32 *) (value->data + value->maxelements / 8);
+ values = (int32 *) (value->data + (value->maxelements + 7) / 8);
for (i = 0; i < value->maxelements; i++)
{
@@ -558,7 +622,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
int32 *values;
if (!AggCheckCallContext(fcinfo, &aggcontext))
- elog(ERROR, "tdigest_combine called in non-aggregate context");
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
/* if no "merged" state yet, try creating it */
if (PG_ARGISNULL(0))
@@ -570,7 +634,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
/* the second argument is not NULL, so copy it */
src = (hashset_t *) PG_GETARG_POINTER(1);
- /* copy the digest into the right long-lived memory context */
+ /* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
src = hashset_copy(src);
MemoryContextSwitchTo(oldcontext);
@@ -590,7 +654,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
dst = (hashset_t *) PG_GETARG_POINTER(0);
bitmap = src->data;
- values = (int32 *) (src->data + src->maxelements / 8);
+ values = (int32 *) (src->data + (src->maxelements + 7) / 8);
for (i = 0; i < src->maxelements; i++)
{
diff --git a/hashset.control b/hashset.control
index 3d0b9a4..89bb1ff 100644
--- a/hashset.control
+++ b/hashset.control
@@ -1,3 +1,3 @@
-comment = 'Provides tdigest aggregate function.'
+comment = 'Provides hashset type.'
default_version = '1.0.0'
relocatable = true
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..5be2501
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,96 @@
+SELECT hashset_sorted('{1}'::hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..b094033
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..889f5ca
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..8d4fbe2
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+ hashset_count
+---------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..662e65a
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,16 @@
+SELECT hashset_sorted('{1}'::hashset);
+SELECT hashset_sorted('{1,2}'::hashset);
+SELECT hashset_sorted('{1,2,3}'::hashset);
+SELECT hashset_sorted('{1,2,3,4}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..ccc0595
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..16c9084
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_numbers
+) q;
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..d848207
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
On Fri, Jun 9, 2023 at 6:58 PM Joel Jacobson <joel@compiler.org> wrote:
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.Sure, I'm willing to help!
I've attached a patch that works on some of the items on your list,
including some additions to the README.md.There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:- values = (int32 *) (set->data + set->maxelements / 8); + values = (int32 *) (set->data + (set->maxelements + 7) / 8);Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV
macros
to the PostgreSQL source code in general, since it's easy to make this
mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.There's a bunch of stuff that needs to be improved to make this properly
usable, like:1) better hash table implementation
TODO
2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the
most natural format,
example:
SELECT '{1,2,3}'::hashset;3) support for other types (now it only works with int32)
TODO
4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?5) more efficient storage format, with versioning etc.
TODO
6) regression tests
I've added some regression tests.
Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.Good point, I fully agree.
/Joel
Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?
2. I don't get the last return set, even the return type should be bool.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
static bool
hashset_contains_element(hashset_t *set, int32 value)
{
int byte;
int bit;
uint32 hash;
char *bitmap;
int32 *values;
hash = ((uint32) value * 7691 + 4201) % set->maxelements;
bitmap = set->data;
values = (int32 *) (set->data + set->maxelements / 8);
while (true)
{
byte = (hash / 8);
bit = (hash % 8);
/* found an empty slot, value is not there */
if ((bitmap[byte] & (0x01 << bit)) == 0)
return false;
/* is it the same value? */
if (values[hash] == value)
return true;
/* move to the next element */
hash = (hash + 13) % set->maxelements;
}
return set;
}
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?
Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly distributing
the hash values across the range and reduce the chance of collisions.
2. I don't get the last return set, even the return type should be bool.
Thanks, you found a mistake!
The line
return set;
is actually unreachable and should be removed.
The function will always return either true or false within the while loop and
never reach the final return statement.
I've attached a new incremental patch with this fix.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
The value 13 is used for linear probing [1]https://en.wikipedia.org/wiki/Linear_probing in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a collision
occurs. 13, being a small prime number not near a power of 2, helps in uniformly
distributing data and ensuring that all slots are probed, as it's relatively prime
to the hash table size.
Hm, I realise we actually don't ensure the hash table size and step size (13)
are coprime. I've fixed that in the attached patch as well.
[1]: https://en.wikipedia.org/wiki/Linear_probing
/Joel
Attachments:
hashset-1.0.0-joel-0002.patchapplication/octet-stream; name=hashset-1.0.0-joel-0002.patchDownload
diff --git a/hashset.c b/hashset.c
index d60ccc8..93c8d8c 100644
--- a/hashset.c
+++ b/hashset.c
@@ -80,6 +80,15 @@ hashset_allocate(int maxelements)
hashset_t *set;
char *ptr;
+ /*
+ * Ensure that maxelements is not divisible by 13;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (maxelements % 13 == 0) {
+ maxelements++;
+ }
+
len = offsetof(hashset_t, data);
len += (maxelements + 7) / 8;
len += maxelements * sizeof(int32);
@@ -387,8 +396,6 @@ hashset_contains_element(hashset_t *set, int32 value)
/* move to the next element */
hash = (hash + 13) % set->maxelements;
}
-
- return set;
}
Datum
On 2023-06-09 Fr 07:56, Joel Jacobson wrote:
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly
distributing
the hash values across the range and reduce the chance of collisions.2. I don't get the last return set, even the return type should be bool.
Thanks, you found a mistake!
The line
return set;
is actually unreachable and should be removed.
The function will always return either true or false within the while
loop and
never reach the final return statement.I've attached a new incremental patch with this fix.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a
collision
occurs. 13, being a small prime number not near a power of 2, helps in
uniformly
distributing data and ensuring that all slots are probed, as it's
relatively prime
to the hash table size.Hm, I realise we actually don't ensure the hash table size and step
size (13)
are coprime. I've fixed that in the attached patch as well.
Maybe you can post a full patch as well as incremental?
Stylistically I think you should reduce reliance on magic numbers (like
13). Probably need some #define's?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
in funcion. hashset_in
int32 value = strtol(str, &endptr, 10);
there is no int32 value range check?
imitate src/backend/utils/adt/int.c. the following way is what I came up
with.
int64 value = strtol(str, &endptr, 10);
if (errno == ERANGE || value < INT_MIN || value > INT_MAX)
ereturn(fcinfo->context, (Datum) 0,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("value \"%s\" is out of range for type %s", str,
"integer")));
set = hashset_add_element(set, (int32)value);
also it will infinity loop in hashset_in if supply the wrong value....
example select '{1,2s}'::hashset;
I need kill -9 to kill the process.
On 6/9/23 12:58, Joel Jacobson wrote:
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.Sure, I'm willing to help!
I've attached a patch that works on some of the items on your list,
including some additions to the README.md.There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:- values = (int32 *) (set->data + set->maxelements / 8); + values = (int32 *) (set->data + (set->maxelements + 7) / 8);Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
to the PostgreSQL source code in general, since it's easy to make this mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.
Yeah, it'd be good to have macros to calculate the sizes. We'll need
that in many places.
There's a bunch of stuff that needs to be improved to make this properly
usable, like:1) better hash table implementation
TODO
2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the most natural format,
example:
SELECT '{1,2,3}'::hashset;
+1 to that. I'd mimic the array in/out functions as much as possible.
3) support for other types (now it only works with int32)
TODO
I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.
I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?
4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?
I don't know. From the storage perspective it doesn't matter much, I
think, we would not need to change that. But I think adding a
polymorphic type (array-like) would require changes to grammar, and
that's not possible for an extension. If there's a way, I'm not aware of
it and I don't recall an extension doing that.
5) more efficient storage format, with versioning etc.
TODO
I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.
I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:
a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)
b) use a more compact hash table (with load factor much closer to 1.0)
Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).
That's roughly what I did for the tdigest extension.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/10/23 17:46, Andrew Dunstan wrote:
On 2023-06-09 Fr 07:56, Joel Jacobson wrote:
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly
distributing
the hash values across the range and reduce the chance of collisions.2. I don't get the last return set, even the return type should be bool.
Thanks, you found a mistake!
The line
return set;
is actually unreachable and should be removed.
The function will always return either true or false within the while
loop and
never reach the final return statement.I've attached a new incremental patch with this fix.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a
collision
occurs. 13, being a small prime number not near a power of 2, helps in
uniformly
distributing data and ensuring that all slots are probed, as it's
relatively prime
to the hash table size.Hm, I realise we actually don't ensure the hash table size and step
size (13)
are coprime. I've fixed that in the attached patch as well.Maybe you can post a full patch as well as incremental?
I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...
Stylistically I think you should reduce reliance on magic numbers (like
13). Probably need some #define's?
Yeah, absolutely. This was just pure laziness.
regard
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
On 6/10/23 17:46, Andrew Dunstan wrote:
Maybe you can post a full patch as well as incremental?
I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...
I think there are some good arguments that speaks in favour of including it in core:
1. It's a fundamental data structure. Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".
2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future
SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
a benefit of exposing it to users, since it allows system testing of the component
in isolation, even if it's primarily gonna be used as a smaller part of a larger more
high-level component.
3. I think there is a general need for hashset, experienced by myself, Andrew and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data structure.
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
3) support for other types (now it only works with int32)
I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?
I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".
/Joel
On 2023-06-11 Su 06:26, Joel Jacobson wrote:
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
On 6/10/23 17:46, Andrew Dunstan wrote:
Maybe you can post a full patch as well as incremental?
I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...I think there are some good arguments that speaks in favour of including it in core:
1. It's a fundamental data structure.
That's reason enough IMNSHO.
Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".
Rather than use "hashset", which as you say is based on an
implementation detail, I would prefer something like "integer_set" -
what it's a set of.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 6/11/23 12:26, Joel Jacobson wrote:
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
On 6/10/23 17:46, Andrew Dunstan wrote:
Maybe you can post a full patch as well as incremental?
I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...I think there are some good arguments that speaks in favour of including it in core:
1. It's a fundamental data structure. Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future
SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
a benefit of exposing it to users, since it allows system testing of the component
in isolation, even if it's primarily gonna be used as a smaller part of a larger more
high-level component.3. I think there is a general need for hashset, experienced by myself, Andrew and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data structure.
I agree with all of that, but ...
It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.
Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.
Presumably people would like to use the extension in the release they
already use, without backporting.
Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
3) support for other types (now it only works with int32)
I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".
OK, makes sense.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jun 11, 2023, at 16:58, Andrew Dunstan wrote:
On 2023-06-11 Su 06:26, Joel Jacobson wrote:
Perhaps "set" would have been a better name, since the use of hash functions from an end-user perspective is >>implementation details, but we cannot use that word since it's a reserved keyword, hence "hashset".Rather than use "hashset", which as you say is based on an implementation detail, I would prefer something like
"integer_set" - what it's a set of.
Apologies for the confusion previously.
Upon further reflection, I recognize that the term "hash" in "hashset"
extends beyond mere representation of implementation.
It imparts key information about performance characteristics as well as functionality inherent to the set.
In hindsight, "hashset" does emerge as the most suitable terminology.
/Joel
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
On 6/11/23 12:26, Joel Jacobson wrote:
I think there are some good arguments that speaks in favour of including it in core:
...
I agree with all of that, but ...
It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.Presumably people would like to use the extension in the release they
already use, without backporting.Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.
Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?
What was the json/jsonb story, was it ever an extension before
being included in core?
/Joel
On 2023-06-11 Su 16:15, Joel Jacobson wrote:
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
On 6/11/23 12:26, Joel Jacobson wrote:
I think there are some good arguments that speaks in favour of including it in core:
...
I agree with all of that, but ...
It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.Presumably people would like to use the extension in the release they
already use, without backporting.Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?What was the json/jsonb story, was it ever an extension before
being included in core?
No, and the difficulty is that an in-core type and associated functions
will have different oids, so migrating from one to the other would be at
best painful.
So it's a kind of now or never decision. I think extensions are
excellent for specialized types. But I don't regard a set type in that
light.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 6/11/23 22:15, Joel Jacobson wrote:
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
On 6/11/23 12:26, Joel Jacobson wrote:
I think there are some good arguments that speaks in favour of including it in core:
...
I agree with all of that, but ...
It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.Presumably people would like to use the extension in the release they
already use, without backporting.Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?
I think it's always a matter of judgment - I don't think there's some
clear set into a stone. If something is "fundamental" and can be done in
an extension, there's always the option to have it in contrib (with all
the limitations I mentioned).
FWIW I'm not strictly against adding it to contrib, if there's agreement
it's worth it. But if we consider it to be a fundamental data structure
and widely useful, maybe we should consider making it a built-in data
type, with fixed OID etc. That'd allow support at the SQL grammar level,
and perhaps also from the proposed SQL/PGQ (and GQL?). AFAIK moving data
types from extension (even if from contrib) to core is not painless.
Either way it might be a nice / useful first patch, I guess.
What was the json/jsonb story, was it ever an extension before
being included in core?
I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/12/23 14:46, Andrew Dunstan wrote:
On 2023-06-11 Su 16:15, Joel Jacobson wrote:
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
On 6/11/23 12:26, Joel Jacobson wrote:
I think there are some good arguments that speaks in favour of including it in core:
...
I agree with all of that, but ...
It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.Presumably people would like to use the extension in the release they
already use, without backporting.Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?What was the json/jsonb story, was it ever an extension before
being included in core?No, and the difficulty is that an in-core type and associated functions
will have different oids, so migrating from one to the other would be at
best painful.So it's a kind of now or never decision. I think extensions are
excellent for specialized types. But I don't regard a set type in that
light.
Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...
If we do that, I wonder if we could do something similar to arrays, with
the polymorphism and SQL grammar support. Does SQL have any concept of
sets (or arrays, for that matter) as data types?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2023-06-12 Mo 09:00, Tomas Vondra wrote:
What was the json/jsonb story, was it ever an extension before
being included in core?I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.
Not quite.
The json type as added in 9.2 (Sept 2012) and jsonb in 9.4 (Dec 2014). I
wouldn't call those far apart or very long ago. Neither included any
grammar changes AFAIR.
But if they had been added as extensions we'd probably be in a whole lot
more trouble now implementing SQL/JSON, so whether that was foresight or
laziness I think the we landed on our feet there.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Sat, Jun 10, 2023, at 17:46, Andrew Dunstan wrote:
Maybe you can post a full patch as well as incremental?
Attached patch is based on tvondra's last commit 375b072.
Stylistically I think you should reduce reliance on magic numbers (like 13). Probably need some #define's?
Great idea, fixed, I've added a HASHSET_STEP definition (set to the value 13).
On Sat, Jun 10, 2023, at 17:51, jian he wrote:
int32 value = strtol(str, &endptr, 10);
there is no int32 value range check?
imitate src/backend/utils/adt/int.c. the following way is what I came up with.int64 value = strtol(str, &endptr, 10);
if (errno == ERANGE || value < INT_MIN || value > INT_MAX)
Thanks, fixed like suggested, except I used PG_INT32_MIN and PG_INT32_MAX,
which explicitly represent the maximum value for a 32-bit integer,
regardless of the platform or C implementation.
also it will infinity loop in hashset_in if supply the wrong value....
example select '{1,2s}'::hashset;
I need kill -9 to kill the process.
Thanks. I've added a new test, `sql/invalid.sql` with that example query.
Here is a summary of all other changes:
* README.md: Added sections Usage, Data types, Functions and Aggregate Functions
* Added test/ directory with some tests.
* Added "not production-ready" notice at top of README, warning for breaking
changes and no migration scripts, until our first release.
* Change version from 1.0.0 to 0.0.1 to indicate current status.
* Added CEIL_DIV macro
* Implemented hashset_in(), hashset_out()
The syntax for the serialized format is comma separated integer values
wrapped around curly braces, e.g '{1,2,3}'
* Implemented hashset_recv() to match the existing hashset_send()
* Removed/rewrote some tdigest related comments
/Joel
Attachments:
hashset-0.0.1-a8a282a.patchapplication/octet-stream; name=hashset-0.0.1-a8a282a.patchDownload
commit a8a282a76e5f617648eac1bba3bffd7a9302ab00
Author: Joel Jakobsson <joel@compiler.org>
Date: Mon Jun 12 17:33:07 2023 +0200
Enhanced README, added tests, updated version, and improved hashset serialization
* README.md: Added sections Usage, Data types, Functions and Aggregate Functions
* Added test/ directory with some tests.
* Added "not production-ready" notice at top of README, warning for breaking
changes and no migration scripts, until our first release.
* Change version from 1.0.0 to 0.0.1 to indicate current status.
* Added CEIL_DIV macro and HASHSET_STEP def
* Implemented hashset_in(), hashset_out()
The syntax for the serialized format is comma separated integer values
wrapped around curly braces, e.g '{1,2,3}'
* Implemented hashset_recv() to match the existing hashset_send()
* Removed/rewrote some tdigest related comments
diff --git a/Makefile b/Makefile
index 6d1ccab..ce88b4b 100644
--- a/Makefile
+++ b/Makefile
@@ -2,12 +2,13 @@ MODULE_big = hashset
OBJS = hashset.o
EXTENSION = hashset
-DATA = hashset--1.0.0.sql
+DATA = hashset--0.0.1.sql
MODULES = hashset
CFLAGS=`pg_config --includedir-server`
-REGRESS = basic cast conversions incremental parallel_query value_count_api trimmed_aggregates
+REGRESS = prelude basic random table invalid
+
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/README.md b/README.md
index 5805613..bd386ad 100644
--- a/README.md
+++ b/README.md
@@ -3,25 +3,99 @@
This PostgreSQL extension implements hashset, a data structure (type)
providing a collection of integer items with fast lookup.
+🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
+PostgreSQL extension is **not production-ready**. As the codebase is evolving
+with possible breaking changes, we are not providing any migration scripts
+until we reach our first release.
## Usage
-FIXME
+After installing the extension, you can use the `hashset` data type and
+associated functions within your PostgreSQL queries.
+
+To demonstrate the usage, let's consider a hypothetical table `users` which has
+a `user_id` and a `user_likes` of type `hashset`.
+
+Firstly, let's create the table:
+
+```sql
+CREATE TABLE users(
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+```
+In the above statement, the `hashset_init(2)` initializes a hashset with initial
+capacity for 2 elements. The hashset will automatically resize itself when more
+elements are added beyond this initial capacity.
+
+Now, we can perform operations on this table. Here are some examples:
+
+```sql
+-- Insert a new user with id 1. The user_likes will automatically be initialized
+-- as an empty hashset
+INSERT INTO users (user_id) VALUES (1);
+
+-- Add elements (likes) for a user
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+
+-- Check if a user likes a particular item
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
+
+-- Count the number of likes a user has
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
+```
+
+You can also use the aggregate functions to perform operations on multiple rows.
+For instance, you can add an integer to a `hashset`.
## Data types
-FIXME
+- **hashset**: This data type represents a set of integers. Internally, it uses
+a combination of a bitmap and a value array to store the elements in a set. It's
+a variable-length type.
-## Operators
+## Functions
-FIXME
+The extension provides the following functions:
+### hashset_add(hashset, int) -> hashset
+Adds an integer to a `hashset`.
-#### Casts
+### hashset_contains(hashset, int) -> boolean
+Checks if an integer is contained in a `hashset`.
-FIXME
+### hashset_count(hashset) -> bigint
+Returns the number of elements in a `hashset`.
+
+### hashset_merge(hashset, hashset) -> hashset
+Merges two `hashset`s into a single `hashset`.
+
+### hashset_to_array(hashset) -> integer[]
+Converts a `hashset` to an integer array.
+
+### hashset_init(int) -> hashset
+Initializes an empty `hashset` with a specified initial capacity for maximum
+elements. The argument determines the maximum number of elements the `hashset`
+can hold before it needs to resize.
+
+## Aggregate Functions
+
+### hashset(integer) -> hashset
+Generates a `hashset` from a series of integers, keeping only the unique ones.
+
+### hashset(hashset) -> hashset
+Merges multiple `hashset`s into a single `hashset`, preserving unique elements.
+
+
+## Installation
+
+To install the extension, run `make install` in the project root. Then, in your
+PostgreSQL connection, execute `CREATE EXTENSION hashset;`.
+
+This extension requires PostgreSQL version ?.? or later.
## License
diff --git a/hashset--1.0.0.sql b/hashset--0.0.1.sql
similarity index 100%
rename from hashset--1.0.0.sql
rename to hashset--0.0.1.sql
diff --git a/hashset.c b/hashset.c
index 33b2133..7278e68 100644
--- a/hashset.c
+++ b/hashset.c
@@ -40,6 +40,8 @@ static bool hashset_contains_element(hashset_t *set, int32 value);
static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
#define PG_GETARG_HASHSET(x) (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_STEP 13
PG_FUNCTION_INFO_V1(hashset_in);
PG_FUNCTION_INFO_V1(hashset_out);
@@ -73,7 +75,6 @@ Datum hashset_agg_combine(PG_FUNCTION_ARGS);
Datum hashset_to_array(PG_FUNCTION_ARGS);
-/* allocate hashset with enough space for a requested number of centroids */
static hashset_t *
hashset_allocate(int maxelements)
{
@@ -81,11 +82,19 @@ hashset_allocate(int maxelements)
hashset_t *set;
char *ptr;
+ /*
+ * Ensure that maxelements is not divisible by HASHSET_STEP;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (maxelements % HASHSET_STEP == 0) {
+ maxelements++;
+ }
+
len = offsetof(hashset_t, data);
- len += (maxelements + 7) / 8;
+ len += CEIL_DIV(maxelements, 8);
len += maxelements * sizeof(int32);
- /* we pre-allocate the array for all centroids and also the buffer for incoming data */
ptr = palloc0(len);
SET_VARSIZE(ptr, len);
@@ -95,7 +104,6 @@ hashset_allocate(int maxelements)
set->maxelements = maxelements;
set->nelements = 0;
- /* new tdigest are automatically storing mean */
set->flags |= 0;
return set;
@@ -104,30 +112,124 @@ hashset_allocate(int maxelements)
Datum
hashset_in(PG_FUNCTION_ARGS)
{
-// int i, r;
-// char *str = PG_GETARG_CSTRING(0);
-// hashset_t *set = NULL;
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ hashset_t *set;
- PG_RETURN_NULL();
+ /* Check the opening and closing braces */
+ if (str[0] != '{' || str[len - 1] != '}')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\" and end with \"}\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = hashset_allocate(len/2);
+
+ while (true)
+ {
+ int64 value = strtol(str, &endptr, 10);
+
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->maxelements)
+ {
+ set = hashset_resize(set);
+ }
+ set = hashset_add_element(set, (int32)value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+ else if (*endptr == ',')
+ {
+ str = endptr + 1; /* Move to the next number */
+ }
+ else if (*endptr == '}')
+ {
+ break; /* End of the hashset */
+ }
+ else
+ {
+ /* Unexpected character */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("unexpected character \"%c\" in hashset input", *endptr)));
+ }
+ }
+
+ PG_RETURN_POINTER(set);
}
Datum
hashset_out(PG_FUNCTION_ARGS)
{
- //int i;
- //tdigest_t *digest = (tdigest_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
- StringInfoData str;
+ hashset_t *set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
+ char *bitmap;
+ int32 *values;
+ int i;
+ StringInfoData str;
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+
+ /* Initialize the StringInfo buffer */
initStringInfo(&str);
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->maxelements; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
PG_RETURN_CSTRING(str.data);
}
Datum
hashset_recv(PG_FUNCTION_ARGS)
{
- //StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
- hashset_t *set= NULL;
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ hashset_t *set;
+
+ set = (hashset_t *) palloc0(sizeof(hashset_t));
+ set->flags = pq_getmsgint(buf, 4);
+ set->maxelements = pq_getmsgint64(buf);
+ set->nelements = pq_getmsgint(buf, 4);
PG_RETURN_POINTER(set);
}
@@ -148,22 +250,6 @@ hashset_send(PG_FUNCTION_ARGS)
}
-/*
- * tdigest_to_array
- * Transform the tdigest into an array of double values.
- *
- * The whole digest is stored in a single "double precision" array, which
- * may be a bit confusing and perhaps fragile if more fields need to be
- * added in the future. The initial elements are flags, count (number of
- * items added to the digest), compression (determines the limit on number
- * of centroids) and current number of centroids. Follows stream of values
- * encoding the centroids in pairs of (mean, count).
- *
- * We make sure to always print mean, even for tdigests in the older format
- * storing sum for centroids. Otherwise the "mean" key would be confusing.
- * But we don't call tdigest_update_format, and instead we simply update the
- * flags and convert the sum/mean values.
- */
Datum
hashset_to_array(PG_FUNCTION_ARGS)
{
@@ -182,7 +268,7 @@ hashset_to_array(PG_FUNCTION_ARGS)
set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
sbitmap = set->data;
- svalues = (int32 *) (set->data + set->maxelements / 8);
+ svalues = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
/* number of values to store in the array */
nvalues = set->nelements;
@@ -231,13 +317,14 @@ int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
static hashset_t *
hashset_resize(hashset_t * set)
{
- int i;
- hashset_t *new = hashset_allocate(set->maxelements * 2);
- char *bitmap;
- int32 *values;
+ int i;
+ hashset_t *new = hashset_allocate(set->maxelements * 2);
+ char *bitmap;
+ int32 *values;
+ /* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
for (i = 0; i < set->maxelements; i++)
{
@@ -266,7 +353,7 @@ hashset_add_element(hashset_t *set, int32 value)
hash = ((uint32) value * 7691 + 4201) % set->maxelements;
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
while (true)
{
@@ -280,7 +367,7 @@ hashset_add_element(hashset_t *set, int32 value)
if (values[hash] == value)
break;
- hash = (hash + 13) % set->maxelements;
+ hash = (hash + HASHSET_STEP) % set->maxelements;
continue;
}
@@ -308,7 +395,7 @@ hashset_contains_element(hashset_t *set, int32 value)
hash = ((uint32) value * 7691 + 4201) % set->maxelements;
bitmap = set->data;
- values = (int32 *) (set->data + set->maxelements / 8);
+ values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
while (true)
{
@@ -324,10 +411,8 @@ hashset_contains_element(hashset_t *set, int32 value)
return true;
/* move to the next element */
- hash = (hash + 13) % set->maxelements;
+ hash = (hash + HASHSET_STEP) % set->maxelements;
}
-
- return set;
}
Datum
@@ -347,7 +432,10 @@ hashset_add(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
set = hashset_allocate(64);
else
- set = (hashset_t *) PG_GETARG_POINTER(0);
+ {
+ /* make sure we are working with a non-toasted and non-shared copy of the input */
+ set = (hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(0));
+ }
set = hashset_add_element(set, PG_GETARG_INT32(1));
@@ -377,7 +465,7 @@ hashset_merge(PG_FUNCTION_ARGS)
setb = PG_GETARG_HASHSET(1);
bitmap = setb->data;
- values = (int32 *) (setb->data + setb->maxelements / 8);
+ values = (int32 *) (setb->data + CEIL_DIV(setb->maxelements, 8));
for (i = 0; i < setb->maxelements; i++)
{
@@ -439,7 +527,7 @@ hashset_agg_add(PG_FUNCTION_ARGS)
/*
* We want to skip NULL values altogether - we return either the existing
- * t-digest (if it already exists) or NULL.
+ * hashset (if it already exists) or NULL.
*/
if (PG_ARGISNULL(1))
{
@@ -450,7 +538,7 @@ hashset_agg_add(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
}
- /* if there's no digest allocated, create it now */
+ /* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
@@ -481,7 +569,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
/*
* We want to skip NULL values altogether - we return either the existing
- * t-digest (if it already exists) or NULL.
+ * hashset (if it already exists) or NULL.
*/
if (PG_ARGISNULL(1))
{
@@ -492,7 +580,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
}
- /* if there's no digest allocated, create it now */
+ /* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
@@ -513,7 +601,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
value = PG_GETARG_HASHSET(1);
bitmap = value->data;
- values = (int32 *) (value->data + value->maxelements / 8);
+ values = (int32 *) (value->data + CEIL_DIV(value->maxelements, 8));
for (i = 0; i < value->maxelements; i++)
{
@@ -558,7 +646,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
int32 *values;
if (!AggCheckCallContext(fcinfo, &aggcontext))
- elog(ERROR, "tdigest_combine called in non-aggregate context");
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
/* if no "merged" state yet, try creating it */
if (PG_ARGISNULL(0))
@@ -570,7 +658,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
/* the second argument is not NULL, so copy it */
src = (hashset_t *) PG_GETARG_POINTER(1);
- /* copy the digest into the right long-lived memory context */
+ /* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
src = hashset_copy(src);
MemoryContextSwitchTo(oldcontext);
@@ -590,7 +678,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
dst = (hashset_t *) PG_GETARG_POINTER(0);
bitmap = src->data;
- values = (int32 *) (src->data + src->maxelements / 8);
+ values = (int32 *) (src->data + CEIL_DIV(src->maxelements, 8));
for (i = 0; i < src->maxelements; i++)
{
diff --git a/hashset.control b/hashset.control
index 3d0b9a4..0743003 100644
--- a/hashset.control
+++ b/hashset.control
@@ -1,3 +1,3 @@
-comment = 'Provides tdigest aggregate function.'
-default_version = '1.0.0'
+comment = 'Provides hashset type.'
+default_version = '0.0.1'
relocatable = true
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..5be2501
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,96 @@
+SELECT hashset_sorted('{1}'::hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
new file mode 100644
index 0000000..0e08925
--- /dev/null
+++ b/test/expected/invalid.out
@@ -0,0 +1,4 @@
+SELECT '{1,2s}'::hashset;
+ERROR: unexpected character "s" in hashset input
+LINE 1: SELECT '{1,2s}'::hashset;
+ ^
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..b094033
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..889f5ca
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..8d4fbe2
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+ hashset_count
+---------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..662e65a
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,16 @@
+SELECT hashset_sorted('{1}'::hashset);
+SELECT hashset_sorted('{1,2}'::hashset);
+SELECT hashset_sorted('{1,2,3}'::hashset);
+SELECT hashset_sorted('{1,2,3,4}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
new file mode 100644
index 0000000..f1a9488
--- /dev/null
+++ b/test/sql/invalid.sql
@@ -0,0 +1 @@
+SELECT '{1,2s}'::hashset;
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..ccc0595
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..16c9084
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_numbers
+) q;
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..d848207
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes hashset DEFAULT hashset_init(2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
1) better hash table implementation
I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?
However, I noticed that according to SMHasher [1] https://github.com/rurban/smhasher,
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"
Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:
wyhash v4.1 (64-bit) [2]https://github.com/wangyi-fudan/wyhash
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474
xxh3low (xxHash v3, 64-bit, low 32-bits part) [3] https://github.com/Cyan4973/xxHash
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756
[1]: https://github.com/rurban/smhasher
[2]: https://github.com/wangyi-fudan/wyhash
[3]: https://github.com/Cyan4973/xxHash
5) more efficient storage format, with versioning etc.
I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)b) use a more compact hash table (with load factor much closer to 1.0)
Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).That's roughly what I did for the tdigest extension.
Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?
/Joel
On 6/12/23 19:34, Joel Jacobson wrote:
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
1) better hash table implementation
I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?However, I noticed that according to SMHasher [1],
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:wyhash v4.1 (64-bit) [2]
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756[1] https://github.com/rurban/smhasher
[2] https://github.com/wangyi-fudan/wyhash
[3] https://github.com/Cyan4973/xxHash
But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:
lookup3:
Small key speed test - 4-byte keys - 30.17 cycles/hash
Small key speed test - 8-byte keys - 31.00 cycles/hash
Small key speed test - 16-byte keys - 49.00 cycles/hash
xxh3low:
Small key speed test - 4-byte keys - 29.00 cycles/hash
Small key speed test - 8-byte keys - 29.58 cycles/hash
Small key speed test - 16-byte keys - 37.00 cycles/hash
But you can try doing some measurements, of course. Or just do profiling
to see how much time we spend in the hash function - I'd bet it's pretty
tiny fraction of the total time.
As for the "quality" issues - it's the same algorithm in Postgres, so it
has the same issues. I don't if that has measurable impact, though. I'd
guess it does not, particularly for "reasonably small" sets.
5) more efficient storage format, with versioning etc.
I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)b) use a more compact hash table (with load factor much closer to 1.0)
Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).That's roughly what I did for the tdigest extension.
Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?
Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).
My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:
Oh, sorry, I completely failed to read the meaning of the Columns.
lookup3:
Small key speed test - 4-byte keys - 30.17 cycles/hash
Small key speed test - 8-byte keys - 31.00 cycles/hash
Small key speed test - 16-byte keys - 49.00 cycles/hashxxh3low:
Small key speed test - 4-byte keys - 29.00 cycles/hash
Small key speed test - 8-byte keys - 29.58 cycles/hash
Small key speed test - 16-byte keys - 37.00 cycles/hash
The winner of the "Small key speed test" competition seems to be:
ahash64 "ahash 64bit":
Small key speed test - 4-byte keys - 24.00 cycles/hash
Small key speed test - 8-byte keys - 24.00 cycles/hash
Small key speed test - 16-byte keys - 26.98 cycles/hash
Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet.
Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.
Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).
With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.
My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.
Sounds good to me.
Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!
/Joel
On Mon, Jun 12, 2023, at 22:36, Joel Jacobson wrote:
On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.
hashset is now using hash_bytes_uint32() from hashfn.h
Other changes in the same commit:
* Introduce hashfn_id field to specify hash function ID
* Implement hashset_send and hashset_recv and add C-test using libpq
* Add operators and operator classes for hashset comparison, sorting
and distinct queries
Looks good? If so, I wonder what's best to focus on next?
Perhaps adding support for bigint? Other ideas?
/Joel
Attachments:
hashset-0.0.1-da3b024.patchapplication/octet-stream; name=hashset-0.0.1-da3b024.patchDownload
commit da3b0242f3a11fc351e5eeb53883cdffe15f55d9
Author: Joel Jakobsson <joel@compiler.org>
Date: Tue Jun 13 20:12:15 2023 +0200
Switch to using hashfn.h, implement operators and send/recv
* Introduce hashfn_id field in hashset to specify hash function ID
* Replace custom hash function with Jenkins/lookup3 from hashfn.h
* Implement hashset_send and hashset_recv and add C-test using libpq
* Add operators and operator classes for hashset comparison, sorting
and distinct queries
diff --git a/.gitignore b/.gitignore
index 0767074..91f216e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,5 @@ results/
**/*.so
regression.diffs
regression.out
+.vscode
+test/c_tests/test_send_recv
diff --git a/Makefile b/Makefile
index ce88b4b..928e211 100644
--- a/Makefile
+++ b/Makefile
@@ -5,12 +5,29 @@ EXTENSION = hashset
DATA = hashset--0.0.1.sql
MODULES = hashset
-CFLAGS=`pg_config --includedir-server`
-
-REGRESS = prelude basic random table invalid
+# Keep the CFLAGS separate
+SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
+CLIENT_INCLUDES=-I$(shell pg_config --includedir)
+LIBRARY_PATH = -L$(shell pg_config --libdir)
+REGRESS = prelude basic random table invalid order
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+C_TESTS_DIR = test/c_tests
+
+EXTRA_CLEAN = $(C_TESTS_DIR)/test_send_recv
+
+c_tests: $(C_TESTS_DIR)/test_send_recv
+
+$(C_TESTS_DIR)/test_send_recv: $(C_TESTS_DIR)/test_send_recv.c
+ $(CC) $(SERVER_INCLUDES) $(CLIENT_INCLUDES) -o $@ $< $(LIBRARY_PATH) -lpq
+
+run_c_tests: c_tests
+ cd $(C_TESTS_DIR) && ./test_send_recv.sh
+
+check: all $(REGRESS_PREP) run_c_tests
+
include $(PGXS)
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index a01ef48..7b79b73 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -107,3 +107,118 @@ CREATE AGGREGATE hashset(hashset) (
COMBINEFUNC = hashset_agg_combine,
PARALLEL = SAFE
);
+
+
+CREATE OR REPLACE FUNCTION hashset_equals(hashset, hashset)
+ RETURNS bool
+ AS 'hashset', 'hashset_equals'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR = (
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ PROCEDURE = hashset_equals,
+ COMMUTATOR = =,
+ HASHES
+);
+
+CREATE OR REPLACE FUNCTION hashset_neq(hashset, hashset)
+ RETURNS bool
+ AS 'hashset', 'hashset_neq'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR <> (
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ PROCEDURE = hashset_neq,
+ COMMUTATOR = '<>',
+ NEGATOR = '=',
+ RESTRICT = neqsel,
+ JOIN = neqjoinsel,
+ HASHES
+);
+
+
+CREATE OR REPLACE FUNCTION hashset_hash(hashset)
+ RETURNS integer
+ AS 'hashset', 'hashset_hash'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR CLASS hashset_hash_ops
+ DEFAULT FOR TYPE hashset USING hash AS
+ OPERATOR 1 = (hashset, hashset),
+ FUNCTION 1 hashset_hash(hashset);
+
+CREATE OR REPLACE FUNCTION hashset_lt(hashset, hashset)
+ RETURNS bool
+ AS 'hashset', 'hashset_lt'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_le(hashset, hashset)
+ RETURNS boolean
+ AS 'hashset', 'hashset_le'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_gt(hashset, hashset)
+ RETURNS boolean
+ AS 'hashset', 'hashset_gt'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_ge(hashset, hashset)
+ RETURNS boolean
+ AS 'hashset', 'hashset_ge'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cmp(hashset, hashset)
+ RETURNS integer
+ AS 'hashset', 'hashset_cmp'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR < (
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ PROCEDURE = hashset_lt,
+ COMMUTATOR = >,
+ NEGATOR = >=,
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR <= (
+ PROCEDURE = hashset_le,
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ COMMUTATOR = '>=',
+ NEGATOR = '>',
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR > (
+ PROCEDURE = hashset_gt,
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ COMMUTATOR = '<',
+ NEGATOR = '<=',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR >= (
+ PROCEDURE = hashset_ge,
+ LEFTARG = hashset,
+ RIGHTARG = hashset,
+ COMMUTATOR = '<=',
+ NEGATOR = '<',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR CLASS hashset_btree_ops
+ DEFAULT FOR TYPE hashset USING btree AS
+ OPERATOR 1 < (hashset, hashset),
+ OPERATOR 2 <= (hashset, hashset),
+ OPERATOR 3 = (hashset, hashset),
+ OPERATOR 4 >= (hashset, hashset),
+ OPERATOR 5 > (hashset, hashset),
+ FUNCTION 1 hashset_cmp(hashset, hashset);
diff --git a/hashset.c b/hashset.c
index 7278e68..ec3ed44 100644
--- a/hashset.c
+++ b/hashset.c
@@ -19,6 +19,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "catalog/pg_type.h"
+#include "common/hashfn.h"
PG_MODULE_MAGIC;
@@ -29,7 +30,8 @@ typedef struct hashset_t {
int32 vl_len_; /* varlena header (do not touch directly!) */
int32 flags; /* reserved for future use (versioning, ...) */
int32 maxelements; /* max number of element we have space for */
- int32 nelements; /* number of items added to the hashset */
+ int32 nelements; /* number of items added to the hashset */
+ int32 hashfn_id; /* ID of the hash function used */
char data[FLEXIBLE_ARRAY_MEMBER];
} hashset_t;
@@ -42,6 +44,7 @@ static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
#define PG_GETARG_HASHSET(x) (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
#define HASHSET_STEP 13
+#define JENKINS_LOOKUP3_HASHFN_ID 1
PG_FUNCTION_INFO_V1(hashset_in);
PG_FUNCTION_INFO_V1(hashset_out);
@@ -56,8 +59,15 @@ PG_FUNCTION_INFO_V1(hashset_agg_add_set);
PG_FUNCTION_INFO_V1(hashset_agg_add);
PG_FUNCTION_INFO_V1(hashset_agg_final);
PG_FUNCTION_INFO_V1(hashset_agg_combine);
-
PG_FUNCTION_INFO_V1(hashset_to_array);
+PG_FUNCTION_INFO_V1(hashset_equals);
+PG_FUNCTION_INFO_V1(hashset_neq);
+PG_FUNCTION_INFO_V1(hashset_hash);
+PG_FUNCTION_INFO_V1(hashset_lt);
+PG_FUNCTION_INFO_V1(hashset_le);
+PG_FUNCTION_INFO_V1(hashset_gt);
+PG_FUNCTION_INFO_V1(hashset_ge);
+PG_FUNCTION_INFO_V1(hashset_cmp);
Datum hashset_in(PG_FUNCTION_ARGS);
Datum hashset_out(PG_FUNCTION_ARGS);
@@ -72,8 +82,15 @@ Datum hashset_agg_add(PG_FUNCTION_ARGS);
Datum hashset_agg_add_set(PG_FUNCTION_ARGS);
Datum hashset_agg_final(PG_FUNCTION_ARGS);
Datum hashset_agg_combine(PG_FUNCTION_ARGS);
-
Datum hashset_to_array(PG_FUNCTION_ARGS);
+Datum hashset_equals(PG_FUNCTION_ARGS);
+Datum hashset_neq(PG_FUNCTION_ARGS);
+Datum hashset_hash(PG_FUNCTION_ARGS);
+Datum hashset_lt(PG_FUNCTION_ARGS);
+Datum hashset_le(PG_FUNCTION_ARGS);
+Datum hashset_gt(PG_FUNCTION_ARGS);
+Datum hashset_ge(PG_FUNCTION_ARGS);
+Datum hashset_cmp(PG_FUNCTION_ARGS);
static hashset_t *
hashset_allocate(int maxelements)
@@ -87,9 +104,8 @@ hashset_allocate(int maxelements)
* i.e. the step size used in hashset_add_element()
* and hashset_contains_element().
*/
- while (maxelements % HASHSET_STEP == 0) {
+ while (maxelements % HASHSET_STEP == 0)
maxelements++;
- }
len = offsetof(hashset_t, data);
len += CEIL_DIV(maxelements, 8);
@@ -103,6 +119,7 @@ hashset_allocate(int maxelements)
set->flags = 0;
set->maxelements = maxelements;
set->nelements = 0;
+ set->hashfn_id = JENKINS_LOOKUP3_HASHFN_ID;
set->flags |= 0;
@@ -220,36 +237,72 @@ hashset_out(PG_FUNCTION_ARGS)
PG_RETURN_CSTRING(str.data);
}
-Datum
-hashset_recv(PG_FUNCTION_ARGS)
-{
- StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
- hashset_t *set;
-
- set = (hashset_t *) palloc0(sizeof(hashset_t));
- set->flags = pq_getmsgint(buf, 4);
- set->maxelements = pq_getmsgint64(buf);
- set->nelements = pq_getmsgint(buf, 4);
-
- PG_RETURN_POINTER(set);
-}
Datum
hashset_send(PG_FUNCTION_ARGS)
{
hashset_t *set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
StringInfoData buf;
+ int32 data_size;
+ /* Begin constructing the message */
pq_begintypsend(&buf);
+ /* Send the non-data fields */
pq_sendint(&buf, set->flags, 4);
- pq_sendint64(&buf, set->maxelements);
+ pq_sendint(&buf, set->maxelements, 4);
pq_sendint(&buf, set->nelements, 4);
+ pq_sendint(&buf, set->hashfn_id, 4);
+
+ /* Compute and send the size of the data field */
+ data_size = VARSIZE(set) - offsetof(hashset_t, data);
+ pq_sendbytes(&buf, set->data, data_size);
PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
}
+Datum
+hashset_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ hashset_t *set;
+ int32 data_size;
+ Size total_size;
+ const char *binary_data;
+
+ /* Read fields from buffer */
+ int32 flags = pq_getmsgint(buf, 4);
+ int32 maxelements = pq_getmsgint(buf, 4);
+ int32 nelements = pq_getmsgint(buf, 4);
+ int32 hashfn_id = pq_getmsgint(buf, 4);
+
+ /* Compute the size of the data field */
+ data_size = buf->len - buf->cursor;
+
+ /* Read the binary data */
+ binary_data = pq_getmsgbytes(buf, data_size);
+
+ /* Compute total size of hashset_t */
+ total_size = offsetof(hashset_t, data) + data_size;
+
+ /* Allocate memory for hashset including the data field */
+ set = (hashset_t *) palloc0(total_size);
+
+ /* Set the size of the variable-length data structure */
+ SET_VARSIZE(set, total_size);
+
+ /* Populate the structure */
+ set->flags = flags;
+ set->maxelements = maxelements;
+ set->nelements = nelements;
+ set->hashfn_id = hashfn_id;
+ memcpy(set->data, binary_data, data_size);
+
+ PG_RETURN_POINTER(set);
+}
+
+
Datum
hashset_to_array(PG_FUNCTION_ARGS)
{
@@ -317,10 +370,10 @@ int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
static hashset_t *
hashset_resize(hashset_t * set)
{
- int i;
+ int i;
hashset_t *new = hashset_allocate(set->maxelements * 2);
- char *bitmap;
- int32 *values;
+ char *bitmap;
+ int32 *values;
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
@@ -350,7 +403,16 @@ hashset_add_element(hashset_t *set, int32 value)
if (set->nelements > set->maxelements * 0.75)
set = hashset_resize(set);
- hash = ((uint32) value * 7691 + 4201) % set->maxelements;
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value) % set->maxelements;
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
bitmap = set->data;
values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
@@ -386,13 +448,23 @@ hashset_add_element(hashset_t *set, int32 value)
static bool
hashset_contains_element(hashset_t *set, int32 value)
{
- int byte;
- int bit;
- uint32 hash;
+ int byte;
+ int bit;
+ uint32 hash;
char *bitmap;
int32 *values;
+ int num_probes = 0; /* Add a counter for the number of probes */
- hash = ((uint32) value * 7691 + 4201) % set->maxelements;
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value) % set->maxelements;
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
bitmap = set->data;
values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
@@ -412,6 +484,12 @@ hashset_contains_element(hashset_t *set, int32 value)
/* move to the next element */
hash = (hash + HASHSET_STEP) % set->maxelements;
+
+ num_probes++; /* Increment the number of probes */
+
+ /* Check if we have probed all slots */
+ if (num_probes >= set->maxelements)
+ return false; /* Avoid infinite loop */
}
}
@@ -692,3 +770,248 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
PG_RETURN_POINTER(dst);
}
+
+
+Datum
+hashset_equals(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ char *bitmap_a;
+ int32 *values_a;
+ int i;
+
+ /*
+ * Check if the number of elements is the same
+ */
+ if (a->nelements != b->nelements)
+ PG_RETURN_BOOL(false);
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
+
+ /*
+ * Check if every element in a is also in b
+ */
+ for (i = 0; i < a->maxelements; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap_a[byte] & (0x01 << bit))
+ {
+ int32 value = values_a[i];
+
+ if (!hashset_contains_element(b, value))
+ PG_RETURN_BOOL(false);
+ }
+ }
+
+ /*
+ * All elements in a are in b and the number of elements is the same,
+ * so the sets must be equal.
+ */
+ PG_RETURN_BOOL(true);
+}
+
+
+Datum
+hashset_neq(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ /* If a is not equal to b, then they are not equal */
+ if (!DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum hashset_hash(PG_FUNCTION_ARGS)
+{
+ hashset_t *set = PG_GETARG_HASHSET(0);
+
+ /* Initial hash value */
+ uint32 hash = 0;
+
+ /* Access the data array */
+ char *bitmap = set->data;
+ int32 *values = (int32 *)(set->data + CEIL_DIV(set->maxelements, 8));
+
+ /* Iterate through all elements */
+ for (int32 i = 0; i < set->maxelements; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the current position is occupied */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Combine the hash value of the current element with the total hash */
+ hash = hash_combine(hash, hash_uint32(values[i]));
+ }
+ }
+
+ /* Return the final hash value */
+ PG_RETURN_INT32(hash);
+}
+
+
+Datum
+hashset_lt(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ char *bitmap_a, *bitmap_b;
+ int32 *values_a, *values_b;
+ int i;
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
+
+ bitmap_b = b->data;
+ values_b = (int32 *)(b->data + CEIL_DIV(b->maxelements, 8));
+
+ /* Compare elements in a lexicographic manner */
+ for (i = 0; i < Min(a->maxelements, b->maxelements); i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ bool has_elem_a = bitmap_a[byte] & (0x01 << bit);
+ bool has_elem_b = bitmap_b[byte] & (0x01 << bit);
+
+ if (has_elem_a && has_elem_b)
+ {
+ int32 value_a = values_a[i];
+ int32 value_b = values_b[i];
+
+ if (value_a < value_b)
+ PG_RETURN_BOOL(true);
+ else if (value_a > value_b)
+ PG_RETURN_BOOL(false);
+
+ }
+ else if (has_elem_a)
+ PG_RETURN_BOOL(false);
+ else if (has_elem_b)
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If all elements are equal up to the shorter hashset length,
+ * then the hashset with fewer elements is considered "less than"
+ */
+ if (a->maxelements < b->maxelements)
+ PG_RETURN_BOOL(true);
+ else
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum
+hashset_le(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ /* If a equals b, or a is less than b, then a is less than or equal to b */
+ if (DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))) ||
+ DatumGetBool(DirectFunctionCall2(hashset_lt, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum
+hashset_gt(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ /* If a is not less than or equal to b, then a is greater than b */
+ if (!DatumGetBool(DirectFunctionCall2(hashset_le, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum
+hashset_ge(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ /* If a equals b, or a is not less than b, then a is greater than or equal to b */
+ if (DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))) ||
+ !DatumGetBool(DirectFunctionCall2(hashset_lt, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum
+hashset_cmp(PG_FUNCTION_ARGS)
+{
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+
+ char *bitmap_a, *bitmap_b;
+ int32 *values_a, *values_b;
+ int i;
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
+
+ bitmap_b = b->data;
+ values_b = (int32 *)(b->data + CEIL_DIV(b->maxelements, 8));
+
+ /*
+ * Iterate through the elements
+ */
+ for (i = 0; i < Min(a->maxelements, b->maxelements); i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ bool a_contains = bitmap_a[byte] & (0x01 << bit);
+ bool b_contains = bitmap_b[byte] & (0x01 << bit);
+
+ if (a_contains && b_contains)
+ {
+ int32 value_a = values_a[i];
+ int32 value_b = values_b[i];
+
+ if (value_a < value_b)
+ PG_RETURN_INT32(-1);
+ else if (value_a > value_b)
+ PG_RETURN_INT32(1);
+ }
+ else if (a_contains)
+ {
+ PG_RETURN_INT32(1);
+ }
+ else if (b_contains)
+ {
+ PG_RETURN_INT32(-1);
+ }
+ }
+
+ /*
+ * If we got here, the elements in the overlap are equal.
+ * We need to check the number of elements to determine the order.
+ */
+ if (a->nelements < b->nelements)
+ PG_RETURN_INT32(-1);
+ else if (a->nelements > b->nelements)
+ PG_RETURN_INT32(1);
+ else
+ PG_RETURN_INT32(0);
+}
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
new file mode 100644
index 0000000..5655b8b
--- /dev/null
+++ b/test/c_tests/test_send_recv.c
@@ -0,0 +1,84 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <libpq-fe.h>
+
+void exit_nicely(PGconn *conn) {
+ PQfinish(conn);
+ exit(1);
+}
+
+int main() {
+ /* Connect to database specified by the PGDATABASE environment variable */
+ const char *conninfo = "host=localhost port=5432";
+ PGconn *conn = PQconnectdb(conninfo);
+ if (PQstatus(conn) != CONNECTION_OK) {
+ fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
+ exit_nicely(conn);
+ }
+
+ /* Create extension */
+ PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
+
+ /* Create temporary table */
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col hashset)");
+
+ /* Enable binary output */
+ PQexec(conn, "SET bytea_output = 'escape'");
+
+ /* Insert dummy data */
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::hashset)";
+ PGresult *res = PQexec(conn, insert_command);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Fetch the data in binary format */
+ const char *select_command = "SELECT hashset_col FROM test_hashset_send_recv";
+ int resultFormat = 1; /* 0 = text, 1 = binary */
+ res = PQexecParams(conn, select_command, 0, NULL, NULL, NULL, NULL, resultFormat);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Store binary data for later use */
+ const char *binary_data = PQgetvalue(res, 0, 0);
+ int binary_data_length = PQgetlength(res, 0, 0);
+ PQclear(res);
+
+ /* Re-insert the binary data */
+ const char *insert_binary_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ($1)";
+ const char *paramValues[1] = {binary_data};
+ int paramLengths[1] = {binary_data_length};
+ int paramFormats[1] = {1}; /* binary format */
+ res = PQexecParams(conn, insert_binary_command, 1, NULL, paramValues, paramLengths, paramFormats, 0);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Check the data */
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col::text) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ res = PQexec(conn, check_command);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Print the results */
+ printf("unique_count: %s\n", PQgetvalue(res, 0, 0));
+ printf("count: %s\n", PQgetvalue(res, 0, 1));
+ PQclear(res);
+
+ /* Disconnect */
+ PQfinish(conn);
+
+ return 0;
+}
diff --git a/test/c_tests/test_send_recv.sh b/test/c_tests/test_send_recv.sh
new file mode 100755
index 0000000..ab308b3
--- /dev/null
+++ b/test/c_tests/test_send_recv.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+# Get the directory of this script
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+
+# Set up database
+export PGDATABASE=test_hashset_send_recv
+dropdb --if-exists "$PGDATABASE"
+createdb
+
+# Define directories
+EXPECTED_DIR="$SCRIPT_DIR/../expected"
+RESULTS_DIR="$SCRIPT_DIR/../results"
+
+# Create the results directory if it doesn't exist
+mkdir -p "$RESULTS_DIR"
+
+# Run the C test and save its output to the results directory
+"$SCRIPT_DIR/test_send_recv" > "$RESULTS_DIR/test_send_recv.out"
+
+printf "test test_send_recv ... "
+
+# Compare the actual output with the expected output
+if diff -q "$RESULTS_DIR/test_send_recv.out" "$EXPECTED_DIR/test_send_recv.out" > /dev/null 2>&1; then
+ echo "ok"
+ # Clean up by removing the results directory if the test passed
+ rm -r "$RESULTS_DIR"
+else
+ echo "failed"
+ git diff --no-index --color "$EXPECTED_DIR/test_send_recv.out" "$RESULTS_DIR/test_send_recv.out"
+fi
diff --git a/test/expected/order.out b/test/expected/order.out
new file mode 100644
index 0000000..8d3ff61
--- /dev/null
+++ b/test/expected/order.out
@@ -0,0 +1,118 @@
+CREATE TABLE IF NOT EXISTS test_hashset_order (hashset_col hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{1,2,3}'::hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{3,2,1}'::hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{4,5,6}'::hashset);
+SELECT COUNT(DISTINCT hashset_col) FROM test_hashset_order;
+ count
+-------
+ 2
+(1 row)
+
+SELECT '{2}'::hashset < '{1}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset < '{2}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset < '{3}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset <= '{1}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset <= '{2}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset <= '{3}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset > '{1}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset > '{2}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset > '{3}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset >= '{1}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset >= '{2}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset >= '{3}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset = '{1}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset = '{2}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset = '{3}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset <> '{1}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::hashset <> '{2}'::hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::hashset <> '{3}'::hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
diff --git a/test/expected/test_send_recv.out b/test/expected/test_send_recv.out
new file mode 100644
index 0000000..12382d5
--- /dev/null
+++ b/test/expected/test_send_recv.out
@@ -0,0 +1,2 @@
+unique_count: 1
+count: 2
diff --git a/test/sql/order.sql b/test/sql/order.sql
new file mode 100644
index 0000000..e6c9323
--- /dev/null
+++ b/test/sql/order.sql
@@ -0,0 +1,29 @@
+CREATE TABLE IF NOT EXISTS test_hashset_order (hashset_col hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{1,2,3}'::hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{3,2,1}'::hashset);
+INSERT INTO test_hashset_order (hashset_col) VALUES ('{4,5,6}'::hashset);
+SELECT COUNT(DISTINCT hashset_col) FROM test_hashset_order;
+
+SELECT '{2}'::hashset < '{1}'::hashset; -- false
+SELECT '{2}'::hashset < '{2}'::hashset; -- false
+SELECT '{2}'::hashset < '{3}'::hashset; -- true
+
+SELECT '{2}'::hashset <= '{1}'::hashset; -- false
+SELECT '{2}'::hashset <= '{2}'::hashset; -- true
+SELECT '{2}'::hashset <= '{3}'::hashset; -- true
+
+SELECT '{2}'::hashset > '{1}'::hashset; -- true
+SELECT '{2}'::hashset > '{2}'::hashset; -- false
+SELECT '{2}'::hashset > '{3}'::hashset; -- false
+
+SELECT '{2}'::hashset >= '{1}'::hashset; -- true
+SELECT '{2}'::hashset >= '{2}'::hashset; -- true
+SELECT '{2}'::hashset >= '{3}'::hashset; -- false
+
+SELECT '{2}'::hashset = '{1}'::hashset; -- false
+SELECT '{2}'::hashset = '{2}'::hashset; -- true
+SELECT '{2}'::hashset = '{3}'::hashset; -- false
+
+SELECT '{2}'::hashset <> '{1}'::hashset; -- true
+SELECT '{2}'::hashset <> '{2}'::hashset; -- false
+SELECT '{2}'::hashset <> '{3}'::hashset; -- true
On Tue, Jun 13, 2023, at 20:50, Joel Jacobson wrote:
hashset is now using hash_bytes_uint32() from hashfn.h
I spotted a problem in the ordering logic of the comparison functions.
The issue was with handling hashsets containing empty positions,
causing non-lexicographic ordering.
The updated implementation now correctly iterates over the hashsets,
skipping any empty positions, which results in proper comparison
and ordering of elements present in the hashset.
New patch attached.
Attachments:
hashset-4e60615.patchapplication/octet-stream; name=hashset-4e60615.patchDownload
commit 4e606150f4e5cb105b83c2b6b2e77d8e55a35a2c
Author: Joel Jakobsson <joel@compiler.org>
Date: Tue Jun 13 22:13:06 2023 +0200
Refactor hashset comparison functions and improve ordering logic
This commit introduces a significant refactor of the comparison functions
for the hashset type, specifically hashset_cmp(), hashset_lt(), hashset_le(),
hashset_gt(), and hashset_ge().
We addressed an issue with the previous implementation where the comparison
functions did not correctly handle hashsets with empty positions. This
resulted in incorrect ordering of hashsets when using these comparison functions.
Now, the comparison functions correctly iterate over the elements in the
hashsets, advancing the iterator for either hashset only when a valid element
is found. This effectively skips over any empty positions, resulting in the
comparison and ordering of elements that are actually present in the hashset.
Additionally, the four functions hashset_lt(), hashset_le(), hashset_gt(),
and hashset_ge() have been simplified by using the hashset_cmp() function,
which reduces redundancy in the codebase.
Tests have been updated to reflect these changes and ensure the correct
functionality of the revised comparison functions.
diff --git a/hashset.c b/hashset.c
index ec3ed44..2e6a51f 100644
--- a/hashset.c
+++ b/hashset.c
@@ -863,53 +863,15 @@ Datum hashset_hash(PG_FUNCTION_ARGS)
Datum
hashset_lt(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ hashset_t *a = PG_GETARG_HASHSET(0);
+ hashset_t *b = PG_GETARG_HASHSET(1);
+ int32 cmp;
- char *bitmap_a, *bitmap_b;
- int32 *values_a, *values_b;
- int i;
+ cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
- bitmap_a = a->data;
- values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
-
- bitmap_b = b->data;
- values_b = (int32 *)(b->data + CEIL_DIV(b->maxelements, 8));
-
- /* Compare elements in a lexicographic manner */
- for (i = 0; i < Min(a->maxelements, b->maxelements); i++)
- {
- int byte = (i / 8);
- int bit = (i % 8);
-
- bool has_elem_a = bitmap_a[byte] & (0x01 << bit);
- bool has_elem_b = bitmap_b[byte] & (0x01 << bit);
-
- if (has_elem_a && has_elem_b)
- {
- int32 value_a = values_a[i];
- int32 value_b = values_b[i];
-
- if (value_a < value_b)
- PG_RETURN_BOOL(true);
- else if (value_a > value_b)
- PG_RETURN_BOOL(false);
-
- }
- else if (has_elem_a)
- PG_RETURN_BOOL(false);
- else if (has_elem_b)
- PG_RETURN_BOOL(true);
- }
-
- /*
- * If all elements are equal up to the shorter hashset length,
- * then the hashset with fewer elements is considered "less than"
- */
- if (a->maxelements < b->maxelements)
- PG_RETURN_BOOL(true);
- else
- PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(cmp < 0);
}
@@ -918,13 +880,13 @@ hashset_le(PG_FUNCTION_ARGS)
{
hashset_t *a = PG_GETARG_HASHSET(0);
hashset_t *b = PG_GETARG_HASHSET(1);
+ int32 cmp;
- /* If a equals b, or a is less than b, then a is less than or equal to b */
- if (DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))) ||
- DatumGetBool(DirectFunctionCall2(hashset_lt, PointerGetDatum(a), PointerGetDatum(b))))
- PG_RETURN_BOOL(true);
+ cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
- PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(cmp <= 0);
}
@@ -933,12 +895,13 @@ hashset_gt(PG_FUNCTION_ARGS)
{
hashset_t *a = PG_GETARG_HASHSET(0);
hashset_t *b = PG_GETARG_HASHSET(1);
+ int32 cmp;
- /* If a is not less than or equal to b, then a is greater than b */
- if (!DatumGetBool(DirectFunctionCall2(hashset_le, PointerGetDatum(a), PointerGetDatum(b))))
- PG_RETURN_BOOL(true);
+ cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
- PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(cmp > 0);
}
@@ -947,13 +910,13 @@ hashset_ge(PG_FUNCTION_ARGS)
{
hashset_t *a = PG_GETARG_HASHSET(0);
hashset_t *b = PG_GETARG_HASHSET(1);
+ int32 cmp;
- /* If a equals b, or a is not less than b, then a is greater than or equal to b */
- if (DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))) ||
- !DatumGetBool(DirectFunctionCall2(hashset_lt, PointerGetDatum(a), PointerGetDatum(b))))
- PG_RETURN_BOOL(true);
+ cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
- PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(cmp >= 0);
}
@@ -965,7 +928,7 @@ hashset_cmp(PG_FUNCTION_ARGS)
char *bitmap_a, *bitmap_b;
int32 *values_a, *values_b;
- int i;
+ int i = 0, j = 0;
bitmap_a = a->data;
values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
@@ -973,45 +936,52 @@ hashset_cmp(PG_FUNCTION_ARGS)
bitmap_b = b->data;
values_b = (int32 *)(b->data + CEIL_DIV(b->maxelements, 8));
- /*
- * Iterate through the elements
- */
- for (i = 0; i < Min(a->maxelements, b->maxelements); i++)
+ /* Iterate over the elements in each hashset independently */
+ while(i < a->maxelements && j < b->maxelements)
{
- int byte = (i / 8);
- int bit = (i % 8);
+ int byte_a = (i / 8);
+ int bit_a = (i % 8);
- bool a_contains = bitmap_a[byte] & (0x01 << bit);
- bool b_contains = bitmap_b[byte] & (0x01 << bit);
+ int byte_b = (j / 8);
+ int bit_b = (j % 8);
- if (a_contains && b_contains)
- {
- int32 value_a = values_a[i];
- int32 value_b = values_b[i];
+ bool has_elem_a = bitmap_a[byte_a] & (0x01 << bit_a);
+ bool has_elem_b = bitmap_b[byte_b] & (0x01 << bit_b);
- if (value_a < value_b)
- PG_RETURN_INT32(-1);
- else if (value_a > value_b)
- PG_RETURN_INT32(1);
- }
- else if (a_contains)
+ int32 value_a;
+ int32 value_b;
+
+ /* Skip if position is empty in either bitmap */
+ if (!has_elem_a)
{
- PG_RETURN_INT32(1);
+ i++;
+ continue;
}
- else if (b_contains)
+
+ if (!has_elem_b)
{
+ j++;
+ continue;
+ }
+
+ /* Both hashsets have an element at the current position */
+ value_a = values_a[i++];
+ value_b = values_b[j++];
+
+ if (value_a < value_b)
PG_RETURN_INT32(-1);
- }
+ else if (value_a > value_b)
+ PG_RETURN_INT32(1);
}
/*
- * If we got here, the elements in the overlap are equal.
- * We need to check the number of elements to determine the order.
+ * If all compared elements are equal,
+ * then compare the remaining elements in the larger hashset
*/
- if (a->nelements < b->nelements)
- PG_RETURN_INT32(-1);
- else if (a->nelements > b->nelements)
+ if (i < a->maxelements)
PG_RETURN_INT32(1);
+ else if (j < b->maxelements)
+ PG_RETURN_INT32(-1);
else
PG_RETURN_INT32(0);
}
diff --git a/test/expected/order.out b/test/expected/order.out
index 8d3ff61..f8f8d5b 100644
--- a/test/expected/order.out
+++ b/test/expected/order.out
@@ -116,3 +116,59 @@ SELECT '{2}'::hashset <> '{3}'::hashset; -- true
t
(1 row)
+CREATE OR REPLACE FUNCTION generate_random_hashset(num_elements INT)
+RETURNS hashset AS $$
+DECLARE
+ element INT;
+ random_set hashset;
+BEGIN
+ random_set := hashset_init(num_elements);
+
+ FOR i IN 1..num_elements LOOP
+ element := floor(random() * 1000)::INT;
+ random_set := hashset_add(random_set, element);
+ END LOOP;
+
+ RETURN random_set;
+END;
+$$ LANGUAGE plpgsql;
+SELECT setseed(0.123465);
+ setseed
+---------
+
+(1 row)
+
+CREATE TABLE hashset_order_test AS
+SELECT generate_random_hashset(3) AS hashset_col
+FROM generate_series(1,1000)
+UNION
+SELECT generate_random_hashset(2)
+FROM generate_series(1,1000);
+SELECT hashset_col
+FROM hashset_order_test
+ORDER BY hashset_col
+LIMIT 20;
+ hashset_col
+-------------
+ {2,857}
+ {3,85,507}
+ {3,569,891}
+ {3,867,610}
+ {5,207,283}
+ {5,283,972}
+ {5,550,991}
+ {5,606,148}
+ {5,734}
+ {5,862}
+ {5,872}
+ {6,431}
+ {6,444,929}
+ {6,521}
+ {6,592}
+ {7,878,229}
+ {8,14,859}
+ {8,605}
+ {8,654}
+ {8,698}
+(20 rows)
+
diff --git a/test/sql/order.sql b/test/sql/order.sql
index e6c9323..1780c0b 100644
--- a/test/sql/order.sql
+++ b/test/sql/order.sql
@@ -27,3 +27,34 @@ SELECT '{2}'::hashset = '{3}'::hashset; -- false
SELECT '{2}'::hashset <> '{1}'::hashset; -- true
SELECT '{2}'::hashset <> '{2}'::hashset; -- false
SELECT '{2}'::hashset <> '{3}'::hashset; -- true
+
+CREATE OR REPLACE FUNCTION generate_random_hashset(num_elements INT)
+RETURNS hashset AS $$
+DECLARE
+ element INT;
+ random_set hashset;
+BEGIN
+ random_set := hashset_init(num_elements);
+
+ FOR i IN 1..num_elements LOOP
+ element := floor(random() * 1000)::INT;
+ random_set := hashset_add(random_set, element);
+ END LOOP;
+
+ RETURN random_set;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.123465);
+
+CREATE TABLE hashset_order_test AS
+SELECT generate_random_hashset(3) AS hashset_col
+FROM generate_series(1,1000)
+UNION
+SELECT generate_random_hashset(2)
+FROM generate_series(1,1000);
+
+SELECT hashset_col
+FROM hashset_order_test
+ORDER BY hashset_col
+LIMIT 20;
On Mon, 12 Jun 2023 at 22:37, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...
Perspective from a potential user: I'm currently working on something where
an array-like structure with fast membership test performance would be very
useful. The main type of query is doing an =ANY(the set) filter, where the
set could contain anywhere from very few to thousands of entries (ints in
our case). So we'd want the same index usage as =ANY(array) but would like
faster row checking than we get with an array when other indexes are used.
Our app runs connecting to either an embedded postgres database that we
control or an external one controlled by customers - this is typically RDS
or some other cloud vendor's DB. Having such a type as a separate extension
would make it unusable for us until all relevant cloud vendors decided that
it was popular enough to include - something that may never happen, or even
if it did, now any time soon.
Cheers
Tom
On Wed, Jun 14, 2023, at 06:31, Tom Dunstan wrote:
On Mon, 12 Jun 2023 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...Perspective from a potential user: I'm currently working on something
where an array-like structure with fast membership test performance
would be very useful. The main type of query is doing an =ANY(the set)
filter, where the set could contain anywhere from very few to thousands
of entries (ints in our case). So we'd want the same index usage as
=ANY(array) but would like faster row checking than we get with an
array when other indexes are used.
Thanks for providing an interesting use-case.
If you would like to help, one thing that would be helpful,
would be a complete runnable sql script,
that demonstrates exactly the various array-based queries
you currently use, with random data that resembles
reality as closely as possible, i.e. the same number of rows
in the tables, and similar distribution of values etc.
This would be helpful in terms of documentation,
as I think it would be good to provide Usage examples
that are based on real-life scenarios.
It would also be helpful to create realistic benchmarks when
evaluating and optimising the performance.
Our app runs connecting to either an embedded postgres database that we
control or an external one controlled by customers - this is typically
RDS or some other cloud vendor's DB. Having such a type as a separate
extension would make it unusable for us until all relevant cloud
vendors decided that it was popular enough to include - something that
may never happen, or even if it did, now any time soon.
Good point.
On 6/14/23 06:31, Tom Dunstan wrote:
On Mon, 12 Jun 2023 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...Perspective from a potential user: I'm currently working on something
where an array-like structure with fast membership test performance
would be very useful. The main type of query is doing an =ANY(the set)
filter, where the set could contain anywhere from very few to thousands
of entries (ints in our case). So we'd want the same index usage as
=ANY(array) but would like faster row checking than we get with an array
when other indexes are used.
We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.
Our app runs connecting to either an embedded postgres database that we
control or an external one controlled by customers - this is typically
RDS or some other cloud vendor's DB. Having such a type as a separate
extension would make it unusable for us until all relevant cloud vendors
decided that it was popular enough to include - something that may never
happen, or even if it did, now any time soon.
Understood, but that's really a problem / choice of the cloud vendors.
The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.
Anyway, that's mostly irrelevant, as PG14 already does the hash table
for this kind of queries. And I'm not strictly against adding some of
this into core, if it ends up being useful enough.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
Perspective from a potential user: I'm currently working on something
where an array-like structure with fast membership test performance
would be very useful. The main type of query is doing an =ANY(the set)
filter, where the set could contain anywhere from very few to thousands
of entries (ints in our case). So we'd want the same index usage as
=ANY(array) but would like faster row checking than we get with an array
when other indexes are used.We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.
Would it be feasible to teach the planner to utilize the internal hash table of
hashset directly? In the case of arrays, the hash table construction is an
ad hoc operation, whereas with hashset, the hash table already exists, which
could potentially lead to a faster execution.
Essentially, the aim would be to support:
=ANY(hashset)
Instead of the current:
=ANY(hashset_to_array(hashset))
Thoughts?
/Joel
On 6/14/23 14:57, Joel Jacobson wrote:
On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
Perspective from a potential user: I'm currently working on something
where an array-like structure with fast membership test performance
would be very useful. The main type of query is doing an =ANY(the set)
filter, where the set could contain anywhere from very few to thousands
of entries (ints in our case). So we'd want the same index usage as
=ANY(array) but would like faster row checking than we get with an array
when other indexes are used.We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.Would it be feasible to teach the planner to utilize the internal hash table of
hashset directly? In the case of arrays, the hash table construction is an
ad hoc operation, whereas with hashset, the hash table already exists, which
could potentially lead to a faster execution.Essentially, the aim would be to support:
=ANY(hashset)
Instead of the current:
=ANY(hashset_to_array(hashset))
Thoughts?
That should be possible, but probably only when hashset is a built-in
data type (maybe polymorphic).
I don't know if it'd be worth it, the general idea is that building the
hash table is way cheaper than repeated lookups in an array. Yeah, it
might save something, but likely only a tiny fraction of the runtime.
It's definitely something I'd leave out of v0, personally.
=ANY(set) should probably work with an implicit ARRAY cast, I believe.
It'll do the ad hoc build, ofc.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2023-06-14 We 05:44, Tomas Vondra wrote:
The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.
I agree it's a judgement call. But the code burden here seems pretty
small, far smaller than, say, the SQL/JSON patches. And I think the
range of applications that could benefit is quite significant.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
On 6/14/23 14:57, Joel Jacobson wrote:
Would it be feasible to teach the planner to utilize the internal hash table of
hashset directly? In the case of arrays, the hash table construction is an
...
It's definitely something I'd leave out of v0, personally.
OK, thanks for guidance, I'll stay away from it.
I've been doing some preparatory work on this todo item:
3) support for other types (now it only works with int32)
I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).
Other changes since last update (4e60615):
* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current capacity
of a hashset.
* Refactored hashset initialization:
- Replaced hashset_init(int) with int4hashset() to initialize an empty hashset
with zero capacity.
- Added int4hashset_with_capacity(int) to initialize a hashset with
a specified capacity.
* Improved README.md and testing
As a next step, I'm planning on adding int8 support.
Looks and sounds good?
/Joel
Attachments:
hashset-0.0.1-48e29ebapplication/octet-stream; name=hashset-0.0.1-48e29ebDownload
diff --git a/Makefile b/Makefile
index 928e211..68c29a2 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
CLIENT_INCLUDES=-I$(shell pg_config --includedir)
LIBRARY_PATH = -L$(shell pg_config --libdir)
-REGRESS = prelude basic random table invalid order
+REGRESS = prelude basic io_varying_lengths random table invalid order
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/README.md b/README.md
index bd386ad..3ff5757 100644
--- a/README.md
+++ b/README.md
@@ -3,30 +3,36 @@
This PostgreSQL extension implements hashset, a data structure (type)
providing a collection of integer items with fast lookup.
+
+## Version
+
+0.0.1
+
🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
PostgreSQL extension is **not production-ready**. As the codebase is evolving
with possible breaking changes, we are not providing any migration scripts
until we reach our first release.
+
## Usage
-After installing the extension, you can use the `hashset` data type and
+After installing the extension, you can use the `int4hashset` data type and
associated functions within your PostgreSQL queries.
To demonstrate the usage, let's consider a hypothetical table `users` which has
-a `user_id` and a `user_likes` of type `hashset`.
+a `user_id` and a `user_likes` of type `int4hashset`.
Firstly, let's create the table:
```sql
CREATE TABLE users(
user_id int PRIMARY KEY,
- user_likes hashset DEFAULT hashset_init(2)
+ user_likes int4hashset DEFAULT int4hashset()
);
```
-In the above statement, the `hashset_init(2)` initializes a hashset with initial
-capacity for 2 elements. The hashset will automatically resize itself when more
-elements are added beyond this initial capacity.
+In the above statement, the `int4hashset()` initializes an empty hashset
+with zero capacity. The hashset will automatically resize itself when more
+elements are added.
Now, we can perform operations on this table. Here are some examples:
@@ -47,47 +53,52 @@ SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
```
You can also use the aggregate functions to perform operations on multiple rows.
-For instance, you can add an integer to a `hashset`.
## Data types
-- **hashset**: This data type represents a set of integers. Internally, it uses
+- **int4hashset**: This data type represents a set of integers. Internally, it uses
a combination of a bitmap and a value array to store the elements in a set. It's
a variable-length type.
## Functions
-The extension provides the following functions:
+- `int4hashset() -> int4hashset`: Initialize an empty int4hashset with no capacity.
+- `int4hashset_with_capacity(int) -> int4hashset`: Initialize an empty int4hashset with given capacity.
+- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
+- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
+- `hashset_merge(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
+- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
+- `hashset_count(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
+- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
-### hashset_add(hashset, int) -> hashset
-Adds an integer to a `hashset`.
+## Aggregation Functions
-### hashset_contains(hashset, int) -> boolean
-Checks if an integer is contained in a `hashset`.
+- `hashset(int) -> int4hashset`: Aggregate integers into a hashset.
+- `hashset(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
-### hashset_count(hashset) -> bigint
-Returns the number of elements in a `hashset`.
-### hashset_merge(hashset, hashset) -> hashset
-Merges two `hashset`s into a single `hashset`.
+## Operators
-### hashset_to_array(hashset) -> integer[]
-Converts a `hashset` to an integer array.
+- Equality (`=`): Checks if two hashsets are equal.
+- Inequality (`<>`): Checks if two hashsets are not equal.
-### hashset_init(int) -> hashset
-Initializes an empty `hashset` with a specified initial capacity for maximum
-elements. The argument determines the maximum number of elements the `hashset`
-can hold before it needs to resize.
-## Aggregate Functions
+## Hashset Hash Operators
-### hashset(integer) -> hashset
-Generates a `hashset` from a series of integers, keeping only the unique ones.
+- `hashset_hash(int4hashset) -> integer`: Returns the hash value of an int4hashset.
-### hashset(hashset) -> hashset
-Merges multiple `hashset`s into a single `hashset`, preserving unique elements.
+
+## Hashset Btree Operators
+
+- `<`, `<=`, `>`, `>=`: Comparison operators for hashsets.
+
+
+## Limitations
+
+- The `int4hashset` data type currently supports integers within the range of int4
+(-2147483648 to 2147483647).
## Installation
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index 7b79b73..ea559ca 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -1,135 +1,158 @@
-CREATE TYPE hashset;
+/*
+ * Hashset Type Definition
+ */
-CREATE OR REPLACE FUNCTION hashset_in(cstring)
- RETURNS hashset
- AS 'hashset', 'hashset_in'
+CREATE TYPE int4hashset;
+
+CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_in'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_out(hashset)
+CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
RETURNS cstring
- AS 'hashset', 'hashset_out'
+ AS 'hashset', 'int4hashset_out'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_send(hashset)
+CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
RETURNS bytea
- AS 'hashset', 'hashset_send'
+ AS 'hashset', 'int4hashset_send'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_recv(internal)
- RETURNS hashset
- AS 'hashset', 'hashset_recv'
+CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_recv'
LANGUAGE C IMMUTABLE STRICT;
-CREATE TYPE hashset (
- INPUT = hashset_in,
- OUTPUT = hashset_out,
- RECEIVE = hashset_recv,
- SEND = hashset_send,
+CREATE TYPE int4hashset (
+ INPUT = int4hashset_in,
+ OUTPUT = int4hashset_out,
+ RECEIVE = int4hashset_recv,
+ SEND = int4hashset_send,
INTERNALLENGTH = variable,
STORAGE = external
);
+/*
+ * Hashset Functions
+ */
-CREATE OR REPLACE FUNCTION hashset_add(hashset, int)
- RETURNS hashset
- AS 'hashset', 'hashset_add'
+CREATE OR REPLACE FUNCTION int4hashset()
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_init'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_contains(hashset, int)
+CREATE OR REPLACE FUNCTION int4hashset_with_capacity(int)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_init'
+ LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_add'
+ LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
RETURNS bool
- AS 'hashset', 'hashset_contains'
+ AS 'hashset', 'int4hashset_contains'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_count(hashset)
- RETURNS bigint
- AS 'hashset', 'hashset_count'
+CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_merge'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_merge(hashset, hashset)
- RETURNS hashset
- AS 'hashset', 'hashset_merge'
- LANGUAGE C IMMUTABLE;
-
-CREATE OR REPLACE FUNCTION hashset_to_array(hashset)
+CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
RETURNS int[]
- AS 'hashset', 'hashset_to_array'
+ AS 'hashset', 'int4hashset_to_array'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_init(int)
- RETURNS hashset
- AS 'hashset', 'hashset_init'
+CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
+ RETURNS bigint
+ AS 'hashset', 'int4hashset_count'
LANGUAGE C IMMUTABLE;
+CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
+ RETURNS bigint
+ AS 'hashset', 'int4hashset_capacity'
+ LANGUAGE C IMMUTABLE;
+
+
+/*
+ * Aggregation Functions
+ */
-CREATE OR REPLACE FUNCTION hashset_agg_add(p_pointer internal, p_value int)
+CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
RETURNS internal
- AS 'hashset', 'hashset_agg_add'
+ AS 'hashset', 'int4hashset_agg_add'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_agg_final(p_pointer internal)
- RETURNS hashset
- AS 'hashset', 'hashset_agg_final'
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_agg_final'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
RETURNS internal
- AS 'hashset', 'hashset_agg_combine'
+ AS 'hashset', 'int4hashset_agg_combine'
LANGUAGE C IMMUTABLE;
CREATE AGGREGATE hashset(int) (
- SFUNC = hashset_agg_add,
+ SFUNC = int4hashset_agg_add,
STYPE = internal,
- FINALFUNC = hashset_agg_final,
- COMBINEFUNC = hashset_agg_combine,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
PARALLEL = SAFE
);
-
-CREATE OR REPLACE FUNCTION hashset_agg_add_set(p_pointer internal, p_value hashset)
+CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
RETURNS internal
- AS 'hashset', 'hashset_agg_add_set'
+ AS 'hashset', 'int4hashset_agg_add_set'
LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_agg_final(p_pointer internal)
- RETURNS hashset
- AS 'hashset', 'hashset_agg_final'
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+ RETURNS int4hashset
+ AS 'hashset', 'int4hashset_agg_final'
LANGUAGE C IMMUTABLE;
-
-CREATE OR REPLACE FUNCTION hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
RETURNS internal
- AS 'hashset', 'hashset_agg_combine'
+ AS 'hashset', 'int4hashset_agg_combine'
LANGUAGE C IMMUTABLE;
-CREATE AGGREGATE hashset(hashset) (
- SFUNC = hashset_agg_add_set,
+CREATE AGGREGATE hashset(int4hashset) (
+ SFUNC = int4hashset_agg_add_set,
STYPE = internal,
- FINALFUNC = hashset_agg_final,
- COMBINEFUNC = hashset_agg_combine,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
PARALLEL = SAFE
);
+/*
+ * Operator Definitions
+ */
-CREATE OR REPLACE FUNCTION hashset_equals(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_equals(int4hashset, int4hashset)
RETURNS bool
- AS 'hashset', 'hashset_equals'
+ AS 'hashset', 'int4hashset_equals'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR = (
- LEFTARG = hashset,
- RIGHTARG = hashset,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
PROCEDURE = hashset_equals,
COMMUTATOR = =,
HASHES
);
-CREATE OR REPLACE FUNCTION hashset_neq(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_neq(int4hashset, int4hashset)
RETURNS bool
- AS 'hashset', 'hashset_neq'
+ AS 'hashset', 'int4hashset_neq'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR <> (
- LEFTARG = hashset,
- RIGHTARG = hashset,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
PROCEDURE = hashset_neq,
COMMUTATOR = '<>',
NEGATOR = '=',
@@ -138,46 +161,53 @@ CREATE OPERATOR <> (
HASHES
);
+/*
+ * Hashset Hash Operators
+ */
-CREATE OR REPLACE FUNCTION hashset_hash(hashset)
+CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
RETURNS integer
- AS 'hashset', 'hashset_hash'
+ AS 'hashset', 'int4hashset_hash'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OPERATOR CLASS hashset_hash_ops
- DEFAULT FOR TYPE hashset USING hash AS
- OPERATOR 1 = (hashset, hashset),
- FUNCTION 1 hashset_hash(hashset);
+CREATE OPERATOR CLASS int4hashset_hash_ops
+ DEFAULT FOR TYPE int4hashset USING hash AS
+ OPERATOR 1 = (int4hashset, int4hashset),
+ FUNCTION 1 hashset_hash(int4hashset);
-CREATE OR REPLACE FUNCTION hashset_lt(hashset, hashset)
+/*
+ * Hashset Btree Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
RETURNS bool
- AS 'hashset', 'hashset_lt'
+ AS 'hashset', 'int4hashset_lt'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_le(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
RETURNS boolean
- AS 'hashset', 'hashset_le'
+ AS 'hashset', 'int4hashset_le'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_gt(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
RETURNS boolean
- AS 'hashset', 'hashset_gt'
+ AS 'hashset', 'int4hashset_gt'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_ge(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
RETURNS boolean
- AS 'hashset', 'hashset_ge'
+ AS 'hashset', 'int4hashset_ge'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_cmp(hashset, hashset)
+CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
RETURNS integer
- AS 'hashset', 'hashset_cmp'
+ AS 'hashset', 'int4hashset_cmp'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR < (
- LEFTARG = hashset,
- RIGHTARG = hashset,
PROCEDURE = hashset_lt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
COMMUTATOR = >,
NEGATOR = >=,
RESTRICT = scalarltsel,
@@ -186,8 +216,8 @@ CREATE OPERATOR < (
CREATE OPERATOR <= (
PROCEDURE = hashset_le,
- LEFTARG = hashset,
- RIGHTARG = hashset,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
COMMUTATOR = '>=',
NEGATOR = '>',
RESTRICT = scalarltsel,
@@ -196,8 +226,8 @@ CREATE OPERATOR <= (
CREATE OPERATOR > (
PROCEDURE = hashset_gt,
- LEFTARG = hashset,
- RIGHTARG = hashset,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
COMMUTATOR = '<',
NEGATOR = '<=',
RESTRICT = scalargtsel,
@@ -206,19 +236,19 @@ CREATE OPERATOR > (
CREATE OPERATOR >= (
PROCEDURE = hashset_ge,
- LEFTARG = hashset,
- RIGHTARG = hashset,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
COMMUTATOR = '<=',
NEGATOR = '<',
RESTRICT = scalargtsel,
JOIN = scalargtjoinsel
);
-CREATE OPERATOR CLASS hashset_btree_ops
- DEFAULT FOR TYPE hashset USING btree AS
- OPERATOR 1 < (hashset, hashset),
- OPERATOR 2 <= (hashset, hashset),
- OPERATOR 3 = (hashset, hashset),
- OPERATOR 4 >= (hashset, hashset),
- OPERATOR 5 > (hashset, hashset),
- FUNCTION 1 hashset_cmp(hashset, hashset);
+CREATE OPERATOR CLASS int4hashset_btree_ops
+ DEFAULT FOR TYPE int4hashset USING btree AS
+ OPERATOR 1 < (int4hashset, int4hashset),
+ OPERATOR 2 <= (int4hashset, int4hashset),
+ OPERATOR 3 = (int4hashset, int4hashset),
+ OPERATOR 4 >= (int4hashset, int4hashset),
+ OPERATOR 5 > (int4hashset, int4hashset),
+ FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset.c b/hashset.c
index 2e6a51f..3bc7fdc 100644
--- a/hashset.c
+++ b/hashset.c
@@ -26,78 +26,80 @@ PG_MODULE_MAGIC;
/*
* hashset
*/
-typedef struct hashset_t {
+typedef struct int4hashset_t {
int32 vl_len_; /* varlena header (do not touch directly!) */
int32 flags; /* reserved for future use (versioning, ...) */
int32 maxelements; /* max number of element we have space for */
int32 nelements; /* number of items added to the hashset */
int32 hashfn_id; /* ID of the hash function used */
char data[FLEXIBLE_ARRAY_MEMBER];
-} hashset_t;
+} int4hashset_t;
-static hashset_t *hashset_resize(hashset_t * set);
-static hashset_t *hashset_add_element(hashset_t *set, int32 value);
-static bool hashset_contains_element(hashset_t *set, int32 value);
+static int4hashset_t *int4hashset_resize(int4hashset_t * set);
+static int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
+static bool int4hashset_contains_element(int4hashset_t *set, int32 value);
static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
-#define PG_GETARG_HASHSET(x) (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
#define HASHSET_STEP 13
#define JENKINS_LOOKUP3_HASHFN_ID 1
-PG_FUNCTION_INFO_V1(hashset_in);
-PG_FUNCTION_INFO_V1(hashset_out);
-PG_FUNCTION_INFO_V1(hashset_send);
-PG_FUNCTION_INFO_V1(hashset_recv);
-PG_FUNCTION_INFO_V1(hashset_add);
-PG_FUNCTION_INFO_V1(hashset_contains);
-PG_FUNCTION_INFO_V1(hashset_count);
-PG_FUNCTION_INFO_V1(hashset_merge);
-PG_FUNCTION_INFO_V1(hashset_init);
-PG_FUNCTION_INFO_V1(hashset_agg_add_set);
-PG_FUNCTION_INFO_V1(hashset_agg_add);
-PG_FUNCTION_INFO_V1(hashset_agg_final);
-PG_FUNCTION_INFO_V1(hashset_agg_combine);
-PG_FUNCTION_INFO_V1(hashset_to_array);
-PG_FUNCTION_INFO_V1(hashset_equals);
-PG_FUNCTION_INFO_V1(hashset_neq);
-PG_FUNCTION_INFO_V1(hashset_hash);
-PG_FUNCTION_INFO_V1(hashset_lt);
-PG_FUNCTION_INFO_V1(hashset_le);
-PG_FUNCTION_INFO_V1(hashset_gt);
-PG_FUNCTION_INFO_V1(hashset_ge);
-PG_FUNCTION_INFO_V1(hashset_cmp);
+PG_FUNCTION_INFO_V1(int4hashset_in);
+PG_FUNCTION_INFO_V1(int4hashset_out);
+PG_FUNCTION_INFO_V1(int4hashset_send);
+PG_FUNCTION_INFO_V1(int4hashset_recv);
+PG_FUNCTION_INFO_V1(int4hashset_add);
+PG_FUNCTION_INFO_V1(int4hashset_contains);
+PG_FUNCTION_INFO_V1(int4hashset_count);
+PG_FUNCTION_INFO_V1(int4hashset_merge);
+PG_FUNCTION_INFO_V1(int4hashset_init);
+PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add);
+PG_FUNCTION_INFO_V1(int4hashset_agg_final);
+PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
+PG_FUNCTION_INFO_V1(int4hashset_to_array);
+PG_FUNCTION_INFO_V1(int4hashset_equals);
+PG_FUNCTION_INFO_V1(int4hashset_neq);
+PG_FUNCTION_INFO_V1(int4hashset_hash);
+PG_FUNCTION_INFO_V1(int4hashset_lt);
+PG_FUNCTION_INFO_V1(int4hashset_le);
+PG_FUNCTION_INFO_V1(int4hashset_gt);
+PG_FUNCTION_INFO_V1(int4hashset_ge);
+PG_FUNCTION_INFO_V1(int4hashset_cmp);
-Datum hashset_in(PG_FUNCTION_ARGS);
-Datum hashset_out(PG_FUNCTION_ARGS);
-Datum hashset_send(PG_FUNCTION_ARGS);
-Datum hashset_recv(PG_FUNCTION_ARGS);
-Datum hashset_add(PG_FUNCTION_ARGS);
-Datum hashset_contains(PG_FUNCTION_ARGS);
-Datum hashset_count(PG_FUNCTION_ARGS);
-Datum hashset_merge(PG_FUNCTION_ARGS);
-Datum hashset_init(PG_FUNCTION_ARGS);
-Datum hashset_agg_add(PG_FUNCTION_ARGS);
-Datum hashset_agg_add_set(PG_FUNCTION_ARGS);
-Datum hashset_agg_final(PG_FUNCTION_ARGS);
-Datum hashset_agg_combine(PG_FUNCTION_ARGS);
-Datum hashset_to_array(PG_FUNCTION_ARGS);
-Datum hashset_equals(PG_FUNCTION_ARGS);
-Datum hashset_neq(PG_FUNCTION_ARGS);
-Datum hashset_hash(PG_FUNCTION_ARGS);
-Datum hashset_lt(PG_FUNCTION_ARGS);
-Datum hashset_le(PG_FUNCTION_ARGS);
-Datum hashset_gt(PG_FUNCTION_ARGS);
-Datum hashset_ge(PG_FUNCTION_ARGS);
-Datum hashset_cmp(PG_FUNCTION_ARGS);
+Datum int4hashset_in(PG_FUNCTION_ARGS);
+Datum int4hashset_out(PG_FUNCTION_ARGS);
+Datum int4hashset_send(PG_FUNCTION_ARGS);
+Datum int4hashset_recv(PG_FUNCTION_ARGS);
+Datum int4hashset_add(PG_FUNCTION_ARGS);
+Datum int4hashset_contains(PG_FUNCTION_ARGS);
+Datum int4hashset_count(PG_FUNCTION_ARGS);
+Datum int4hashset_merge(PG_FUNCTION_ARGS);
+Datum int4hashset_init(PG_FUNCTION_ARGS);
+Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
+Datum int4hashset_to_array(PG_FUNCTION_ARGS);
+Datum int4hashset_equals(PG_FUNCTION_ARGS);
+Datum int4hashset_neq(PG_FUNCTION_ARGS);
+Datum int4hashset_hash(PG_FUNCTION_ARGS);
+Datum int4hashset_lt(PG_FUNCTION_ARGS);
+Datum int4hashset_le(PG_FUNCTION_ARGS);
+Datum int4hashset_gt(PG_FUNCTION_ARGS);
+Datum int4hashset_ge(PG_FUNCTION_ARGS);
+Datum int4hashset_cmp(PG_FUNCTION_ARGS);
-static hashset_t *
-hashset_allocate(int maxelements)
+static int4hashset_t *
+int4hashset_allocate(int maxelements)
{
- Size len;
- hashset_t *set;
- char *ptr;
+ Size len;
+ int4hashset_t *set;
+ char *ptr;
/*
* Ensure that maxelements is not divisible by HASHSET_STEP;
@@ -107,14 +109,14 @@ hashset_allocate(int maxelements)
while (maxelements % HASHSET_STEP == 0)
maxelements++;
- len = offsetof(hashset_t, data);
+ len = offsetof(int4hashset_t, data);
len += CEIL_DIV(maxelements, 8);
len += maxelements * sizeof(int32);
ptr = palloc0(len);
SET_VARSIZE(ptr, len);
- set = (hashset_t *) ptr;
+ set = (int4hashset_t *) ptr;
set->flags = 0;
set->maxelements = maxelements;
@@ -127,12 +129,12 @@ hashset_allocate(int maxelements)
}
Datum
-hashset_in(PG_FUNCTION_ARGS)
+int4hashset_in(PG_FUNCTION_ARGS)
{
char *str = PG_GETARG_CSTRING(0);
char *endptr;
int32 len = strlen(str);
- hashset_t *set;
+ int4hashset_t *set;
/* Check the opening and closing braces */
if (str[0] != '{' || str[len - 1] != '}')
@@ -147,9 +149,15 @@ hashset_in(PG_FUNCTION_ARGS)
str++;
/* Initial size based on input length (arbitrary, could be optimized) */
- set = hashset_allocate(len/2);
+ set = int4hashset_allocate(len/2);
- while (true)
+ /* Check for empty set */
+ if (*str == '}')
+ {
+ PG_RETURN_POINTER(set);
+ }
+
+ while (*str != '}')
{
int64 value = strtol(str, &endptr, 10);
@@ -164,9 +172,9 @@ hashset_in(PG_FUNCTION_ARGS)
/* Add the value to the hashset, resize if needed */
if (set->nelements >= set->maxelements)
{
- set = hashset_resize(set);
+ set = int4hashset_resize(set);
}
- set = hashset_add_element(set, (int32)value);
+ set = int4hashset_add_element(set, (int32)value);
/* Error handling for strtol */
if (endptr == str)
@@ -179,26 +187,26 @@ hashset_in(PG_FUNCTION_ARGS)
{
str = endptr + 1; /* Move to the next number */
}
- else if (*endptr == '}')
- {
- break; /* End of the hashset */
- }
- else
+ else if (*endptr != '}')
{
/* Unexpected character */
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("unexpected character \"%c\" in hashset input", *endptr)));
}
+ else /* *endptr is '}', move to next iteration */
+ {
+ str = endptr;
+ }
}
PG_RETURN_POINTER(set);
}
Datum
-hashset_out(PG_FUNCTION_ARGS)
+int4hashset_out(PG_FUNCTION_ARGS)
{
- hashset_t *set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
+ int4hashset_t *set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
char *bitmap;
int32 *values;
int i;
@@ -239,9 +247,9 @@ hashset_out(PG_FUNCTION_ARGS)
Datum
-hashset_send(PG_FUNCTION_ARGS)
+int4hashset_send(PG_FUNCTION_ARGS)
{
- hashset_t *set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
+ int4hashset_t *set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
StringInfoData buf;
int32 data_size;
@@ -255,7 +263,7 @@ hashset_send(PG_FUNCTION_ARGS)
pq_sendint(&buf, set->hashfn_id, 4);
/* Compute and send the size of the data field */
- data_size = VARSIZE(set) - offsetof(hashset_t, data);
+ data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
pq_sendbytes(&buf, set->data, data_size);
PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
@@ -263,10 +271,10 @@ hashset_send(PG_FUNCTION_ARGS)
Datum
-hashset_recv(PG_FUNCTION_ARGS)
+int4hashset_recv(PG_FUNCTION_ARGS)
{
StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
- hashset_t *set;
+ int4hashset_t *set;
int32 data_size;
Size total_size;
const char *binary_data;
@@ -284,10 +292,10 @@ hashset_recv(PG_FUNCTION_ARGS)
binary_data = pq_getmsgbytes(buf, data_size);
/* Compute total size of hashset_t */
- total_size = offsetof(hashset_t, data) + data_size;
+ total_size = offsetof(int4hashset_t, data) + data_size;
/* Allocate memory for hashset including the data field */
- set = (hashset_t *) palloc0(total_size);
+ set = (int4hashset_t *) palloc0(total_size);
/* Set the size of the variable-length data structure */
SET_VARSIZE(set, total_size);
@@ -304,21 +312,21 @@ hashset_recv(PG_FUNCTION_ARGS)
Datum
-hashset_to_array(PG_FUNCTION_ARGS)
+int4hashset_to_array(PG_FUNCTION_ARGS)
{
- int i,
- idx;
- hashset_t *set;
- int32 *values;
- int nvalues;
+ int i,
+ idx;
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
- char *sbitmap;
- int32 *svalues;
+ char *sbitmap;
+ int32 *svalues;
if (PG_ARGISNULL(0))
PG_RETURN_NULL();
- set = (hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));
+ set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
sbitmap = set->data;
svalues = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
@@ -367,13 +375,13 @@ int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
}
-static hashset_t *
-hashset_resize(hashset_t * set)
+static int4hashset_t *
+int4hashset_resize(int4hashset_t * set)
{
- int i;
- hashset_t *new = hashset_allocate(set->maxelements * 2);
- char *bitmap;
- int32 *values;
+ int i;
+ int4hashset_t *new = int4hashset_allocate(set->maxelements * 2);
+ char *bitmap;
+ int32 *values;
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
@@ -385,14 +393,14 @@ hashset_resize(hashset_t * set)
int bit = (i % 8);
if (bitmap[byte] & (0x01 << bit))
- hashset_add_element(new, values[i]);
+ int4hashset_add_element(new, values[i]);
}
return new;
}
-static hashset_t *
-hashset_add_element(hashset_t *set, int32 value)
+static int4hashset_t *
+int4hashset_add_element(int4hashset_t *set, int32 value)
{
int byte;
int bit;
@@ -401,7 +409,7 @@ hashset_add_element(hashset_t *set, int32 value)
int32 *values;
if (set->nelements > set->maxelements * 0.75)
- set = hashset_resize(set);
+ set = int4hashset_resize(set);
if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
{
@@ -446,7 +454,7 @@ hashset_add_element(hashset_t *set, int32 value)
}
static bool
-hashset_contains_element(hashset_t *set, int32 value)
+int4hashset_contains_element(int4hashset_t *set, int32 value)
{
int byte;
int bit;
@@ -494,9 +502,9 @@ hashset_contains_element(hashset_t *set, int32 value)
}
Datum
-hashset_add(PG_FUNCTION_ARGS)
+int4hashset_add(PG_FUNCTION_ARGS)
{
- hashset_t *set;
+ int4hashset_t *set;
if (PG_ARGISNULL(1))
{
@@ -508,39 +516,38 @@ hashset_add(PG_FUNCTION_ARGS)
/* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
- set = hashset_allocate(64);
+ set = int4hashset_allocate(64);
else
{
/* make sure we are working with a non-toasted and non-shared copy of the input */
- set = (hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(0));
+ set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
}
- set = hashset_add_element(set, PG_GETARG_INT32(1));
+ set = int4hashset_add_element(set, PG_GETARG_INT32(1));
PG_RETURN_POINTER(set);
}
Datum
-hashset_merge(PG_FUNCTION_ARGS)
+int4hashset_merge(PG_FUNCTION_ARGS)
{
- int i;
+ int i;
- hashset_t *seta;
- hashset_t *setb;
+ int4hashset_t *seta;
+ int4hashset_t *setb;
- char *bitmap;
- int32 *values;
+ char *bitmap;
+ int32_t *values;
- /* */
if (PG_ARGISNULL(0) && PG_ARGISNULL(1))
PG_RETURN_NULL();
else if (PG_ARGISNULL(1))
- PG_RETURN_POINTER(PG_GETARG_HASHSET(0));
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(0));
else if (PG_ARGISNULL(0))
- PG_RETURN_POINTER(PG_GETARG_HASHSET(1));
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(1));
- seta = PG_GETARG_HASHSET(0);
- setb = PG_GETARG_HASHSET(1);
+ seta = PG_GETARG_INT4HASHSET(0);
+ setb = PG_GETARG_INT4HASHSET(1);
bitmap = setb->data;
values = (int32 *) (setb->data + CEIL_DIV(setb->maxelements, 8));
@@ -551,53 +558,78 @@ hashset_merge(PG_FUNCTION_ARGS)
int bit = (i % 8);
if (bitmap[byte] & (0x01 << bit))
- seta = hashset_add_element(seta, values[i]);
+ seta = int4hashset_add_element(seta, values[i]);
}
PG_RETURN_POINTER(seta);
}
Datum
-hashset_init(PG_FUNCTION_ARGS)
+int4hashset_init(PG_FUNCTION_ARGS)
{
- PG_RETURN_POINTER(hashset_allocate(PG_GETARG_INT32(0)));
+ if (PG_NARGS() == 0) {
+ /*
+ * No initial capacity argument was passed,
+ * allocate hashset with zero capacity
+ */
+ PG_RETURN_POINTER(int4hashset_allocate(0));
+ } else {
+ /*
+ * Initial capacity argument was passed,
+ * allocate hashset with the specified capacity
+ */
+ PG_RETURN_POINTER(int4hashset_allocate(PG_GETARG_INT32(0)));
+ }
}
Datum
-hashset_contains(PG_FUNCTION_ARGS)
+int4hashset_contains(PG_FUNCTION_ARGS)
{
- hashset_t *set;
- int32 value;
+ int4hashset_t *set;
+ int32 value;
if (PG_ARGISNULL(1) || PG_ARGISNULL(0))
PG_RETURN_BOOL(false);
- set = PG_GETARG_HASHSET(0);
+ set = PG_GETARG_INT4HASHSET(0);
value = PG_GETARG_INT32(1);
- PG_RETURN_BOOL(hashset_contains_element(set, value));
+ PG_RETURN_BOOL(int4hashset_contains_element(set, value));
}
Datum
-hashset_count(PG_FUNCTION_ARGS)
+int4hashset_count(PG_FUNCTION_ARGS)
{
- hashset_t *set;
+ int4hashset_t *set;
if (PG_ARGISNULL(0))
PG_RETURN_NULL();
- set = PG_GETARG_HASHSET(0);
+ set = PG_GETARG_INT4HASHSET(0);
PG_RETURN_INT64(set->nelements);
}
Datum
-hashset_agg_add(PG_FUNCTION_ARGS)
+int4hashset_capacity(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ PG_RETURN_INT64(set->maxelements);
+}
+
+Datum
+int4hashset_agg_add(PG_FUNCTION_ARGS)
{
MemoryContext oldcontext;
- hashset_t *state;
+ int4hashset_t *state;
- MemoryContext aggcontext;
+ MemoryContext aggcontext;
/* cannot be called directly because of internal-type argument */
if (!AggCheckCallContext(fcinfo, &aggcontext))
@@ -620,26 +652,26 @@ hashset_agg_add(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
- state = hashset_allocate(64);
+ state = int4hashset_allocate(64);
MemoryContextSwitchTo(oldcontext);
}
else
- state = (hashset_t *) PG_GETARG_POINTER(0);
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
oldcontext = MemoryContextSwitchTo(aggcontext);
- state = hashset_add_element(state, PG_GETARG_INT32(1));
+ state = int4hashset_add_element(state, PG_GETARG_INT32(1));
MemoryContextSwitchTo(oldcontext);
PG_RETURN_POINTER(state);
}
Datum
-hashset_agg_add_set(PG_FUNCTION_ARGS)
+int4hashset_agg_add_set(PG_FUNCTION_ARGS)
{
MemoryContext oldcontext;
- hashset_t *state;
+ int4hashset_t *state;
- MemoryContext aggcontext;
+ MemoryContext aggcontext;
/* cannot be called directly because of internal-type argument */
if (!AggCheckCallContext(fcinfo, &aggcontext))
@@ -662,21 +694,21 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
- state = hashset_allocate(64);
+ state = int4hashset_allocate(64);
MemoryContextSwitchTo(oldcontext);
}
else
- state = (hashset_t *) PG_GETARG_POINTER(0);
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
oldcontext = MemoryContextSwitchTo(aggcontext);
{
- int i;
- char *bitmap;
- int32 *values;
- hashset_t *value;
+ int i;
+ char *bitmap;
+ int32 *values;
+ int4hashset_t *value;
- value = PG_GETARG_HASHSET(1);
+ value = PG_GETARG_INT4HASHSET(1);
bitmap = value->data;
values = (int32 *) (value->data + CEIL_DIV(value->maxelements, 8));
@@ -687,7 +719,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
int bit = (i % 8);
if (bitmap[byte] & (0x01 << bit))
- state = hashset_add_element(state, values[i]);
+ state = int4hashset_add_element(state, values[i]);
}
}
@@ -697,7 +729,7 @@ hashset_agg_add_set(PG_FUNCTION_ARGS)
}
Datum
-hashset_agg_final(PG_FUNCTION_ARGS)
+int4hashset_agg_final(PG_FUNCTION_ARGS)
{
if (PG_ARGISNULL(0))
PG_RETURN_NULL();
@@ -705,23 +737,23 @@ hashset_agg_final(PG_FUNCTION_ARGS)
PG_RETURN_POINTER(PG_GETARG_POINTER(0));
}
-static hashset_t *
-hashset_copy(hashset_t *src)
+static int4hashset_t *
+int4hashset_copy(int4hashset_t *src)
{
return src;
}
Datum
-hashset_agg_combine(PG_FUNCTION_ARGS)
+int4hashset_agg_combine(PG_FUNCTION_ARGS)
{
- int i;
- hashset_t *src;
- hashset_t *dst;
- MemoryContext aggcontext;
- MemoryContext oldcontext;
+ int i;
+ int4hashset_t *src;
+ int4hashset_t *dst;
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
- char *bitmap;
- int32 *values;
+ char *bitmap;
+ int32 *values;
if (!AggCheckCallContext(fcinfo, &aggcontext))
elog(ERROR, "hashset_agg_combine called in non-aggregate context");
@@ -734,11 +766,11 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
PG_RETURN_NULL();
/* the second argument is not NULL, so copy it */
- src = (hashset_t *) PG_GETARG_POINTER(1);
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
/* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
- src = hashset_copy(src);
+ src = int4hashset_copy(src);
MemoryContextSwitchTo(oldcontext);
PG_RETURN_POINTER(src);
@@ -752,8 +784,8 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
/* Now we know neither argument is NULL, so merge them. */
- src = (hashset_t *) PG_GETARG_POINTER(1);
- dst = (hashset_t *) PG_GETARG_POINTER(0);
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+ dst = (int4hashset_t *) PG_GETARG_POINTER(0);
bitmap = src->data;
values = (int32 *) (src->data + CEIL_DIV(src->maxelements, 8));
@@ -764,7 +796,7 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
int bit = (i % 8);
if (bitmap[byte] & (0x01 << bit))
- dst = hashset_add_element(dst, values[i]);
+ dst = int4hashset_add_element(dst, values[i]);
}
@@ -773,10 +805,10 @@ hashset_agg_combine(PG_FUNCTION_ARGS)
Datum
-hashset_equals(PG_FUNCTION_ARGS)
+int4hashset_equals(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
char *bitmap_a;
int32 *values_a;
@@ -803,7 +835,7 @@ hashset_equals(PG_FUNCTION_ARGS)
{
int32 value = values_a[i];
- if (!hashset_contains_element(b, value))
+ if (!int4hashset_contains_element(b, value))
PG_RETURN_BOOL(false);
}
}
@@ -817,22 +849,22 @@ hashset_equals(PG_FUNCTION_ARGS)
Datum
-hashset_neq(PG_FUNCTION_ARGS)
+int4hashset_neq(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
/* If a is not equal to b, then they are not equal */
- if (!DatumGetBool(DirectFunctionCall2(hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
PG_RETURN_BOOL(true);
PG_RETURN_BOOL(false);
}
-Datum hashset_hash(PG_FUNCTION_ARGS)
+Datum int4hashset_hash(PG_FUNCTION_ARGS)
{
- hashset_t *set = PG_GETARG_HASHSET(0);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
/* Initial hash value */
uint32 hash = 0;
@@ -861,13 +893,13 @@ Datum hashset_hash(PG_FUNCTION_ARGS)
Datum
-hashset_lt(PG_FUNCTION_ARGS)
+int4hashset_lt(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
int32 cmp;
- cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
PointerGetDatum(b)));
@@ -876,13 +908,13 @@ hashset_lt(PG_FUNCTION_ARGS)
Datum
-hashset_le(PG_FUNCTION_ARGS)
+int4hashset_le(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
int32 cmp;
- cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
PointerGetDatum(b)));
@@ -891,13 +923,13 @@ hashset_le(PG_FUNCTION_ARGS)
Datum
-hashset_gt(PG_FUNCTION_ARGS)
+int4hashset_gt(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
int32 cmp;
- cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
PointerGetDatum(b)));
@@ -906,13 +938,13 @@ hashset_gt(PG_FUNCTION_ARGS)
Datum
-hashset_ge(PG_FUNCTION_ARGS)
+int4hashset_ge(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
int32 cmp;
- cmp = DatumGetInt32(DirectFunctionCall2(hashset_cmp,
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
PointerGetDatum(b)));
@@ -921,10 +953,10 @@ hashset_ge(PG_FUNCTION_ARGS)
Datum
-hashset_cmp(PG_FUNCTION_ARGS)
+int4hashset_cmp(PG_FUNCTION_ARGS)
{
- hashset_t *a = PG_GETARG_HASHSET(0);
- hashset_t *b = PG_GETARG_HASHSET(1);
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
char *bitmap_a, *bitmap_b;
int32 *values_a, *values_b;
diff --git a/test/expected/basic.out b/test/expected/basic.out
index 5be2501..a793ef2 100644
--- a/test/expected/basic.out
+++ b/test/expected/basic.out
@@ -1,96 +1,335 @@
-SELECT hashset_sorted('{1}'::hashset);
- hashset_sorted
-----------------
- {1}
-(1 row)
-
-SELECT hashset_sorted('{1,2}'::hashset);
- hashset_sorted
-----------------
- {1,2}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3}'::hashset);
- hashset_sorted
-----------------
- {1,2,3}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4}'::hashset);
- hashset_sorted
-----------------
- {1,2,3,4}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
- hashset_sorted
-----------------
- {1,2,3,4,5}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
- hashset_sorted
-----------------
- {1,2,3,4,5,6}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
- hashset_sorted
------------------
- {1,2,3,4,5,6,7}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
- hashset_sorted
--------------------
- {1,2,3,4,5,6,7,8}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
- hashset_sorted
----------------------
- {1,2,3,4,5,6,7,8,9}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
- hashset_sorted
-------------------------
- {1,2,3,4,5,6,7,8,9,10}
-(1 row)
-
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
- hashset_sorted
+/*
+ * Hashset Type
+ */
+SELECT '{}'::int4hashset; -- empty int4hashset
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset;
+ int4hashset
+-------------
+ {3,2,1}
+(1 row)
+
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+ int4hashset
+----------------------------
+ {0,2147483647,-2147483648}
+(1 row)
+
+SELECT '{-2147483649}'::int4hashset; -- out of range
+ERROR: value "-2147483649}" is out of range for type integer
+LINE 1: SELECT '{-2147483649}'::int4hashset;
+ ^
+SELECT '{2147483648}'::int4hashset; -- out of range
+ERROR: value "2147483648}" is out of range for type integer
+LINE 1: SELECT '{2147483648}'::int4hashset;
+ ^
+/*
+ * Hashset Functions
+ */
+SELECT int4hashset(); -- init empty int4hashset with no capacity
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT int4hashset_with_capacity(10); -- init empty int4hashset with specified capacity
+ int4hashset_with_capacity
---------------------------
- {1,2,3,4,5,6,7,8,9,10,11}
+ {}
(1 row)
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
- hashset_sorted
-------------------------------
- {1,2,3,4,5,6,7,8,9,10,11,12}
+SELECT hashset_add(int4hashset(), 123);
+ hashset_add
+-------------
+ {123}
(1 row)
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
- hashset_sorted
----------------------------------
- {1,2,3,4,5,6,7,8,9,10,11,12,13}
+SELECT hashset_add(NULL::int4hashset, 123);
+ hashset_add
+-------------
+ {123}
(1 row)
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
- hashset_sorted
-------------------------------------
- {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+SELECT hashset_add('{123}'::int4hashset, 456);
+ hashset_add
+-------------
+ {456,123}
(1 row)
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
- hashset_sorted
----------------------------------------
- {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+ hashset_contains
+------------------
+ t
(1 row)
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
- hashset_sorted
-------------------------------------------
- {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+ hashset_contains
+------------------
+ f
+(1 row)
+
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_merge
+---------------
+ {3,1,2}
+(1 row)
+
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+ hashset_to_array
+------------------
+ {3,2,1}
+(1 row)
+
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+ hashset_count
+---------------
+ 3
+(1 row)
+
+SELECT hashset_capacity(int4hashset_with_capacity(10)); -- 10
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+/*
+ * Aggregation Functions
+ */
+SELECT hashset(i) FROM generate_series(1,10) AS i;
+ hashset
+------------------------
+ {8,1,10,3,9,4,6,2,5,7}
+(1 row)
+
+SELECT hashset(h) FROM
+(
+ SELECT hashset(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset
+------------------------
+ {8,1,10,3,9,4,6,2,5,7}
+(1 row)
+
+/*
+ * Operator Definitions
+ */
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+/*
+ * Hashset Hash Operators
+ */
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+ hashset_hash
+--------------
+ -1778803072
+(1 row)
+
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+ hashset_hash
+--------------
+ -1778803072
+(1 row)
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+ count | count
+-------+-------
+ 2 | 1
+(1 row)
+
+/*
+ * Hashset Btree Operators
+ */
+SELECT h FROM
+(
+ SELECT '{2}'::int4hashset AS h
+ UNION ALL
+ SELECT '{1}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3}'::int4hashset AS h
+) q
+ORDER BY h;
+ h
+-----
+ {1}
+ {2}
+ {3}
+(3 rows)
+
+SELECT '{2}'::int4hashset < '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset < '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset < '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <= '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <= '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <= '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset > '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset > '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset > '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset >= '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset >= '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset >= '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
(1 row)
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
index 0e08925..bd44199 100644
--- a/test/expected/invalid.out
+++ b/test/expected/invalid.out
@@ -1,4 +1,4 @@
-SELECT '{1,2s}'::hashset;
+SELECT '{1,2s}'::int4hashset;
ERROR: unexpected character "s" in hashset input
-LINE 1: SELECT '{1,2s}'::hashset;
+LINE 1: SELECT '{1,2s}'::int4hashset;
^
diff --git a/test/expected/io_varying_lengths.out b/test/expected/io_varying_lengths.out
new file mode 100644
index 0000000..45e9fb1
--- /dev/null
+++ b/test/expected/io_varying_lengths.out
@@ -0,0 +1,100 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+SELECT hashset_sorted('{1}'::int4hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/order.out b/test/expected/order.out
index f8f8d5b..089bd15 100644
--- a/test/expected/order.out
+++ b/test/expected/order.out
@@ -1,128 +1,20 @@
-CREATE TABLE IF NOT EXISTS test_hashset_order (hashset_col hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{1,2,3}'::hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{3,2,1}'::hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{4,5,6}'::hashset);
-SELECT COUNT(DISTINCT hashset_col) FROM test_hashset_order;
+CREATE TABLE IF NOT EXISTS test_int4hashset_order (int4hashset_col int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{1,2,3}'::int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{3,2,1}'::int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{4,5,6}'::int4hashset);
+SELECT COUNT(DISTINCT int4hashset_col) FROM test_int4hashset_order;
count
-------
2
(1 row)
-SELECT '{2}'::hashset < '{1}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset < '{2}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset < '{3}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset <= '{1}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset <= '{2}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset <= '{3}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset > '{1}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset > '{2}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset > '{3}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset >= '{1}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset >= '{2}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset >= '{3}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset = '{1}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset = '{2}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset = '{3}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset <> '{1}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-SELECT '{2}'::hashset <> '{2}'::hashset; -- false
- ?column?
-----------
- f
-(1 row)
-
-SELECT '{2}'::hashset <> '{3}'::hashset; -- true
- ?column?
-----------
- t
-(1 row)
-
-CREATE OR REPLACE FUNCTION generate_random_hashset(num_elements INT)
-RETURNS hashset AS $$
+CREATE OR REPLACE FUNCTION generate_random_int4hashset(num_elements INT)
+RETURNS int4hashset AS $$
DECLARE
element INT;
- random_set hashset;
+ random_set int4hashset;
BEGIN
- random_set := hashset_init(num_elements);
+ random_set := int4hashset_with_capacity(num_elements);
FOR i IN 1..num_elements LOOP
element := floor(random() * 1000)::INT;
@@ -138,14 +30,14 @@ SELECT setseed(0.123465);
(1 row)
-CREATE TABLE hashset_order_test AS
-SELECT generate_random_hashset(3) AS hashset_col
+CREATE TABLE int4hashset_order_test AS
+SELECT generate_random_int4hashset(3) AS hashset_col
FROM generate_series(1,1000)
UNION
-SELECT generate_random_hashset(2)
+SELECT generate_random_int4hashset(2)
FROM generate_series(1,1000);
SELECT hashset_col
-FROM hashset_order_test
+FROM int4hashset_order_test
ORDER BY hashset_col
LIMIT 20;
hashset_col
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
index b094033..f34e190 100644
--- a/test/expected/prelude.out
+++ b/test/expected/prelude.out
@@ -1,5 +1,5 @@
CREATE EXTENSION hashset;
-CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
RETURNS TEXT AS
$$
SELECT array_agg(i ORDER BY i::int)::text
diff --git a/test/expected/random.out b/test/expected/random.out
index 889f5ca..9d9026b 100644
--- a/test/expected/random.out
+++ b/test/expected/random.out
@@ -5,7 +5,7 @@ SELECT setseed(0.12345);
(1 row)
\set MAX_INT 2147483647
-CREATE TABLE hashset_random_numbers AS
+CREATE TABLE hashset_random_int4_numbers AS
SELECT
(random()*:MAX_INT)::int AS i
FROM generate_series(1,(random()*10000)::int)
@@ -15,8 +15,8 @@ SELECT
FROM
(
SELECT
- hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
- FROM hashset_random_numbers
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
) q;
md5
----------------------------------
@@ -29,7 +29,7 @@ FROM
(
SELECT
format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
- FROM hashset_random_numbers
+ FROM hashset_random_int4_numbers
) q;
md5
----------------------------------
diff --git a/test/expected/table.out b/test/expected/table.out
index 8d4fbe2..3c020b6 100644
--- a/test/expected/table.out
+++ b/test/expected/table.out
@@ -1,6 +1,6 @@
CREATE TABLE users (
user_id int PRIMARY KEY,
- user_likes hashset DEFAULT hashset_init(2)
+ user_likes int4hashset DEFAULT int4hashset_with_capacity(2)
);
INSERT INTO users (user_id) VALUES (1);
UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
index 662e65a..563c626 100644
--- a/test/sql/basic.sql
+++ b/test/sql/basic.sql
@@ -1,16 +1,107 @@
-SELECT hashset_sorted('{1}'::hashset);
-SELECT hashset_sorted('{1,2}'::hashset);
-SELECT hashset_sorted('{1,2,3}'::hashset);
-SELECT hashset_sorted('{1,2,3,4}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::hashset);
-SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::hashset);
+/*
+ * Hashset Type
+ */
+
+SELECT '{}'::int4hashset; -- empty int4hashset
+SELECT '{1,2,3}'::int4hashset;
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+SELECT '{-2147483649}'::int4hashset; -- out of range
+SELECT '{2147483648}'::int4hashset; -- out of range
+
+/*
+ * Hashset Functions
+ */
+
+SELECT int4hashset(); -- init empty int4hashset with no capacity
+SELECT int4hashset_with_capacity(10); -- init empty int4hashset with specified capacity
+SELECT hashset_add(int4hashset(), 123);
+SELECT hashset_add(NULL::int4hashset, 123);
+SELECT hashset_add('{123}'::int4hashset, 456);
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_capacity(int4hashset_with_capacity(10)); -- 10
+
+/*
+ * Aggregation Functions
+ */
+
+SELECT hashset(i) FROM generate_series(1,10) AS i;
+
+SELECT hashset(h) FROM
+(
+ SELECT hashset(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset(j) AS h FROM generate_series(6,10) AS j
+) q;
+
+/*
+ * Operator Definitions
+ */
+
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+
+/*
+ * Hashset Hash Operators
+ */
+
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+
+/*
+ * Hashset Btree Operators
+ */
+
+SELECT h FROM
+(
+ SELECT '{2}'::int4hashset AS h
+ UNION ALL
+ SELECT '{1}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3}'::int4hashset AS h
+) q
+ORDER BY h;
+
+SELECT '{2}'::int4hashset < '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset < '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset < '{3}'::int4hashset; -- true
+
+SELECT '{2}'::int4hashset <= '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <= '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <= '{3}'::int4hashset; -- true
+
+SELECT '{2}'::int4hashset > '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset > '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset > '{3}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset >= '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset >= '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset >= '{3}'::int4hashset; -- false
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
index f1a9488..43689ab 100644
--- a/test/sql/invalid.sql
+++ b/test/sql/invalid.sql
@@ -1 +1 @@
-SELECT '{1,2s}'::hashset;
+SELECT '{1,2s}'::int4hashset;
diff --git a/test/sql/io_varying_lengths.sql b/test/sql/io_varying_lengths.sql
new file mode 100644
index 0000000..8acb6b8
--- /dev/null
+++ b/test/sql/io_varying_lengths.sql
@@ -0,0 +1,21 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+
+SELECT hashset_sorted('{1}'::int4hashset);
+SELECT hashset_sorted('{1,2}'::int4hashset);
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
diff --git a/test/sql/order.sql b/test/sql/order.sql
index 1780c0b..2dcdb39 100644
--- a/test/sql/order.sql
+++ b/test/sql/order.sql
@@ -1,40 +1,16 @@
-CREATE TABLE IF NOT EXISTS test_hashset_order (hashset_col hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{1,2,3}'::hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{3,2,1}'::hashset);
-INSERT INTO test_hashset_order (hashset_col) VALUES ('{4,5,6}'::hashset);
-SELECT COUNT(DISTINCT hashset_col) FROM test_hashset_order;
-
-SELECT '{2}'::hashset < '{1}'::hashset; -- false
-SELECT '{2}'::hashset < '{2}'::hashset; -- false
-SELECT '{2}'::hashset < '{3}'::hashset; -- true
-
-SELECT '{2}'::hashset <= '{1}'::hashset; -- false
-SELECT '{2}'::hashset <= '{2}'::hashset; -- true
-SELECT '{2}'::hashset <= '{3}'::hashset; -- true
-
-SELECT '{2}'::hashset > '{1}'::hashset; -- true
-SELECT '{2}'::hashset > '{2}'::hashset; -- false
-SELECT '{2}'::hashset > '{3}'::hashset; -- false
-
-SELECT '{2}'::hashset >= '{1}'::hashset; -- true
-SELECT '{2}'::hashset >= '{2}'::hashset; -- true
-SELECT '{2}'::hashset >= '{3}'::hashset; -- false
-
-SELECT '{2}'::hashset = '{1}'::hashset; -- false
-SELECT '{2}'::hashset = '{2}'::hashset; -- true
-SELECT '{2}'::hashset = '{3}'::hashset; -- false
-
-SELECT '{2}'::hashset <> '{1}'::hashset; -- true
-SELECT '{2}'::hashset <> '{2}'::hashset; -- false
-SELECT '{2}'::hashset <> '{3}'::hashset; -- true
-
-CREATE OR REPLACE FUNCTION generate_random_hashset(num_elements INT)
-RETURNS hashset AS $$
+CREATE TABLE IF NOT EXISTS test_int4hashset_order (int4hashset_col int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{1,2,3}'::int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{3,2,1}'::int4hashset);
+INSERT INTO test_int4hashset_order (int4hashset_col) VALUES ('{4,5,6}'::int4hashset);
+SELECT COUNT(DISTINCT int4hashset_col) FROM test_int4hashset_order;
+
+CREATE OR REPLACE FUNCTION generate_random_int4hashset(num_elements INT)
+RETURNS int4hashset AS $$
DECLARE
element INT;
- random_set hashset;
+ random_set int4hashset;
BEGIN
- random_set := hashset_init(num_elements);
+ random_set := int4hashset_with_capacity(num_elements);
FOR i IN 1..num_elements LOOP
element := floor(random() * 1000)::INT;
@@ -47,14 +23,14 @@ $$ LANGUAGE plpgsql;
SELECT setseed(0.123465);
-CREATE TABLE hashset_order_test AS
-SELECT generate_random_hashset(3) AS hashset_col
+CREATE TABLE int4hashset_order_test AS
+SELECT generate_random_int4hashset(3) AS hashset_col
FROM generate_series(1,1000)
UNION
-SELECT generate_random_hashset(2)
+SELECT generate_random_int4hashset(2)
FROM generate_series(1,1000);
SELECT hashset_col
-FROM hashset_order_test
+FROM int4hashset_order_test
ORDER BY hashset_col
LIMIT 20;
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
index ccc0595..2fee0fc 100644
--- a/test/sql/prelude.sql
+++ b/test/sql/prelude.sql
@@ -1,6 +1,6 @@
CREATE EXTENSION hashset;
-CREATE OR REPLACE FUNCTION hashset_sorted(hashset)
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
RETURNS TEXT AS
$$
SELECT array_agg(i ORDER BY i::int)::text
diff --git a/test/sql/random.sql b/test/sql/random.sql
index 16c9084..7cc8f87 100644
--- a/test/sql/random.sql
+++ b/test/sql/random.sql
@@ -2,7 +2,7 @@ SELECT setseed(0.12345);
\set MAX_INT 2147483647
-CREATE TABLE hashset_random_numbers AS
+CREATE TABLE hashset_random_int4_numbers AS
SELECT
(random()*:MAX_INT)::int AS i
FROM generate_series(1,(random()*10000)::int)
@@ -13,8 +13,8 @@ SELECT
FROM
(
SELECT
- hashset_sorted(hashset(format('{%s}',string_agg(i::text,','))))
- FROM hashset_random_numbers
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
) q;
SELECT
@@ -23,5 +23,5 @@ FROM
(
SELECT
format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
- FROM hashset_random_numbers
+ FROM hashset_random_int4_numbers
) q;
diff --git a/test/sql/table.sql b/test/sql/table.sql
index d848207..a63253f 100644
--- a/test/sql/table.sql
+++ b/test/sql/table.sql
@@ -1,6 +1,6 @@
CREATE TABLE users (
user_id int PRIMARY KEY,
- user_likes hashset DEFAULT hashset_init(2)
+ user_likes int4hashset DEFAULT int4hashset_with_capacity(2)
);
INSERT INTO users (user_id) VALUES (1);
UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
On Wed, 14 Jun 2023 at 19:14, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
...So we'd want the same index usage as
=ANY(array) but would like faster row checking than we get with an array
when other indexes are used.We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.
Ha! That is great. Unfortunately we can't rely on it as we have customers
using versions back to 12. But good to know that it's available when we
bump the required versions.
Thanks
Tom
On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
On 6/14/23 14:57, Joel Jacobson wrote:
Would it be feasible to teach the planner to utilize the internal hash
table of
hashset directly? In the case of arrays, the hash table construction is
an
...It's definitely something I'd leave out of v0, personally.
OK, thanks for guidance, I'll stay away from it.
I've been doing some preparatory work on this todo item:
3) support for other types (now it only works with int32)
I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).Other changes since last update (4e60615):
* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current
capacity
of a hashset.
* Refactored hashset initialization:
- Replaced hashset_init(int) with int4hashset() to initialize an empty
hashset
with zero capacity.
- Added int4hashset_with_capacity(int) to initialize a hashset with
a specified capacity.
* Improved README.md and testingAs a next step, I'm planning on adding int8 support.
Looks and sounds good?
/Joel
still playing around with hashset-0.0.1-a8a282a.patch.
I think "postgres.h" should be on the top, (someone have said it on another
email thread, I forgot who said that)
In my
local /home/jian/postgres/pg16/include/postgresql/server/libpq/pqformat.h:
/*
* Append a binary integer to a StringInfo buffer
*
* This function is deprecated; prefer use of the functions above.
*/
static inline void
pq_sendint(StringInfo buf, uint32 i, int b)
So I changed to pq_sendint32.
ending and beginning, and in between white space should be stripped. The
following c example seems ok for now. but I am not sure, I don't know how
to glue it in hashset_in.
forgive me the patch name....
/*
gcc /home/jian/Desktop/regress_pgsql/strip_white_space.c && ./a.out
*/
#include<stdio.h>
#include<stdint.h>
#include<string.h>
#include<stdbool.h>
#include <ctype.h>
#include<stdlib.h>
/*
* array_isspace() --- a non-locale-dependent isspace()
*
* We used to use isspace() for parsing array values, but that has
* undesirable results: an array value might be silently interpreted
* differently depending on the locale setting. Now we just hard-wire
* the traditional ASCII definition of isspace().
*/
static bool
array_isspace(char ch)
{
if (ch == ' ' ||
ch == '\t' ||
ch == '\n' ||
ch == '\r' ||
ch == '\v' ||
ch == '\f')
return true;
return false;
}
int main(void)
{
long *temp = malloc(10 * sizeof(long));
memset(temp,0,10);
char source[5][50] = {{0}};
snprintf(source[0],sizeof(source[0]),"%s"," { 1 , 20 }");
snprintf(source[1],sizeof(source[0]),"%s"," { 1 ,20 , 30 ");
snprintf(source[2],sizeof(source[0]),"%s"," {1 ,20 , 30 ");
snprintf(source[3],sizeof(source[0]),"%s"," {1 , 20 , 30 }");
snprintf(source[4],sizeof(source[0]),"%s"," {1 , 20 , 30 }
");
/* Make a modifiable copy of the input */
char *p;
char string_save[50];
for(int j = 0; j < 5; j++)
{
snprintf(string_save,sizeof(string_save),"%s",source[j]);
p = string_save;
int i = 0;
while (array_isspace(*p))
p++;
if (*p != '{')
{
printf("line: %d should be {\n",__LINE__);
exit(EXIT_FAILURE);
}
for (;;)
{
char *q;
if (*p == '{')
p++;
temp[i] = strtol(p, &q,10);
printf("temp[j=%d] [%d]=%ld\n",j,i,temp[i]);
if (*q == '}' && (*(q+1) == '\0'))
{
printf("all works ok now exit\n");
break;
}
if( !array_isspace(*q) && *q != ',')
{
printf("wrong format. program will exit\n");
exit(EXIT_FAILURE);
}
while(array_isspace(*q))
q++;
if(*q != ',')
break;
else
p = q+1;
i++;
}
}
}
Attachments:
temp.patchtext/x-patch; charset=US-ASCII; name=temp.patchDownload
diff --git a/hashset.c b/hashset.c
index d62e0491..e71764c5 100644
--- a/hashset.c
+++ b/hashset.c
@@ -3,15 +3,13 @@
*
* Copyright (C) Tomas Vondra, 2019
*/
+#include "postgres.h"
-#include <stdio.h>
#include <math.h>
-#include <string.h>
#include <sys/time.h>
#include <unistd.h>
#include <limits.h>
-#include "postgres.h"
#include "libpq/pqformat.h"
#include "nodes/memnodes.h"
#include "utils/array.h"
@@ -255,10 +253,10 @@ hashset_send(PG_FUNCTION_ARGS)
pq_begintypsend(&buf);
/* Send the non-data fields */
- pq_sendint(&buf, set->flags, 4);
- pq_sendint(&buf, set->maxelements, 4);
- pq_sendint(&buf, set->nelements, 4);
- pq_sendint(&buf, set->hashfn_id, 4);
+ pq_sendint32(&buf,set->flags);
+ pq_sendint32(&buf,set->maxelements);
+ pq_sendint32(&buf,set->nelements);
+ pq_sendint32(&buf,set->hashfn_id);
/* Compute and send the size of the data field */
data_size = VARSIZE(set) - offsetof(hashset_t, data);
On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
On 6/14/23 14:57, Joel Jacobson wrote:
Would it be feasible to teach the planner to utilize the internal hash
table of
hashset directly? In the case of arrays, the hash table construction is
an
...It's definitely something I'd leave out of v0, personally.
OK, thanks for guidance, I'll stay away from it.
I've been doing some preparatory work on this todo item:
3) support for other types (now it only works with int32)
I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).Other changes since last update (4e60615):
* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current
capacity
of a hashset.
* Refactored hashset initialization:
- Replaced hashset_init(int) with int4hashset() to initialize an empty
hashset
with zero capacity.
- Added int4hashset_with_capacity(int) to initialize a hashset with
a specified capacity.
* Improved README.md and testingAs a next step, I'm planning on adding int8 support.
Looks and sounds good?
/Joel
I am not sure the following results are correct.
with cte as (
select hashset(x) as x
,hashset_capacity(hashset(x))
,hashset_count(hashset(x))
from generate_series(1,10) g(x))
select *
,'|' as delim
, hashset_add(x,11111::int)
,hashset_capacity(hashset_add(x,11111::int))
,hashset_count(hashset_add(x,11111::int))
from cte \gx
results:
-[ RECORD 1 ]----+-----------------------------
x | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count | 10
delim | |
hashset_add | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count | 11
but:
with cte as(select '{1,2}'::int4hashset as x) select
x,hashset_add(x,3::int) from cte;
returns
x | hashset_add
-------+-------------
{1,2} | {3,1,2}
(1 row)
last simple query seems more sensible to me.
On Thu, Jun 15, 2023, at 04:22, jian he wrote:
Attachments:
* temp.patch
Thanks for good suggestions.
New patch attached:
Enhance parsing and reorder headers in hashset module
Allow whitespaces in hashset input and reorder the inclusion of
header files, placing PostgreSQL headers first. Additionally, update
deprecated pq_sendint calls to pq_sendint32. Add tests for improved
parsing functionality.
/Joel
Attachments:
hashset-0.0.1-1fd790fapplication/octet-stream; name=hashset-0.0.1-1fd790fDownload
commit 1fd790fa375383e70a641ac4b2d62c2932f1715c
Author: Joel Jakobsson <joel@compiler.org>
Date: Thu Jun 15 08:23:25 2023 +0200
Enhance parsing and reorder headers in hashset module
Allow whitespaces in hashset input and reorder the inclusion of
header files, placing PostgreSQL headers first. Additionally, update
deprecated pq_sendint calls to pq_sendint32. Add tests for improved
parsing functionality.
diff --git a/Makefile b/Makefile
index 68c29a2..b09a50f 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
CLIENT_INCLUDES=-I$(shell pg_config --includedir)
LIBRARY_PATH = -L$(shell pg_config --libdir)
-REGRESS = prelude basic io_varying_lengths random table invalid order
+REGRESS = prelude basic io_varying_lengths random table invalid order parsing
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/hashset.c b/hashset.c
index 3bc7fdc..b9025b4 100644
--- a/hashset.c
+++ b/hashset.c
@@ -4,13 +4,6 @@
* Copyright (C) Tomas Vondra, 2019
*/
-#include <stdio.h>
-#include <math.h>
-#include <string.h>
-#include <sys/time.h>
-#include <unistd.h>
-#include <limits.h>
-
#include "postgres.h"
#include "libpq/pqformat.h"
#include "nodes/memnodes.h"
@@ -21,6 +14,13 @@
#include "catalog/pg_type.h"
#include "common/hashfn.h"
+#include <stdio.h>
+#include <math.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <limits.h>
+
PG_MODULE_MAGIC;
/*
@@ -94,6 +94,28 @@ Datum int4hashset_gt(PG_FUNCTION_ARGS);
Datum int4hashset_ge(PG_FUNCTION_ARGS);
Datum int4hashset_cmp(PG_FUNCTION_ARGS);
+/*
+ * hashset_isspace() --- a non-locale-dependent isspace()
+ *
+ * Identical to array_isspace() in src/backend/utils/adt/arrayfuncs.c.
+ * We used to use isspace() for parsing hashset values, but that has
+ * undesirable results: a hashset value might be silently interpreted
+ * differently depending on the locale setting. So here, we hard-wire
+ * the traditional ASCII definition of isspace().
+ */
+static bool
+hashset_isspace(char ch)
+{
+ if (ch == ' ' ||
+ ch == '\t' ||
+ ch == '\n' ||
+ ch == '\r' ||
+ ch == '\v' ||
+ ch == '\f')
+ return true;
+ return false;
+}
+
static int4hashset_t *
int4hashset_allocate(int maxelements)
{
@@ -135,14 +157,18 @@ int4hashset_in(PG_FUNCTION_ARGS)
char *endptr;
int32 len = strlen(str);
int4hashset_t *set;
+ int64 value;
- /* Check the opening and closing braces */
- if (str[0] != '{' || str[len - 1] != '}')
+ /* Skip initial spaces */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check the opening brace */
+ if (*str != '{')
{
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid input syntax for hashset: \"%s\"", str),
- errdetail("Hashset representation must start with \"{\" and end with \"}\".")));
+ errdetail("Hashset representation must start with \"{\".")));
}
/* Start parsing from the first number (after the opening brace) */
@@ -151,22 +177,27 @@ int4hashset_in(PG_FUNCTION_ARGS)
/* Initial size based on input length (arbitrary, could be optimized) */
set = int4hashset_allocate(len/2);
- /* Check for empty set */
- if (*str == '}')
+ while (true)
{
- PG_RETURN_POINTER(set);
- }
+ /* Skip spaces before number */
+ while (hashset_isspace(*str)) str++;
- while (*str != '}')
- {
- int64 value = strtol(str, &endptr, 10);
+ /* Check for closing brace, handling the case for an empty set */
+ if (*str == '}')
+ {
+ str++; /* Move past the closing brace */
+ break;
+ }
+
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
{
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("value \"%s\" is out of range for type %s", str,
- "integer")));
+ "integer")));
}
/* Add the value to the hashset, resize if needed */
@@ -183,21 +214,36 @@ int4hashset_in(PG_FUNCTION_ARGS)
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid input syntax for integer: \"%s\"", str)));
}
- else if (*endptr == ',')
+
+ str = endptr; /* Move to next potential number or closing brace */
+
+ /* Skip spaces before the next number or closing brace */
+ while (hashset_isspace(*str)) str++;
+
+ if (*str == ',')
{
- str = endptr + 1; /* Move to the next number */
+ str++; /* Skip comma before next loop iteration */
}
- else if (*endptr != '}')
+ else if (*str != '}')
{
/* Unexpected character */
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
- errmsg("unexpected character \"%c\" in hashset input", *endptr)));
+ errmsg("unexpected character \"%c\" in hashset input", *str)));
}
- else /* *endptr is '}', move to next iteration */
+ }
+
+ /* Only whitespace is allowed after the closing brace */
+ while (*str)
+ {
+ if (!hashset_isspace(*str))
{
- str = endptr;
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed hashset literal: \"%s\"", str),
+ errdetail("Junk after closing right brace.")));
}
+ str++;
}
PG_RETURN_POINTER(set);
@@ -257,10 +303,10 @@ int4hashset_send(PG_FUNCTION_ARGS)
pq_begintypsend(&buf);
/* Send the non-data fields */
- pq_sendint(&buf, set->flags, 4);
- pq_sendint(&buf, set->maxelements, 4);
- pq_sendint(&buf, set->nelements, 4);
- pq_sendint(&buf, set->hashfn_id, 4);
+ pq_sendint32(&buf, set->flags);
+ pq_sendint32(&buf, set->maxelements);
+ pq_sendint32(&buf, set->nelements);
+ pq_sendint32(&buf, set->hashfn_id);
/* Compute and send the size of the data field */
data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
diff --git a/test/expected/parsing.out b/test/expected/parsing.out
new file mode 100644
index 0000000..263797e
--- /dev/null
+++ b/test/expected/parsing.out
@@ -0,0 +1,71 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+ERROR: malformed hashset literal: "1"
+LINE 2: SELECT ' { 1 , 23 , -456 } 1'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+ERROR: malformed hashset literal: ","
+LINE 1: SELECT ' { 1 , 23 , -456 } ,'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+ERROR: malformed hashset literal: "{"
+LINE 1: SELECT ' { 1 , 23 , -456 } {'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+ERROR: malformed hashset literal: "}"
+LINE 1: SELECT ' { 1 , 23 , -456 } }'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+ERROR: malformed hashset literal: "x"
+LINE 1: SELECT ' { 1 , 23 , -456 } x'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+ERROR: unexpected character "1" in hashset input
+LINE 2: SELECT ' { 1 , 23 , -456 1'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+ERROR: unexpected character "{" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 {'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+ERROR: unexpected character "x" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 x'::int4hashset;
+ ^
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ", 23 , -456 } "
+LINE 2: SELECT ' { , 23 , -456 } '::int4hashset;
+ ^
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ""
+LINE 1: SELECT ' { 1 , 23 , '::int4hashset;
+ ^
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: "s , 23 , -456 } "
+LINE 1: SELECT ' { s , 23 , -456 } '::int4hashset;
+ ^
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for hashset: "1 , 23 , -456 } "
+LINE 2: SELECT ' 1 , 23 , -456 } '::int4hashset;
+ ^
+DETAIL: Hashset representation must start with "{".
diff --git a/test/sql/parsing.sql b/test/sql/parsing.sql
new file mode 100644
index 0000000..1e56bbe
--- /dev/null
+++ b/test/sql/parsing.sql
@@ -0,0 +1,23 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
In hashset/test/sql/order.sql, can we add the following to test
whether the optimizer
will use our index.
CREATE INDEX ON test_int4hashset_order (int4hashset_col
int4hashset_btree_ops);
-- to make sure that this work with just two rows
SET enable_seqscan TO off;
explain(costs off) SELECT * FROM test_int4hashset_order WHERE
int4hashset_col = '{1,2}'::int4hashset;
reset enable_seqscan;
Since most contrib modules, one module, only one test file, maybe we need
to consolidate all the test sql files to one sql file (int4hashset.sql)?
--------------
I didn't install the extension directly. I copied the hashset--0.0.1.sql to
another place, using gcc to compile these functions.
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c
/home/jian/hashset/hashset.c
gcc -shared -o /home/jian/hashset/hashset.so /home/jian/hashset/hashset.o
then modify hashset--0.0.1.sql then in psql \i fullsqlfilename to create
these functions, types.
Because even make
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still has
an error.
fatal error: libpq-fe.h: No such file or directory
3 | #include <libpq-fe.h>
Is there any way to put test_send_recv.c to sql test file?
Attached is a patch slightly modified README.md. feel free to change, since
i am not native english speaker...
Attachments:
0001-add-instruction-using-PG_CONFIG-to-install-extension.patchtext/x-patch; charset=US-ASCII; name=0001-add-instruction-using-PG_CONFIG-to-install-extension.patchDownload
From 2bb310affed4a06c6fa38fb0e1b1ff39f330d88d Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universality@gmail.com>
Date: Thu, 15 Jun 2023 15:50:00 +0800
Subject: [PATCH] add instruction using PG_CONFIG to install extension, also
how to run installcheck
---
README.md | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 3ff57576..30c66667 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
# hashset
This PostgreSQL extension implements hashset, a data structure (type)
-providing a collection of integer items with fast lookup.
+providing a collection of unique, not null integer items with fast lookup.
## Version
@@ -103,11 +103,19 @@ a variable-length type.
## Installation
-To install the extension, run `make install` in the project root. Then, in your
+To install the extension, run `make install` in the project root. To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+
+ make PG_CONFIG=/else/where/pg_config
+ make install PG_CONFIG=/else/where/pg_config
+
+Then, in your
PostgreSQL connection, execute `CREATE EXTENSION hashset;`.
This extension requires PostgreSQL version ?.? or later.
+## Test
+To run regression test, execute
+`make installcheck`
## License
--
2.34.1
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
I didn't install the extension directly. I copied the
hashset--0.0.1.sql to another place, using gcc to compile these
functions.
..
Because even make
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still
has an error.
fatal error: libpq-fe.h: No such file or directory
3 | #include <libpq-fe.h>
What platform are you on?
You seem to be missing the postgresql dev package.
For instance, here is how to compile and install the extension on Ubuntu 22.04.1 LTS:
sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
git clone https://github.com/tvondra/hashset.git
cd hashset
make
sudo make install
make installcheck
Is there any way to put test_send_recv.c to sql test file?
Unfortunately, there doesn't seem to be a way to test *_recv() functions from SQL,
since they take `internal` as input. The only way I could figure out to test them
was to write a C-program using libpq's binary mode.
I also note that the test_send_recv test was broken; I had forgot to change
the type from "hashset" to "int4hashset". Fixed in attached commit.
On Ubuntu, you can now run the test by specifying to connect via the UNIX socket:
PGHOST=/var/run/postgresql make run_c_tests
cd test/c_tests && ./test_send_recv.sh
test test_send_recv ... ok
Attached is a patch slightly modified README.md. feel free to change,
since i am not native english speaker...
Attachments:
* 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch
Thanks, will have a look later.
/Joel
Attachments:
hashset-0.0.1-bb0ece8.patchapplication/octet-stream; name=hashset-0.0.1-bb0ece8.patchDownload
diff --git a/README.md b/README.md
index 3ff5757..3c1242e 100644
--- a/README.md
+++ b/README.md
@@ -103,11 +103,34 @@ a variable-length type.
## Installation
-To install the extension, run `make install` in the project root. Then, in your
-PostgreSQL connection, execute `CREATE EXTENSION hashset;`.
+To install the extension on any platform, follow these general steps:
+
+1. Ensure you have PostgreSQL installed on your system, including the development files.
+2. Clone the repository.
+3. Navigate to the cloned repository directory.
+4. Compile the extension using `make`.
+5. Install the extension using `sudo make install`.
+6. Run the tests using `make installcheck` (optional).
+
+In your PostgreSQL connection, enable the hashset extension using the following SQL command:
+```sql
+CREATE EXTENSION hashset;
+```
This extension requires PostgreSQL version ?.? or later.
+For Ubuntu 22.04.1 LTS, you would run the following commands:
+
+```sh
+sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
+git clone https://github.com/tvondra/hashset.git
+cd hashset
+make
+sudo make install
+make installcheck
+```
+
+Please note that this project is currently under active development and is not yet considered production-ready.
## License
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
index 5655b8b..cc7c48a 100644
--- a/test/c_tests/test_send_recv.c
+++ b/test/c_tests/test_send_recv.c
@@ -9,8 +9,16 @@ void exit_nicely(PGconn *conn) {
int main() {
/* Connect to database specified by the PGDATABASE environment variable */
- const char *conninfo = "host=localhost port=5432";
- PGconn *conn = PQconnectdb(conninfo);
+ const char *hostname = getenv("PGHOST");
+ char conninfo[1024];
+ PGconn *conn;
+
+ if (hostname == NULL)
+ hostname = "localhost";
+
+ /* Connect to database specified by the PGDATABASE environment variable */
+ snprintf(conninfo, sizeof(conninfo), "host=%s port=5432", hostname);
+ conn = PQconnectdb(conninfo);
if (PQstatus(conn) != CONNECTION_OK) {
fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
exit_nicely(conn);
@@ -20,13 +28,13 @@ int main() {
PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
/* Create temporary table */
- PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col hashset)");
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col int4hashset)");
/* Enable binary output */
PQexec(conn, "SET bytea_output = 'escape'");
/* Insert dummy data */
- const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::hashset)";
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::int4hashset)";
PGresult *res = PQexec(conn, insert_command);
if (PQresultStatus(res) != PGRES_COMMAND_OK) {
fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
@@ -64,7 +72,7 @@ int main() {
PQclear(res);
/* Check the data */
- const char *check_command = "SELECT COUNT(DISTINCT hashset_col::text) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
res = PQexec(conn, check_command);
if (PQresultStatus(res) != PGRES_TUPLES_OK) {
fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
On Thu, Jun 15, 2023, at 06:29, jian he wrote:
I am not sure the following results are correct.
with cte as (
select hashset(x) as x
,hashset_capacity(hashset(x))
,hashset_count(hashset(x))
from generate_series(1,10) g(x))
select *
,'|' as delim
, hashset_add(x,11111::int)
,hashset_capacity(hashset_add(x,11111::int))
,hashset_count(hashset_add(x,11111::int))
from cte \gxresults:
-[ RECORD 1 ]----+-----------------------------
x | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count | 10
delim | |
hashset_add | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count | 11
Nice catch, you found a bug!
Fixed in attached patch:
---
Ensure hashset_add and hashset_merge operate on copied data
Previously, the hashset_add() and hashset_merge() functions were
modifying the original hashset in-place. This was leading to unexpected
results because the original data in the hashset was being altered.
This commit introduces the macro PG_GETARG_INT4HASHSET_COPY(), ensuring
a copy of the hashset is created and modified, leaving the original
hashset untouched.
This adjustment ensures hashset_add() and hashset_merge() operate
correctly on the copied hashset and prevent modification of the
original data.
A new regression test file `reported_bugs.sql` has been added to
validate the proper functionality of these changes. Future reported
bugs and their corresponding tests will also be added to this file.
---
I wonder if this function:
static int4hashset_t *
int4hashset_copy(int4hashset_t *src)
{
return src;
}
...that was previously named hashset_copy(),
should be implemented to actually copy the struct,
instead of just returning the input?
It is being used by int4hashset_agg_combine() like this:
/* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
src = int4hashset_copy(src);
MemoryContextSwitchTo(oldcontext);
/Joel
Attachments:
hashset-0.0.1-da84659.patchapplication/octet-stream; name=hashset-0.0.1-da84659.patchDownload
commit da84659aacf0b72769c01783ae8b5ee595da3f77
Author: Joel Jakobsson <joel@compiler.org>
Date: Thu Jun 15 22:54:05 2023 +0200
Ensure hashset_add and hashset_merge operate on copied data
Previously, the hashset_add() and hashset_merge() functions were
modifying the original hashset in-place. This was leading to unexpected
results because the original data in the hashset was being altered.
This commit introduces the macro PG_GETARG_INT4HASHSET_COPY(), ensuring
a copy of the hashset is created and modified, leaving the original
hashset untouched.
This adjustment ensures hashset_add() and hashset_merge() operate
correctly on the copied hashset and prevent modification of the
original data.
A new regression test file `reported_bugs.sql` has been added to
validate the proper functionality of these changes. Future reported
bugs and their corresponding tests will also be added to this file.
diff --git a/Makefile b/Makefile
index b09a50f..59669ef 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
CLIENT_INCLUDES=-I$(shell pg_config --includedir)
LIBRARY_PATH = -L$(shell pg_config --libdir)
-REGRESS = prelude basic io_varying_lengths random table invalid order parsing
+REGRESS = prelude basic io_varying_lengths random table invalid order parsing reported_bugs
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/hashset.c b/hashset.c
index b9025b4..569ae91 100644
--- a/hashset.c
+++ b/hashset.c
@@ -42,6 +42,7 @@ static bool int4hashset_contains_element(int4hashset_t *set, int32 value);
static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET_COPY(x) (int4hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(x))
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
#define HASHSET_STEP 13
#define JENKINS_LOOKUP3_HASHFN_ID 1
@@ -566,7 +567,7 @@ int4hashset_add(PG_FUNCTION_ARGS)
else
{
/* make sure we are working with a non-toasted and non-shared copy of the input */
- set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
+ set = PG_GETARG_INT4HASHSET_COPY(0);
}
set = int4hashset_add_element(set, PG_GETARG_INT32(1));
@@ -592,7 +593,7 @@ int4hashset_merge(PG_FUNCTION_ARGS)
else if (PG_ARGISNULL(0))
PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(1));
- seta = PG_GETARG_INT4HASHSET(0);
+ seta = PG_GETARG_INT4HASHSET_COPY(0);
setb = PG_GETARG_INT4HASHSET(1);
bitmap = setb->data;
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
new file mode 100644
index 0000000..226e81c
--- /dev/null
+++ b/test/expected/reported_bugs.out
@@ -0,0 +1,27 @@
+/*
+ * In the original implementation of the query, the hashset_add() and
+ * hashset_merge() functions were modifying the original hashset in-place.
+ * This issue was leading to unexpected results because the functions
+ * were altering the original data in the hashset.
+ *
+ * The problem was fixed by introducing a macro function
+ * PG_GETARG_INT4HASHSET_COPY() in the C code. This function ensures that
+ * a copy of the hashset is created and modified, leaving the original
+ * hashset untouched. This fix resulted in the correct execution of the
+ * query, with hashset_add() and hashset_merge() working on the copied
+ * hashset, thereby preventing alteration of the original data.
+ */
+SELECT
+ q.hashset,
+ hashset_add(hashset,4)
+FROM
+(
+ SELECT
+ hashset(generate_series)
+ FROM generate_series(1,3)
+) q;
+ hashset | hashset_add
+---------+-------------
+ {1,3,2} | {1,3,4,2}
+(1 row)
+
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
new file mode 100644
index 0000000..fcd0b9d
--- /dev/null
+++ b/test/sql/reported_bugs.sql
@@ -0,0 +1,22 @@
+/*
+ * In the original implementation of the query, the hashset_add() and
+ * hashset_merge() functions were modifying the original hashset in-place.
+ * This issue was leading to unexpected results because the functions
+ * were altering the original data in the hashset.
+ *
+ * The problem was fixed by introducing a macro function
+ * PG_GETARG_INT4HASHSET_COPY() in the C code. This function ensures that
+ * a copy of the hashset is created and modified, leaving the original
+ * hashset untouched. This fix resulted in the correct execution of the
+ * query, with hashset_add() and hashset_merge() working on the copied
+ * hashset, thereby preventing alteration of the original data.
+ */
+SELECT
+ q.hashset,
+ hashset_add(hashset,4)
+FROM
+(
+ SELECT
+ hashset(generate_series)
+ FROM generate_series(1,3)
+) q;
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
In hashset/test/sql/order.sql, can we add the following to test whether
the optimizer will use our index.CREATE INDEX ON test_int4hashset_order (int4hashset_col
int4hashset_btree_ops);-- to make sure that this work with just two rows
SET enable_seqscan TO off;explain(costs off) SELECT * FROM test_int4hashset_order WHERE
int4hashset_col = '{1,2}'::int4hashset;
reset enable_seqscan;
Not sure I can see the value of that test,
since we've already tested the comparison functions,
which are used by the int4hashset_btree_ops operator class.
I think a test that verifies the btree index is actually used,
would more be a test of the query planner than hashset.
I might be missing something here, please tell me if so.
Since most contrib modules, one module, only one test file, maybe we
need to consolidate all the test sql files to one sql file
(int4hashset.sql)?
I've also made the same observation; I wonder if it's by design
or by coincidence? I think multiple test files improves modularity,
isolation and overall organisation of the testing.
As long as we are developing in the pre-release phase,
I think it's beneficial and affordable with rigorous testing.
However, if hashset would ever be considered
for core inclusion, then we should consolidate all tests into
one file and retain only essential tests, thereby minimizing
impact on PostgreSQL's overall test suite runtime
where every millisecond matters.
Attached is a patch slightly modified README.md. feel free to change,
since i am not native english speaker...Attachments:
* 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch
Thanks, improvements incorporated with some minor changes.
/Joel
Attachments:
hashset-0.0.1-61d572a.patchapplication/octet-stream; name=hashset-0.0.1-61d572a.patchDownload
commit 61d572a2420203925bcdbeb7e9a7532f7c0e7388
Author: Joel Jakobsson <joel@compiler.org>
Date: Fri Jun 16 02:17:40 2023 +0200
Enhance README.md with more precise hashset description and installation instructions
diff --git a/README.md b/README.md
index 3c1242e..99237df 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
# hashset
This PostgreSQL extension implements hashset, a data structure (type)
-providing a collection of integer items with fast lookup.
+providing a collection of unique, not null integer items with fast lookup.
## Version
@@ -112,6 +112,12 @@ To install the extension on any platform, follow these general steps:
5. Install the extension using `sudo make install`.
6. Run the tests using `make installcheck` (optional).
+To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+```sh
+make PG_CONFIG=/else/where/pg_config
+sudo make install PG_CONFIG=/else/where/pg_config
+```
+
In your PostgreSQL connection, enable the hashset extension using the following SQL command:
```sql
CREATE EXTENSION hashset;
New patch attached:
Add customizable params to int4hashset() and collision count function
This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.
Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.
Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().
A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.
The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.
SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.
In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.
Timing results not statistical significant, the purpose is just to
give an idea of the execution times.
*** Elements in sequence 1..100000
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:23: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE: hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:40: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE: hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:57: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE: hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 100000 random ints
setseed
---------
(1 row)
Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:77: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE: hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
setseed
---------
(1 row)
Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:95: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE: hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
setseed
---------
(1 row)
Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:113: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE: hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)
/Joel
Attachments:
hashset-0.0.1-184a18a.patchapplication/octet-stream; name=hashset-0.0.1-184a18a.patchDownload
commit 184a18a36774c268dd63e2b3c1e970de86eedbbc
Author: Joel Jakobsson <joel@compiler.org>
Date: Fri Jun 16 08:52:09 2023 +0200
Add customizable params to int4hashset() and collision count function
This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.
Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.
Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().
A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.
The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.
SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.
In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.
Timing results not statistical significant, the purpose is just to
give an idea of the execution times.
*** Elements in sequence 1..100000
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:23: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE: hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:40: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE: hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:57: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE: hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 100000 random ints
setseed
---------
(1 row)
Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:77: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE: hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
setseed
---------
(1 row)
Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:95: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE: hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
setseed
---------
(1 row)
Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:113: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE: hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)
diff --git a/README.md b/README.md
index 99237df..4a5e5a7 100644
--- a/README.md
+++ b/README.md
@@ -64,19 +64,28 @@ a variable-length type.
## Functions
-- `int4hashset() -> int4hashset`: Initialize an empty int4hashset with no capacity.
-- `int4hashset_with_capacity(int) -> int4hashset`: Initialize an empty int4hashset with given capacity.
+- `int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`:
+ Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
- `hashset_merge(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
- `hashset_count(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
+- `hashset_load_factor(int4hashset) -> float4`: Returns the load factor of an int4hashset.
+- `hashset_growth_factor(int4hashset) -> float4`: Returns the growth factor of an int4hashset.
## Aggregation Functions
-- `hashset(int) -> int4hashset`: Aggregate integers into a hashset.
-- `hashset(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
+- `hashset_agg(int) -> int4hashset`: Aggregate integers into a hashset.
+- `hashset_agg(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
## Operators
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index ea559ca..20d019d 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -5,24 +5,24 @@
CREATE TYPE int4hashset;
CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_in'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_in'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
- RETURNS cstring
- AS 'hashset', 'int4hashset_out'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS cstring
+AS 'hashset', 'int4hashset_out'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
- RETURNS bytea
- AS 'hashset', 'int4hashset_send'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS bytea
+AS 'hashset', 'int4hashset_send'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_recv'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_recv'
+LANGUAGE C IMMUTABLE STRICT;
CREATE TYPE int4hashset (
INPUT = int4hashset_in,
@@ -37,67 +37,71 @@ CREATE TYPE int4hashset (
* Hashset Functions
*/
-CREATE OR REPLACE FUNCTION int4hashset()
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_init'
- LANGUAGE C IMMUTABLE;
-
-CREATE OR REPLACE FUNCTION int4hashset_with_capacity(int)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_init'
- LANGUAGE C IMMUTABLE;
+CREATE OR REPLACE FUNCTION int4hashset(
+ capacity int DEFAULT 0,
+ load_factor float4 DEFAULT 0.75,
+ growth_factor float4 DEFAULT 2.0,
+ hashfn_id int DEFAULT 1
+)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_init'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_add'
- LANGUAGE C IMMUTABLE;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_add'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
- RETURNS bool
- AS 'hashset', 'int4hashset_contains'
- LANGUAGE C IMMUTABLE;
+RETURNS bool
+AS 'hashset', 'int4hashset_contains'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_merge'
- LANGUAGE C IMMUTABLE;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_merge'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
- RETURNS int[]
- AS 'hashset', 'int4hashset_to_array'
- LANGUAGE C IMMUTABLE;
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_array'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
- RETURNS bigint
- AS 'hashset', 'int4hashset_count'
- LANGUAGE C IMMUTABLE;
+RETURNS bigint
+AS 'hashset', 'int4hashset_count'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
- RETURNS bigint
- AS 'hashset', 'int4hashset_capacity'
- LANGUAGE C IMMUTABLE;
+RETURNS bigint
+AS 'hashset', 'int4hashset_capacity'
+LANGUAGE C IMMUTABLE;
+CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_collisions'
+LANGUAGE C IMMUTABLE;
/*
* Aggregation Functions
*/
CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
- RETURNS internal
- AS 'hashset', 'int4hashset_agg_add'
- LANGUAGE C IMMUTABLE;
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_agg_final'
- LANGUAGE C IMMUTABLE;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
- RETURNS internal
- AS 'hashset', 'int4hashset_agg_combine'
- LANGUAGE C IMMUTABLE;
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
-CREATE AGGREGATE hashset(int) (
+CREATE AGGREGATE hashset_agg(int) (
SFUNC = int4hashset_agg_add,
STYPE = internal,
FINALFUNC = int4hashset_agg_final,
@@ -106,21 +110,21 @@ CREATE AGGREGATE hashset(int) (
);
CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
- RETURNS internal
- AS 'hashset', 'int4hashset_agg_add_set'
- LANGUAGE C IMMUTABLE;
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add_set'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
- RETURNS int4hashset
- AS 'hashset', 'int4hashset_agg_final'
- LANGUAGE C IMMUTABLE;
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
- RETURNS internal
- AS 'hashset', 'int4hashset_agg_combine'
- LANGUAGE C IMMUTABLE;
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
-CREATE AGGREGATE hashset(int4hashset) (
+CREATE AGGREGATE hashset_agg(int4hashset) (
SFUNC = int4hashset_agg_add_set,
STYPE = internal,
FINALFUNC = int4hashset_agg_final,
@@ -133,9 +137,9 @@ CREATE AGGREGATE hashset(int4hashset) (
*/
CREATE OR REPLACE FUNCTION hashset_equals(int4hashset, int4hashset)
- RETURNS bool
- AS 'hashset', 'int4hashset_equals'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS bool
+AS 'hashset', 'int4hashset_equals'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR = (
LEFTARG = int4hashset,
@@ -146,9 +150,9 @@ CREATE OPERATOR = (
);
CREATE OR REPLACE FUNCTION hashset_neq(int4hashset, int4hashset)
- RETURNS bool
- AS 'hashset', 'int4hashset_neq'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS bool
+AS 'hashset', 'int4hashset_neq'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR <> (
LEFTARG = int4hashset,
@@ -166,43 +170,43 @@ CREATE OPERATOR <> (
*/
CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
- RETURNS integer
- AS 'hashset', 'int4hashset_hash'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS integer
+AS 'hashset', 'int4hashset_hash'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR CLASS int4hashset_hash_ops
- DEFAULT FOR TYPE int4hashset USING hash AS
- OPERATOR 1 = (int4hashset, int4hashset),
- FUNCTION 1 hashset_hash(int4hashset);
+DEFAULT FOR TYPE int4hashset USING hash AS
+OPERATOR 1 = (int4hashset, int4hashset),
+FUNCTION 1 hashset_hash(int4hashset);
/*
* Hashset Btree Operators
*/
CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
- RETURNS bool
- AS 'hashset', 'int4hashset_lt'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS bool
+AS 'hashset', 'int4hashset_lt'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
- RETURNS boolean
- AS 'hashset', 'int4hashset_le'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS boolean
+AS 'hashset', 'int4hashset_le'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
- RETURNS boolean
- AS 'hashset', 'int4hashset_gt'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS boolean
+AS 'hashset', 'int4hashset_gt'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
- RETURNS boolean
- AS 'hashset', 'int4hashset_ge'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS boolean
+AS 'hashset', 'int4hashset_ge'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
- RETURNS integer
- AS 'hashset', 'int4hashset_cmp'
- LANGUAGE C IMMUTABLE STRICT;
+RETURNS integer
+AS 'hashset', 'int4hashset_cmp'
+LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR < (
PROCEDURE = hashset_lt,
@@ -245,10 +249,10 @@ CREATE OPERATOR >= (
);
CREATE OPERATOR CLASS int4hashset_btree_ops
- DEFAULT FOR TYPE int4hashset USING btree AS
- OPERATOR 1 < (int4hashset, int4hashset),
- OPERATOR 2 <= (int4hashset, int4hashset),
- OPERATOR 3 = (int4hashset, int4hashset),
- OPERATOR 4 >= (int4hashset, int4hashset),
- OPERATOR 5 > (int4hashset, int4hashset),
- FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
+DEFAULT FOR TYPE int4hashset USING btree AS
+OPERATOR 1 < (int4hashset, int4hashset),
+OPERATOR 2 <= (int4hashset, int4hashset),
+OPERATOR 3 = (int4hashset, int4hashset),
+OPERATOR 4 >= (int4hashset, int4hashset),
+OPERATOR 5 > (int4hashset, int4hashset),
+FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset.c b/hashset.c
index 569ae91..9a1bd3f 100644
--- a/hashset.c
+++ b/hashset.c
@@ -29,9 +29,12 @@ PG_MODULE_MAGIC;
typedef struct int4hashset_t {
int32 vl_len_; /* varlena header (do not touch directly!) */
int32 flags; /* reserved for future use (versioning, ...) */
- int32 maxelements; /* max number of element we have space for */
+ int32 capacity; /* max number of element we have space for */
int32 nelements; /* number of items added to the hashset */
int32 hashfn_id; /* ID of the hash function used */
+ float4 load_factor; /* Load factor before triggering resize */
+ float4 growth_factor; /* Growth factor when resizing the hashset */
+ int32 ncollisions; /* Number of collisions */
char data[FLEXIBLE_ARRAY_MEMBER];
} int4hashset_t;
@@ -46,6 +49,16 @@ static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
#define HASHSET_STEP 13
#define JENKINS_LOOKUP3_HASHFN_ID 1
+#define MURMURHASH32_HASHFN_ID 2
+#define NAIVE_HASHFN_ID 3
+
+/*
+ * These defaults should match the the SQL function int4hashset()
+ */
+#define DEFAULT_INITIAL_CAPACITY 0
+#define DEFAULT_LOAD_FACTOR 0.75
+#define DEFAULT_GROWTH_FACTOR 2.0
+#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
PG_FUNCTION_INFO_V1(int4hashset_in);
PG_FUNCTION_INFO_V1(int4hashset_out);
@@ -57,6 +70,7 @@ PG_FUNCTION_INFO_V1(int4hashset_count);
PG_FUNCTION_INFO_V1(int4hashset_merge);
PG_FUNCTION_INFO_V1(int4hashset_init);
PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_collisions);
PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
PG_FUNCTION_INFO_V1(int4hashset_agg_add);
PG_FUNCTION_INFO_V1(int4hashset_agg_final);
@@ -81,6 +95,7 @@ Datum int4hashset_count(PG_FUNCTION_ARGS);
Datum int4hashset_merge(PG_FUNCTION_ARGS);
Datum int4hashset_init(PG_FUNCTION_ARGS);
Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_collisions(PG_FUNCTION_ARGS);
Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
@@ -118,23 +133,28 @@ hashset_isspace(char ch)
}
static int4hashset_t *
-int4hashset_allocate(int maxelements)
+int4hashset_allocate(
+ int capacity,
+ float4 load_factor,
+ float4 growth_factor,
+ int hashfn_id
+)
{
Size len;
int4hashset_t *set;
char *ptr;
/*
- * Ensure that maxelements is not divisible by HASHSET_STEP;
+ * Ensure that capacity is not divisible by HASHSET_STEP;
* i.e. the step size used in hashset_add_element()
* and hashset_contains_element().
*/
- while (maxelements % HASHSET_STEP == 0)
- maxelements++;
+ while (capacity % HASHSET_STEP == 0)
+ capacity++;
len = offsetof(int4hashset_t, data);
- len += CEIL_DIV(maxelements, 8);
- len += maxelements * sizeof(int32);
+ len += CEIL_DIV(capacity, 8);
+ len += capacity * sizeof(int32);
ptr = palloc0(len);
SET_VARSIZE(ptr, len);
@@ -142,9 +162,11 @@ int4hashset_allocate(int maxelements)
set = (int4hashset_t *) ptr;
set->flags = 0;
- set->maxelements = maxelements;
+ set->capacity = capacity;
set->nelements = 0;
- set->hashfn_id = JENKINS_LOOKUP3_HASHFN_ID;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
set->flags |= 0;
@@ -176,7 +198,12 @@ int4hashset_in(PG_FUNCTION_ARGS)
str++;
/* Initial size based on input length (arbitrary, could be optimized) */
- set = int4hashset_allocate(len/2);
+ set = int4hashset_allocate(
+ len/2,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
while (true)
{
@@ -202,7 +229,7 @@ int4hashset_in(PG_FUNCTION_ARGS)
}
/* Add the value to the hashset, resize if needed */
- if (set->nelements >= set->maxelements)
+ if (set->nelements >= set->capacity)
{
set = int4hashset_resize(set);
}
@@ -261,7 +288,7 @@ int4hashset_out(PG_FUNCTION_ARGS)
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
/* Initialize the StringInfo buffer */
initStringInfo(&str);
@@ -270,7 +297,7 @@ int4hashset_out(PG_FUNCTION_ARGS)
appendStringInfoChar(&str, '{');
/* Loop through the elements and append them to the string */
- for (i = 0; i < set->maxelements; i++)
+ for (i = 0; i < set->capacity; i++)
{
int byte = i / 8;
int bit = i % 8;
@@ -305,7 +332,7 @@ int4hashset_send(PG_FUNCTION_ARGS)
/* Send the non-data fields */
pq_sendint32(&buf, set->flags);
- pq_sendint32(&buf, set->maxelements);
+ pq_sendint32(&buf, set->capacity);
pq_sendint32(&buf, set->nelements);
pq_sendint32(&buf, set->hashfn_id);
@@ -328,7 +355,7 @@ int4hashset_recv(PG_FUNCTION_ARGS)
/* Read fields from buffer */
int32 flags = pq_getmsgint(buf, 4);
- int32 maxelements = pq_getmsgint(buf, 4);
+ int32 capacity = pq_getmsgint(buf, 4);
int32 nelements = pq_getmsgint(buf, 4);
int32 hashfn_id = pq_getmsgint(buf, 4);
@@ -349,7 +376,7 @@ int4hashset_recv(PG_FUNCTION_ARGS)
/* Populate the structure */
set->flags = flags;
- set->maxelements = maxelements;
+ set->capacity = capacity;
set->nelements = nelements;
set->hashfn_id = hashfn_id;
memcpy(set->data, binary_data, data_size);
@@ -376,14 +403,14 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
sbitmap = set->data;
- svalues = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+ svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
/* number of values to store in the array */
nvalues = set->nelements;
values = (int32 *) palloc(sizeof(int32) * nvalues);
idx = 0;
- for (i = 0; i < set->maxelements; i++)
+ for (i = 0; i < set->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -426,15 +453,22 @@ static int4hashset_t *
int4hashset_resize(int4hashset_t * set)
{
int i;
- int4hashset_t *new = int4hashset_allocate(set->maxelements * 2);
+ int4hashset_t *new;
char *bitmap;
int32 *values;
+ new = int4hashset_allocate(
+ set->capacity * 2,
+ set->load_factor,
+ set->growth_factor,
+ set->hashfn_id
+ );
+
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
- for (i = 0; i < set->maxelements; i++)
+ for (i = 0; i < set->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -455,12 +489,20 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
char *bitmap;
int32 *values;
- if (set->nelements > set->maxelements * 0.75)
+ if (set->nelements > set->capacity * set->load_factor)
set = int4hashset_resize(set);
if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
{
- hash = hash_bytes_uint32((uint32) value) % set->maxelements;
+ hash = hash_bytes_uint32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201) % set->capacity;
}
else
{
@@ -470,7 +512,7 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
}
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
while (true)
{
@@ -484,7 +526,10 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
if (values[hash] == value)
break;
- hash = (hash + HASHSET_STEP) % set->maxelements;
+ /* Increment the collision counter */
+ set->ncollisions++;
+
+ hash = (hash + HASHSET_STEP) % set->capacity;
continue;
}
@@ -512,7 +557,15 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
{
- hash = hash_bytes_uint32((uint32) value) % set->maxelements;
+ hash = hash_bytes_uint32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201) % set->capacity;
}
else
{
@@ -522,7 +575,7 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
}
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->maxelements, 8));
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
while (true)
{
@@ -538,12 +591,12 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
return true;
/* move to the next element */
- hash = (hash + HASHSET_STEP) % set->maxelements;
+ hash = (hash + HASHSET_STEP) % set->capacity;
num_probes++; /* Increment the number of probes */
/* Check if we have probed all slots */
- if (num_probes >= set->maxelements)
+ if (num_probes >= set->capacity)
return false; /* Avoid infinite loop */
}
}
@@ -563,7 +616,14 @@ int4hashset_add(PG_FUNCTION_ARGS)
/* if there's no hashset allocated, create it now */
if (PG_ARGISNULL(0))
- set = int4hashset_allocate(64);
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
else
{
/* make sure we are working with a non-toasted and non-shared copy of the input */
@@ -597,9 +657,9 @@ int4hashset_merge(PG_FUNCTION_ARGS)
setb = PG_GETARG_INT4HASHSET(1);
bitmap = setb->data;
- values = (int32 *) (setb->data + CEIL_DIV(setb->maxelements, 8));
+ values = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
- for (i = 0; i < setb->maxelements; i++)
+ for (i = 0; i < setb->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -614,19 +674,51 @@ int4hashset_merge(PG_FUNCTION_ARGS)
Datum
int4hashset_init(PG_FUNCTION_ARGS)
{
- if (PG_NARGS() == 0) {
- /*
- * No initial capacity argument was passed,
- * allocate hashset with zero capacity
- */
- PG_RETURN_POINTER(int4hashset_allocate(0));
- } else {
- /*
- * Initial capacity argument was passed,
- * allocate hashset with the specified capacity
- */
- PG_RETURN_POINTER(int4hashset_allocate(PG_GETARG_INT32(0)));
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
+
+ /* Validate input arguments */
+ if (!(initial_capacity >= 0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("initial capacity cannot be negative")));
}
+
+ if (!(load_factor > 0.0 && load_factor < 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("load factor must be between 0.0 and 1.0")));
+ }
+
+ if (!(growth_factor > 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("growth factor must be greater than 1.0")));
+ }
+
+ if (!(hashfn_id == JENKINS_LOOKUP3_HASHFN_ID ||
+ hashfn_id == MURMURHASH32_HASHFN_ID ||
+ hashfn_id == NAIVE_HASHFN_ID))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid hash function ID")));
+ }
+
+ set = int4hashset_allocate(
+ initial_capacity,
+ load_factor,
+ growth_factor,
+ hashfn_id
+ );
+
+ PG_RETURN_POINTER(set);
}
Datum
@@ -667,7 +759,20 @@ int4hashset_capacity(PG_FUNCTION_ARGS)
set = (int4hashset_t *) PG_GETARG_POINTER(0);
- PG_RETURN_INT64(set->maxelements);
+ PG_RETURN_INT64(set->capacity);
+}
+
+Datum
+int4hashset_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->ncollisions);
}
Datum
@@ -699,7 +804,12 @@ int4hashset_agg_add(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
- state = int4hashset_allocate(64);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
MemoryContextSwitchTo(oldcontext);
}
else
@@ -741,7 +851,12 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
if (PG_ARGISNULL(0))
{
oldcontext = MemoryContextSwitchTo(aggcontext);
- state = int4hashset_allocate(64);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
MemoryContextSwitchTo(oldcontext);
}
else
@@ -758,9 +873,9 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
value = PG_GETARG_INT4HASHSET(1);
bitmap = value->data;
- values = (int32 *) (value->data + CEIL_DIV(value->maxelements, 8));
+ values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
- for (i = 0; i < value->maxelements; i++)
+ for (i = 0; i < value->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -835,9 +950,9 @@ int4hashset_agg_combine(PG_FUNCTION_ARGS)
dst = (int4hashset_t *) PG_GETARG_POINTER(0);
bitmap = src->data;
- values = (int32 *) (src->data + CEIL_DIV(src->maxelements, 8));
+ values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
- for (i = 0; i < src->maxelements; i++)
+ for (i = 0; i < src->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -868,12 +983,12 @@ int4hashset_equals(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(false);
bitmap_a = a->data;
- values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
+ values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
/*
* Check if every element in a is also in b
*/
- for (i = 0; i < a->maxelements; i++)
+ for (i = 0; i < a->capacity; i++)
{
int byte = (i / 8);
int bit = (i % 8);
@@ -918,10 +1033,10 @@ Datum int4hashset_hash(PG_FUNCTION_ARGS)
/* Access the data array */
char *bitmap = set->data;
- int32 *values = (int32 *)(set->data + CEIL_DIV(set->maxelements, 8));
+ int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
/* Iterate through all elements */
- for (int32 i = 0; i < set->maxelements; i++)
+ for (int32 i = 0; i < set->capacity; i++)
{
int byte = i / 8;
int bit = i % 8;
@@ -1010,13 +1125,13 @@ int4hashset_cmp(PG_FUNCTION_ARGS)
int i = 0, j = 0;
bitmap_a = a->data;
- values_a = (int32 *)(a->data + CEIL_DIV(a->maxelements, 8));
+ values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
bitmap_b = b->data;
- values_b = (int32 *)(b->data + CEIL_DIV(b->maxelements, 8));
+ values_b = (int32 *)(b->data + CEIL_DIV(b->capacity, 8));
/* Iterate over the elements in each hashset independently */
- while(i < a->maxelements && j < b->maxelements)
+ while(i < a->capacity && j < b->capacity)
{
int byte_a = (i / 8);
int bit_a = (i % 8);
@@ -1057,9 +1172,9 @@ int4hashset_cmp(PG_FUNCTION_ARGS)
* If all compared elements are equal,
* then compare the remaining elements in the larger hashset
*/
- if (i < a->maxelements)
+ if (i < a->capacity)
PG_RETURN_INT32(1);
- else if (j < b->maxelements)
+ else if (j < b->capacity)
PG_RETURN_INT32(-1);
else
PG_RETURN_INT32(0);
diff --git a/test/expected/basic.out b/test/expected/basic.out
index a793ef2..65be2a6 100644
--- a/test/expected/basic.out
+++ b/test/expected/basic.out
@@ -30,15 +30,20 @@ LINE 1: SELECT '{2147483648}'::int4hashset;
/*
* Hashset Functions
*/
-SELECT int4hashset(); -- init empty int4hashset with no capacity
+SELECT int4hashset();
int4hashset
-------------
{}
(1 row)
-SELECT int4hashset_with_capacity(10); -- init empty int4hashset with specified capacity
- int4hashset_with_capacity
----------------------------
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+ int4hashset
+-------------
{}
(1 row)
@@ -90,7 +95,7 @@ SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
3
(1 row)
-SELECT hashset_capacity(int4hashset_with_capacity(10)); -- 10
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
hashset_capacity
------------------
10
@@ -99,21 +104,21 @@ SELECT hashset_capacity(int4hashset_with_capacity(10)); -- 10
/*
* Aggregation Functions
*/
-SELECT hashset(i) FROM generate_series(1,10) AS i;
- hashset
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+ hashset_agg
------------------------
- {8,1,10,3,9,4,6,2,5,7}
+ {6,10,1,8,2,3,4,5,9,7}
(1 row)
-SELECT hashset(h) FROM
+SELECT hashset_agg(h) FROM
(
- SELECT hashset(i) AS h FROM generate_series(1,5) AS i
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
UNION ALL
- SELECT hashset(j) AS h FROM generate_series(6,10) AS j
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
) q;
- hashset
+ hashset_agg
------------------------
- {8,1,10,3,9,4,6,2,5,7}
+ {6,8,1,3,2,10,4,5,9,7}
(1 row)
/*
diff --git a/test/expected/order.out b/test/expected/order.out
index 089bd15..2eb321c 100644
--- a/test/expected/order.out
+++ b/test/expected/order.out
@@ -14,7 +14,7 @@ DECLARE
element INT;
random_set int4hashset;
BEGIN
- random_set := int4hashset_with_capacity(num_elements);
+ random_set := int4hashset(capacity := num_elements);
FOR i IN 1..num_elements LOOP
element := floor(random() * 1000)::INT;
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
index 226e81c..860370d 100644
--- a/test/expected/reported_bugs.out
+++ b/test/expected/reported_bugs.out
@@ -12,16 +12,16 @@
* hashset, thereby preventing alteration of the original data.
*/
SELECT
- q.hashset,
- hashset_add(hashset,4)
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
FROM
(
SELECT
- hashset(generate_series)
+ hashset_agg(generate_series)
FROM generate_series(1,3)
) q;
- hashset | hashset_add
----------+-------------
- {1,3,2} | {1,3,4,2}
+ hashset_agg | hashset_add
+-------------+-------------
+ {3,1,2} | {3,4,1,2}
(1 row)
diff --git a/test/expected/table.out b/test/expected/table.out
index 3c020b6..9793a49 100644
--- a/test/expected/table.out
+++ b/test/expected/table.out
@@ -1,6 +1,6 @@
CREATE TABLE users (
user_id int PRIMARY KEY,
- user_likes int4hashset DEFAULT int4hashset_with_capacity(2)
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
);
INSERT INTO users (user_id) VALUES (1);
UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
index 563c626..4882895 100644
--- a/test/sql/basic.sql
+++ b/test/sql/basic.sql
@@ -12,8 +12,13 @@ SELECT '{2147483648}'::int4hashset; -- out of range
* Hashset Functions
*/
-SELECT int4hashset(); -- init empty int4hashset with no capacity
-SELECT int4hashset_with_capacity(10); -- init empty int4hashset with specified capacity
+SELECT int4hashset();
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
SELECT hashset_add(int4hashset(), 123);
SELECT hashset_add(NULL::int4hashset, 123);
SELECT hashset_add('{123}'::int4hashset, 456);
@@ -22,19 +27,19 @@ SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
SELECT hashset_to_array('{1,2,3}'::int4hashset);
SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
-SELECT hashset_capacity(int4hashset_with_capacity(10)); -- 10
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
/*
* Aggregation Functions
*/
-SELECT hashset(i) FROM generate_series(1,10) AS i;
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
-SELECT hashset(h) FROM
+SELECT hashset_agg(h) FROM
(
- SELECT hashset(i) AS h FROM generate_series(1,5) AS i
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
UNION ALL
- SELECT hashset(j) AS h FROM generate_series(6,10) AS j
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
) q;
/*
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
new file mode 100644
index 0000000..6f825dc
--- /dev/null
+++ b/test/sql/benchmark.sql
@@ -0,0 +1,110 @@
+DROP EXTENSION IF EXISTS hashset;
+CREATE EXTENSION hashset;
+
+\timing on
+
+\echo *** Elements in sequence 1..100000
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo *** Testing 100000 random ints
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
diff --git a/test/sql/order.sql b/test/sql/order.sql
index 2dcdb39..ba6af17 100644
--- a/test/sql/order.sql
+++ b/test/sql/order.sql
@@ -10,7 +10,7 @@ DECLARE
element INT;
random_set int4hashset;
BEGIN
- random_set := int4hashset_with_capacity(num_elements);
+ random_set := int4hashset(capacity := num_elements);
FOR i IN 1..num_elements LOOP
element := floor(random() * 1000)::INT;
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
index fcd0b9d..a47a6f0 100644
--- a/test/sql/reported_bugs.sql
+++ b/test/sql/reported_bugs.sql
@@ -12,11 +12,11 @@
* hashset, thereby preventing alteration of the original data.
*/
SELECT
- q.hashset,
- hashset_add(hashset,4)
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
FROM
(
SELECT
- hashset(generate_series)
+ hashset_agg(generate_series)
FROM generate_series(1,3)
) q;
diff --git a/test/sql/table.sql b/test/sql/table.sql
index a63253f..0472352 100644
--- a/test/sql/table.sql
+++ b/test/sql/table.sql
@@ -1,6 +1,6 @@
CREATE TABLE users (
user_id int PRIMARY KEY,
- user_likes int4hashset DEFAULT int4hashset_with_capacity(2)
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
);
INSERT INTO users (user_id) VALUES (1);
UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
similar to (int[] || int4) and (int4 || int[])
should we expect ('{1,2}'::int4hashset || 3) == (3 ||
'{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); *?*
The following is the general idea on how to make it work by looking at
similar code....
CREATE OPERATOR || (
leftarg = int4hashset,
rightarg = int4,
function = int4hashset_add,
commutator = ||
);
CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
RETURNS int4hashset
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN $2 || $1;
CREATE OPERATOR || (
leftarg = int4,
rightarg = int4hashset,
function = int4_add_int4hashset,
commutator = ||
);
while creating an operator. I am not sure how to specify
NEGATOR,RESTRICT,JOIN clause.
-----------------------------------------------------------------------------------------------------------------------------
also. I think the following query should return one row only? but now it
doesn't.
select hashset_cmp('{1,2}','{2,1}')
union
select hashset_cmp('{1,2}','{1,2,1}')
union
select hashset_cmp('{1,2}','{1,2}');
----------------------------------------------------------------------------------------------------------------------
similar to elem_contained_by_range, range_contains_elem. we should already
consider the operator *<@* and @*>? *
CREATE OR REPLACE FUNCTION elem_contained_by_hashset(int4, int4hashset)
RETURNS bool
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN hashset_contains ($2,$1);
Is the integer contained in the int4hashset?
integer <@ int4hashset → boolean
1 <@ int4hashset'{1,7}' → t
CREATE OPERATOR <@ (
leftarg = integer,
rightarg = int4hashset,
function = elem_contained_by_hashset
);
int4hashset @> integer → boolean
Does the int4hashset contain the element?
int4hashset'{1,7}' @> 1 → t
CREATE OPERATOR @> (
leftarg = int4hashset,
rightarg = integer,
function = hashset_contains
);
-------------------
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
similar to (int[] || int4) and (int4 || int[])
should we expect ('{1,2}'::int4hashset || 3) == (3 ||
'{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3));
*?*
Good idea, makes sense to support it.
Implemented in attached patch.
CREATE OPERATOR || (
leftarg = int4,
rightarg = int4hashset,
function = int4_add_int4hashset,
commutator = ||
);
while creating an operator. I am not sure how to specify
NEGATOR,RESTRICT,JOIN clause.
I don't think we need those for this operator, might be wrong though.
-----------------------------------------------------------------------------------------------------------------------------
also. I think the following query should return one row only? but now
it doesn't.
select hashset_cmp('{1,2}','{2,1}')
union
select hashset_cmp('{1,2}','{1,2,1}')
union
select hashset_cmp('{1,2}','{1,2}');
Good point.
I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:
SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
?column?
----------
t
(1 row)
SELECT hashset_hash('{1,2}'::int4hashset);
hashset_hash
--------------
990882385
(1 row)
SELECT hashset_hash('{2,1}'::int4hashset);
hashset_hash
--------------
996377797
(1 row)
Do we have any ideas on how to fix this without sacrificing performance?
We of course want to avoid having to sort the hashsets,
which is the naive solution.
To understand why this is happening, consider this example:
SELECT '{1,2}'::int4hashset;
int4hashset
-------------
{1,2}
(1 row)
SELECT '{2,1}'::int4hashset;
int4hashset
-------------
{2,1}
(1 row)
If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.
In the second case, the opposite happens.
Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.
SELECTint4hashset() || 1 || 2 || 3;
{3,1,2}
SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
{3,1,2}
SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
{3,1,2}
SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
{3,2,1}
SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
{3,1,2}
SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
{1,2,3}
SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
{1,3,2}
----------------------------------------------------------------------------------------------------------------------
similar to elem_contained_by_range, range_contains_elem. we should
already consider the operator *<@* and @*>? *
That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function hashset_contains(int4hashset, integer) directly?
/Joel
Attachments:
hashset-0.0.1-d83bdef.patchapplication/octet-stream; name=hashset-0.0.1-d83bdef.patchDownload
commit d83bdef5428279be0126d20c7f75187b9f4847b0
Author: Joel Jakobsson <joel@compiler.org>
Date: Fri Jun 16 17:14:49 2023 +0200
Implement || operator to append an int to an int4hashset
To avoid need for another C-function, the new SQL helper function
int4_add_int4hashset() swaps the arguments of the || so that
hashset_add() can be used for both scenarios.
Also setseed() in the benchmark.sql for reproducibility.
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index 20d019d..2c2f9e8 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -82,6 +82,12 @@ RETURNS bigint
AS 'hashset', 'int4hashset_collisions'
LANGUAGE C IMMUTABLE;
+CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
+RETURNS int4hashset
+AS $$SELECT $2 || $1$$
+LANGUAGE SQL
+IMMUTABLE PARALLEL SAFE STRICT COST 1;
+
/*
* Aggregation Functions
*/
@@ -165,6 +171,20 @@ CREATE OPERATOR <> (
HASHES
);
+CREATE OPERATOR || (
+ leftarg = int4hashset,
+ rightarg = int4,
+ function = hashset_add,
+ commutator = ||
+);
+
+CREATE OPERATOR || (
+ leftarg = int4,
+ rightarg = int4hashset,
+ function = int4_add_int4hashset,
+ commutator = ||
+);
+
/*
* Hashset Hash Operators
*/
diff --git a/test/expected/basic.out b/test/expected/basic.out
index 65be2a6..5690ba4 100644
--- a/test/expected/basic.out
+++ b/test/expected/basic.out
@@ -220,6 +220,18 @@ SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
t
(1 row)
+SELECT '{1,2,3}'::int4hashset || 4;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+SELECT 4 || '{1,2,3}'::int4hashset;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
/*
* Hashset Hash Operators
*/
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
index 4882895..464884e 100644
--- a/test/sql/basic.sql
+++ b/test/sql/basic.sql
@@ -66,6 +66,9 @@ SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset || 4;
+SELECT 4 || '{1,2,3}'::int4hashset;
+
/*
* Hashset Hash Operators
*/
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
index 6f825dc..1697451 100644
--- a/test/sql/benchmark.sql
+++ b/test/sql/benchmark.sql
@@ -58,6 +58,7 @@ $$ LANGUAGE plpgsql;
\echo *** Testing 100000 random ints
+SELECT setseed(0.12345);
\echo - Testing default hash function (Jenkins/lookup3)
DO
@@ -75,6 +76,7 @@ BEGIN
END
$$ LANGUAGE plpgsql;
+SELECT setseed(0.12345);
\echo - Testing Murmurhash32
DO
@@ -92,6 +94,7 @@ BEGIN
END
$$ LANGUAGE plpgsql;
+SELECT setseed(0.12345);
\echo - Testing naive hash function
DO
On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:
...
Do we have any ideas on how to fix this without sacrificing performance?
The problem was due to hashset_hash() function accumulating the hashes
of individual elements in a non-commutative manner. As a consequence, the
final hash value was sensitive to the order in which elements were inserted
into the hashset. This behavior led to inconsistencies, as logically
equivalent sets (i.e., sets with the same elements but in different orders)
produced different hash values.
Solved by modifying the hashset_hash() function to use a commutative operation
when combining the hashes of individual elements. This change ensures that the
final hash value is independent of the element insertion order, and logically
equivalent sets produce the same hash.
An somewhat unfortunate side-effect of this fix, is that we can no longer
visually sort the hashset output format, since it's not lexicographically sorted.
I think this is an acceptable trade-off for a hashset type,
since the only alternative I see would be to sort the elements,
but then it wouldn't be a hashset, but a treeset, which different
Big-O complexity.
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.
/Joel
Attachments:
hashset-0.0.1-8da4aa8.patchapplication/octet-stream; name=hashset-0.0.1-8da4aa8.patchDownload
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..91f216e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+.deps/
+results/
+**/*.o
+**/*.so
+regression.diffs
+regression.out
+.vscode
+test/c_tests/test_send_recv
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..908853d
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,16 @@
+Copyright (c) 2019, Tomas Vondra (tomas.vondra@postgresql.org).
+
+Permission to use, copy, modify, and distribute this software and its documentation
+for any purpose, without fee, and without a written agreement is hereby granted,
+provided that the above copyright notice and this paragraph and the following two
+paragraphs appear in all copies.
+
+IN NO EVENT SHALL $ORGANISATION BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
+INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE
+OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF TOMAS VONDRA HAS BEEN ADVISED OF
+THE POSSIBILITY OF SUCH DAMAGE.
+
+TOMAS VONDRA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
+SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND $ORGANISATION HAS NO
+OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..d8be8ee
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,33 @@
+MODULE_big = hashset
+OBJS = hashset.o
+
+EXTENSION = hashset
+DATA = hashset--0.0.1.sql
+MODULES = hashset
+
+# Keep the CFLAGS separate
+SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
+CLIENT_INCLUDES=-I$(shell pg_config --includedir)
+LIBRARY_PATH = -L$(shell pg_config --libdir)
+
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs
+REGRESS_OPTS = --inputdir=test
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+C_TESTS_DIR = test/c_tests
+
+EXTRA_CLEAN = $(C_TESTS_DIR)/test_send_recv
+
+c_tests: $(C_TESTS_DIR)/test_send_recv
+
+$(C_TESTS_DIR)/test_send_recv: $(C_TESTS_DIR)/test_send_recv.c
+ $(CC) $(SERVER_INCLUDES) $(CLIENT_INCLUDES) -o $@ $< $(LIBRARY_PATH) -lpq
+
+run_c_tests: c_tests
+ cd $(C_TESTS_DIR) && ./test_send_recv.sh
+
+check: all $(REGRESS_PREP) run_c_tests
+
+include $(PGXS)
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9787825
--- /dev/null
+++ b/README.md
@@ -0,0 +1,152 @@
+# hashset
+
+This PostgreSQL extension implements hashset, a data structure (type)
+providing a collection of unique, not null integer items with fast lookup.
+
+
+## Version
+
+0.0.1
+
+🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
+PostgreSQL extension is **not production-ready**. As the codebase is evolving
+with possible breaking changes, we are not providing any migration scripts
+until we reach our first release.
+
+
+## Usage
+
+After installing the extension, you can use the `int4hashset` data type and
+associated functions within your PostgreSQL queries.
+
+To demonstrate the usage, let's consider a hypothetical table `users` which has
+a `user_id` and a `user_likes` of type `int4hashset`.
+
+Firstly, let's create the table:
+
+```sql
+CREATE TABLE users(
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset()
+);
+```
+In the above statement, the `int4hashset()` initializes an empty hashset
+with zero capacity. The hashset will automatically resize itself when more
+elements are added.
+
+Now, we can perform operations on this table. Here are some examples:
+
+```sql
+-- Insert a new user with id 1. The user_likes will automatically be initialized
+-- as an empty hashset
+INSERT INTO users (user_id) VALUES (1);
+
+-- Add elements (likes) for a user
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+
+-- Check if a user likes a particular item
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
+
+-- Count the number of likes a user has
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
+```
+
+You can also use the aggregate functions to perform operations on multiple rows.
+
+
+## Data types
+
+- **int4hashset**: This data type represents a set of integers. Internally, it uses
+a combination of a bitmap and a value array to store the elements in a set. It's
+a variable-length type.
+
+
+## Functions
+
+- `int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`:
+ Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
+- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
+- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
+- `hashset_merge(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
+- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
+- `hashset_count(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
+- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
+
+## Aggregation Functions
+
+- `hashset_agg(int) -> int4hashset`: Aggregate integers into a hashset.
+- `hashset_agg(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
+
+
+## Operators
+
+- Equality (`=`): Checks if two hashsets are equal.
+- Inequality (`<>`): Checks if two hashsets are not equal.
+
+
+## Hashset Hash Operators
+
+- `hashset_hash(int4hashset) -> integer`: Returns the hash value of an int4hashset.
+
+
+## Hashset Btree Operators
+
+- `<`, `<=`, `>`, `>=`: Comparison operators for hashsets.
+
+
+## Limitations
+
+- The `int4hashset` data type currently supports integers within the range of int4
+(-2147483648 to 2147483647).
+
+
+## Installation
+
+To install the extension on any platform, follow these general steps:
+
+1. Ensure you have PostgreSQL installed on your system, including the development files.
+2. Clone the repository.
+3. Navigate to the cloned repository directory.
+4. Compile the extension using `make`.
+5. Install the extension using `sudo make install`.
+6. Run the tests using `make installcheck` (optional).
+
+To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+```sh
+make PG_CONFIG=/else/where/pg_config
+sudo make install PG_CONFIG=/else/where/pg_config
+```
+
+In your PostgreSQL connection, enable the hashset extension using the following SQL command:
+```sql
+CREATE EXTENSION hashset;
+```
+
+This extension requires PostgreSQL version ?.? or later.
+
+For Ubuntu 22.04.1 LTS, you would run the following commands:
+
+```sh
+sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
+git clone https://github.com/tvondra/hashset.git
+cd hashset
+make
+sudo make install
+make installcheck
+```
+
+Please note that this project is currently under active development and is not yet considered production-ready.
+
+## License
+
+This software is distributed under the terms of PostgreSQL license.
+See LICENSE or http://www.opensource.org/licenses/bsd-license.php for
+more details.
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
new file mode 100644
index 0000000..17e4a45
--- /dev/null
+++ b/hashset--0.0.1.sql
@@ -0,0 +1,278 @@
+/*
+ * Hashset Type Definition
+ */
+
+CREATE TYPE int4hashset;
+
+CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_in'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
+RETURNS cstring
+AS 'hashset', 'int4hashset_out'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
+RETURNS bytea
+AS 'hashset', 'int4hashset_send'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_recv'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE TYPE int4hashset (
+ INPUT = int4hashset_in,
+ OUTPUT = int4hashset_out,
+ RECEIVE = int4hashset_recv,
+ SEND = int4hashset_send,
+ INTERNALLENGTH = variable,
+ STORAGE = external
+);
+
+/*
+ * Hashset Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset(
+ capacity int DEFAULT 0,
+ load_factor float4 DEFAULT 0.75,
+ growth_factor float4 DEFAULT 2.0,
+ hashfn_id int DEFAULT 1
+)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_init'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
+RETURNS bool
+AS 'hashset', 'int4hashset_contains'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_merge'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_array'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_count'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_capacity'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_collisions'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
+RETURNS int4hashset
+AS $$SELECT $2 || $1$$
+LANGUAGE SQL
+IMMUTABLE PARALLEL SAFE STRICT COST 1;
+
+/*
+ * Aggregation Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int) (
+ SFUNC = int4hashset_agg_add,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add_set'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int4hashset) (
+ SFUNC = int4hashset_agg_add_set,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+/*
+ * Operator Definitions
+ */
+
+CREATE OR REPLACE FUNCTION hashset_equals(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_equals'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR = (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_equals,
+ COMMUTATOR = =,
+ HASHES
+);
+
+CREATE OR REPLACE FUNCTION hashset_neq(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_neq'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR <> (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_neq,
+ COMMUTATOR = '<>',
+ NEGATOR = '=',
+ RESTRICT = neqsel,
+ JOIN = neqjoinsel,
+ HASHES
+);
+
+CREATE OPERATOR || (
+ leftarg = int4hashset,
+ rightarg = int4,
+ function = hashset_add,
+ commutator = ||
+);
+
+CREATE OPERATOR || (
+ leftarg = int4,
+ rightarg = int4hashset,
+ function = int4_add_int4hashset,
+ commutator = ||
+);
+
+/*
+ * Hashset Hash Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_hash'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR CLASS int4hashset_hash_ops
+DEFAULT FOR TYPE int4hashset USING hash AS
+OPERATOR 1 = (int4hashset, int4hashset),
+FUNCTION 1 hashset_hash(int4hashset);
+
+/*
+ * Hashset Btree Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_lt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_le'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_gt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ge'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_cmp'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR < (
+ PROCEDURE = hashset_lt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = >,
+ NEGATOR = >=,
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR <= (
+ PROCEDURE = hashset_le,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '>=',
+ NEGATOR = '>',
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR > (
+ PROCEDURE = hashset_gt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<',
+ NEGATOR = '<=',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR >= (
+ PROCEDURE = hashset_ge,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<=',
+ NEGATOR = '<',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR CLASS int4hashset_btree_ops
+DEFAULT FOR TYPE int4hashset USING btree AS
+OPERATOR 1 < (int4hashset, int4hashset),
+OPERATOR 2 <= (int4hashset, int4hashset),
+OPERATOR 3 = (int4hashset, int4hashset),
+OPERATOR 4 >= (int4hashset, int4hashset),
+OPERATOR 5 > (int4hashset, int4hashset),
+FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset.c b/hashset.c
new file mode 100644
index 0000000..e289112
--- /dev/null
+++ b/hashset.c
@@ -0,0 +1,1210 @@
+/*
+ * hashset.c
+ *
+ * Copyright (C) Tomas Vondra, 2019
+ */
+
+#include "postgres.h"
+#include "libpq/pqformat.h"
+#include "nodes/memnodes.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "catalog/pg_type.h"
+#include "common/hashfn.h"
+
+#include <stdio.h>
+#include <math.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <limits.h>
+
+PG_MODULE_MAGIC;
+
+/*
+ * hashset
+ */
+typedef struct int4hashset_t {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int32 flags; /* reserved for future use (versioning, ...) */
+ int32 capacity; /* max number of element we have space for */
+ int32 nelements; /* number of items added to the hashset */
+ int32 hashfn_id; /* ID of the hash function used */
+ float4 load_factor; /* Load factor before triggering resize */
+ float4 growth_factor; /* Growth factor when resizing the hashset */
+ int32 ncollisions; /* Number of collisions */
+ int32 hash; /* Stored hash value of the hashset */
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} int4hashset_t;
+
+static int4hashset_t *int4hashset_resize(int4hashset_t * set);
+static int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
+static bool int4hashset_contains_element(int4hashset_t *set, int32 value);
+
+static Datum int32_to_array(FunctionCallInfo fcinfo, int32 * d, int len);
+
+#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET_COPY(x) (int4hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(x))
+#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_STEP 13
+#define JENKINS_LOOKUP3_HASHFN_ID 1
+#define MURMURHASH32_HASHFN_ID 2
+#define NAIVE_HASHFN_ID 3
+
+/*
+ * These defaults should match the the SQL function int4hashset()
+ */
+#define DEFAULT_INITIAL_CAPACITY 0
+#define DEFAULT_LOAD_FACTOR 0.75
+#define DEFAULT_GROWTH_FACTOR 2.0
+#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
+
+PG_FUNCTION_INFO_V1(int4hashset_in);
+PG_FUNCTION_INFO_V1(int4hashset_out);
+PG_FUNCTION_INFO_V1(int4hashset_send);
+PG_FUNCTION_INFO_V1(int4hashset_recv);
+PG_FUNCTION_INFO_V1(int4hashset_add);
+PG_FUNCTION_INFO_V1(int4hashset_contains);
+PG_FUNCTION_INFO_V1(int4hashset_count);
+PG_FUNCTION_INFO_V1(int4hashset_merge);
+PG_FUNCTION_INFO_V1(int4hashset_init);
+PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add);
+PG_FUNCTION_INFO_V1(int4hashset_agg_final);
+PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
+PG_FUNCTION_INFO_V1(int4hashset_to_array);
+PG_FUNCTION_INFO_V1(int4hashset_equals);
+PG_FUNCTION_INFO_V1(int4hashset_neq);
+PG_FUNCTION_INFO_V1(int4hashset_hash);
+PG_FUNCTION_INFO_V1(int4hashset_lt);
+PG_FUNCTION_INFO_V1(int4hashset_le);
+PG_FUNCTION_INFO_V1(int4hashset_gt);
+PG_FUNCTION_INFO_V1(int4hashset_ge);
+PG_FUNCTION_INFO_V1(int4hashset_cmp);
+
+Datum int4hashset_in(PG_FUNCTION_ARGS);
+Datum int4hashset_out(PG_FUNCTION_ARGS);
+Datum int4hashset_send(PG_FUNCTION_ARGS);
+Datum int4hashset_recv(PG_FUNCTION_ARGS);
+Datum int4hashset_add(PG_FUNCTION_ARGS);
+Datum int4hashset_contains(PG_FUNCTION_ARGS);
+Datum int4hashset_count(PG_FUNCTION_ARGS);
+Datum int4hashset_merge(PG_FUNCTION_ARGS);
+Datum int4hashset_init(PG_FUNCTION_ARGS);
+Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
+Datum int4hashset_to_array(PG_FUNCTION_ARGS);
+Datum int4hashset_equals(PG_FUNCTION_ARGS);
+Datum int4hashset_neq(PG_FUNCTION_ARGS);
+Datum int4hashset_hash(PG_FUNCTION_ARGS);
+Datum int4hashset_lt(PG_FUNCTION_ARGS);
+Datum int4hashset_le(PG_FUNCTION_ARGS);
+Datum int4hashset_gt(PG_FUNCTION_ARGS);
+Datum int4hashset_ge(PG_FUNCTION_ARGS);
+Datum int4hashset_cmp(PG_FUNCTION_ARGS);
+
+/*
+ * hashset_isspace() --- a non-locale-dependent isspace()
+ *
+ * Identical to array_isspace() in src/backend/utils/adt/arrayfuncs.c.
+ * We used to use isspace() for parsing hashset values, but that has
+ * undesirable results: a hashset value might be silently interpreted
+ * differently depending on the locale setting. So here, we hard-wire
+ * the traditional ASCII definition of isspace().
+ */
+static bool
+hashset_isspace(char ch)
+{
+ if (ch == ' ' ||
+ ch == '\t' ||
+ ch == '\n' ||
+ ch == '\r' ||
+ ch == '\v' ||
+ ch == '\f')
+ return true;
+ return false;
+}
+
+static int4hashset_t *
+int4hashset_allocate(
+ int capacity,
+ float4 load_factor,
+ float4 growth_factor,
+ int hashfn_id
+)
+{
+ Size len;
+ int4hashset_t *set;
+ char *ptr;
+
+ /*
+ * Ensure that capacity is not divisible by HASHSET_STEP;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (capacity % HASHSET_STEP == 0)
+ capacity++;
+
+ len = offsetof(int4hashset_t, data);
+ len += CEIL_DIV(capacity, 8);
+ len += capacity * sizeof(int32);
+
+ ptr = palloc0(len);
+ SET_VARSIZE(ptr, len);
+
+ set = (int4hashset_t *) ptr;
+
+ set->flags = 0;
+ set->capacity = capacity;
+ set->nelements = 0;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->hash = 0; /* Initial hash value */
+
+ set->flags |= 0;
+
+ return set;
+}
+
+Datum
+int4hashset_in(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ int4hashset_t *set;
+ int64 value;
+
+ /* Skip initial spaces */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check the opening brace */
+ if (*str != '{')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = int4hashset_allocate(
+ len/2,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ while (true)
+ {
+ /* Skip spaces before number */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check for closing brace, handling the case for an empty set */
+ if (*str == '}')
+ {
+ str++; /* Move past the closing brace */
+ break;
+ }
+
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
+
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->capacity)
+ {
+ set = int4hashset_resize(set);
+ }
+ set = int4hashset_add_element(set, (int32)value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+
+ str = endptr; /* Move to next potential number or closing brace */
+
+ /* Skip spaces before the next number or closing brace */
+ while (hashset_isspace(*str)) str++;
+
+ if (*str == ',')
+ {
+ str++; /* Skip comma before next loop iteration */
+ }
+ else if (*str != '}')
+ {
+ /* Unexpected character */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("unexpected character \"%c\" in hashset input", *str)));
+ }
+ }
+
+ /* Only whitespace is allowed after the closing brace */
+ while (*str)
+ {
+ if (!hashset_isspace(*str))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed hashset literal: \"%s\"", str),
+ errdetail("Junk after closing right brace.")));
+ }
+ str++;
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_out(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
+ char *bitmap;
+ int32 *values;
+ int i;
+ StringInfoData str;
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Initialize the StringInfo buffer */
+ initStringInfo(&str);
+
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
+ PG_RETURN_CSTRING(str.data);
+}
+
+
+Datum
+int4hashset_send(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
+ StringInfoData buf;
+ int32 data_size;
+
+ /* Begin constructing the message */
+ pq_begintypsend(&buf);
+
+ /* Send the non-data fields */
+ pq_sendint32(&buf, set->flags);
+ pq_sendint32(&buf, set->capacity);
+ pq_sendint32(&buf, set->nelements);
+ pq_sendint32(&buf, set->hashfn_id);
+
+ /* Compute and send the size of the data field */
+ data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
+ pq_sendbytes(&buf, set->data, data_size);
+
+ PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+
+
+Datum
+int4hashset_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ int4hashset_t *set;
+ int32 data_size;
+ Size total_size;
+ const char *binary_data;
+
+ /* Read fields from buffer */
+ int32 flags = pq_getmsgint(buf, 4);
+ int32 capacity = pq_getmsgint(buf, 4);
+ int32 nelements = pq_getmsgint(buf, 4);
+ int32 hashfn_id = pq_getmsgint(buf, 4);
+
+ /* Compute the size of the data field */
+ data_size = buf->len - buf->cursor;
+
+ /* Read the binary data */
+ binary_data = pq_getmsgbytes(buf, data_size);
+
+ /* Compute total size of hashset_t */
+ total_size = offsetof(int4hashset_t, data) + data_size;
+
+ /* Allocate memory for hashset including the data field */
+ set = (int4hashset_t *) palloc0(total_size);
+
+ /* Set the size of the variable-length data structure */
+ SET_VARSIZE(set, total_size);
+
+ /* Populate the structure */
+ set->flags = flags;
+ set->capacity = capacity;
+ set->nelements = nelements;
+ set->hashfn_id = hashfn_id;
+ memcpy(set->data, binary_data, data_size);
+
+ PG_RETURN_POINTER(set);
+}
+
+
+Datum
+int4hashset_to_array(PG_FUNCTION_ARGS)
+{
+ int i,
+ idx;
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+
+ char *sbitmap;
+ int32 *svalues;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = (int4hashset_t *) PG_GETARG_INT4HASHSET(0);
+
+ sbitmap = set->data;
+ svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+ values = (int32 *) palloc(sizeof(int32) * nvalues);
+
+ idx = 0;
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (sbitmap[byte] & (0x01 << bit))
+ values[idx++] = svalues[i];
+ }
+
+ Assert(idx == nvalues);
+
+ return int32_to_array(fcinfo, values, nvalues);
+}
+
+
+/*
+ * construct an SQL array from a simple C double array
+ */
+static Datum
+int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
+{
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ for (i = 0; i < len; i++)
+ {
+ /* stash away this field */
+ astate = accumArrayResult(astate,
+ Int32GetDatum(d[i]),
+ false,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ PG_RETURN_DATUM(makeArrayResult(astate,
+ CurrentMemoryContext));
+}
+
+
+static int4hashset_t *
+int4hashset_resize(int4hashset_t * set)
+{
+ int i;
+ int4hashset_t *new;
+ char *bitmap;
+ int32 *values;
+
+ new = int4hashset_allocate(
+ set->capacity * 2,
+ set->load_factor,
+ set->growth_factor,
+ set->hashfn_id
+ );
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ int4hashset_add_element(new, values[i]);
+ }
+
+ return new;
+}
+
+static int4hashset_t *
+int4hashset_add_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+
+ if (set->nelements > set->capacity * set->load_factor)
+ set = int4hashset_resize(set);
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* the item is already used - maybe it's the same value? */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* same value, we're done */
+ if (values[position] == value)
+ break;
+
+ /* Increment the collision counter */
+ set->ncollisions++;
+
+ position = (position + HASHSET_STEP) % set->capacity;
+ continue;
+ }
+
+ /* found an empty spot, before hitting the value first */
+ bitmap[byte] |= (0x01 << bit);
+ values[position] = value;
+
+ set->hash ^= hash;
+
+ set->nelements++;
+
+ break;
+ }
+
+ return set;
+}
+
+static bool
+int4hashset_contains_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ char *bitmap;
+ int32 *values;
+ int num_probes = 0; /* Add a counter for the number of probes */
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value) % set->capacity;
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201) % set->capacity;
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (hash / 8);
+ bit = (hash % 8);
+
+ /* found an empty slot, value is not there */
+ if ((bitmap[byte] & (0x01 << bit)) == 0)
+ return false;
+
+ /* is it the same value? */
+ if (values[hash] == value)
+ return true;
+
+ /* move to the next element */
+ hash = (hash + HASHSET_STEP) % set->capacity;
+
+ num_probes++; /* Increment the number of probes */
+
+ /* Check if we have probed all slots */
+ if (num_probes >= set->capacity)
+ return false; /* Avoid infinite loop */
+ }
+}
+
+Datum
+int4hashset_add(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
+ else
+ {
+ /* make sure we are working with a non-toasted and non-shared copy of the input */
+ set = PG_GETARG_INT4HASHSET_COPY(0);
+ }
+
+ set = int4hashset_add_element(set, PG_GETARG_INT32(1));
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_merge(PG_FUNCTION_ARGS)
+{
+ int i;
+
+ int4hashset_t *seta;
+ int4hashset_t *setb;
+
+ char *bitmap;
+ int32_t *values;
+
+ if (PG_ARGISNULL(0) && PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+ else if (PG_ARGISNULL(1))
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(0));
+ else if (PG_ARGISNULL(0))
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(1));
+
+ seta = PG_GETARG_INT4HASHSET_COPY(0);
+ setb = PG_GETARG_INT4HASHSET(1);
+
+ bitmap = setb->data;
+ values = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ seta = int4hashset_add_element(seta, values[i]);
+ }
+
+ PG_RETURN_POINTER(seta);
+}
+
+Datum
+int4hashset_init(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
+
+ /* Validate input arguments */
+ if (!(initial_capacity >= 0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("initial capacity cannot be negative")));
+ }
+
+ if (!(load_factor > 0.0 && load_factor < 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("load factor must be between 0.0 and 1.0")));
+ }
+
+ if (!(growth_factor > 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("growth factor must be greater than 1.0")));
+ }
+
+ if (!(hashfn_id == JENKINS_LOOKUP3_HASHFN_ID ||
+ hashfn_id == MURMURHASH32_HASHFN_ID ||
+ hashfn_id == NAIVE_HASHFN_ID))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid hash function ID")));
+ }
+
+ set = int4hashset_allocate(
+ initial_capacity,
+ load_factor,
+ growth_factor,
+ hashfn_id
+ );
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_contains(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 value;
+
+ if (PG_ARGISNULL(1) || PG_ARGISNULL(0))
+ PG_RETURN_BOOL(false);
+
+ set = PG_GETARG_INT4HASHSET(0);
+ value = PG_GETARG_INT32(1);
+
+ PG_RETURN_BOOL(int4hashset_contains_element(set, value));
+}
+
+Datum
+int4hashset_count(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->nelements);
+}
+
+Datum
+int4hashset_capacity(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ PG_RETURN_INT64(set->capacity);
+}
+
+Datum
+int4hashset_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->ncollisions);
+}
+
+Datum
+int4hashset_agg_add(PG_FUNCTION_ARGS)
+{
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ MemoryContext aggcontext;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_add_element(state, PG_GETARG_INT32(1));
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_add_set(PG_FUNCTION_ARGS)
+{
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ MemoryContext aggcontext;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+
+ {
+ int i;
+ char *bitmap;
+ int32 *values;
+ int4hashset_t *value;
+
+ value = PG_GETARG_INT4HASHSET(1);
+
+ bitmap = value->data;
+ values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+
+ for (i = 0; i < value->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ state = int4hashset_add_element(state, values[i]);
+ }
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_final(PG_FUNCTION_ARGS)
+{
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ PG_RETURN_POINTER(PG_GETARG_POINTER(0));
+}
+
+static int4hashset_t *
+int4hashset_copy(int4hashset_t *src)
+{
+ return src;
+}
+
+Datum
+int4hashset_agg_combine(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *src;
+ int4hashset_t *dst;
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+
+ char *bitmap;
+ int32 *values;
+
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
+
+ /* if no "merged" state yet, try creating it */
+ if (PG_ARGISNULL(0))
+ {
+ /* nope, the second argument is NULL to, so return NULL */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ /* the second argument is not NULL, so copy it */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+
+ /* copy the hashset into the right long-lived memory context */
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ src = int4hashset_copy(src);
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(src);
+ }
+
+ /*
+ * If the second argument is NULL, just return the first one (we know
+ * it's not NULL at this point).
+ */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+
+ /* Now we know neither argument is NULL, so merge them. */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+ dst = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ bitmap = src->data;
+ values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+
+ for (i = 0; i < src->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ dst = int4hashset_add_element(dst, values[i]);
+ }
+
+
+ PG_RETURN_POINTER(dst);
+}
+
+
+Datum
+int4hashset_equals(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ char *bitmap_a;
+ int32 *values_a;
+ int i;
+
+ /*
+ * Check if the number of elements is the same
+ */
+ if (a->nelements != b->nelements)
+ PG_RETURN_BOOL(false);
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+
+ /*
+ * Check if every element in a is also in b
+ */
+ for (i = 0; i < a->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap_a[byte] & (0x01 << bit))
+ {
+ int32 value = values_a[i];
+
+ if (!int4hashset_contains_element(b, value))
+ PG_RETURN_BOOL(false);
+ }
+ }
+
+ /*
+ * All elements in a are in b and the number of elements is the same,
+ * so the sets must be equal.
+ */
+ PG_RETURN_BOOL(true);
+}
+
+
+Datum
+int4hashset_neq(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ /* If a is not equal to b, then they are not equal */
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum int4hashset_hash(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT32(set->hash);
+}
+
+
+Datum
+int4hashset_lt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp < 0);
+}
+
+
+Datum
+int4hashset_le(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp <= 0);
+}
+
+
+Datum
+int4hashset_gt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp > 0);
+}
+
+
+Datum
+int4hashset_ge(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp >= 0);
+}
+
+
+static int
+int32_cmp(const void *a, const void *b)
+{
+ int32 arg1 = *(const int32 *)a;
+ int32 arg2 = *(const int32 *)b;
+
+ if (arg1 < arg2) return -1;
+ if (arg1 > arg2) return 1;
+ return 0;
+}
+
+
+static int32 *
+int4hashset_extract_sorted_elements(int4hashset_t *set)
+{
+ /* Allocate memory for the elements array */
+ int32 *elements = palloc(set->nelements * sizeof(int32));
+
+ /* Access the data array */
+ char *bitmap = set->data;
+ int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Counter for the number of extracted elements */
+ int32 nextracted = 0;
+
+ /* Iterate through all elements */
+ for (int32 i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the current position is occupied */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Add the value to the elements array */
+ elements[nextracted++] = values[i];
+ }
+ }
+
+ /* Make sure we extracted the correct number of elements */
+ Assert(nextracted == set->nelements);
+
+ /* Sort the elements array */
+ qsort(elements, nextracted, sizeof(int32), int32_cmp);
+
+ /* Return the sorted elements array */
+ return elements;
+}
+
+
+Datum
+int4hashset_cmp(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 *elements_a;
+ int32 *elements_b;
+
+ /*
+ * Compare the hashes first, if they are different,
+ * we can immediately tell which set is 'greater'
+ */
+ if (a->hash < b->hash)
+ PG_RETURN_INT32(-1);
+ else if (a->hash > b->hash)
+ PG_RETURN_INT32(1);
+
+ /*
+ * If hashes are equal, perform a more rigorous comparison
+ */
+
+ /*
+ * If number of elements are different,
+ * we can use that to deterministically return -1 or 1
+ */
+ if (a->nelements < b->nelements)
+ PG_RETURN_INT32(-1);
+ else if (a->nelements > b->nelements)
+ PG_RETURN_INT32(1);
+
+ /* Assert that the number of elements in both hashsets are equal */
+ Assert(a->nelements == b->nelements);
+
+ /* Extract and sort elements from each set */
+ elements_a = int4hashset_extract_sorted_elements(a);
+ elements_b = int4hashset_extract_sorted_elements(b);
+
+ /* Now we can perform a lexicographical comparison */
+ for (int32 i = 0; i < a->nelements; i++)
+ {
+ if (elements_a[i] < elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(-1);
+ }
+ else if (elements_a[i] > elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(1);
+ }
+ }
+
+ /* All elements are equal, so the sets are equal */
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(0);
+}
diff --git a/hashset.control b/hashset.control
new file mode 100644
index 0000000..0743003
--- /dev/null
+++ b/hashset.control
@@ -0,0 +1,3 @@
+comment = 'Provides hashset type.'
+default_version = '0.0.1'
+relocatable = true
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
new file mode 100644
index 0000000..cc7c48a
--- /dev/null
+++ b/test/c_tests/test_send_recv.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <libpq-fe.h>
+
+void exit_nicely(PGconn *conn) {
+ PQfinish(conn);
+ exit(1);
+}
+
+int main() {
+ /* Connect to database specified by the PGDATABASE environment variable */
+ const char *hostname = getenv("PGHOST");
+ char conninfo[1024];
+ PGconn *conn;
+
+ if (hostname == NULL)
+ hostname = "localhost";
+
+ /* Connect to database specified by the PGDATABASE environment variable */
+ snprintf(conninfo, sizeof(conninfo), "host=%s port=5432", hostname);
+ conn = PQconnectdb(conninfo);
+ if (PQstatus(conn) != CONNECTION_OK) {
+ fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
+ exit_nicely(conn);
+ }
+
+ /* Create extension */
+ PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
+
+ /* Create temporary table */
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col int4hashset)");
+
+ /* Enable binary output */
+ PQexec(conn, "SET bytea_output = 'escape'");
+
+ /* Insert dummy data */
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::int4hashset)";
+ PGresult *res = PQexec(conn, insert_command);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Fetch the data in binary format */
+ const char *select_command = "SELECT hashset_col FROM test_hashset_send_recv";
+ int resultFormat = 1; /* 0 = text, 1 = binary */
+ res = PQexecParams(conn, select_command, 0, NULL, NULL, NULL, NULL, resultFormat);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Store binary data for later use */
+ const char *binary_data = PQgetvalue(res, 0, 0);
+ int binary_data_length = PQgetlength(res, 0, 0);
+ PQclear(res);
+
+ /* Re-insert the binary data */
+ const char *insert_binary_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ($1)";
+ const char *paramValues[1] = {binary_data};
+ int paramLengths[1] = {binary_data_length};
+ int paramFormats[1] = {1}; /* binary format */
+ res = PQexecParams(conn, insert_binary_command, 1, NULL, paramValues, paramLengths, paramFormats, 0);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Check the data */
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ res = PQexec(conn, check_command);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Print the results */
+ printf("unique_count: %s\n", PQgetvalue(res, 0, 0));
+ printf("count: %s\n", PQgetvalue(res, 0, 1));
+ PQclear(res);
+
+ /* Disconnect */
+ PQfinish(conn);
+
+ return 0;
+}
diff --git a/test/c_tests/test_send_recv.sh b/test/c_tests/test_send_recv.sh
new file mode 100755
index 0000000..ab308b3
--- /dev/null
+++ b/test/c_tests/test_send_recv.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+# Get the directory of this script
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+
+# Set up database
+export PGDATABASE=test_hashset_send_recv
+dropdb --if-exists "$PGDATABASE"
+createdb
+
+# Define directories
+EXPECTED_DIR="$SCRIPT_DIR/../expected"
+RESULTS_DIR="$SCRIPT_DIR/../results"
+
+# Create the results directory if it doesn't exist
+mkdir -p "$RESULTS_DIR"
+
+# Run the C test and save its output to the results directory
+"$SCRIPT_DIR/test_send_recv" > "$RESULTS_DIR/test_send_recv.out"
+
+printf "test test_send_recv ... "
+
+# Compare the actual output with the expected output
+if diff -q "$RESULTS_DIR/test_send_recv.out" "$EXPECTED_DIR/test_send_recv.out" > /dev/null 2>&1; then
+ echo "ok"
+ # Clean up by removing the results directory if the test passed
+ rm -r "$RESULTS_DIR"
+else
+ echo "failed"
+ git diff --no-index --color "$EXPECTED_DIR/test_send_recv.out" "$RESULTS_DIR/test_send_recv.out"
+fi
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..19a8bd0
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,286 @@
+/*
+ * Hashset Type
+ */
+SELECT '{}'::int4hashset; -- empty int4hashset
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset;
+ int4hashset
+-------------
+ {3,2,1}
+(1 row)
+
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+ int4hashset
+----------------------------
+ {0,2147483647,-2147483648}
+(1 row)
+
+SELECT '{-2147483649}'::int4hashset; -- out of range
+ERROR: value "-2147483649}" is out of range for type integer
+LINE 1: SELECT '{-2147483649}'::int4hashset;
+ ^
+SELECT '{2147483648}'::int4hashset; -- out of range
+ERROR: value "2147483648}" is out of range for type integer
+LINE 1: SELECT '{2147483648}'::int4hashset;
+ ^
+/*
+ * Hashset Functions
+ */
+SELECT int4hashset();
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT hashset_add(int4hashset(), 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add(NULL::int4hashset, 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add('{123}'::int4hashset, 456);
+ hashset_add
+-------------
+ {456,123}
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+ hashset_contains
+------------------
+ f
+(1 row)
+
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_merge
+---------------
+ {3,1,2}
+(1 row)
+
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+ hashset_to_array
+------------------
+ {3,2,1}
+(1 row)
+
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+ hashset_count
+---------------
+ 3
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+/*
+ * Aggregation Functions
+ */
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+ hashset_agg
+------------------------
+ {6,10,1,8,2,3,4,5,9,7}
+(1 row)
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset_agg
+------------------------
+ {6,8,1,3,2,10,4,5,9,7}
+(1 row)
+
+/*
+ * Operator Definitions
+ */
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset || 4;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+SELECT 4 || '{1,2,3}'::int4hashset;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+/*
+ * Hashset Hash Operators
+ */
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+ count | count
+-------+-------
+ 2 | 1
+(1 row)
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
+ h
+---------
+ {9,7,8}
+ {3,2,1}
+ {5,6,4}
+(3 rows)
+
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
new file mode 100644
index 0000000..bd44199
--- /dev/null
+++ b/test/expected/invalid.out
@@ -0,0 +1,4 @@
+SELECT '{1,2s}'::int4hashset;
+ERROR: unexpected character "s" in hashset input
+LINE 1: SELECT '{1,2s}'::int4hashset;
+ ^
diff --git a/test/expected/io_varying_lengths.out b/test/expected/io_varying_lengths.out
new file mode 100644
index 0000000..45e9fb1
--- /dev/null
+++ b/test/expected/io_varying_lengths.out
@@ -0,0 +1,100 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+SELECT hashset_sorted('{1}'::int4hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/parsing.out b/test/expected/parsing.out
new file mode 100644
index 0000000..263797e
--- /dev/null
+++ b/test/expected/parsing.out
@@ -0,0 +1,71 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+ERROR: malformed hashset literal: "1"
+LINE 2: SELECT ' { 1 , 23 , -456 } 1'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+ERROR: malformed hashset literal: ","
+LINE 1: SELECT ' { 1 , 23 , -456 } ,'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+ERROR: malformed hashset literal: "{"
+LINE 1: SELECT ' { 1 , 23 , -456 } {'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+ERROR: malformed hashset literal: "}"
+LINE 1: SELECT ' { 1 , 23 , -456 } }'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+ERROR: malformed hashset literal: "x"
+LINE 1: SELECT ' { 1 , 23 , -456 } x'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+ERROR: unexpected character "1" in hashset input
+LINE 2: SELECT ' { 1 , 23 , -456 1'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+ERROR: unexpected character "{" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 {'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+ERROR: unexpected character "x" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 x'::int4hashset;
+ ^
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ", 23 , -456 } "
+LINE 2: SELECT ' { , 23 , -456 } '::int4hashset;
+ ^
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ""
+LINE 1: SELECT ' { 1 , 23 , '::int4hashset;
+ ^
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: "s , 23 , -456 } "
+LINE 1: SELECT ' { s , 23 , -456 } '::int4hashset;
+ ^
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for hashset: "1 , 23 , -456 } "
+LINE 2: SELECT ' 1 , 23 , -456 } '::int4hashset;
+ ^
+DETAIL: Hashset representation must start with "{".
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..f34e190
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..9d9026b
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
new file mode 100644
index 0000000..f258a19
--- /dev/null
+++ b/test/expected/reported_bugs.out
@@ -0,0 +1,68 @@
+/*
+ * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+ hashset_agg | hashset_add
+-------------+-------------
+ {3,1,2} | {3,4,1,2}
+(1 row)
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_hash('{2,1}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+ hashset_cmp
+-------------
+ 0
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..9793a49
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+ hashset_count
+---------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/expected/test_send_recv.out b/test/expected/test_send_recv.out
new file mode 100644
index 0000000..12382d5
--- /dev/null
+++ b/test/expected/test_send_recv.out
@@ -0,0 +1,2 @@
+unique_count: 1
+count: 2
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..bb885b5
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,105 @@
+/*
+ * Hashset Type
+ */
+
+SELECT '{}'::int4hashset; -- empty int4hashset
+SELECT '{1,2,3}'::int4hashset;
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+SELECT '{-2147483649}'::int4hashset; -- out of range
+SELECT '{2147483648}'::int4hashset; -- out of range
+
+/*
+ * Hashset Functions
+ */
+
+SELECT int4hashset();
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+SELECT hashset_add(int4hashset(), 123);
+SELECT hashset_add(NULL::int4hashset, 123);
+SELECT hashset_add('{123}'::int4hashset, 456);
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+
+/*
+ * Aggregation Functions
+ */
+
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+
+/*
+ * Operator Definitions
+ */
+
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset || 4;
+SELECT 4 || '{1,2,3}'::int4hashset;
+
+/*
+ * Hashset Hash Operators
+ */
+
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
new file mode 100644
index 0000000..1697451
--- /dev/null
+++ b/test/sql/benchmark.sql
@@ -0,0 +1,113 @@
+DROP EXTENSION IF EXISTS hashset;
+CREATE EXTENSION hashset;
+
+\timing on
+
+\echo *** Elements in sequence 1..100000
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo *** Testing 100000 random ints
+
+SELECT setseed(0.12345);
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+END
+$$ LANGUAGE plpgsql;
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
new file mode 100644
index 0000000..43689ab
--- /dev/null
+++ b/test/sql/invalid.sql
@@ -0,0 +1 @@
+SELECT '{1,2s}'::int4hashset;
diff --git a/test/sql/io_varying_lengths.sql b/test/sql/io_varying_lengths.sql
new file mode 100644
index 0000000..8acb6b8
--- /dev/null
+++ b/test/sql/io_varying_lengths.sql
@@ -0,0 +1,21 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+
+SELECT hashset_sorted('{1}'::int4hashset);
+SELECT hashset_sorted('{1,2}'::int4hashset);
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
diff --git a/test/sql/parsing.sql b/test/sql/parsing.sql
new file mode 100644
index 0000000..1e56bbe
--- /dev/null
+++ b/test/sql/parsing.sql
@@ -0,0 +1,23 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..2fee0fc
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..7cc8f87
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
new file mode 100644
index 0000000..48e86d3
--- /dev/null
+++ b/test/sql/reported_bugs.sql
@@ -0,0 +1,50 @@
+/*
+ * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+SELECT hashset_hash('{2,1}'::int4hashset);
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..0472352
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.
Cool, thanks.
A couple of random thoughts:
. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.
. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").
. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Sun, Jun 18, 2023, at 18:45, Andrew Dunstan wrote:
. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.
Great idea; implemented.
. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").
Another great idea; implemented.
. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.
I agree that's probably the most logical thing to focus on next. I'm on it.
New patch attached.
/Joel
Attachments:
hashset-0.0.1-75bf3ab.patchapplication/octet-stream; name=hashset-0.0.1-75bf3ab.patchDownload
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..91f216e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+.deps/
+results/
+**/*.o
+**/*.so
+regression.diffs
+regression.out
+.vscode
+test/c_tests/test_send_recv
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..908853d
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,16 @@
+Copyright (c) 2019, Tomas Vondra (tomas.vondra@postgresql.org).
+
+Permission to use, copy, modify, and distribute this software and its documentation
+for any purpose, without fee, and without a written agreement is hereby granted,
+provided that the above copyright notice and this paragraph and the following two
+paragraphs appear in all copies.
+
+IN NO EVENT SHALL $ORGANISATION BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
+INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE
+OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF TOMAS VONDRA HAS BEEN ADVISED OF
+THE POSSIBILITY OF SUCH DAMAGE.
+
+TOMAS VONDRA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
+SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND $ORGANISATION HAS NO
+OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..85e7691
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,33 @@
+MODULE_big = hashset
+OBJS = hashset.o hashset-api.o
+
+EXTENSION = hashset
+DATA = hashset--0.0.1.sql
+MODULES = hashset
+
+# Keep the CFLAGS separate
+SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
+CLIENT_INCLUDES=-I$(shell pg_config --includedir)
+LIBRARY_PATH = -L$(shell pg_config --libdir)
+
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs
+REGRESS_OPTS = --inputdir=test
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+C_TESTS_DIR = test/c_tests
+
+EXTRA_CLEAN = $(C_TESTS_DIR)/test_send_recv
+
+c_tests: $(C_TESTS_DIR)/test_send_recv
+
+$(C_TESTS_DIR)/test_send_recv: $(C_TESTS_DIR)/test_send_recv.c
+ $(CC) $(SERVER_INCLUDES) $(CLIENT_INCLUDES) -o $@ $< $(LIBRARY_PATH) -lpq
+
+run_c_tests: c_tests
+ cd $(C_TESTS_DIR) && ./test_send_recv.sh
+
+check: all $(REGRESS_PREP) run_c_tests
+
+include $(PGXS)
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9c26cf6
--- /dev/null
+++ b/README.md
@@ -0,0 +1,156 @@
+# hashset
+
+This PostgreSQL extension implements hashset, a data structure (type)
+providing a collection of unique, not null integer items with fast lookup.
+
+
+## Version
+
+0.0.1
+
+🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
+PostgreSQL extension is **not production-ready**. As the codebase is evolving
+with possible breaking changes, we are not providing any migration scripts
+until we reach our first release.
+
+
+## Usage
+
+After installing the extension, you can use the `int4hashset` data type and
+associated functions within your PostgreSQL queries.
+
+To demonstrate the usage, let's consider a hypothetical table `users` which has
+a `user_id` and a `user_likes` of type `int4hashset`.
+
+Firstly, let's create the table:
+
+```sql
+CREATE TABLE users(
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset()
+);
+```
+In the above statement, the `int4hashset()` initializes an empty hashset
+with zero capacity. The hashset will automatically resize itself when more
+elements are added.
+
+Now, we can perform operations on this table. Here are some examples:
+
+```sql
+-- Insert a new user with id 1. The user_likes will automatically be initialized
+-- as an empty hashset
+INSERT INTO users (user_id) VALUES (1);
+
+-- Add elements (likes) for a user
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+
+-- Check if a user likes a particular item
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
+
+-- Count the number of likes a user has
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
+```
+
+You can also use the aggregate functions to perform operations on multiple rows.
+
+
+## Data types
+
+- **int4hashset**: This data type represents a set of integers. Internally, it uses
+a combination of a bitmap and a value array to store the elements in a set. It's
+a variable-length type.
+
+
+## Functions
+
+- `int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`:
+ Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
+- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
+- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
+- `hashset_merge(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
+- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
+- `hashset_count(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
+- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
+- `hashset_max_collisions(int4hashset) -> bigint`: Returns the maximum number of collisions that have occurred for a single element
+- `hashset_intersection(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that is the intersection of the two input sets.
+- `hashset_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that contains the elements present in the first set but not in the second set.
+- `hashset_symmetric_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset containing elements that are in either of the input sets, but not in their intersection.
+
+## Aggregation Functions
+
+- `hashset_agg(int) -> int4hashset`: Aggregate integers into a hashset.
+- `hashset_agg(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
+
+
+## Operators
+
+- Equality (`=`): Checks if two hashsets are equal.
+- Inequality (`<>`): Checks if two hashsets are not equal.
+
+
+## Hashset Hash Operators
+
+- `hashset_hash(int4hashset) -> integer`: Returns the hash value of an int4hashset.
+
+
+## Hashset Btree Operators
+
+- `<`, `<=`, `>`, `>=`: Comparison operators for hashsets.
+
+
+## Limitations
+
+- The `int4hashset` data type currently supports integers within the range of int4
+(-2147483648 to 2147483647).
+
+
+## Installation
+
+To install the extension on any platform, follow these general steps:
+
+1. Ensure you have PostgreSQL installed on your system, including the development files.
+2. Clone the repository.
+3. Navigate to the cloned repository directory.
+4. Compile the extension using `make`.
+5. Install the extension using `sudo make install`.
+6. Run the tests using `make installcheck` (optional).
+
+To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+```sh
+make PG_CONFIG=/else/where/pg_config
+sudo make install PG_CONFIG=/else/where/pg_config
+```
+
+In your PostgreSQL connection, enable the hashset extension using the following SQL command:
+```sql
+CREATE EXTENSION hashset;
+```
+
+This extension requires PostgreSQL version ?.? or later.
+
+For Ubuntu 22.04.1 LTS, you would run the following commands:
+
+```sh
+sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
+git clone https://github.com/tvondra/hashset.git
+cd hashset
+make
+sudo make install
+make installcheck
+```
+
+Please note that this project is currently under active development and is not yet considered production-ready.
+
+## License
+
+This software is distributed under the terms of PostgreSQL license.
+See LICENSE or http://www.opensource.org/licenses/bsd-license.php for
+more details.
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
new file mode 100644
index 0000000..a155190
--- /dev/null
+++ b/hashset--0.0.1.sql
@@ -0,0 +1,298 @@
+/*
+ * Hashset Type Definition
+ */
+
+CREATE TYPE int4hashset;
+
+CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_in'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
+RETURNS cstring
+AS 'hashset', 'int4hashset_out'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
+RETURNS bytea
+AS 'hashset', 'int4hashset_send'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_recv'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE TYPE int4hashset (
+ INPUT = int4hashset_in,
+ OUTPUT = int4hashset_out,
+ RECEIVE = int4hashset_recv,
+ SEND = int4hashset_send,
+ INTERNALLENGTH = variable,
+ STORAGE = external
+);
+
+/*
+ * Hashset Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset(
+ capacity int DEFAULT 0,
+ load_factor float4 DEFAULT 0.75,
+ growth_factor float4 DEFAULT 2.0,
+ hashfn_id int DEFAULT 1
+)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_init'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
+RETURNS bool
+AS 'hashset', 'int4hashset_contains'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_merge'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_array'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_count'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_capacity'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_collisions'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_max_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_max_collisions'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
+RETURNS int4hashset
+AS $$SELECT $2 || $1$$
+LANGUAGE SQL
+IMMUTABLE PARALLEL SAFE STRICT COST 1;
+
+CREATE OR REPLACE FUNCTION hashset_intersection(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_intersection'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_difference'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_symmetric_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_symmetric_difference'
+LANGUAGE C IMMUTABLE;
+
+/*
+ * Aggregation Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int) (
+ SFUNC = int4hashset_agg_add,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add_set'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int4hashset) (
+ SFUNC = int4hashset_agg_add_set,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+/*
+ * Operator Definitions
+ */
+
+CREATE OR REPLACE FUNCTION hashset_equals(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_equals'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR = (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_equals,
+ COMMUTATOR = =,
+ HASHES
+);
+
+CREATE OR REPLACE FUNCTION hashset_neq(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_neq'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR <> (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_neq,
+ COMMUTATOR = '<>',
+ NEGATOR = '=',
+ RESTRICT = neqsel,
+ JOIN = neqjoinsel,
+ HASHES
+);
+
+CREATE OPERATOR || (
+ leftarg = int4hashset,
+ rightarg = int4,
+ function = hashset_add,
+ commutator = ||
+);
+
+CREATE OPERATOR || (
+ leftarg = int4,
+ rightarg = int4hashset,
+ function = int4_add_int4hashset,
+ commutator = ||
+);
+
+/*
+ * Hashset Hash Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_hash'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR CLASS int4hashset_hash_ops
+DEFAULT FOR TYPE int4hashset USING hash AS
+OPERATOR 1 = (int4hashset, int4hashset),
+FUNCTION 1 hashset_hash(int4hashset);
+
+/*
+ * Hashset Btree Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
+RETURNS bool
+AS 'hashset', 'int4hashset_lt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_le'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_gt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ge'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_cmp'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR < (
+ PROCEDURE = hashset_lt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = >,
+ NEGATOR = >=,
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR <= (
+ PROCEDURE = hashset_le,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '>=',
+ NEGATOR = '>',
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR > (
+ PROCEDURE = hashset_gt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<',
+ NEGATOR = '<=',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR >= (
+ PROCEDURE = hashset_ge,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<=',
+ NEGATOR = '<',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR CLASS int4hashset_btree_ops
+DEFAULT FOR TYPE int4hashset USING btree AS
+OPERATOR 1 < (int4hashset, int4hashset),
+OPERATOR 2 <= (int4hashset, int4hashset),
+OPERATOR 3 = (int4hashset, int4hashset),
+OPERATOR 4 >= (int4hashset, int4hashset),
+OPERATOR 5 > (int4hashset, int4hashset),
+FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset-api.c b/hashset-api.c
new file mode 100644
index 0000000..3feb06d
--- /dev/null
+++ b/hashset-api.c
@@ -0,0 +1,1057 @@
+#include "hashset.h"
+
+#include <stdio.h>
+#include <math.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <limits.h>
+
+#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET_COPY(x) (int4hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(x))
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(int4hashset_in);
+PG_FUNCTION_INFO_V1(int4hashset_out);
+PG_FUNCTION_INFO_V1(int4hashset_send);
+PG_FUNCTION_INFO_V1(int4hashset_recv);
+PG_FUNCTION_INFO_V1(int4hashset_add);
+PG_FUNCTION_INFO_V1(int4hashset_contains);
+PG_FUNCTION_INFO_V1(int4hashset_count);
+PG_FUNCTION_INFO_V1(int4hashset_merge);
+PG_FUNCTION_INFO_V1(int4hashset_init);
+PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_max_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
+PG_FUNCTION_INFO_V1(int4hashset_agg_final);
+PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
+PG_FUNCTION_INFO_V1(int4hashset_to_array);
+PG_FUNCTION_INFO_V1(int4hashset_equals);
+PG_FUNCTION_INFO_V1(int4hashset_neq);
+PG_FUNCTION_INFO_V1(int4hashset_hash);
+PG_FUNCTION_INFO_V1(int4hashset_lt);
+PG_FUNCTION_INFO_V1(int4hashset_le);
+PG_FUNCTION_INFO_V1(int4hashset_gt);
+PG_FUNCTION_INFO_V1(int4hashset_ge);
+PG_FUNCTION_INFO_V1(int4hashset_cmp);
+PG_FUNCTION_INFO_V1(int4hashset_intersection);
+PG_FUNCTION_INFO_V1(int4hashset_difference);
+PG_FUNCTION_INFO_V1(int4hashset_symmetric_difference);
+
+Datum int4hashset_in(PG_FUNCTION_ARGS);
+Datum int4hashset_out(PG_FUNCTION_ARGS);
+Datum int4hashset_send(PG_FUNCTION_ARGS);
+Datum int4hashset_recv(PG_FUNCTION_ARGS);
+Datum int4hashset_add(PG_FUNCTION_ARGS);
+Datum int4hashset_contains(PG_FUNCTION_ARGS);
+Datum int4hashset_count(PG_FUNCTION_ARGS);
+Datum int4hashset_merge(PG_FUNCTION_ARGS);
+Datum int4hashset_init(PG_FUNCTION_ARGS);
+Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_max_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
+Datum int4hashset_to_array(PG_FUNCTION_ARGS);
+Datum int4hashset_equals(PG_FUNCTION_ARGS);
+Datum int4hashset_neq(PG_FUNCTION_ARGS);
+Datum int4hashset_hash(PG_FUNCTION_ARGS);
+Datum int4hashset_lt(PG_FUNCTION_ARGS);
+Datum int4hashset_le(PG_FUNCTION_ARGS);
+Datum int4hashset_gt(PG_FUNCTION_ARGS);
+Datum int4hashset_ge(PG_FUNCTION_ARGS);
+Datum int4hashset_cmp(PG_FUNCTION_ARGS);
+Datum int4hashset_intersection(PG_FUNCTION_ARGS);
+Datum int4hashset_difference(PG_FUNCTION_ARGS);
+Datum int4hashset_symmetric_difference(PG_FUNCTION_ARGS);
+
+Datum
+int4hashset_in(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ int4hashset_t *set;
+ int64 value;
+
+ /* Skip initial spaces */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check the opening brace */
+ if (*str != '{')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = int4hashset_allocate(
+ len/2,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ while (true)
+ {
+ /* Skip spaces before number */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check for closing brace, handling the case for an empty set */
+ if (*str == '}')
+ {
+ str++; /* Move past the closing brace */
+ break;
+ }
+
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
+
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->capacity)
+ {
+ set = int4hashset_resize(set);
+ }
+ set = int4hashset_add_element(set, (int32)value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+
+ str = endptr; /* Move to next potential number or closing brace */
+
+ /* Skip spaces before the next number or closing brace */
+ while (hashset_isspace(*str)) str++;
+
+ if (*str == ',')
+ {
+ str++; /* Skip comma before next loop iteration */
+ }
+ else if (*str != '}')
+ {
+ /* Unexpected character */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("unexpected character \"%c\" in hashset input", *str)));
+ }
+ }
+
+ /* Only whitespace is allowed after the closing brace */
+ while (*str)
+ {
+ if (!hashset_isspace(*str))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed hashset literal: \"%s\"", str),
+ errdetail("Junk after closing right brace.")));
+ }
+ str++;
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_out(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ char *bitmap;
+ int32 *values;
+ int i;
+ StringInfoData str;
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Initialize the StringInfo buffer */
+ initStringInfo(&str);
+
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
+ PG_RETURN_CSTRING(str.data);
+}
+
+Datum
+int4hashset_send(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ StringInfoData buf;
+ int32 data_size;
+ int version = 1;
+
+ /* Begin constructing the message */
+ pq_begintypsend(&buf);
+
+ /* Send the version number */
+ pq_sendint8(&buf, version);
+
+ /* Send the non-data fields */
+ pq_sendint32(&buf, set->flags);
+ pq_sendint32(&buf, set->capacity);
+ pq_sendint32(&buf, set->nelements);
+ pq_sendint32(&buf, set->hashfn_id);
+ pq_sendfloat4(&buf, set->load_factor);
+ pq_sendfloat4(&buf, set->growth_factor);
+ pq_sendint32(&buf, set->ncollisions);
+ pq_sendint32(&buf, set->max_collisions);
+ pq_sendint32(&buf, set->hash);
+
+ /* Compute and send the size of the data field */
+ data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
+ pq_sendbytes(&buf, set->data, data_size);
+
+ PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+
+Datum
+int4hashset_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ int4hashset_t *set;
+ int32 data_size;
+ Size total_size;
+ const char *binary_data;
+ int version;
+ int32 flags;
+ int32 capacity;
+ int32 nelements;
+ int32 hashfn_id;
+ float4 load_factor;
+ float4 growth_factor;
+ int32 ncollisions;
+ int32 max_collisions;
+ int32 hash;
+
+ version = pq_getmsgint(buf, 1);
+ if (version != 1)
+ elog(ERROR, "unsupported hashset version number %d", version);
+
+ /* Read fields from buffer */
+ flags = pq_getmsgint(buf, 4);
+ capacity = pq_getmsgint(buf, 4);
+ nelements = pq_getmsgint(buf, 4);
+ hashfn_id = pq_getmsgint(buf, 4);
+ load_factor = pq_getmsgfloat4(buf);
+ growth_factor = pq_getmsgfloat4(buf);
+ ncollisions = pq_getmsgint(buf, 4);
+ max_collisions = pq_getmsgint(buf, 4);
+ hash = pq_getmsgint(buf, 4);
+
+ /* Compute the size of the data field */
+ data_size = buf->len - buf->cursor;
+
+ /* Read the binary data */
+ binary_data = pq_getmsgbytes(buf, data_size);
+
+ /* Make sure that there is no extra data left in the message */
+ pq_getmsgend(buf);
+
+ /* Compute total size of hashset_t */
+ total_size = offsetof(int4hashset_t, data) + data_size;
+
+ /* Allocate memory for hashset including the data field */
+ set = (int4hashset_t *) palloc0(total_size);
+
+ /* Set the size of the variable-length data structure */
+ SET_VARSIZE(set, total_size);
+
+ /* Populate the structure */
+ set->flags = flags;
+ set->capacity = capacity;
+ set->nelements = nelements;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = ncollisions;
+ set->max_collisions = max_collisions;
+ set->hash = hash;
+ memcpy(set->data, binary_data, data_size);
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_add(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
+ else
+ {
+ /* make sure we are working with a non-toasted and non-shared copy of the input */
+ set = PG_GETARG_INT4HASHSET_COPY(0);
+ }
+
+ set = int4hashset_add_element(set, PG_GETARG_INT32(1));
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_contains(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 value;
+
+ if (PG_ARGISNULL(1) || PG_ARGISNULL(0))
+ PG_RETURN_BOOL(false);
+
+ set = PG_GETARG_INT4HASHSET(0);
+ value = PG_GETARG_INT32(1);
+
+ PG_RETURN_BOOL(int4hashset_contains_element(set, value));
+}
+
+Datum
+int4hashset_count(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->nelements);
+}
+
+Datum
+int4hashset_merge(PG_FUNCTION_ARGS)
+{
+ int i;
+
+ int4hashset_t *seta;
+ int4hashset_t *setb;
+
+ char *bitmap;
+ int32_t *values;
+
+ if (PG_ARGISNULL(0) && PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+ else if (PG_ARGISNULL(1))
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(0));
+ else if (PG_ARGISNULL(0))
+ PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(1));
+
+ seta = PG_GETARG_INT4HASHSET_COPY(0);
+ setb = PG_GETARG_INT4HASHSET(1);
+
+ bitmap = setb->data;
+ values = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ seta = int4hashset_add_element(seta, values[i]);
+ }
+
+ PG_RETURN_POINTER(seta);
+}
+
+Datum
+int4hashset_init(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
+
+ /* Validate input arguments */
+ if (!(initial_capacity >= 0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("initial capacity cannot be negative")));
+ }
+
+ if (!(load_factor > 0.0 && load_factor < 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("load factor must be between 0.0 and 1.0")));
+ }
+
+ if (!(growth_factor > 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("growth factor must be greater than 1.0")));
+ }
+
+ if (!(hashfn_id == JENKINS_LOOKUP3_HASHFN_ID ||
+ hashfn_id == MURMURHASH32_HASHFN_ID ||
+ hashfn_id == NAIVE_HASHFN_ID))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid hash function ID")));
+ }
+
+ set = int4hashset_allocate(
+ initial_capacity,
+ load_factor,
+ growth_factor,
+ hashfn_id
+ );
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_capacity(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->capacity);
+}
+
+Datum
+int4hashset_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->ncollisions);
+}
+
+Datum
+int4hashset_max_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->max_collisions);
+}
+
+Datum
+int4hashset_agg_add(PG_FUNCTION_ARGS)
+{
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ MemoryContext aggcontext;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_add_element(state, PG_GETARG_INT32(1));
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_add_set(PG_FUNCTION_ARGS)
+{
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ MemoryContext aggcontext;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+
+ {
+ int i;
+ char *bitmap;
+ int32 *values;
+ int4hashset_t *value;
+
+ value = PG_GETARG_INT4HASHSET(1);
+
+ bitmap = value->data;
+ values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+
+ for (i = 0; i < value->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ state = int4hashset_add_element(state, values[i]);
+ }
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_final(PG_FUNCTION_ARGS)
+{
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ PG_RETURN_POINTER(PG_GETARG_POINTER(0));
+}
+
+Datum
+int4hashset_agg_combine(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *src;
+ int4hashset_t *dst;
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+
+ char *bitmap;
+ int32 *values;
+
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
+
+ /* if no "merged" state yet, try creating it */
+ if (PG_ARGISNULL(0))
+ {
+ /* nope, the second argument is NULL to, so return NULL */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ /* the second argument is not NULL, so copy it */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+
+ /* copy the hashset into the right long-lived memory context */
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ src = int4hashset_copy(src);
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(src);
+ }
+
+ /*
+ * If the second argument is NULL, just return the first one (we know
+ * it's not NULL at this point).
+ */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+
+ /* Now we know neither argument is NULL, so merge them. */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+ dst = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ bitmap = src->data;
+ values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+
+ for (i = 0; i < src->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ dst = int4hashset_add_element(dst, values[i]);
+ }
+
+
+ PG_RETURN_POINTER(dst);
+}
+
+Datum
+int4hashset_to_array(PG_FUNCTION_ARGS)
+{
+ int i,
+ idx;
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+
+ char *sbitmap;
+ int32 *svalues;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ sbitmap = set->data;
+ svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+ values = (int32 *) palloc(sizeof(int32) * nvalues);
+
+ idx = 0;
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (sbitmap[byte] & (0x01 << bit))
+ values[idx++] = svalues[i];
+ }
+
+ Assert(idx == nvalues);
+
+ return int32_to_array(fcinfo, values, nvalues);
+}
+
+Datum
+int4hashset_equals(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ char *bitmap_a;
+ int32 *values_a;
+ int i;
+
+ /*
+ * Check if the number of elements is the same
+ */
+ if (a->nelements != b->nelements)
+ PG_RETURN_BOOL(false);
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+
+ /*
+ * Check if every element in a is also in b
+ */
+ for (i = 0; i < a->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap_a[byte] & (0x01 << bit))
+ {
+ int32 value = values_a[i];
+
+ if (!int4hashset_contains_element(b, value))
+ PG_RETURN_BOOL(false);
+ }
+ }
+
+ /*
+ * All elements in a are in b and the number of elements is the same,
+ * so the sets must be equal.
+ */
+ PG_RETURN_BOOL(true);
+}
+
+
+Datum
+int4hashset_neq(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ /* If a is not equal to b, then they are not equal */
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum int4hashset_hash(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT32(set->hash);
+}
+
+
+Datum
+int4hashset_lt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp < 0);
+}
+
+
+Datum
+int4hashset_le(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp <= 0);
+}
+
+
+Datum
+int4hashset_gt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp > 0);
+}
+
+
+Datum
+int4hashset_ge(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp >= 0);
+}
+
+Datum
+int4hashset_cmp(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 *elements_a;
+ int32 *elements_b;
+
+ /*
+ * Compare the hashes first, if they are different,
+ * we can immediately tell which set is 'greater'
+ */
+ if (a->hash < b->hash)
+ PG_RETURN_INT32(-1);
+ else if (a->hash > b->hash)
+ PG_RETURN_INT32(1);
+
+ /*
+ * If hashes are equal, perform a more rigorous comparison
+ */
+
+ /*
+ * If number of elements are different,
+ * we can use that to deterministically return -1 or 1
+ */
+ if (a->nelements < b->nelements)
+ PG_RETURN_INT32(-1);
+ else if (a->nelements > b->nelements)
+ PG_RETURN_INT32(1);
+
+ /* Assert that the number of elements in both hashsets are equal */
+ Assert(a->nelements == b->nelements);
+
+ /* Extract and sort elements from each set */
+ elements_a = int4hashset_extract_sorted_elements(a);
+ elements_b = int4hashset_extract_sorted_elements(b);
+
+ /* Now we can perform a lexicographical comparison */
+ for (int32 i = 0; i < a->nelements; i++)
+ {
+ if (elements_a[i] < elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(-1);
+ }
+ else if (elements_a[i] > elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(1);
+ }
+ }
+
+ /* All elements are equal, so the sets are equal */
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(0);
+}
+
+Datum
+int4hashset_intersection(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta;
+ int4hashset_t *setb;
+ int4hashset_t *intersection;
+ char *bitmap;
+ int32_t *values;
+
+ if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ seta = PG_GETARG_INT4HASHSET(0);
+ setb = PG_GETARG_INT4HASHSET(1);
+
+ intersection = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ bitmap = setb->data;
+ values = (int32_t *)(setb->data + CEIL_DIV(setb->capacity, 8));
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ int4hashset_contains_element(seta, values[i]))
+ {
+ intersection = int4hashset_add_element(intersection, values[i]);
+ }
+ }
+
+ PG_RETURN_POINTER(intersection);
+}
+
+Datum
+int4hashset_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta;
+ int4hashset_t *setb;
+ int4hashset_t *difference;
+ char *bitmap;
+ int32_t *values;
+
+ if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ seta = PG_GETARG_INT4HASHSET(0);
+ setb = PG_GETARG_INT4HASHSET(1);
+
+ difference = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ bitmap = seta->data;
+ values = (int32_t *)(seta->data + CEIL_DIV(seta->capacity, 8));
+
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ !int4hashset_contains_element(setb, values[i]))
+ {
+ difference = int4hashset_add_element(difference, values[i]);
+ }
+ }
+
+ PG_RETURN_POINTER(difference);
+}
+
+Datum
+int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta;
+ int4hashset_t *setb;
+ int4hashset_t *result;
+ char *bitmapa;
+ char *bitmapb;
+ int32_t *valuesa;
+ int32_t *valuesb;
+
+ if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
+ ereport(ERROR,
+ (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+ errmsg("hashset arguments cannot be null")));
+
+ seta = PG_GETARG_INT4HASHSET(0);
+ setb = PG_GETARG_INT4HASHSET(1);
+
+ bitmapa = seta->data;
+ valuesa = (int32 *) (seta->data + CEIL_DIV(seta->capacity, 8));
+
+ bitmapb = setb->data;
+ valuesb = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
+
+ result = int4hashset_allocate(
+ seta->nelements + setb->nelements,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ /* Add elements that are in seta but not in setb */
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapa[byte] & (0x01 << bit))
+ {
+ int32 value = valuesa[i];
+ if (!int4hashset_contains_element(setb, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ /* Add elements that are in setb but not in seta */
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapb[byte] & (0x01 << bit))
+ {
+ int32 value = valuesb[i];
+ if (!int4hashset_contains_element(seta, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ PG_RETURN_POINTER(result);
+}
diff --git a/hashset.c b/hashset.c
new file mode 100644
index 0000000..67cbdf3
--- /dev/null
+++ b/hashset.c
@@ -0,0 +1,329 @@
+/*
+ * hashset.c
+ *
+ * Copyright (C) Tomas Vondra, 2019
+ */
+
+#include "hashset.h"
+
+static int int32_cmp(const void *a, const void *b);
+
+int4hashset_t *
+int4hashset_allocate(
+ int capacity,
+ float4 load_factor,
+ float4 growth_factor,
+ int hashfn_id
+)
+{
+ Size len;
+ int4hashset_t *set;
+ char *ptr;
+
+ /*
+ * Ensure that capacity is not divisible by HASHSET_STEP;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (capacity % HASHSET_STEP == 0)
+ capacity++;
+
+ len = offsetof(int4hashset_t, data);
+ len += CEIL_DIV(capacity, 8);
+ len += capacity * sizeof(int32);
+
+ ptr = palloc0(len);
+ SET_VARSIZE(ptr, len);
+
+ set = (int4hashset_t *) ptr;
+
+ set->flags = 0;
+ set->capacity = capacity;
+ set->nelements = 0;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = 0;
+ set->max_collisions = 0;
+ set->hash = 0; /* Initial hash value */
+
+ set->flags |= 0;
+
+ return set;
+}
+
+int4hashset_t *
+int4hashset_resize(int4hashset_t * set)
+{
+ int i;
+ int4hashset_t *new;
+ char *bitmap;
+ int32 *values;
+ int new_capacity;
+
+ new_capacity = (int)(set->capacity * set->growth_factor);
+
+ /*
+ * If growth factor is too small, new capacity might remain the same as
+ * the old capacity. This can lead to an infinite loop in resizing.
+ * To prevent this, we manually increment the capacity by 1 if new capacity
+ * equals the old capacity.
+ */
+ if (new_capacity == set->capacity)
+ new_capacity = set->capacity + 1;
+
+ new = int4hashset_allocate(
+ new_capacity,
+ set->load_factor,
+ set->growth_factor,
+ set->hashfn_id
+ );
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ int4hashset_add_element(new, values[i]);
+ }
+
+ return new;
+}
+
+int4hashset_t *
+int4hashset_add_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+ int32 current_collisions = 0;
+
+ if (set->nelements > set->capacity * set->load_factor)
+ set = int4hashset_resize(set);
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* The item is already used - maybe it's the same value? */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Same value, we're done */
+ if (values[position] == value)
+ break;
+
+ /* Increment the collision counter */
+ set->ncollisions++;
+ current_collisions++;
+
+ if (current_collisions > set->max_collisions)
+ set->max_collisions = current_collisions;
+
+ position = (position + HASHSET_STEP) % set->capacity;
+ continue;
+ }
+
+ /* Found an empty spot, before hitting the value first */
+ bitmap[byte] |= (0x01 << bit);
+ values[position] = value;
+
+ set->hash ^= hash;
+
+ set->nelements++;
+
+ break;
+ }
+
+ return set;
+}
+
+bool
+int4hashset_contains_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+ int num_probes = 0; /* Counter for the number of probes */
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * NAIVE_HASHFN_MULTIPLIER + NAIVE_HASHFN_INCREMENT);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* Found an empty slot, value is not there */
+ if ((bitmap[byte] & (0x01 << bit)) == 0)
+ return false;
+
+ /* Is it the same value? */
+ if (values[position] == value)
+ return true;
+
+ /* Move to the next element */
+ position = (position + HASHSET_STEP) % set->capacity;
+
+ num_probes++; /* Increment the number of probes */
+
+ /* Check if we have probed all slots */
+ if (num_probes >= set->capacity)
+ return false; /* Avoid infinite loop */
+ }
+}
+
+int32 *
+int4hashset_extract_sorted_elements(int4hashset_t *set)
+{
+ /* Allocate memory for the elements array */
+ int32 *elements = palloc(set->nelements * sizeof(int32));
+
+ /* Access the data array */
+ char *bitmap = set->data;
+ int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Counter for the number of extracted elements */
+ int32 nextracted = 0;
+
+ /* Iterate through all elements */
+ for (int32 i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the current position is occupied */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Add the value to the elements array */
+ elements[nextracted++] = values[i];
+ }
+ }
+
+ /* Make sure we extracted the correct number of elements */
+ Assert(nextracted == set->nelements);
+
+ /* Sort the elements array */
+ qsort(elements, nextracted, sizeof(int32), int32_cmp);
+
+ /* Return the sorted elements array */
+ return elements;
+}
+
+int4hashset_t *
+int4hashset_copy(int4hashset_t *src)
+{
+ return src;
+}
+
+/*
+ * hashset_isspace() --- a non-locale-dependent isspace()
+ *
+ * Identical to array_isspace() in src/backend/utils/adt/arrayfuncs.c.
+ * We used to use isspace() for parsing hashset values, but that has
+ * undesirable results: a hashset value might be silently interpreted
+ * differently depending on the locale setting. So here, we hard-wire
+ * the traditional ASCII definition of isspace().
+ */
+bool
+hashset_isspace(char ch)
+{
+ if (ch == ' ' ||
+ ch == '\t' ||
+ ch == '\n' ||
+ ch == '\r' ||
+ ch == '\v' ||
+ ch == '\f')
+ return true;
+ return false;
+}
+
+/*
+ * Construct an SQL array from a simple C double array
+ */
+Datum
+int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
+{
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ for (i = 0; i < len; i++)
+ {
+ /* Stash away this field */
+ astate = accumArrayResult(astate,
+ Int32GetDatum(d[i]),
+ false,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ PG_RETURN_DATUM(makeArrayResult(astate,
+ CurrentMemoryContext));
+}
+
+static int
+int32_cmp(const void *a, const void *b)
+{
+ int32 arg1 = *(const int32 *)a;
+ int32 arg2 = *(const int32 *)b;
+
+ if (arg1 < arg2) return -1;
+ if (arg1 > arg2) return 1;
+ return 0;
+}
diff --git a/hashset.control b/hashset.control
new file mode 100644
index 0000000..0743003
--- /dev/null
+++ b/hashset.control
@@ -0,0 +1,3 @@
+comment = 'Provides hashset type.'
+default_version = '0.0.1'
+relocatable = true
diff --git a/hashset.h b/hashset.h
new file mode 100644
index 0000000..3f22133
--- /dev/null
+++ b/hashset.h
@@ -0,0 +1,53 @@
+#ifndef HASHSET_H
+#define HASHSET_H
+
+#include "postgres.h"
+#include "libpq/pqformat.h"
+#include "nodes/memnodes.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "catalog/pg_type.h"
+#include "common/hashfn.h"
+
+#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_STEP 13
+#define JENKINS_LOOKUP3_HASHFN_ID 1
+#define MURMURHASH32_HASHFN_ID 2
+#define NAIVE_HASHFN_ID 3
+#define NAIVE_HASHFN_MULTIPLIER 7691
+#define NAIVE_HASHFN_INCREMENT 4201
+
+/*
+ * These defaults should match the the SQL function int4hashset()
+ */
+#define DEFAULT_INITIAL_CAPACITY 0
+#define DEFAULT_LOAD_FACTOR 0.75
+#define DEFAULT_GROWTH_FACTOR 2.0
+#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
+
+typedef struct int4hashset_t {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int32 flags; /* reserved for future use (versioning, ...) */
+ int32 capacity; /* max number of element we have space for */
+ int32 nelements; /* number of items added to the hashset */
+ int32 hashfn_id; /* ID of the hash function used */
+ float4 load_factor; /* Load factor before triggering resize */
+ float4 growth_factor; /* Growth factor when resizing the hashset */
+ int32 ncollisions; /* Number of collisions */
+ int32 max_collisions; /* Maximum collisions for a single element */
+ int32 hash; /* Stored hash value of the hashset */
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} int4hashset_t;
+
+int4hashset_t *int4hashset_allocate(int capacity, float4 load_factor, float4 growth_factor, int hashfn_id);
+int4hashset_t *int4hashset_resize(int4hashset_t * set);
+int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
+bool int4hashset_contains_element(int4hashset_t *set, int32 value);
+int32 *int4hashset_extract_sorted_elements(int4hashset_t *set);
+int4hashset_t *int4hashset_copy(int4hashset_t *src);
+bool hashset_isspace(char ch);
+Datum int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len);
+
+#endif /* HASHSET_H */
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
new file mode 100644
index 0000000..cc7c48a
--- /dev/null
+++ b/test/c_tests/test_send_recv.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <libpq-fe.h>
+
+void exit_nicely(PGconn *conn) {
+ PQfinish(conn);
+ exit(1);
+}
+
+int main() {
+ /* Connect to database specified by the PGDATABASE environment variable */
+ const char *hostname = getenv("PGHOST");
+ char conninfo[1024];
+ PGconn *conn;
+
+ if (hostname == NULL)
+ hostname = "localhost";
+
+ /* Connect to database specified by the PGDATABASE environment variable */
+ snprintf(conninfo, sizeof(conninfo), "host=%s port=5432", hostname);
+ conn = PQconnectdb(conninfo);
+ if (PQstatus(conn) != CONNECTION_OK) {
+ fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
+ exit_nicely(conn);
+ }
+
+ /* Create extension */
+ PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
+
+ /* Create temporary table */
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col int4hashset)");
+
+ /* Enable binary output */
+ PQexec(conn, "SET bytea_output = 'escape'");
+
+ /* Insert dummy data */
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::int4hashset)";
+ PGresult *res = PQexec(conn, insert_command);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Fetch the data in binary format */
+ const char *select_command = "SELECT hashset_col FROM test_hashset_send_recv";
+ int resultFormat = 1; /* 0 = text, 1 = binary */
+ res = PQexecParams(conn, select_command, 0, NULL, NULL, NULL, NULL, resultFormat);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Store binary data for later use */
+ const char *binary_data = PQgetvalue(res, 0, 0);
+ int binary_data_length = PQgetlength(res, 0, 0);
+ PQclear(res);
+
+ /* Re-insert the binary data */
+ const char *insert_binary_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ($1)";
+ const char *paramValues[1] = {binary_data};
+ int paramLengths[1] = {binary_data_length};
+ int paramFormats[1] = {1}; /* binary format */
+ res = PQexecParams(conn, insert_binary_command, 1, NULL, paramValues, paramLengths, paramFormats, 0);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Check the data */
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ res = PQexec(conn, check_command);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Print the results */
+ printf("unique_count: %s\n", PQgetvalue(res, 0, 0));
+ printf("count: %s\n", PQgetvalue(res, 0, 1));
+ PQclear(res);
+
+ /* Disconnect */
+ PQfinish(conn);
+
+ return 0;
+}
diff --git a/test/c_tests/test_send_recv.sh b/test/c_tests/test_send_recv.sh
new file mode 100755
index 0000000..ab308b3
--- /dev/null
+++ b/test/c_tests/test_send_recv.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+# Get the directory of this script
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+
+# Set up database
+export PGDATABASE=test_hashset_send_recv
+dropdb --if-exists "$PGDATABASE"
+createdb
+
+# Define directories
+EXPECTED_DIR="$SCRIPT_DIR/../expected"
+RESULTS_DIR="$SCRIPT_DIR/../results"
+
+# Create the results directory if it doesn't exist
+mkdir -p "$RESULTS_DIR"
+
+# Run the C test and save its output to the results directory
+"$SCRIPT_DIR/test_send_recv" > "$RESULTS_DIR/test_send_recv.out"
+
+printf "test test_send_recv ... "
+
+# Compare the actual output with the expected output
+if diff -q "$RESULTS_DIR/test_send_recv.out" "$EXPECTED_DIR/test_send_recv.out" > /dev/null 2>&1; then
+ echo "ok"
+ # Clean up by removing the results directory if the test passed
+ rm -r "$RESULTS_DIR"
+else
+ echo "failed"
+ git diff --no-index --color "$EXPECTED_DIR/test_send_recv.out" "$RESULTS_DIR/test_send_recv.out"
+fi
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..b89ab52
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,304 @@
+/*
+ * Hashset Type
+ */
+SELECT '{}'::int4hashset; -- empty int4hashset
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset;
+ int4hashset
+-------------
+ {3,2,1}
+(1 row)
+
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+ int4hashset
+----------------------------
+ {0,2147483647,-2147483648}
+(1 row)
+
+SELECT '{-2147483649}'::int4hashset; -- out of range
+ERROR: value "-2147483649}" is out of range for type integer
+LINE 1: SELECT '{-2147483649}'::int4hashset;
+ ^
+SELECT '{2147483648}'::int4hashset; -- out of range
+ERROR: value "2147483648}" is out of range for type integer
+LINE 1: SELECT '{2147483648}'::int4hashset;
+ ^
+/*
+ * Hashset Functions
+ */
+SELECT int4hashset();
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT hashset_add(int4hashset(), 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add(NULL::int4hashset, 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add('{123}'::int4hashset, 456);
+ hashset_add
+-------------
+ {456,123}
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+ hashset_contains
+------------------
+ f
+(1 row)
+
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_merge
+---------------
+ {3,1,2}
+(1 row)
+
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+ hashset_to_array
+------------------
+ {3,2,1}
+(1 row)
+
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+ hashset_count
+---------------
+ 3
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_intersection
+----------------------
+ {2}
+(1 row)
+
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_difference
+--------------------
+ {1}
+(1 row)
+
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_symmetric_difference
+------------------------------
+ {1,3}
+(1 row)
+
+/*
+ * Aggregation Functions
+ */
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+ hashset_agg
+------------------------
+ {6,10,1,8,2,3,4,5,9,7}
+(1 row)
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset_agg
+------------------------
+ {6,8,1,3,2,10,4,5,9,7}
+(1 row)
+
+/*
+ * Operator Definitions
+ */
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset || 4;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+SELECT 4 || '{1,2,3}'::int4hashset;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+/*
+ * Hashset Hash Operators
+ */
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+ count | count
+-------+-------
+ 2 | 1
+(1 row)
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
+ h
+---------
+ {9,7,8}
+ {3,2,1}
+ {5,6,4}
+(3 rows)
+
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
new file mode 100644
index 0000000..bd44199
--- /dev/null
+++ b/test/expected/invalid.out
@@ -0,0 +1,4 @@
+SELECT '{1,2s}'::int4hashset;
+ERROR: unexpected character "s" in hashset input
+LINE 1: SELECT '{1,2s}'::int4hashset;
+ ^
diff --git a/test/expected/io_varying_lengths.out b/test/expected/io_varying_lengths.out
new file mode 100644
index 0000000..45e9fb1
--- /dev/null
+++ b/test/expected/io_varying_lengths.out
@@ -0,0 +1,100 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+SELECT hashset_sorted('{1}'::int4hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/parsing.out b/test/expected/parsing.out
new file mode 100644
index 0000000..263797e
--- /dev/null
+++ b/test/expected/parsing.out
@@ -0,0 +1,71 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+ERROR: malformed hashset literal: "1"
+LINE 2: SELECT ' { 1 , 23 , -456 } 1'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+ERROR: malformed hashset literal: ","
+LINE 1: SELECT ' { 1 , 23 , -456 } ,'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+ERROR: malformed hashset literal: "{"
+LINE 1: SELECT ' { 1 , 23 , -456 } {'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+ERROR: malformed hashset literal: "}"
+LINE 1: SELECT ' { 1 , 23 , -456 } }'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+ERROR: malformed hashset literal: "x"
+LINE 1: SELECT ' { 1 , 23 , -456 } x'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+ERROR: unexpected character "1" in hashset input
+LINE 2: SELECT ' { 1 , 23 , -456 1'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+ERROR: unexpected character "{" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 {'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+ERROR: unexpected character "x" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 x'::int4hashset;
+ ^
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ", 23 , -456 } "
+LINE 2: SELECT ' { , 23 , -456 } '::int4hashset;
+ ^
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ""
+LINE 1: SELECT ' { 1 , 23 , '::int4hashset;
+ ^
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: "s , 23 , -456 } "
+LINE 1: SELECT ' { s , 23 , -456 } '::int4hashset;
+ ^
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for hashset: "1 , 23 , -456 } "
+LINE 2: SELECT ' 1 , 23 , -456 } '::int4hashset;
+ ^
+DETAIL: Hashset representation must start with "{".
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..f34e190
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..9d9026b
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
new file mode 100644
index 0000000..b356b64
--- /dev/null
+++ b/test/expected/reported_bugs.out
@@ -0,0 +1,138 @@
+/*
+ * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+ hashset_agg | hashset_add
+-------------+-------------
+ {3,1,2} | {3,4,1,2}
+(1 row)
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_hash('{2,1}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+ hashset_cmp
+-------------
+ 0
+(1 row)
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+ hashset_capacity
+------------------
+ 2
+(1 row)
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..9793a49
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+ hashset_count
+---------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/expected/test_send_recv.out b/test/expected/test_send_recv.out
new file mode 100644
index 0000000..12382d5
--- /dev/null
+++ b/test/expected/test_send_recv.out
@@ -0,0 +1,2 @@
+unique_count: 1
+count: 2
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..8688666
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,108 @@
+/*
+ * Hashset Type
+ */
+
+SELECT '{}'::int4hashset; -- empty int4hashset
+SELECT '{1,2,3}'::int4hashset;
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+SELECT '{-2147483649}'::int4hashset; -- out of range
+SELECT '{2147483648}'::int4hashset; -- out of range
+
+/*
+ * Hashset Functions
+ */
+
+SELECT int4hashset();
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+SELECT hashset_add(int4hashset(), 123);
+SELECT hashset_add(NULL::int4hashset, 123);
+SELECT hashset_add('{123}'::int4hashset, 456);
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+
+/*
+ * Aggregation Functions
+ */
+
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+
+/*
+ * Operator Definitions
+ */
+
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset || 4;
+SELECT 4 || '{1,2,3}'::int4hashset;
+
+/*
+ * Hashset Hash Operators
+ */
+
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
new file mode 100644
index 0000000..1535c22
--- /dev/null
+++ b/test/sql/benchmark.sql
@@ -0,0 +1,191 @@
+DROP EXTENSION IF EXISTS hashset CASCADE;
+CREATE EXTENSION hashset;
+
+\timing on
+
+\echo * Benchmark array_agg(DISTINCT ...) vs hashset_agg()
+
+DROP TABLE IF EXISTS benchmark_input_100k;
+DROP TABLE IF EXISTS benchmark_input_10M;
+DROP TABLE IF EXISTS benchmark_array_agg;
+DROP TABLE IF EXISTS benchmark_hashset_agg;
+
+SELECT setseed(0.12345);
+
+CREATE TABLE benchmark_input_100k AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,100000) AS i;
+
+CREATE TABLE benchmark_input_10M AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,10000000) AS i;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k unique integers
+CREATE TABLE benchmark_array_agg AS
+SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
+CREATE TABLE benchmark_hashset_agg AS
+SELECT hashset_agg(i) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M unique integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(i) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_10M;
+
+SELECT cardinality(array_agg) FROM benchmark_array_agg ORDER BY 1;
+
+SELECT
+ hashset_count(hashset_agg),
+ hashset_capacity(hashset_agg),
+ hashset_collisions(hashset_agg),
+ hashset_max_collisions(hashset_agg)
+FROM benchmark_hashset_agg;
+
+SELECT hashset_capacity(hashset_agg(rnd)) FROM benchmark_input_10M;
+
+\echo * Benchmark different hash functions
+
+\echo *** Elements in sequence 1..100000
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo *** Testing 100000 random ints
+
+SELECT setseed(0.12345);
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
new file mode 100644
index 0000000..43689ab
--- /dev/null
+++ b/test/sql/invalid.sql
@@ -0,0 +1 @@
+SELECT '{1,2s}'::int4hashset;
diff --git a/test/sql/io_varying_lengths.sql b/test/sql/io_varying_lengths.sql
new file mode 100644
index 0000000..8acb6b8
--- /dev/null
+++ b/test/sql/io_varying_lengths.sql
@@ -0,0 +1,21 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+
+SELECT hashset_sorted('{1}'::int4hashset);
+SELECT hashset_sorted('{1,2}'::int4hashset);
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
diff --git a/test/sql/parsing.sql b/test/sql/parsing.sql
new file mode 100644
index 0000000..1e56bbe
--- /dev/null
+++ b/test/sql/parsing.sql
@@ -0,0 +1,23 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..2fee0fc
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..7cc8f87
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
new file mode 100644
index 0000000..9166f5d
--- /dev/null
+++ b/test/sql/reported_bugs.sql
@@ -0,0 +1,85 @@
+/*
+ * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+SELECT hashset_hash('{2,1}'::int4hashset);
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..0472352
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson <joel@compiler.org> wrote:
On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:...
Do we have any ideas on how to fix this without sacrificing performance?
The problem was due to hashset_hash() function accumulating the hashes
of individual elements in a non-commutative manner. As a consequence, the
final hash value was sensitive to the order in which elements were inserted
into the hashset. This behavior led to inconsistencies, as logically
equivalent sets (i.e., sets with the same elements but in different orders)
produced different hash values.Solved by modifying the hashset_hash() function to use a commutative operation
when combining the hashes of individual elements. This change ensures that the
final hash value is independent of the element insertion order, and logically
equivalent sets produce the same hash.An somewhat unfortunate side-effect of this fix, is that we can no longer
visually sort the hashset output format, since it's not lexicographically sorted.
I think this is an acceptable trade-off for a hashset type,
since the only alternative I see would be to sort the elements,
but then it wouldn't be a hashset, but a treeset, which different
Big-O complexity.New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches./Joel
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
---------------------------------------------------------------------------------
SELECT attname
,pc.relname
,CASE attstorage
WHEN 'p' THEN 'plain'
WHEN 'e' THEN 'external'
WHEN 'm' THEN 'main'
WHEN 'x' THEN 'extended'
END AS storage
FROM pg_attribute pa
join pg_class pc on pc.oid = pa.attrelid
where attnum > 0 and pa.attstorage = 'e';
In my system catalog, it seems only the hashset type storage =
'external'. most is extended.....
I am not sure the consequence of switch from external to extended.
------------------------------------------------------------------------------------------------------------
select hashset_hash('{-1,1}') as a1
,hashset_hash('{1,-2}') as a2
,hashset_hash('{-3,1}') as a3
,hashset_hash('{4,1}') as a4;
returns:
a1 | a2 | a3 | a4
-------------+-----------+------------+------------
-1735582196 | 998516167 | 1337000903 | 1305426029
(1 row)
values {a1,a2,a3,a4} should be monotone increasing, based on the
function int4hashset_cmp, but now it's not.
so the following queries failed.
--should return only one row.
select hashset_cmp('{2,1}','{3,1}')
union
select hashset_cmp('{3,1}','{4,1}')
union
select hashset_cmp('{1,3}','{4,1}');
select hashset_cmp('{9,10,11}','{10,9,-11}') =
hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true
select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false.
based on int array comparison,.
-----------------------------------------------------------------------------------------
I comment out following lines in hashset-api.c somewhere between {810,829}
// if (a->hash < b->hash)
// PG_RETURN_INT32(-1);
// else if (a->hash > b->hash)
// PG_RETURN_INT32(1);
// if (a->nelements < b->nelements)
// PG_RETURN_INT32(-1);
// else if (a->nelements > b->nelements)
// PG_RETURN_INT32(1);
// Assert(a->nelements == b->nelements);
So hashset_cmp will directly compare int array. the above queries works.
{int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp.
maybe we can just wrap it just like int4hashset_le?
now store 10 element int4hashset need 99 bytes, similar one dimension
bigint array with length 10, occupy 101 byte....
in int4hashset_send, newly add struct attributes/member {load_factor
growth_factor ncollisions hash} also need send to buf?
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
Hmm, that's a good philosophical question.
I notice Tomas Vondra in the initial commit opted for allowing NULL inputs,
treating them as empty sets, e.g. in int4hashset_add() we create a
new hashset if the first argument is NULL.
I guess the easiest perhaps most consistent NULL-handling strategy
would be to just mark all relevant functions STRICT except for the agg ones
since we probably want to allow skipping over rows with NULL values
without the entire result becoming NULL.
But if we're not just going the STRICT route, then I think it's a bit more tricky,
since you could argue the hashset_contains() example should return FALSE
since the set doesn't contain the NULL value, but OTOH, since we don't
store NULL values, we don't know if has ever been added, hence a NULL
result would perhaps make more sense.
I think I lean on thinking that if we want to be "NULL-friendly", like we
currently are in hashset_add(), it would probably be most user-friendly
to be consistent and let all functions return non-null return values in
all cases where it is not unreasonable.
Since we're essentially designing a set-theoretic system, I think we should
aim for the logical "soundness" property of it and think about how we can
verify that it is.
Thoughts?
/Joel
On 6/18/23 18:45, Andrew Dunstan wrote:
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.Cool, thanks.
It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.
A couple of random thoughts:
. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.
I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.
(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)
The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.
AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).
So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jun 19, 2023 at 2:51 PM Joel Jacobson <joel@compiler.org> wrote:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?Hmm, that's a good philosophical question.
I notice Tomas Vondra in the initial commit opted for allowing NULL
inputs,
treating them as empty sets, e.g. in int4hashset_add() we create a
new hashset if the first argument is NULL.I guess the easiest perhaps most consistent NULL-handling strategy
would be to just mark all relevant functions STRICT except for the agg
ones
since we probably want to allow skipping over rows with NULL values
without the entire result becoming NULL.But if we're not just going the STRICT route, then I think it's a bit
more tricky,
since you could argue the hashset_contains() example should return FALSE
since the set doesn't contain the NULL value, but OTOH, since we don't
store NULL values, we don't know if has ever been added, hence a NULL
result would perhaps make more sense.I think I lean on thinking that if we want to be "NULL-friendly", like we
currently are in hashset_add(), it would probably be most user-friendly
to be consistent and let all functions return non-null return values in
all cases where it is not unreasonable.Since we're essentially designing a set-theoretic system, I think we
should
aim for the logical "soundness" property of it and think about how we can
verify that it is.Thoughts?
/Joel
hashset_to_array function should be strict?
I noticed hashset_symmetric_difference and hashset_difference handle null
in a different way, seems they should handle null in a consistent way?
select '{1,2,NULL}'::int[] operator (pg_catalog.@>) '{NULL}'::int[]; --false
select '{1,2,NULL}'::int[] operator (pg_catalog.&&) '{NULL}'::int[];
--false.
So similarly I guess hashset_contains should be false.
select hashset_contains('{1,2}'::int4hashset,NULL::int);
On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.
Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
more fundamental and prevalent in std libs than multisets.
Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.
The necessity of multisets is vague beyond standards compliance.
/Joel
On 2023-06-19 Mo 05:21, Tomas Vondra wrote:
On 6/18/23 18:45, Andrew Dunstan wrote:
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.Cool, thanks.
It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.A couple of random thoughts:
. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)
Interesting idea. There's also the keyword SETOF that we could possibly
make use of.
The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.
Yes, Multisets (a.k.a. bags and a large number of other names) would be
interesting. But I wouldn't like to abandon pure sets either. Maybe a
typmod indicating the allowed multiplicity of the type?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Mon, Jun 19, 2023, at 11:49, jian he wrote:
hashset_to_array function should be strict?
I noticed hashset_symmetric_difference and hashset_difference handle
null in a different way, seems they should handle null in a consistent
way?
Yes, I agree, they should be consistent.
I've thought a bit more on this, and came to the conclusion that I think it
would be easiest, safest and least confusing to just mark all functions STRICT.
That way, it's the user's responsibility to ensure null operands are not passed
to the functions, which is simply a WHERE ... or FILTER (WHERE ...). And if
making a mistake and passing, it's better to make the entire result blow up by
letting the result be NULL, than to silently ignore the operand or return some
true/false value that is questionable.
SQL has a quite unique NULL handling compared to other languages, so I think
it's better to let the user use the full arsenal of SQL to deal with nulls,
rather than trying to shoehorn some null semantics into a set-theoretic system.
/Joel
On 6/19/23 13:50, Andrew Dunstan wrote:
On 2023-06-19 Mo 05:21, Tomas Vondra wrote:
On 6/18/23 18:45, Andrew Dunstan wrote:
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.Cool, thanks.
It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.A couple of random thoughts:
. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)Interesting idea. There's also the keyword SETOF that we could possibly
make use of.The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.Yes, Multisets (a.k.a. bags and a large number of other names) would be
interesting. But I wouldn't like to abandon pure sets either. Maybe a
typmod indicating the allowed multiplicity of the type?
Maybe, although I'm not sure if that can be specified with a multiset
constructor, i.e. when using MULTISET[...] in places where we now use
ARRAY[...] to specify arrays.
I was thinking more about having one set of operators, one considering
the duplicity (and thus doing what SQL standard says) and one ignoring
it (thus treating MULTISETS as plain sets).
Anyway, I'm just thinking aloud. I'm not sure if this is the way to do,
but it'd be silly to end up implementing stuff unnecessarily and/or
inventing something that contradicts the SQL standard (or is somehow
inconsistent with similar stuff).
regard
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/19/23 13:33, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
more fundamental and prevalent in std libs than multisets.
What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.
I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.
Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.The necessity of multisets is vague beyond standards compliance.
True - we haven't had any requests/proposal to implement MULTISETs.
I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).
We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.
The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).
That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.
This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).
Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).
A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1]/messages/by-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9=Gq3g@mail.gmail.com might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).
[1]: /messages/by-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9=Gq3g@mail.gmail.com
/messages/by-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9=Gq3g@mail.gmail.com
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Andrew Dunstan <andrew@dunslane.net> writes:
Yes, Multisets (a.k.a. bags and a large number of other names) would be
interesting. But I wouldn't like to abandon pure sets either. Maybe a
typmod indicating the allowed multiplicity of the type?
I don't think trying to use typmod to carry fundamental semantic
information will work, because we drop it in too many places
(e.g. there's no way to pass it through a function). If you want
both sets and multisets, they'll need to be two different container
types, even if code is shared under the hood.
regards, tom lane
On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.
The unexpected issues I had in mind are subtle bugs due to treating multisets
as sets, which could go undetected due to having no duplicates initially.
Multisets might initially therefore seem equal, but later diverge due to
different element counts, leading to hard-to-detect issues.
I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.
Might help. But still think providing both structures would be a more foolproof
solution, offering users the choice to select what's best for their use-case.
Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.The necessity of multisets is vague beyond standards compliance.
True - we haven't had any requests/proposal to implement MULTISETs.
I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.
Is the idea to use the "SET" keyword for the syntax?
Isn't it a risk that will be confusing, since "SET" is currently
only used for configuration and update operations?
The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).
I think it's very interesting thoughts and ambitions.
I wonder though, from a user-perspective, if a new hashset type still
wouldn't just be considered simpler, than introducing new SQL syntax?
However, it would be interesting to see how the piggy-backing on the
existing array infrastructure would look in practise code-wise though.
I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.
A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1] might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).[1]
/messages/by-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9=Gq3g@mail.gmail.com
I remember that one! We tried to revive that one, but didn't manage to keep it alive.
It's a really good idea though. Good idea to see if there might be synergies
between arrays and hashsets in this area, since if we envision the elements in
a hashset mostly will be PKs, then it would be nice to enforce reference
integrity.
/Joel
On 6/20/23 00:50, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.The unexpected issues I had in mind are subtle bugs due to treating multisets
as sets, which could go undetected due to having no duplicates initially.
Multisets might initially therefore seem equal, but later diverge due to
different element counts, leading to hard-to-detect issues.
Understood.
I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.Might help. But still think providing both structures would be a more foolproof
solution, offering users the choice to select what's best for their use-case.
Yeah. Not confusing people is better.
Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.The necessity of multisets is vague beyond standards compliance.
True - we haven't had any requests/proposal to implement MULTISETs.
I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.Is the idea to use the "SET" keyword for the syntax?
Isn't it a risk that will be confusing, since "SET" is currently
only used for configuration and update operations?
I haven't tried doing that, so not sure if there would be any conflicts
in the grammar. But I can't think of a case that'd be confusing for
users - when setting internal GUC variables it's a completely different
context, there's no use for SQL-level collections (arrays, sets, ...).
For UPDATE, it'd be pretty clear too, I think. It's possible to do
UPDATE table SET col = SET[1,2,3]
and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to do
ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];
Seems clear to me too, I think.
The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).I think it's very interesting thoughts and ambitions.
I wonder though, from a user-perspective, if a new hashset type still
wouldn't just be considered simpler, than introducing new SQL syntax?
It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.
However, it would be interesting to see how the piggy-backing on the
existing array infrastructure would look in practise code-wise though.I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.
Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.
TBH I don't particularly see why we'd want to sort sets.
I wonder if the SQL standard says something about these things (for
MULTISETs), especially for the NULL handling. If it does, I'd try to
stick with those rules.
A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1] might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).[1]
/messages/by-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9=Gq3g@mail.gmail.comI remember that one! We tried to revive that one, but didn't manage to keep it alive.
It's a really good idea though. Good idea to see if there might be synergies
between arrays and hashsets in this area, since if we envision the elements in
a hashset mostly will be PKs, then it would be nice to enforce reference
integrity.
I haven't followed that at all, but I wonder how difficult would it be
to also support other collection types (like sets) and not just arrays.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 20, 2023, at 02:04, Tomas Vondra wrote:
For UPDATE, it'd be pretty clear too, I think. It's possible to do
UPDATE table SET col = SET[1,2,3]
and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to doALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];
Seems clear to me too, I think.
...
It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.
Now when I see it written out, I actually agree looks nice.
I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.
OK, cool, and didn't mean at all that you did. I appreciate the long-term
perspective, otherwise our short-term work might go wasted.
TBH I don't particularly see why we'd want to sort sets.
Me neither, sorting sets in the conventional, visually coherent sense
(i.e., lexicographically) doesn't seem necessary. However, for ORDER BY hashset
functionality, we need a we need a stable and deterministic method.
This can be achieved performance-efficiently by computing a commutative hash of
the hashset, XORing each new value's hash with set->hash:
set->hash ^= hash;
...and then sort primarily by set->hash.
Though resulting in an apparently random order, this approach, already employed
in int4hashset_add_element() and int4hashset_cmp(), ensures a deterministic and
stable sorting order.
I think this an acceptable trade-off, better than not supporting ORDER BY.
Jian He had some comments on hashset_cmp() which I will look at.
/Joel
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
I agree, it should.
I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.
I think this patch is OK to send as an incremental one, since it's an isolated change:
Apply STRICT to hashset functions; clean up null handling in hashset-api.c
Set hashset functions to be STRICT, thereby letting the system reject null
inputs automatically. This change reflects the nature of hashset as an
implementation of a set-theoretic system, where null values are conceptually
unusual.
Alongside, the hashset-api.c code has been refactored for clarity, consolidating
null checks and assignments into single lines.
A 'strict' test case has been added to account for these changes.
/Joel
Attachments:
hashset-0.0.1-1ee0df0.patchapplication/octet-stream; name=hashset-0.0.1-1ee0df0.patchDownload
commit 1ee0df053d139f98f69090672ecc33afb96e9818
Author: Joel Jakobsson <joel@compiler.org>
Date: Tue Jun 20 12:42:03 2023 +0200
Apply STRICT to hashset functions; clean up null handling in hashset-api.c
Set hashset functions to be STRICT, thereby letting the system reject null
inputs automatically. This change reflects the nature of hashset as an
implementation of a set-theoretic system, where null values are conceptually
unusual.
Alongside, the hashset-api.c code has been refactored for clarity, consolidating
null checks and assignments into single lines.
A 'strict' test case has been added to account for these changes.
diff --git a/Makefile b/Makefile
index 85e7691..cfb8362 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
CLIENT_INCLUDES=-I$(shell pg_config --includedir)
LIBRARY_PATH = -L$(shell pg_config --libdir)
-REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs strict
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index a155190..d48260f 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -50,42 +50,42 @@ LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
RETURNS int4hashset
AS 'hashset', 'int4hashset_add'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
RETURNS bool
AS 'hashset', 'int4hashset_contains'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
RETURNS int4hashset
AS 'hashset', 'int4hashset_merge'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
RETURNS int[]
AS 'hashset', 'int4hashset_to_array'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
RETURNS bigint
AS 'hashset', 'int4hashset_count'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
RETURNS bigint
AS 'hashset', 'int4hashset_capacity'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
RETURNS bigint
AS 'hashset', 'int4hashset_collisions'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_max_collisions(int4hashset)
RETURNS bigint
AS 'hashset', 'int4hashset_max_collisions'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
RETURNS int4hashset
@@ -96,17 +96,17 @@ IMMUTABLE PARALLEL SAFE STRICT COST 1;
CREATE OR REPLACE FUNCTION hashset_intersection(int4hashset, int4hashset)
RETURNS int4hashset
AS 'hashset', 'int4hashset_intersection'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_difference(int4hashset, int4hashset)
RETURNS int4hashset
AS 'hashset', 'int4hashset_difference'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_symmetric_difference(int4hashset, int4hashset)
RETURNS int4hashset
AS 'hashset', 'int4hashset_symmetric_difference'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
/*
* Aggregation Functions
@@ -120,7 +120,7 @@ LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
RETURNS int4hashset
AS 'hashset', 'int4hashset_agg_final'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
RETURNS internal
@@ -143,7 +143,7 @@ LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
RETURNS int4hashset
AS 'hashset', 'int4hashset_agg_final'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
RETURNS internal
diff --git a/hashset-api.c b/hashset-api.c
index 3feb06d..e00ef0c 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -318,33 +318,10 @@ int4hashset_recv(PG_FUNCTION_ARGS)
Datum
int4hashset_add(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
-
- if (PG_ARGISNULL(1))
- {
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
- PG_RETURN_DATUM(PG_GETARG_DATUM(0));
- }
-
- /* if there's no hashset allocated, create it now */
- if (PG_ARGISNULL(0))
- {
- set = int4hashset_allocate(
- DEFAULT_INITIAL_CAPACITY,
- DEFAULT_LOAD_FACTOR,
- DEFAULT_GROWTH_FACTOR,
- DEFAULT_HASHFN_ID
- );
- }
- else
- {
- /* make sure we are working with a non-toasted and non-shared copy of the input */
- set = PG_GETARG_INT4HASHSET_COPY(0);
- }
-
- set = int4hashset_add_element(set, PG_GETARG_INT32(1));
+ int4hashset_t *set = int4hashset_add_element(
+ PG_GETARG_INT4HASHSET_COPY(0),
+ PG_GETARG_INT32(1)
+ );
PG_RETURN_POINTER(set);
}
@@ -352,14 +329,8 @@ int4hashset_add(PG_FUNCTION_ARGS)
Datum
int4hashset_contains(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
- int32 value;
-
- if (PG_ARGISNULL(1) || PG_ARGISNULL(0))
- PG_RETURN_BOOL(false);
-
- set = PG_GETARG_INT4HASHSET(0);
- value = PG_GETARG_INT32(1);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ int32 value = PG_GETARG_INT32(1);
PG_RETURN_BOOL(int4hashset_contains_element(set, value));
}
@@ -367,12 +338,7 @@ int4hashset_contains(PG_FUNCTION_ARGS)
Datum
int4hashset_count(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
-
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
- set = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
PG_RETURN_INT64(set->nelements);
}
@@ -381,25 +347,10 @@ Datum
int4hashset_merge(PG_FUNCTION_ARGS)
{
int i;
-
- int4hashset_t *seta;
- int4hashset_t *setb;
-
- char *bitmap;
- int32_t *values;
-
- if (PG_ARGISNULL(0) && PG_ARGISNULL(1))
- PG_RETURN_NULL();
- else if (PG_ARGISNULL(1))
- PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(0));
- else if (PG_ARGISNULL(0))
- PG_RETURN_POINTER(PG_GETARG_INT4HASHSET(1));
-
- seta = PG_GETARG_INT4HASHSET_COPY(0);
- setb = PG_GETARG_INT4HASHSET(1);
-
- bitmap = setb->data;
- values = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = setb->data;
+ int32_t *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
for (i = 0; i < setb->capacity; i++)
{
@@ -416,11 +367,11 @@ int4hashset_merge(PG_FUNCTION_ARGS)
Datum
int4hashset_init(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
- int32 initial_capacity = PG_GETARG_INT32(0);
- float4 load_factor = PG_GETARG_FLOAT4(1);
- float4 growth_factor = PG_GETARG_FLOAT4(2);
- int32 hashfn_id = PG_GETARG_INT32(3);
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
/* Validate input arguments */
if (!(initial_capacity >= 0))
@@ -466,12 +417,7 @@ int4hashset_init(PG_FUNCTION_ARGS)
Datum
int4hashset_capacity(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
-
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
- set = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
PG_RETURN_INT64(set->capacity);
}
@@ -479,12 +425,7 @@ int4hashset_capacity(PG_FUNCTION_ARGS)
Datum
int4hashset_collisions(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
-
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
- set = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
PG_RETURN_INT64(set->ncollisions);
}
@@ -492,12 +433,7 @@ int4hashset_collisions(PG_FUNCTION_ARGS)
Datum
int4hashset_max_collisions(PG_FUNCTION_ARGS)
{
- int4hashset_t *set;
-
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
- set = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
PG_RETURN_INT64(set->max_collisions);
}
@@ -505,11 +441,10 @@ int4hashset_max_collisions(PG_FUNCTION_ARGS)
Datum
int4hashset_agg_add(PG_FUNCTION_ARGS)
{
+ MemoryContext aggcontext;
MemoryContext oldcontext;
int4hashset_t *state;
- MemoryContext aggcontext;
-
/* cannot be called directly because of internal-type argument */
if (!AggCheckCallContext(fcinfo, &aggcontext))
elog(ERROR, "hashset_add_add called in non-aggregate context");
@@ -552,10 +487,9 @@ int4hashset_agg_add(PG_FUNCTION_ARGS)
Datum
int4hashset_agg_add_set(PG_FUNCTION_ARGS)
{
- MemoryContext oldcontext;
- int4hashset_t *state;
-
MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ int4hashset_t *state;
/* cannot be called directly because of internal-type argument */
if (!AggCheckCallContext(fcinfo, &aggcontext))
@@ -620,9 +554,6 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
Datum
int4hashset_agg_final(PG_FUNCTION_ARGS)
{
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
PG_RETURN_POINTER(PG_GETARG_POINTER(0));
}
@@ -634,7 +565,6 @@ int4hashset_agg_combine(PG_FUNCTION_ARGS)
int4hashset_t *dst;
MemoryContext aggcontext;
MemoryContext oldcontext;
-
char *bitmap;
int32 *values;
@@ -694,13 +624,9 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
int4hashset_t *set;
int32 *values;
int nvalues;
-
char *sbitmap;
int32 *svalues;
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
set = PG_GETARG_INT4HASHSET(0);
sbitmap = set->data;
@@ -728,12 +654,11 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
Datum
int4hashset_equals(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
-
- char *bitmap_a;
- int32 *values_a;
- int i;
+ int i;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ char *bitmap_a;
+ int32 *values_a;
/*
* Check if the number of elements is the same
@@ -794,9 +719,9 @@ Datum int4hashset_hash(PG_FUNCTION_ARGS)
Datum
int4hashset_lt(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
- int32 cmp;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
@@ -809,9 +734,9 @@ int4hashset_lt(PG_FUNCTION_ARGS)
Datum
int4hashset_le(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
- int32 cmp;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
@@ -824,9 +749,9 @@ int4hashset_le(PG_FUNCTION_ARGS)
Datum
int4hashset_gt(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
- int32 cmp;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
@@ -839,9 +764,9 @@ int4hashset_gt(PG_FUNCTION_ARGS)
Datum
int4hashset_ge(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
- int32 cmp;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
PointerGetDatum(a),
@@ -853,10 +778,10 @@ int4hashset_ge(PG_FUNCTION_ARGS)
Datum
int4hashset_cmp(PG_FUNCTION_ARGS)
{
- int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
- int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
- int32 *elements_a;
- int32 *elements_b;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 *elements_a;
+ int32 *elements_b;
/*
* Compare the hashes first, if they are different,
@@ -914,17 +839,11 @@ Datum
int4hashset_intersection(PG_FUNCTION_ARGS)
{
int i;
- int4hashset_t *seta;
- int4hashset_t *setb;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = setb->data;
+ int32_t *values = (int32_t *)(bitmap + CEIL_DIV(setb->capacity, 8));
int4hashset_t *intersection;
- char *bitmap;
- int32_t *values;
-
- if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
- PG_RETURN_NULL();
-
- seta = PG_GETARG_INT4HASHSET(0);
- setb = PG_GETARG_INT4HASHSET(1);
intersection = int4hashset_allocate(
seta->capacity,
@@ -933,9 +852,6 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
DEFAULT_HASHFN_ID
);
- bitmap = setb->data;
- values = (int32_t *)(setb->data + CEIL_DIV(setb->capacity, 8));
-
for (i = 0; i < setb->capacity; i++)
{
int byte = (i / 8);
@@ -955,17 +871,11 @@ Datum
int4hashset_difference(PG_FUNCTION_ARGS)
{
int i;
- int4hashset_t *seta;
- int4hashset_t *setb;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *difference;
- char *bitmap;
- int32_t *values;
-
- if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
- PG_RETURN_NULL();
-
- seta = PG_GETARG_INT4HASHSET(0);
- setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = seta->data;
+ int32_t *values = (int32_t *)(bitmap + CEIL_DIV(seta->capacity, 8));
difference = int4hashset_allocate(
seta->capacity,
@@ -974,9 +884,6 @@ int4hashset_difference(PG_FUNCTION_ARGS)
DEFAULT_HASHFN_ID
);
- bitmap = seta->data;
- values = (int32_t *)(seta->data + CEIL_DIV(seta->capacity, 8));
-
for (i = 0; i < seta->capacity; i++)
{
int byte = (i / 8);
@@ -996,27 +903,13 @@ Datum
int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
{
int i;
- int4hashset_t *seta;
- int4hashset_t *setb;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *result;
- char *bitmapa;
- char *bitmapb;
- int32_t *valuesa;
- int32_t *valuesb;
-
- if (PG_ARGISNULL(0) || PG_ARGISNULL(1))
- ereport(ERROR,
- (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
- errmsg("hashset arguments cannot be null")));
-
- seta = PG_GETARG_INT4HASHSET(0);
- setb = PG_GETARG_INT4HASHSET(1);
-
- bitmapa = seta->data;
- valuesa = (int32 *) (seta->data + CEIL_DIV(seta->capacity, 8));
-
- bitmapb = setb->data;
- valuesb = (int32 *) (setb->data + CEIL_DIV(setb->capacity, 8));
+ char *bitmapa = seta->data;
+ char *bitmapb = setb->data;
+ int32_t *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
+ int32_t *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
result = int4hashset_allocate(
seta->nelements + setb->nelements,
diff --git a/test/expected/basic.out b/test/expected/basic.out
index b89ab52..b5326f2 100644
--- a/test/expected/basic.out
+++ b/test/expected/basic.out
@@ -53,12 +53,6 @@ SELECT hashset_add(int4hashset(), 123);
{123}
(1 row)
-SELECT hashset_add(NULL::int4hashset, 123);
- hashset_add
--------------
- {123}
-(1 row)
-
SELECT hashset_add('{123}'::int4hashset, 456);
hashset_add
-------------
diff --git a/test/expected/strict.out b/test/expected/strict.out
new file mode 100644
index 0000000..4a9d904
--- /dev/null
+++ b/test/expected/strict.out
@@ -0,0 +1,114 @@
+/*
+ * Test to verify all relevant functions return NULL if any of
+ * the input parameters are NULL, i.e. testing that they are declared as STRICT
+ */
+SELECT hashset_add(int4hashset(), NULL::int);
+ hashset_add
+-------------
+
+(1 row)
+
+SELECT hashset_add(NULL::int4hashset, 123::int);
+ hashset_add
+-------------
+
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, NULL::int);
+ hashset_contains
+------------------
+
+(1 row)
+
+SELECT hashset_contains(NULL::int4hashset, 456::int);
+ hashset_contains
+------------------
+
+(1 row)
+
+SELECT hashset_merge('{1,2}'::int4hashset, NULL::int4hashset);
+ hashset_merge
+---------------
+
+(1 row)
+
+SELECT hashset_merge(NULL::int4hashset, '{2,3}'::int4hashset);
+ hashset_merge
+---------------
+
+(1 row)
+
+SELECT hashset_to_array(NULL::int4hashset);
+ hashset_to_array
+------------------
+
+(1 row)
+
+SELECT hashset_count(NULL::int4hashset);
+ hashset_count
+---------------
+
+(1 row)
+
+SELECT hashset_capacity(NULL::int4hashset);
+ hashset_capacity
+------------------
+
+(1 row)
+
+SELECT hashset_intersection('{1,2}'::int4hashset,NULL::int4hashset);
+ hashset_intersection
+----------------------
+
+(1 row)
+
+SELECT hashset_intersection(NULL::int4hashset,'{2,3}'::int4hashset);
+ hashset_intersection
+----------------------
+
+(1 row)
+
+SELECT hashset_difference('{1,2}'::int4hashset,NULL::int4hashset);
+ hashset_difference
+--------------------
+
+(1 row)
+
+SELECT hashset_difference(NULL::int4hashset,'{2,3}'::int4hashset);
+ hashset_difference
+--------------------
+
+(1 row)
+
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,NULL::int4hashset);
+ hashset_symmetric_difference
+------------------------------
+
+(1 row)
+
+SELECT hashset_symmetric_difference(NULL::int4hashset,'{2,3}'::int4hashset);
+ hashset_symmetric_difference
+------------------------------
+
+(1 row)
+
+/*
+ * For convenience, hashset_agg() is not STRICT and just ignore NULL values
+ */
+SELECT hashset_agg(i) FROM (VALUES (NULL::int),(1::int),(2::int)) q(i);
+ hashset_agg
+-------------
+ {1,2}
+(1 row)
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT NULL::int4hashset AS h
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset_agg
+--------------
+ {6,7,8,9,10}
+(1 row)
+
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
index 8688666..061794c 100644
--- a/test/sql/basic.sql
+++ b/test/sql/basic.sql
@@ -20,7 +20,6 @@ SELECT int4hashset(
hashfn_id := 1
);
SELECT hashset_add(int4hashset(), 123);
-SELECT hashset_add(NULL::int4hashset, 123);
SELECT hashset_add('{123}'::int4hashset, 456);
SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
diff --git a/test/sql/strict.sql b/test/sql/strict.sql
new file mode 100644
index 0000000..d0f33bd
--- /dev/null
+++ b/test/sql/strict.sql
@@ -0,0 +1,32 @@
+/*
+ * Test to verify all relevant functions return NULL if any of
+ * the input parameters are NULL, i.e. testing that they are declared as STRICT
+ */
+
+SELECT hashset_add(int4hashset(), NULL::int);
+SELECT hashset_add(NULL::int4hashset, 123::int);
+SELECT hashset_contains('{123,456}'::int4hashset, NULL::int);
+SELECT hashset_contains(NULL::int4hashset, 456::int);
+SELECT hashset_merge('{1,2}'::int4hashset, NULL::int4hashset);
+SELECT hashset_merge(NULL::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array(NULL::int4hashset);
+SELECT hashset_count(NULL::int4hashset);
+SELECT hashset_capacity(NULL::int4hashset);
+SELECT hashset_intersection('{1,2}'::int4hashset,NULL::int4hashset);
+SELECT hashset_intersection(NULL::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_difference('{1,2}'::int4hashset,NULL::int4hashset);
+SELECT hashset_difference(NULL::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,NULL::int4hashset);
+SELECT hashset_symmetric_difference(NULL::int4hashset,'{2,3}'::int4hashset);
+
+/*
+ * For convenience, hashset_agg() is not STRICT and just ignore NULL values
+ */
+SELECT hashset_agg(i) FROM (VALUES (NULL::int),(1::int),(2::int)) q(i);
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT NULL::int4hashset AS h
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
On 6/20/23 12:59, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?I agree, it should.
I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.
I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,
because then it can affect results of membership checks.
Consider these IN / ANY queries:
test=# select 4 in (1,2,3);
?column?
----------
f
(1 row)
test=# select 4 = ANY(ARRAY[1,2,3]);
?column?
----------
f
(1 row)
now add a NULL:
test=# select 4 in (1,2,3,null);
?column?
----------
(1 row)
test=# select 4 = ANY(ARRAY[1,2,3,NULL]);
?column?
----------
(1 row)
I don't see why a (hash)set should behave any differently. It's true
arrays don't behave like this:
test=# select array[1,2,3,4,NULL] @> ARRAY[5];
?column?
----------
f
(1 row)
but I'd say that's more an anomaly than something we should replicate.
This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.
So if we see a set as a special case of multiset (with no duplicates),
then we have to handle NULLs this way too. It'd be weird to have this
behavior inconsistent.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/20/23 14:10, Tomas Vondra wrote:
...
This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.
BTW I just notices there's also a multiset proposal at the wiscorp page:
http://www.wiscorp.com/sqlmultisets.zip
It's just the initial proposal and I'm not sure how much it changed over
time, but it likely provide way more context for the choices than the
(rather dry) SQL standard.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
On 6/20/23 12:59, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?I agree, it should.
I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,
The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.
I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.
I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.
However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.
A different perspective is to look at what use-cases we can foresee.
I've been trying hard, but I can't find compelling use-cases where a NULL element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.
Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?
/Joel
On Tue, Jun 20, 2023, at 16:56, Joel Jacobson wrote:
I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.
Adding to my previous note - If there's a worry about future SQL standards
introducing SETs with NULLs, causing compatibility issues, we could address it
proactively. We could set up set functions to throw errors when passed NULL
inputs, rather than being STRICT. This keeps our theoretical alignment now, and
offers a smooth transition if standards evolve.
Considering we have a flag field in the struct, we could use it to indicate
whether a value stored on disk was written with NULL support or not.
/Joel
On 6/20/23 16:56, Joel Jacobson wrote:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
On 6/20/23 12:59, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?I agree, it should.
I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.
I don't see how it's new, considering relational algebra is pretty much
based on (multi)sets, and the three-valued logic with NULL values is
pretty well established part of that.
I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.
I don't see how that would be more theoretically pure, really. The
three-valued logic is a well established part of relational algebra, so
not respecting that is more a violation of the purity.
I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.
I doubt this going to influence what the SQL standard says, especially
because it already defined the behavior for MULTISETS (of which the sets
are a special case, pretty much). So this has 0% chance of success.
However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.
Right, it's what we already do for similar cases, and if you have NULLS
in the data, you better be aware of the behavior. Granted, some people
are surprised by three-valued logic, but using a different behavior for
some new features would just increase the confusion.
A different perspective is to look at what use-cases we can foresee.
I've been trying hard, but I can't find compelling use-cases where a NULL element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.
IMO if you have NULL values in the data, you better be aware of it and
handle the case accordingly (e.g. by filtering them out when building
the set). If you don't have NULLs in the data, there's no issue.
And in the graph case, I don't see why you'd have any NULLs, considering
we're dealing with adjacent nodes, and if there's adjacent node, it's ID
is not NULL.
Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?
I'm sure there are cases where you have NULLs in the dat aand need to
filter them out, but that's just natural consequence of having NULLs. If
you have them you better know what NULLs do ...
It's too early to make any strong statements, but it's going to be hard
to convince me we should handle NULLs differently from what we already
do elsewhere.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 6/20/23 16:56, Joel Jacobson wrote:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
On 6/20/23 12:59, Joel Jacobson wrote:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?I agree, it should.
I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.I don't see how it's new, considering relational algebra is pretty much
based on (multi)sets, and the three-valued logic with NULL values is
pretty well established part of that.I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.I don't see how that would be more theoretically pure, really. The
three-valued logic is a well established part of relational algebra, so
not respecting that is more a violation of the purity.I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.I doubt this going to influence what the SQL standard says, especially
because it already defined the behavior for MULTISETS (of which the sets
are a special case, pretty much). So this has 0% chance of success.However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.Right, it's what we already do for similar cases, and if you have NULLS
in the data, you better be aware of the behavior. Granted, some people
are surprised by three-valued logic, but using a different behavior for
some new features would just increase the confusion.A different perspective is to look at what use-cases we can foresee.
I've been trying hard, but I can't find compelling use-cases where a NULL element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.IMO if you have NULL values in the data, you better be aware of it and
handle the case accordingly (e.g. by filtering them out when building
the set). If you don't have NULLs in the data, there's no issue.And in the graph case, I don't see why you'd have any NULLs, considering
we're dealing with adjacent nodes, and if there's adjacent node, it's ID
is not NULL.Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?I'm sure there are cases where you have NULLs in the dat aand need to
filter them out, but that's just natural consequence of having NULLs. If
you have them you better know what NULLs do ...It's too early to make any strong statements, but it's going to be hard
to convince me we should handle NULLs differently from what we already
do elsewhere.regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Conceptually, a multiset is an unordered collection of elements, all of the same type, with dupli-
cates permitted. Unlike arrays, a multiset is an unbounded collection, with no declared maximum
cardinality. This does not mean that the user can insert elements in a multiset without limit, just
that the standard does not mandate that there should be a limit. This is analogous to tables, which
have no declared maximum number of rows.
Postgres arrays don't have size limits.
unordered means no need to use subscript?
So multiset is a more limited array type?
null is fine. but personally I feel like so far the hashset main
feature is the quickly aggregate unique value using hashset.
I found using hashset count distinct (non null values) is quite faster.
On 6/20/23 20:08, jian he wrote:
On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
...Conceptually, a multiset is an unordered collection of elements, all of the same type, with dupli-
cates permitted. Unlike arrays, a multiset is an unbounded collection, with no declared maximum
cardinality. This does not mean that the user can insert elements in a multiset without limit, just
that the standard does not mandate that there should be a limit. This is analogous to tables, which
have no declared maximum number of rows.Postgres arrays don't have size limits.
Right. You can say int[5] but we don't enforce that limit (I haven't
checked why, but presumably because we had arrays before the standard
existed, and it was more like a list in LISP or something.)
unordered means no need to use subscript?
Yeah - there's no obvious way to subscript the items when there's no
implicit ordering.
So multiset is a more limited array type?
Yes and no - both are collection types, so there are similarities and
differences. Multiset does not need to keep the ordering, so in this
sense it's a relaxed version of array.
null is fine. but personally I feel like so far the hashset main
feature is the quickly aggregate unique value using hashset.
I found using hashset count distinct (non null values) is quite faster.
True. That's related to fast membership checks.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 20, 2023, at 18:25, Tomas Vondra wrote:
On 6/20/23 16:56, Joel Jacobson wrote:
The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.I don't see how it's new, considering relational algebra is pretty much
based on (multi)sets, and the three-valued logic with NULL values is
pretty well established part of that.
What I meant was that the SET-feature is new; since it doesn't exist in PostgreSQL nor SQL.
I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.I don't see how that would be more theoretically pure, really. The
three-valued logic is a well established part of relational algebra, so
not respecting that is more a violation of the purity.
Hmm, I think it's pure in different ways;
Set Theory is well established and is based on two-values logic,
but at the same time SQL's three-valued logic is also well established.
I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.I doubt this going to influence what the SQL standard says, especially
because it already defined the behavior for MULTISETS (of which the sets
are a special case, pretty much). So this has 0% chance of success.
OK. 0% is 1% too low for me to work with. :)
However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.Right, it's what we already do for similar cases, and if you have NULLS
in the data, you better be aware of the behavior. Granted, some people
are surprised by three-valued logic, but using a different behavior for
some new features would just increase the confusion.
Good point.
I've been trying hard, but I can't find compelling use-cases where a NULL element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.IMO if you have NULL values in the data, you better be aware of it and
handle the case accordingly (e.g. by filtering them out when building
the set). If you don't have NULLs in the data, there's no issue.
As long as the data model and queries would ensure there can never be
any NULLs, fine, then there's is no issue.
And in the graph case, I don't see why you'd have any NULLs, considering
we're dealing with adjacent nodes, and if there's adjacent node, it's ID
is not NULL.
Me neither, can't see the need for any NULLs there.
Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?I'm sure there are cases where you have NULLs in the dat aand need to
filter them out, but that's just natural consequence of having NULLs. If
you have them you better know what NULLs do ...
What I tried to find was an example for was when you wouldn't want to
filter out the NULLs, when you would want to include the NULL
in the set.
If we could just find one should realistic use-case, that would be very
helpful, since it would then kill my argument completely that we couldn't
do without storing a NULL in the set.
It's too early to make any strong statements, but it's going to be hard
to convince me we should handle NULLs differently from what we already
do elsewhere.
I think it's a trade-off, and I don't have any strong preference for the simplicity
of a classical two-valued set-theoretic system vs a three-valued
multiset-based one. I was 51/49 but given your feedback I'm now 49/51.
I think the next step is to think about how the hashset type should work
with three-valued logic, and then implement it to get a feeling for it.
For instance, how should hashset_count() work?
Given the query,
SELECT hashset_count('{1,2,3,null}'::int4hashset);
Should we,
a) threat NULL as a distinct value and return 4?
b) ignore NULL and return 3?
c) return NULL? (since the presence of NULL can be thought to render the entire count indeterminate)
I think my personal preference is (b) since it is then consistent with how COUNT() works.
/Joel
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.
I've looked again at the paper you mentioned and found something intriguing
in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
null when we're certain it's not null but just doesn't have any elements?
In the same vein, it says, "If it has more than one element, an exception is
raised." Makes sense to me, but what about when there are no elements at all?
Why not raise an exception in that case too?
The ELEMENT function is designed to do one simple thing: return the element of
a multiset if the multiset has only 1 element. This seems very similar to how
our INTO STRICT operates, right?
The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
should propose a bit of an improvement here:
"If it doesn't have exactly one element, an exception is raised."
Meaning, it would raise an exception both if there are more elements,
or zero elements (no elements).
I think this would make the semantics more intuitive and less surprising.
/Joel
On 6/22/23 19:52, Joel Jacobson wrote:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.I've looked again at the paper you mentioned and found something intriguing
in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
null when we're certain it's not null but just doesn't have any elements?In the same vein, it says, "If it has more than one element, an exception is
raised." Makes sense to me, but what about when there are no elements at all?
Why not raise an exception in that case too?The ELEMENT function is designed to do one simple thing: return the element of
a multiset if the multiset has only 1 element. This seems very similar to how
our INTO STRICT operates, right?
I agree this looks a bit weird, but that's what I mentioned - this is an
initial a proposal, outlining the idea. Inevitably some of the stuff
will get reworked or just left out of the final version. It's useful
mostly to explain the motivation / goal.
I believe that's the case here - I don't think the ELEMENT got into the
standard at all, and the NULL rules for the MEMBER OF clause seem not to
have these strange bits.
The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
should propose a bit of an improvement here:"If it doesn't have exactly one element, an exception is raised."
Meaning, it would raise an exception both if there are more elements,
or zero elements (no elements).I think this would make the semantics more intuitive and less surprising.
Well, the simple truth is the draft is freely available, but you'd need
to buy the final version. It doesn't mean it's still being worked on or
that no SQL standard was released since then. In fact, SQL 2023 was
released a couple weeks ago [1]https://www.iso.org/standard/76584.html.
It'd be interesting to know the version that actually got into the SQL
standard (if at all), but I don't have access to the standard yet.
regards
[1]: https://www.iso.org/standard/76584.html
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested is
somehow beyond the imagination...).
* A standard varlena array has the following internal structure:
* <vl_len_> - standard varlena header word
* <ndim> - number of dimensions of the array
* <dataoffset> - offset to stored data, or 0 if no nulls bitmap
* <elemtype> - element type OID
* <dimensions> - length of each array axis (C array of int)
* <lower bnds> - lower boundary of each dimension (C array of int)
* <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
* <actual data> - whatever is the stored data
in set/multiset, we don't need {ndim,lower bnds}, since we are only one
dimension, also we don't need subscript.
So for set we can have following
* int32 vl_len_; /* varlena header (do not touch directly!) */
* int32 capacity; /* # of capacity */
* int32 dataoffset; /* offset to data, or 0 if no bitmap */
* int32 nelements; /* number of items added to the hashset */
* Oid elemtype; /* element type OID */
* <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
* <bitmap> - bitmap showing this slot is empty or not ( I am not sure
this part)
* <actual data> - whatever is the stored data
many of the code in array_func.c can be reused.
array_isspace ==> set_isspace
ArrayMetaState ==> SetMetastate
ArrayCount ==> SetCount (similar to ArrayCount return the dimension
of set, should be zero (empty set) or one)
ArrayParseState ==> SetParseState
ReadArrayStr ==> ReadSetStr
attached is a demo shows that use array_func.c to parse cstring. have
similar effect of array_in.
for multiset_in set type input function. if no duplicate required then
multiset_in would just like array, so more code can be copied from
array_func.c
but if unique required then we need first palloc0(capacity * datums (size
per type)) then put valid value into to a specific slot?
On Fri, Jun 23, 2023 at 6:27 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 6/22/23 19:52, Joel Jacobson wrote:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.I've looked again at the paper you mentioned and found something
intriguing
in section 2.6 (b). I'm a bit puzzled about this: why would we want to
return
null when we're certain it's not null but just doesn't have any elements?
In the same vein, it says, "If it has more than one element, an
exception is
raised." Makes sense to me, but what about when there are no elements at
all?
Why not raise an exception in that case too?
The ELEMENT function is designed to do one simple thing: return the
element of
a multiset if the multiset has only 1 element. This seems very similar
to how
our INTO STRICT operates, right?
I agree this looks a bit weird, but that's what I mentioned - this is an
initial a proposal, outlining the idea. Inevitably some of the stuff
will get reworked or just left out of the final version. It's useful
mostly to explain the motivation / goal.I believe that's the case here - I don't think the ELEMENT got into the
standard at all, and the NULL rules for the MEMBER OF clause seem not to
have these strange bits.The SQL:20nn seems to still be in draft form, and I can't help but
wonder if we
should propose a bit of an improvement here:
"If it doesn't have exactly one element, an exception is raised."
Meaning, it would raise an exception both if there are more elements,
or zero elements (no elements).I think this would make the semantics more intuitive and less surprising.
Well, the simple truth is the draft is freely available, but you'd need
to buy the final version. It doesn't mean it's still being worked on or
that no SQL standard was released since then. In fact, SQL 2023 was
released a couple weeks ago [1].It'd be interesting to know the version that actually got into the SQL
standard (if at all), but I don't have access to the standard yet.regards
[1] https://www.iso.org/standard/76584.html
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
I recommend David Deutsch's <<The Beginning of Infinity>>
Jian
Attachments:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).
Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?
/Joel
On Fri, Jun 23, 2023 at 4:23 PM Joel Jacobson <joel@compiler.org> wrote:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?/Joel
select prosrc from pg_proc where proname ~*
'(hash.*extended)|(extended.*hash)';
return around 30 rows.
so it's a bit generic?
I tend to think set/multiset as a one dimension array, so the textual input
should be like a one dimension array.
use array_func.c functions to parse and validate the input.
So different types, one input validation function.
Does this make sense?
On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?
Before we run too far down this rabbit hole, let's discuss the storage
implications of using multisets. ISTM that for small base datums like
integers it will be a substantial increase in size, since you'll need an
addition int for the item count, unless some very clever tricks are played.
As for this older discussion referred to upthread, if the SQL Standards
Committee hasn't acted on it by now it seem reasonable to think they are
unlikely to.
Just for reference, Here's some description of Oracle's suport for
Multisets from
<https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Oracle-Support-for-Optional-Features-of-SQLFoundation2011.html#GUID-3BA98AEC-FAAD-4F21-A6AD-F696B5D36D56>:
Multisets in the standard are supported as nested table types in
Oracle. The Oracle nested table data type based on a scalar type ST is
equivalent, in standard terminology, to a multiset of rows having a
single field of type ST and named column_value. The Oracle nested
table type based on an object type is equivalent to a multiset of
structured type in the standard.Oracle supports the following elements of this feature on nested
tables using the same syntax as the standard has for multisets:The CARDINALITY function
The SET function
The MEMBER predicate
The IS A SET predicate
The COLLECT aggregate
All other aspects of this feature are supported with non-standard
syntax, as follows:To create an empty multiset, denoted MULTISET[] in the standard,
use an empty constructor of the nested table type.To obtain the sole element of a multiset with one element, denoted
ELEMENT (<multiset value expression>) in the standard, use a scalar
subquery to select the single element from the nested table.To construct a multiset by enumeration, use the constructor of the
nested table type.To construct a multiset by query, use CAST with a multiset
argument, casting to the nested table type.To unnest a multiset, use the TABLE operator in the FROM clause.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 6/23/23 13:47, Andrew Dunstan wrote:
On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?Before we run too far down this rabbit hole, let's discuss the storage
implications of using multisets. ISTM that for small base datums like
integers it will be a substantial increase in size, since you'll need an
addition int for the item count, unless some very clever tricks are played.
I honestly don't quite understand what exactly is meant by the proposal
to "reuse array_func.c for multisets". We're implementing sets, not
multisets (those were mentioned only to illustrate behavior). And the
whole point is that sets are not arrays - no duplicates, ordering does
not matter (so no index).
I mentioned that maybe we can model sets based on arrays (say, gram.y
would do similar stuff for SET[] and ARRAY[], polymorphism), not that we
should store sets as arrays. Would it be possible - maybe, if we extend
arrays to also maintain some hash hash table. But I'd bet that'll just
make arrays more complex, and will make sets slower.
Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.
As for the storage size, I don't think an extra "count" field would make
any measurable difference. If we're storing a hash table, we're bound to
have a couple percent of wasted space due to load factor (likely between
0.75 and 0.9).
As for this older discussion referred to upthread, if the SQL Standards
Committee hasn't acted on it by now it seem reasonable to think they are
unlikely to.
AFAIK multisets are included in SQL 2023, pretty much matching the draft
we discussed earlier. Yeah, it's unlikely to change in the future.
Just for reference, Here's some description of Oracle's suport for
Multisets from
<https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Oracle-Support-for-Optional-Features-of-SQLFoundation2011.html#GUID-3BA98AEC-FAAD-4F21-A6AD-F696B5D36D56>:
good to know
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 22, 2023, at 07:51, Joel Jacobson wrote:
For instance, how should hashset_count() work?
Given the query,
SELECT hashset_count('{1,2,3,null}'::int4hashset);
Should we,
a) threat NULL as a distinct value and return 4?
b) ignore NULL and return 3?
c) return NULL? (since the presence of NULL can be thought to render
the entire count indeterminate)I think my personal preference is (b) since it is then consistent with
how COUNT() works.
Having thought a bit more on this matter,
I think it's better to remove hashset_count() since the semantics are not obvious,
and instead provide a hashset_cardinality() function, that would obviously
include a possible null value in the number of elements:
SELECT hashset_cardinality('{1,2,3,null}'::int4hashset);
4
SELECT hashset_cardinality('{null}'::int4hashset);
1
SELECT hashset_cardinality('{null,null}'::int4hashset);
1
SELECT hashset_cardinality('{}'::int4hashset);
0
SELECT hashset_cardinality(NULL::int4hashset);
NULL
Sounds good?
/Joel
New version of int4hashset_contains() that should follow the same
General Rules as MULTISET's MEMBER OF (8.16 <member predicate>).
The first rule is to return False if the cardinality is 0 (zero).
However, we must first check if the first argument is null,
in which case the cardinality cannot be 0 (zero),
so if the first argument is null then we return Unknown
(represented as null).
We then proceed and check if the set is empty,
which is defined as nelements being 0 (zero)
as well as the new null_element field being false.
If the set is empty, then we always return False,
regardless of the second argument, that is,
even if it would be null we would still return False,
since the set is empty and can therefore not contain
any element.
The second rule is to return Unknown (represented as null)
if any of the arguments are null. We've already checked that
the first argument is not null, so now we check the second
argument, and return Unknown (represented as null) if it is null.
The third rule is to check for the element, and return True if
the set contains the element. Otherwise, if the set contains
the null element, we don't know if the element we're checking
for is in the set, so we then return Unknown (represented as null).
Finally, if the set doesn't contain the null element and nor the
element we're checking for, then we return False.
Datum
int4hashset_contains(PG_FUNCTION_ARGS)
{
int4hashset_t *set;
int32 value;
bool result;
if (PG_ARGISNULL(0))
PG_RETURN_NULL();
set = PG_GETARG_INT4HASHSET(0);
if (set->nelements == 0 && !set->null_element)
PG_RETURN_BOOL(false);
if (PG_ARGISNULL(1))
PG_RETURN_NULL();
value = PG_GETARG_INT32(1);
result = int4hashset_contains_element(set, value);
if (!result && set->null_element)
PG_RETURN_NULL();
PG_RETURN_BOOL(result);
}
Example queries and expected results:
SELECT hashset_contains(NULL::int4hashset, NULL::int); -- null
SELECT hashset_contains(NULL::int4hashset, 1::int); -- null
SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
SELECT hashset_contains('{}'::int4hashset, 1::int); -- false
SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{null}'::int4hashset, 1::int); -- null
SELECT hashset_contains('{1}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{1}'::int4hashset, 1::int); -- true
SELECT hashset_contains('{1}'::int4hashset, 2::int); -- false
Looks good?
/Joel
On Sat, Jun 24, 2023, at 21:16, Joel Jacobson wrote:
New version of int4hashset_contains() that should follow the same
General Rules as MULTISET's MEMBER OF (8.16 <member predicate>).
...
SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
...
SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null
When it comes to SQL, the general rule of thumb is that expressions and functions
handling null usually return the null value. This is why it might feel a bit out
of the ordinary to return False when checking if an empty set contains NULL.
However, that's my understanding of the General Rules on page 553 of
ISO/IEC 9075-2:2023(E). Rule 3 Case a) specifically states:
"If N is 0 (zero), then the <member predicate> is False.",
where N is the cardinality, and for an empty set, that's 0 (zero).
Rule 3 Case b) goes on to say:
"If at least one of XV and MV is the null value, then the
<member predicate> is Unknown."
But since b) follows a), and the condition for a) already matches, b) is out of
the running. This leads me to believe that the result of:
SELECT hashset_contains('{}'::int4hashset, NULL::int);
would be False, according to the General Rules.
Now, this is based on the assumption that the Case conditions are evaluated in
sequence, stopping at the first match. Does that assumption hold water?
Applying the same rules, we'd have to return Unknown (which we represent as
null) for:
SELECT hashset_contains('{null}'::int4hashset, NULL::int);
Here, since the cardinality N is 1, Case a) doesn't apply, but Case b) does
since XV is null.
Looking ahead, we're entertaining the possibility of a future SET SQL-syntax
feature and wondering how our hashset type could be adapted to be compatible and
reusable for such a development. It's a common prediction that any future SET
syntax feature would probably operate on Three-Valued Logic. Therefore, it's key
for our hashset to handle null values, whether storing, identifying, or adding
them.
But here's my two cents, and remember it's just a personal viewpoint. I'm not so
sure that the hashset type functions need to mirror the corresponding MULTISET
language constructs exactly. In my book, our hashset catalog functions could
take a more clear-cut route with null handling, as long as our data structure is
prepared to handle null values.
Think about this possibility:
hashset_contains_null(int4hashset) -> boolean
hashset_add_null(int4hashset) -> int4hashset
hashset_contains(..., NULL) -> ERROR
hashset_add(..., NULL) -> ERROR
In my mind, this explicit null handling could simplify things, clear up any
potential confusion, and at the same time pave the way for compatibility with
any future SET SQL-syntax feature.
Thoughts?
/Joel
Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.
currently joel's idea is a int4hashset. based on the code first tomas wrote.
it looks like a non-nested an collection of unique int4. external text
format looks like {int4, int4,int4}
structure looks like (header + capacity slots * int4).
Within the capacity slots, some slots are empty, some have unique values.
The textual int4hashset looks like a one dimensional array.
so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
slight generic hashset input and output function.
see the attached c file.
It works fine for non-null input output for {int4hashset, int8hashset,
timestamphashset,intervalhashset,uuidhashset).
Attachments:
On 6/25/23 15:32, jian he wrote:
Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.currently joel's idea is a int4hashset. based on the code first tomas wrote.
it looks like a non-nested an collection of unique int4. external text
format looks like {int4, int4,int4}
structure looks like (header + capacity slots * int4).
Within the capacity slots, some slots are empty, some have unique values.The textual int4hashset looks like a one dimensional array.
so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
slight generic hashset input and output function.see the attached c file.
It works fine for non-null input output for {int4hashset, int8hashset,
timestamphashset,intervalhashset,uuidhashset).
So how do you define a table with a "set" column? I mean, with the
original patch we could have done
CREATE TABLE (a int4hashset);
and then store / query this. How do you do that with this approach?
I've looked at the patch only very briefly - it's really difficult to
grok such patches - large, with half the comments possibly obsolete etc.
So what does reusing the array code give us, really?
I'm not against reusing some of the array code, but arrays seem to be
much more elaborate (multiple dimensions, ...) so the code needs to do
significantly more stuff in various cases.
When I previously suggested that maybe we should get "inspiration" from
the array code, I was mostly talking about (a) type polymorphism, i.e.
doing sets for arbitrary types, and (b) integrating this into grammar
(instead of using functions).
I don't see how copying arrayfuncs.c like this achieves either of these
things. It still hardcodes just a handful of selected data types, and
the array polymorphism relies on automatic creation of array type for
every scalar type.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
SELECT hashset_contains('{}'::int4hashset, NULL::int);
would be False, according to the General Rules.
...
Applying the same rules, we'd have to return Unknown (which we represent as
null) for:SELECT hashset_contains('{null}'::int4hashset, NULL::int);
Aha! I just discovered to my surprise that the corresponding array
queries gives the same result:
SELECT NULL = ANY(ARRAY[]::int[]);
?column?
----------
f
(1 row)
SELECT NULL = ANY(ARRAY[NULL]::int[]);
?column?
----------
(1 row)
I have no more objections; let's stick to the same null semantics as arrays and multisets.
/Joel
On Mon, Jun 26, 2023 at 2:56 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 6/25/23 15:32, jian he wrote:
Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.currently joel's idea is a int4hashset. based on the code first tomas
wrote.
it looks like a non-nested an collection of unique int4. external text
format looks like {int4, int4,int4}
structure looks like (header + capacity slots * int4).
Within the capacity slots, some slots are empty, some have unique
values.
The textual int4hashset looks like a one dimensional array.
so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
slight generic hashset input and output function.see the attached c file.
It works fine for non-null input output for {int4hashset, int8hashset,
timestamphashset,intervalhashset,uuidhashset).So how do you define a table with a "set" column? I mean, with the
original patch we could have doneCREATE TABLE (a int4hashset);
and then store / query this. How do you do that with this approach?
I've looked at the patch only very briefly - it's really difficult to
grok such patches - large, with half the comments possibly obsolete etc.
So what does reusing the array code give us, really?I'm not against reusing some of the array code, but arrays seem to be
much more elaborate (multiple dimensions, ...) so the code needs to do
significantly more stuff in various cases.When I previously suggested that maybe we should get "inspiration" from
the array code, I was mostly talking about (a) type polymorphism, i.e.
doing sets for arbitrary types, and (b) integrating this into grammar
(instead of using functions).I don't see how copying arrayfuncs.c like this achieves either of these
things. It still hardcodes just a handful of selected data types, and
the array polymorphism relies on automatic creation of array type for
every scalar type.regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
You are right.
I misread sql-createtype.html about type input_function that can take 3
arguments (cstring, oid, integer) part.
I thought while creating data types, I can pass different params to the
input_function.
On Mon, Jun 26, 2023 at 4:36 AM Joel Jacobson <joel@compiler.org> wrote:
On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
SELECT hashset_contains('{}'::int4hashset, NULL::int);
would be False, according to the General Rules.
...
Applying the same rules, we'd have to return Unknown (which we
represent as
null) for:
SELECT hashset_contains('{null}'::int4hashset, NULL::int);
Aha! I just discovered to my surprise that the corresponding array
queries gives the same result:SELECT NULL = ANY(ARRAY[]::int[]);
?column?
----------
f
(1 row)SELECT NULL = ANY(ARRAY[NULL]::int[]);
?column?
----------(1 row)
I have no more objections; let's stick to the same null semantics as
arrays and multisets.
/Joel
Can you try to glue the attached to the hashset data type input function.
the attached will parse cstring with double quote and not. so '{1,2,3}' ==
'{"1","2","3"}'. obviously quote will preserve the inner string as is.
currently int4hashset input is delimited by comma, if you want deal with
range then you need escape the comma.
Attachments:
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
Can you try to glue the attached to the hashset data type input
function.
the attached will parse cstring with double quote and not. so '{1,2,3}'
== '{"1","2","3"}'. obviously quote will preserve the inner string as
is.
currently int4hashset input is delimited by comma, if you want deal
with range then you need escape the comma.
Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?
I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.
Help wanted to reason about what the following queries should return:
SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
Should they return NULL, the empty set or something else?
I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.
/Joel
On Mon, Jun 26, 2023 at 4:55 PM Joel Jacobson <joel@compiler.org> wrote:
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
Can you try to glue the attached to the hashset data type input
function.
the attached will parse cstring with double quote and not. so '{1,2,3}'
== '{"1","2","3"}'. obviously quote will preserve the inner string as
is.
currently int4hashset input is delimited by comma, if you want deal
with range then you need escape the comma.Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.Help wanted to reason about what the following queries should return:
SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
Should they return NULL, the empty set or something else?
I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.
Shouldn't they return the same thing that left(NULL::text,1) returns?
(NULL)...
Typically any operation on NULL is NULL.
Kirk...
On Tue, Jun 27, 2023 at 4:55 AM Joel Jacobson <joel@compiler.org> wrote:
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
Can you try to glue the attached to the hashset data type input
function.
the attached will parse cstring with double quote and not. so '{1,2,3}'
== '{"1","2","3"}'. obviously quote will preserve the inner string as
is.
currently int4hashset input is delimited by comma, if you want deal
with range then you need escape the comma.Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.Help wanted to reason about what the following queries should return:
SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
Should they return NULL, the empty set or something else?
I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION./Joel
in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.
(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)
CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END
(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)
In page11,
Unlike the corresponding table operators UNION, INTERSECT and EXCEPT, we have chosen ALL as the default, since this is the most natural interpretation of MULTISET UNION, etc
also in page 11 aggregate name FUSION. (I like the name.................)
On Tue, Jun 27, 2023, at 04:35, jian he wrote:
in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)
Thanks! This was exactly what I was looking for, I knew I've seen it but failed to find it.
Attached is a new incremental patch as well as a full patch, since this is a substantial change:
Align null semantics with SQL:2023 array and multiset standards
* Introduced a new boolean field, null_element, in the int4hashset_t type.
* Rename hashset_count() to hashset_cardinality().
* Rename hashset_merge() to hashset_union().
* Rename hashset_equals() to hashset_eq().
* Rename hashset_neq() to hashset_ne().
* Add hashset_to_sorted_array().
* Handle null semantics to work as in arrays and multisets.
* Update int4hashset_add() to allow creating a new set if none exists.
* Use more portable int32 typedef instead of int32_t.
This also adds a thorough test suite in array-and-multiset-semantics.sql,
which aims to test all relevant combinations of operations and values.
Makefile | 2 +-
README.md | 6 ++--
hashset--0.0.1.sql | 37 +++++++++++---------
hashset-api.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------
hashset.c | 12 ++++++-
hashset.h | 11 +++---
test/expected/array-and-multiset-semantics.out | 365 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
test/expected/basic.out | 12 +++----
test/expected/reported_bugs.out | 6 ++--
test/expected/strict.out | 114 ------------------------------------------------------------
test/expected/table.out | 8 ++---
test/sql/array-and-multiset-semantics.sql | 232 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
test/sql/basic.sql | 4 +--
test/sql/benchmark.sql | 14 ++++----
test/sql/reported_bugs.sql | 6 ++--
test/sql/strict.sql | 32 -----------------
test/sql/table.sql | 2 +-
17 files changed, 823 insertions(+), 248 deletions(-)
/Joel
Attachments:
hashset-0.0.1-b7e5614-full.patchapplication/octet-stream; name=hashset-0.0.1-b7e5614-full.patchDownload
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..91f216e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+.deps/
+results/
+**/*.o
+**/*.so
+regression.diffs
+regression.out
+.vscode
+test/c_tests/test_send_recv
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..908853d
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,16 @@
+Copyright (c) 2019, Tomas Vondra (tomas.vondra@postgresql.org).
+
+Permission to use, copy, modify, and distribute this software and its documentation
+for any purpose, without fee, and without a written agreement is hereby granted,
+provided that the above copyright notice and this paragraph and the following two
+paragraphs appear in all copies.
+
+IN NO EVENT SHALL $ORGANISATION BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
+INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE
+OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF TOMAS VONDRA HAS BEEN ADVISED OF
+THE POSSIBILITY OF SUCH DAMAGE.
+
+TOMAS VONDRA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
+SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND $ORGANISATION HAS NO
+OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..ee62511
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,33 @@
+MODULE_big = hashset
+OBJS = hashset.o hashset-api.o
+
+EXTENSION = hashset
+DATA = hashset--0.0.1.sql
+MODULES = hashset
+
+# Keep the CFLAGS separate
+SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
+CLIENT_INCLUDES=-I$(shell pg_config --includedir)
+LIBRARY_PATH = -L$(shell pg_config --libdir)
+
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs array-and-multiset-semantics
+REGRESS_OPTS = --inputdir=test
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+C_TESTS_DIR = test/c_tests
+
+EXTRA_CLEAN = $(C_TESTS_DIR)/test_send_recv
+
+c_tests: $(C_TESTS_DIR)/test_send_recv
+
+$(C_TESTS_DIR)/test_send_recv: $(C_TESTS_DIR)/test_send_recv.c
+ $(CC) $(SERVER_INCLUDES) $(CLIENT_INCLUDES) -o $@ $< $(LIBRARY_PATH) -lpq
+
+run_c_tests: c_tests
+ cd $(C_TESTS_DIR) && ./test_send_recv.sh
+
+check: all $(REGRESS_PREP) run_c_tests
+
+include $(PGXS)
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..91af6ee
--- /dev/null
+++ b/README.md
@@ -0,0 +1,156 @@
+# hashset
+
+This PostgreSQL extension implements hashset, a data structure (type)
+providing a collection of unique, not null integer items with fast lookup.
+
+
+## Version
+
+0.0.1
+
+🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
+PostgreSQL extension is **not production-ready**. As the codebase is evolving
+with possible breaking changes, we are not providing any migration scripts
+until we reach our first release.
+
+
+## Usage
+
+After installing the extension, you can use the `int4hashset` data type and
+associated functions within your PostgreSQL queries.
+
+To demonstrate the usage, let's consider a hypothetical table `users` which has
+a `user_id` and a `user_likes` of type `int4hashset`.
+
+Firstly, let's create the table:
+
+```sql
+CREATE TABLE users(
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset()
+);
+```
+In the above statement, the `int4hashset()` initializes an empty hashset
+with zero capacity. The hashset will automatically resize itself when more
+elements are added.
+
+Now, we can perform operations on this table. Here are some examples:
+
+```sql
+-- Insert a new user with id 1. The user_likes will automatically be initialized
+-- as an empty hashset
+INSERT INTO users (user_id) VALUES (1);
+
+-- Add elements (likes) for a user
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+
+-- Check if a user likes a particular item
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
+
+-- Count the number of likes a user has
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1; -- 2
+```
+
+You can also use the aggregate functions to perform operations on multiple rows.
+
+
+## Data types
+
+- **int4hashset**: This data type represents a set of integers. Internally, it uses
+a combination of a bitmap and a value array to store the elements in a set. It's
+a variable-length type.
+
+
+## Functions
+
+- `int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`:
+ Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
+- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
+- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
+- `hashset_union(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
+- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
+- `hashset_cardinality(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
+- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
+- `hashset_max_collisions(int4hashset) -> bigint`: Returns the maximum number of collisions that have occurred for a single element
+- `hashset_intersection(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that is the intersection of the two input sets.
+- `hashset_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that contains the elements present in the first set but not in the second set.
+- `hashset_symmetric_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset containing elements that are in either of the input sets, but not in their intersection.
+
+## Aggregation Functions
+
+- `hashset_agg(int) -> int4hashset`: Aggregate integers into a hashset.
+- `hashset_agg(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
+
+
+## Operators
+
+- Equality (`=`): Checks if two hashsets are equal.
+- Inequality (`<>`): Checks if two hashsets are not equal.
+
+
+## Hashset Hash Operators
+
+- `hashset_hash(int4hashset) -> integer`: Returns the hash value of an int4hashset.
+
+
+## Hashset Btree Operators
+
+- `<`, `<=`, `>`, `>=`: Comparison operators for hashsets.
+
+
+## Limitations
+
+- The `int4hashset` data type currently supports integers within the range of int4
+(-2147483648 to 2147483647).
+
+
+## Installation
+
+To install the extension on any platform, follow these general steps:
+
+1. Ensure you have PostgreSQL installed on your system, including the development files.
+2. Clone the repository.
+3. Navigate to the cloned repository directory.
+4. Compile the extension using `make`.
+5. Install the extension using `sudo make install`.
+6. Run the tests using `make installcheck` (optional).
+
+To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+```sh
+make PG_CONFIG=/else/where/pg_config
+sudo make install PG_CONFIG=/else/where/pg_config
+```
+
+In your PostgreSQL connection, enable the hashset extension using the following SQL command:
+```sql
+CREATE EXTENSION hashset;
+```
+
+This extension requires PostgreSQL version ?.? or later.
+
+For Ubuntu 22.04.1 LTS, you would run the following commands:
+
+```sh
+sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
+git clone https://github.com/tvondra/hashset.git
+cd hashset
+make
+sudo make install
+make installcheck
+```
+
+Please note that this project is currently under active development and is not yet considered production-ready.
+
+## License
+
+This software is distributed under the terms of PostgreSQL license.
+See LICENSE or http://www.opensource.org/licenses/bsd-license.php for
+more details.
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
new file mode 100644
index 0000000..d0478ce
--- /dev/null
+++ b/hashset--0.0.1.sql
@@ -0,0 +1,303 @@
+/*
+ * Hashset Type Definition
+ */
+
+CREATE TYPE int4hashset;
+
+CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_in'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
+RETURNS cstring
+AS 'hashset', 'int4hashset_out'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
+RETURNS bytea
+AS 'hashset', 'int4hashset_send'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_recv'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE TYPE int4hashset (
+ INPUT = int4hashset_in,
+ OUTPUT = int4hashset_out,
+ RECEIVE = int4hashset_recv,
+ SEND = int4hashset_send,
+ INTERNALLENGTH = variable,
+ STORAGE = external
+);
+
+/*
+ * Hashset Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset(
+ capacity int DEFAULT 0,
+ load_factor float4 DEFAULT 0.75,
+ growth_factor float4 DEFAULT 2.0,
+ hashfn_id int DEFAULT 1
+)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_init'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
+RETURNS boolean
+AS 'hashset', 'int4hashset_contains'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_union'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_array'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_to_sorted_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_sorted_array'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cardinality(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_cardinality'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_capacity'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_collisions'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_max_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_max_collisions'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
+RETURNS int4hashset
+AS $$SELECT $2 || $1$$
+LANGUAGE SQL
+IMMUTABLE PARALLEL SAFE STRICT COST 1;
+
+CREATE OR REPLACE FUNCTION hashset_intersection(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_intersection'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_symmetric_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_symmetric_difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+/*
+ * Aggregation Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int) (
+ SFUNC = int4hashset_agg_add,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add_set'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int4hashset) (
+ SFUNC = int4hashset_agg_add_set,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+/*
+ * Operator Definitions
+ */
+
+CREATE OR REPLACE FUNCTION hashset_eq(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_eq'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR = (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_eq,
+ COMMUTATOR = =,
+ HASHES
+);
+
+CREATE OR REPLACE FUNCTION hashset_ne(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ne'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR <> (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_ne,
+ COMMUTATOR = '<>',
+ NEGATOR = '=',
+ RESTRICT = neqsel,
+ JOIN = neqjoinsel,
+ HASHES
+);
+
+CREATE OPERATOR || (
+ leftarg = int4hashset,
+ rightarg = int4,
+ function = hashset_add,
+ commutator = ||
+);
+
+CREATE OPERATOR || (
+ leftarg = int4,
+ rightarg = int4hashset,
+ function = int4_add_int4hashset,
+ commutator = ||
+);
+
+/*
+ * Hashset Hash Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_hash'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR CLASS int4hashset_hash_ops
+DEFAULT FOR TYPE int4hashset USING hash AS
+OPERATOR 1 = (int4hashset, int4hashset),
+FUNCTION 1 hashset_hash(int4hashset);
+
+/*
+ * Hashset Btree Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_lt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_le'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_gt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ge'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_cmp'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR < (
+ PROCEDURE = hashset_lt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = >,
+ NEGATOR = >=,
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR <= (
+ PROCEDURE = hashset_le,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '>=',
+ NEGATOR = '>',
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR > (
+ PROCEDURE = hashset_gt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<',
+ NEGATOR = '<=',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR >= (
+ PROCEDURE = hashset_ge,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<=',
+ NEGATOR = '<',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR CLASS int4hashset_btree_ops
+DEFAULT FOR TYPE int4hashset USING btree AS
+OPERATOR 1 < (int4hashset, int4hashset),
+OPERATOR 2 <= (int4hashset, int4hashset),
+OPERATOR 3 = (int4hashset, int4hashset),
+OPERATOR 4 >= (int4hashset, int4hashset),
+OPERATOR 5 > (int4hashset, int4hashset),
+FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset-api.c b/hashset-api.c
new file mode 100644
index 0000000..a4beef4
--- /dev/null
+++ b/hashset-api.c
@@ -0,0 +1,1058 @@
+#include "hashset.h"
+
+#include <stdio.h>
+#include <math.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <limits.h>
+
+#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET_COPY(x) (int4hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(x))
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(int4hashset_in);
+PG_FUNCTION_INFO_V1(int4hashset_out);
+PG_FUNCTION_INFO_V1(int4hashset_send);
+PG_FUNCTION_INFO_V1(int4hashset_recv);
+PG_FUNCTION_INFO_V1(int4hashset_add);
+PG_FUNCTION_INFO_V1(int4hashset_contains);
+PG_FUNCTION_INFO_V1(int4hashset_cardinality);
+PG_FUNCTION_INFO_V1(int4hashset_union);
+PG_FUNCTION_INFO_V1(int4hashset_init);
+PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_max_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
+PG_FUNCTION_INFO_V1(int4hashset_agg_final);
+PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
+PG_FUNCTION_INFO_V1(int4hashset_to_array);
+PG_FUNCTION_INFO_V1(int4hashset_to_sorted_array);
+PG_FUNCTION_INFO_V1(int4hashset_eq);
+PG_FUNCTION_INFO_V1(int4hashset_ne);
+PG_FUNCTION_INFO_V1(int4hashset_hash);
+PG_FUNCTION_INFO_V1(int4hashset_lt);
+PG_FUNCTION_INFO_V1(int4hashset_le);
+PG_FUNCTION_INFO_V1(int4hashset_gt);
+PG_FUNCTION_INFO_V1(int4hashset_ge);
+PG_FUNCTION_INFO_V1(int4hashset_cmp);
+PG_FUNCTION_INFO_V1(int4hashset_intersection);
+PG_FUNCTION_INFO_V1(int4hashset_difference);
+PG_FUNCTION_INFO_V1(int4hashset_symmetric_difference);
+
+Datum int4hashset_in(PG_FUNCTION_ARGS);
+Datum int4hashset_out(PG_FUNCTION_ARGS);
+Datum int4hashset_send(PG_FUNCTION_ARGS);
+Datum int4hashset_recv(PG_FUNCTION_ARGS);
+Datum int4hashset_add(PG_FUNCTION_ARGS);
+Datum int4hashset_contains(PG_FUNCTION_ARGS);
+Datum int4hashset_cardinality(PG_FUNCTION_ARGS);
+Datum int4hashset_union(PG_FUNCTION_ARGS);
+Datum int4hashset_init(PG_FUNCTION_ARGS);
+Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_max_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
+Datum int4hashset_to_array(PG_FUNCTION_ARGS);
+Datum int4hashset_to_sorted_array(PG_FUNCTION_ARGS);
+Datum int4hashset_eq(PG_FUNCTION_ARGS);
+Datum int4hashset_ne(PG_FUNCTION_ARGS);
+Datum int4hashset_hash(PG_FUNCTION_ARGS);
+Datum int4hashset_lt(PG_FUNCTION_ARGS);
+Datum int4hashset_le(PG_FUNCTION_ARGS);
+Datum int4hashset_gt(PG_FUNCTION_ARGS);
+Datum int4hashset_ge(PG_FUNCTION_ARGS);
+Datum int4hashset_cmp(PG_FUNCTION_ARGS);
+Datum int4hashset_intersection(PG_FUNCTION_ARGS);
+Datum int4hashset_difference(PG_FUNCTION_ARGS);
+Datum int4hashset_symmetric_difference(PG_FUNCTION_ARGS);
+
+Datum
+int4hashset_in(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ int4hashset_t *set;
+ int64 value;
+
+ /* Skip initial spaces */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check the opening brace */
+ if (*str != '{')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = int4hashset_allocate(
+ len/2,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ while (true)
+ {
+ /* Skip spaces before number */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check for closing brace, handling the case for an empty set */
+ if (*str == '}')
+ {
+ str++; /* Move past the closing brace */
+ break;
+ }
+
+ /* Check if "null" is encountered (case-insensitive) */
+ if (strncasecmp(str, "null", 4) == 0)
+ {
+ set->null_element = true;
+ str = str + 4; /* Move past "null" */
+ }
+ else
+ {
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
+
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->capacity)
+ {
+ set = int4hashset_resize(set);
+ }
+ set = int4hashset_add_element(set, (int32)value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+
+ str = endptr; /* Move to next number, "null" or closing brace */
+ }
+
+ /* Skip spaces before the next number or closing brace */
+ while (hashset_isspace(*str)) str++;
+
+ if (*str == ',')
+ {
+ str++; /* Skip comma before next loop iteration */
+ }
+ else if (*str != '}')
+ {
+ /* Unexpected character */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("unexpected character \"%c\" in hashset input", *str)));
+ }
+ }
+
+ /* Only whitespace is allowed after the closing brace */
+ while (*str)
+ {
+ if (!hashset_isspace(*str))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed hashset literal: \"%s\"", str),
+ errdetail("Junk after closing right brace.")));
+ }
+ str++;
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_out(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ char *bitmap;
+ int32 *values;
+ int i;
+ StringInfoData str;
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Initialize the StringInfo buffer */
+ initStringInfo(&str);
+
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Check if the null_element field is set */
+ if (set->null_element)
+ {
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfoString(&str, "NULL");
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
+ PG_RETURN_CSTRING(str.data);
+}
+
+Datum
+int4hashset_send(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ StringInfoData buf;
+ int32 data_size;
+ int version = 1;
+
+ /* Begin constructing the message */
+ pq_begintypsend(&buf);
+
+ /* Send the version number */
+ pq_sendint8(&buf, version);
+
+ /* Send the non-data fields */
+ pq_sendint32(&buf, set->flags);
+ pq_sendint32(&buf, set->capacity);
+ pq_sendint32(&buf, set->nelements);
+ pq_sendint32(&buf, set->hashfn_id);
+ pq_sendfloat4(&buf, set->load_factor);
+ pq_sendfloat4(&buf, set->growth_factor);
+ pq_sendint32(&buf, set->ncollisions);
+ pq_sendint32(&buf, set->max_collisions);
+ pq_sendint32(&buf, set->hash);
+ pq_sendbyte(&buf, set->null_element ? 1 : 0);
+
+ /* Compute and send the size of the data field */
+ data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
+ pq_sendbytes(&buf, set->data, data_size);
+
+ PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+
+Datum
+int4hashset_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ int4hashset_t *set;
+ int32 data_size;
+ Size total_size;
+ const char *binary_data;
+ int version;
+ int32 flags;
+ int32 capacity;
+ int32 nelements;
+ int32 hashfn_id;
+ float4 load_factor;
+ float4 growth_factor;
+ int32 ncollisions;
+ int32 max_collisions;
+ int32 hash;
+ bool null_element;
+
+ version = pq_getmsgint(buf, 1);
+ if (version != 1)
+ elog(ERROR, "unsupported hashset version number %d", version);
+
+ /* Read fields from buffer */
+ flags = pq_getmsgint(buf, 4);
+ capacity = pq_getmsgint(buf, 4);
+ nelements = pq_getmsgint(buf, 4);
+ hashfn_id = pq_getmsgint(buf, 4);
+ load_factor = pq_getmsgfloat4(buf);
+ growth_factor = pq_getmsgfloat4(buf);
+ ncollisions = pq_getmsgint(buf, 4);
+ max_collisions = pq_getmsgint(buf, 4);
+ hash = pq_getmsgint(buf, 4);
+ null_element = pq_getmsgbyte(buf) == 1;
+
+ /* Compute the size of the data field */
+ data_size = buf->len - buf->cursor;
+
+ /* Read the binary data */
+ binary_data = pq_getmsgbytes(buf, data_size);
+
+ /* Make sure that there is no extra data left in the message */
+ pq_getmsgend(buf);
+
+ /* Compute total size of hashset_t */
+ total_size = offsetof(int4hashset_t, data) + data_size;
+
+ /* Allocate memory for hashset including the data field */
+ set = (int4hashset_t *) palloc0(total_size);
+
+ /* Set the size of the variable-length data structure */
+ SET_VARSIZE(set, total_size);
+
+ /* Populate the structure */
+ set->flags = flags;
+ set->capacity = capacity;
+ set->nelements = nelements;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = ncollisions;
+ set->max_collisions = max_collisions;
+ set->hash = hash;
+ set->null_element = null_element;
+ memcpy(set->data, binary_data, data_size);
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_add(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ /* If there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
+ else
+ {
+ set = PG_GETARG_INT4HASHSET_COPY(0);
+ }
+
+ if (PG_ARGISNULL(1))
+ {
+ set->null_element = true;
+ }
+ else
+ {
+ int32 element = PG_GETARG_INT32(1);
+ set = int4hashset_add_element(set, element);
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_contains(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 value;
+ bool result;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ if (set->nelements == 0 && !set->null_element)
+ PG_RETURN_BOOL(false);
+
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ value = PG_GETARG_INT32(1);
+ result = int4hashset_contains_element(set, value);
+
+ if (!result && set->null_element)
+ PG_RETURN_NULL();
+
+ PG_RETURN_BOOL(result);
+}
+
+Datum
+int4hashset_cardinality(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ int64 cardinality = set->nelements + (set->null_element ? 1 : 0);
+
+ PG_RETURN_INT64(cardinality);
+}
+
+Datum
+int4hashset_union(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = setb->data;
+ int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ seta = int4hashset_add_element(seta, values[i]);
+ }
+
+ if (!seta->null_element && setb->null_element)
+ seta->null_element = true;
+
+ PG_RETURN_POINTER(seta);
+}
+
+Datum
+int4hashset_init(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
+
+ /* Validate input arguments */
+ if (!(initial_capacity >= 0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("initial capacity cannot be negative")));
+ }
+
+ if (!(load_factor > 0.0 && load_factor < 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("load factor must be between 0.0 and 1.0")));
+ }
+
+ if (!(growth_factor > 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("growth factor must be greater than 1.0")));
+ }
+
+ if (!(hashfn_id == JENKINS_LOOKUP3_HASHFN_ID ||
+ hashfn_id == MURMURHASH32_HASHFN_ID ||
+ hashfn_id == NAIVE_HASHFN_ID))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid hash function ID")));
+ }
+
+ set = int4hashset_allocate(
+ initial_capacity,
+ load_factor,
+ growth_factor,
+ hashfn_id
+ );
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_capacity(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->capacity);
+}
+
+Datum
+int4hashset_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->ncollisions);
+}
+
+Datum
+int4hashset_max_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->max_collisions);
+}
+
+Datum
+int4hashset_agg_add(PG_FUNCTION_ARGS)
+{
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_add_element(state, PG_GETARG_INT32(1));
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_add_set(PG_FUNCTION_ARGS)
+{
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+
+ {
+ int i;
+ char *bitmap;
+ int32 *values;
+ int4hashset_t *value;
+
+ value = PG_GETARG_INT4HASHSET(1);
+
+ bitmap = value->data;
+ values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+
+ for (i = 0; i < value->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ state = int4hashset_add_element(state, values[i]);
+ }
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_final(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(PG_GETARG_POINTER(0));
+}
+
+Datum
+int4hashset_agg_combine(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *src;
+ int4hashset_t *dst;
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ char *bitmap;
+ int32 *values;
+
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
+
+ /* if no "merged" state yet, try creating it */
+ if (PG_ARGISNULL(0))
+ {
+ /* nope, the second argument is NULL to, so return NULL */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ /* the second argument is not NULL, so copy it */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+
+ /* copy the hashset into the right long-lived memory context */
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ src = int4hashset_copy(src);
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(src);
+ }
+
+ /*
+ * If the second argument is NULL, just return the first one (we know
+ * it's not NULL at this point).
+ */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+
+ /* Now we know neither argument is NULL, so merge them. */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+ dst = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ bitmap = src->data;
+ values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+
+ for (i = 0; i < src->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ dst = int4hashset_add_element(dst, values[i]);
+ }
+
+
+ PG_RETURN_POINTER(dst);
+}
+
+Datum
+int4hashset_to_array(PG_FUNCTION_ARGS)
+{
+ int i,
+ idx;
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+ char *sbitmap;
+ int32 *svalues;
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element) {
+ Datum d = PointerGetDatum(construct_empty_array(INT4OID));
+ PG_RETURN_ARRAYTYPE_P(d);
+ }
+
+ sbitmap = set->data;
+ svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+ values = (int32 *) palloc(sizeof(int32) * nvalues);
+
+ idx = 0;
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (sbitmap[byte] & (0x01 << bit))
+ values[idx++] = svalues[i];
+ }
+
+ Assert(idx == nvalues);
+
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
+}
+
+Datum
+int4hashset_to_sorted_array(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element) {
+ Datum d = PointerGetDatum(construct_empty_array(INT4OID));
+ PG_RETURN_ARRAYTYPE_P(d);
+ }
+
+ /* extract the sorted elements from the hashset */
+ values = int4hashset_extract_sorted_elements(set);
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
+}
+
+Datum
+int4hashset_eq(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ char *bitmap_a;
+ int32 *values_a;
+
+ /*
+ * Check if the number of elements is the same
+ */
+ if (a->nelements != b->nelements)
+ PG_RETURN_BOOL(false);
+
+ bitmap_a = a->data;
+ values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+
+ /*
+ * Check if every element in a is also in b
+ */
+ for (i = 0; i < a->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap_a[byte] & (0x01 << bit))
+ {
+ int32 value = values_a[i];
+
+ if (!int4hashset_contains_element(b, value))
+ PG_RETURN_BOOL(false);
+ }
+ }
+
+ if (a->null_element != b->null_element)
+ PG_RETURN_BOOL(false);
+
+ /*
+ * All elements in a are in b and the number of elements is the same,
+ * so the sets must be equal.
+ */
+ PG_RETURN_BOOL(true);
+}
+
+
+Datum
+int4hashset_ne(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ /* If a is not equal to b, then they are not equal */
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_eq, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum int4hashset_hash(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT32(set->hash);
+}
+
+
+Datum
+int4hashset_lt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp < 0);
+}
+
+
+Datum
+int4hashset_le(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp <= 0);
+}
+
+
+Datum
+int4hashset_gt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp > 0);
+}
+
+
+Datum
+int4hashset_ge(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp >= 0);
+}
+
+Datum
+int4hashset_cmp(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 *elements_a;
+ int32 *elements_b;
+
+ /*
+ * Compare the hashes first, if they are different,
+ * we can immediately tell which set is 'greater'
+ */
+ if (a->hash < b->hash)
+ PG_RETURN_INT32(-1);
+ else if (a->hash > b->hash)
+ PG_RETURN_INT32(1);
+
+ /*
+ * If hashes are equal, perform a more rigorous comparison
+ */
+
+ /*
+ * If number of elements are different,
+ * we can use that to deterministically return -1 or 1
+ */
+ if (a->nelements < b->nelements)
+ PG_RETURN_INT32(-1);
+ else if (a->nelements > b->nelements)
+ PG_RETURN_INT32(1);
+
+ /* Assert that the number of elements in both hashsets are equal */
+ Assert(a->nelements == b->nelements);
+
+ /* Extract and sort elements from each set */
+ elements_a = int4hashset_extract_sorted_elements(a);
+ elements_b = int4hashset_extract_sorted_elements(b);
+
+ /* Now we can perform a lexicographical comparison */
+ for (int32 i = 0; i < a->nelements; i++)
+ {
+ if (elements_a[i] < elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(-1);
+ }
+ else if (elements_a[i] > elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(1);
+ }
+ }
+
+ /* All elements are equal, so the sets are equal */
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(0);
+}
+
+Datum
+int4hashset_intersection(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = setb->data;
+ int32 *values = (int32 *)(bitmap + CEIL_DIV(setb->capacity, 8));
+ int4hashset_t *intersection;
+
+ intersection = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ int4hashset_contains_element(seta, values[i]))
+ {
+ intersection = int4hashset_add_element(intersection, values[i]);
+ }
+ }
+
+ if (seta->null_element && setb->null_element)
+ intersection->null_element = true;
+
+ PG_RETURN_POINTER(intersection);
+}
+
+Datum
+int4hashset_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ int4hashset_t *difference;
+ char *bitmap = seta->data;
+ int32 *values = (int32 *)(bitmap + CEIL_DIV(seta->capacity, 8));
+
+ difference = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ !int4hashset_contains_element(setb, values[i]))
+ {
+ difference = int4hashset_add_element(difference, values[i]);
+ }
+ }
+
+ if (seta->null_element && !setb->null_element)
+ difference->null_element = true;
+
+ PG_RETURN_POINTER(difference);
+}
+
+Datum
+int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ int4hashset_t *result;
+ char *bitmapa = seta->data;
+ char *bitmapb = setb->data;
+ int32 *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
+ int32 *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
+
+ result = int4hashset_allocate(
+ seta->nelements + setb->nelements,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ /* Add elements that are in seta but not in setb */
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapa[byte] & (0x01 << bit))
+ {
+ int32 value = valuesa[i];
+ if (!int4hashset_contains_element(setb, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ /* Add elements that are in setb but not in seta */
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapb[byte] & (0x01 << bit))
+ {
+ int32 value = valuesb[i];
+ if (!int4hashset_contains_element(seta, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ if (seta->null_element ^ setb->null_element)
+ result->null_element = true;
+
+ PG_RETURN_POINTER(result);
+}
diff --git a/hashset.c b/hashset.c
new file mode 100644
index 0000000..91907ab
--- /dev/null
+++ b/hashset.c
@@ -0,0 +1,339 @@
+/*
+ * hashset.c
+ *
+ * Copyright (C) Tomas Vondra, 2019
+ */
+
+#include "hashset.h"
+
+static int int32_cmp(const void *a, const void *b);
+
+int4hashset_t *
+int4hashset_allocate(
+ int capacity,
+ float4 load_factor,
+ float4 growth_factor,
+ int hashfn_id
+)
+{
+ Size len;
+ int4hashset_t *set;
+ char *ptr;
+
+ /*
+ * Ensure that capacity is not divisible by HASHSET_STEP;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (capacity % HASHSET_STEP == 0)
+ capacity++;
+
+ len = offsetof(int4hashset_t, data);
+ len += CEIL_DIV(capacity, 8);
+ len += capacity * sizeof(int32);
+
+ ptr = palloc0(len);
+ SET_VARSIZE(ptr, len);
+
+ set = (int4hashset_t *) ptr;
+
+ set->flags = 0;
+ set->capacity = capacity;
+ set->nelements = 0;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = 0;
+ set->max_collisions = 0;
+ set->hash = 0; /* Initial hash value */
+ set->null_element = false; /* No null element initially */
+
+ set->flags |= 0;
+
+ return set;
+}
+
+int4hashset_t *
+int4hashset_resize(int4hashset_t * set)
+{
+ int i;
+ int4hashset_t *new;
+ char *bitmap;
+ int32 *values;
+ int new_capacity;
+
+ new_capacity = (int)(set->capacity * set->growth_factor);
+
+ /*
+ * If growth factor is too small, new capacity might remain the same as
+ * the old capacity. This can lead to an infinite loop in resizing.
+ * To prevent this, we manually increment the capacity by 1 if new capacity
+ * equals the old capacity.
+ */
+ if (new_capacity == set->capacity)
+ new_capacity = set->capacity + 1;
+
+ new = int4hashset_allocate(
+ new_capacity,
+ set->load_factor,
+ set->growth_factor,
+ set->hashfn_id
+ );
+
+ /* Calculate the pointer to the bitmap and values array */
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ int4hashset_add_element(new, values[i]);
+ }
+
+ return new;
+}
+
+int4hashset_t *
+int4hashset_add_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+ int32 current_collisions = 0;
+
+ if (set->nelements > set->capacity * set->load_factor)
+ set = int4hashset_resize(set);
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* The item is already used - maybe it's the same value? */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Same value, we're done */
+ if (values[position] == value)
+ break;
+
+ /* Increment the collision counter */
+ set->ncollisions++;
+ current_collisions++;
+
+ if (current_collisions > set->max_collisions)
+ set->max_collisions = current_collisions;
+
+ position = (position + HASHSET_STEP) % set->capacity;
+ continue;
+ }
+
+ /* Found an empty spot, before hitting the value first */
+ bitmap[byte] |= (0x01 << bit);
+ values[position] = value;
+
+ set->hash ^= hash;
+
+ set->nelements++;
+
+ break;
+ }
+
+ return set;
+}
+
+bool
+int4hashset_contains_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+ int num_probes = 0; /* Counter for the number of probes */
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * NAIVE_HASHFN_MULTIPLIER + NAIVE_HASHFN_INCREMENT);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = set->data;
+ values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* Found an empty slot, value is not there */
+ if ((bitmap[byte] & (0x01 << bit)) == 0)
+ return false;
+
+ /* Is it the same value? */
+ if (values[position] == value)
+ return true;
+
+ /* Move to the next element */
+ position = (position + HASHSET_STEP) % set->capacity;
+
+ num_probes++; /* Increment the number of probes */
+
+ /* Check if we have probed all slots */
+ if (num_probes >= set->capacity)
+ return false; /* Avoid infinite loop */
+ }
+}
+
+int32 *
+int4hashset_extract_sorted_elements(int4hashset_t *set)
+{
+ /* Allocate memory for the elements array */
+ int32 *elements = palloc(set->nelements * sizeof(int32));
+
+ /* Access the data array */
+ char *bitmap = set->data;
+ int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
+
+ /* Counter for the number of extracted elements */
+ int32 nextracted = 0;
+
+ /* Iterate through all elements */
+ for (int32 i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the current position is occupied */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Add the value to the elements array */
+ elements[nextracted++] = values[i];
+ }
+ }
+
+ /* Make sure we extracted the correct number of elements */
+ Assert(nextracted == set->nelements);
+
+ /* Sort the elements array */
+ qsort(elements, nextracted, sizeof(int32), int32_cmp);
+
+ /* Return the sorted elements array */
+ return elements;
+}
+
+int4hashset_t *
+int4hashset_copy(int4hashset_t *src)
+{
+ return src;
+}
+
+/*
+ * hashset_isspace() --- a non-locale-dependent isspace()
+ *
+ * Identical to array_isspace() in src/backend/utils/adt/arrayfuncs.c.
+ * We used to use isspace() for parsing hashset values, but that has
+ * undesirable results: a hashset value might be silently interpreted
+ * differently depending on the locale setting. So here, we hard-wire
+ * the traditional ASCII definition of isspace().
+ */
+bool
+hashset_isspace(char ch)
+{
+ if (ch == ' ' ||
+ ch == '\t' ||
+ ch == '\n' ||
+ ch == '\r' ||
+ ch == '\v' ||
+ ch == '\f')
+ return true;
+ return false;
+}
+
+/*
+ * Construct an SQL array from a simple C double array
+ */
+Datum
+int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element)
+{
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ for (i = 0; i < len; i++)
+ {
+ /* Stash away this field */
+ astate = accumArrayResult(astate,
+ Int32GetDatum(d[i]),
+ false,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ if (null_element)
+ {
+ astate = accumArrayResult(astate,
+ (Datum) 0,
+ true,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ PG_RETURN_DATUM(makeArrayResult(astate,
+ CurrentMemoryContext));
+}
+
+static int
+int32_cmp(const void *a, const void *b)
+{
+ int32 arg1 = *(const int32 *)a;
+ int32 arg2 = *(const int32 *)b;
+
+ if (arg1 < arg2) return -1;
+ if (arg1 > arg2) return 1;
+ return 0;
+}
diff --git a/hashset.control b/hashset.control
new file mode 100644
index 0000000..0743003
--- /dev/null
+++ b/hashset.control
@@ -0,0 +1,3 @@
+comment = 'Provides hashset type.'
+default_version = '0.0.1'
+relocatable = true
diff --git a/hashset.h b/hashset.h
new file mode 100644
index 0000000..86f5d1b
--- /dev/null
+++ b/hashset.h
@@ -0,0 +1,54 @@
+#ifndef HASHSET_H
+#define HASHSET_H
+
+#include "postgres.h"
+#include "libpq/pqformat.h"
+#include "nodes/memnodes.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "catalog/pg_type.h"
+#include "common/hashfn.h"
+
+#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_STEP 13
+#define JENKINS_LOOKUP3_HASHFN_ID 1
+#define MURMURHASH32_HASHFN_ID 2
+#define NAIVE_HASHFN_ID 3
+#define NAIVE_HASHFN_MULTIPLIER 7691
+#define NAIVE_HASHFN_INCREMENT 4201
+
+/*
+ * These defaults should match the the SQL function int4hashset()
+ */
+#define DEFAULT_INITIAL_CAPACITY 0
+#define DEFAULT_LOAD_FACTOR 0.75
+#define DEFAULT_GROWTH_FACTOR 2.0
+#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
+
+typedef struct int4hashset_t {
+ int32 vl_len_; /* Varlena header (do not touch directly!) */
+ int32 flags; /* Reserved for future use (versioning, ...) */
+ int32 capacity; /* Max number of element we have space for */
+ int32 nelements; /* Number of items added to the hashset */
+ int32 hashfn_id; /* ID of the hash function used */
+ float4 load_factor; /* Load factor before triggering resize */
+ float4 growth_factor; /* Growth factor when resizing the hashset */
+ int32 ncollisions; /* Number of collisions */
+ int32 max_collisions; /* Maximum collisions for a single element */
+ int32 hash; /* Stored hash value of the hashset */
+ bool null_element; /* Indicates if null is present in hashset */
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} int4hashset_t;
+
+int4hashset_t *int4hashset_allocate(int capacity, float4 load_factor, float4 growth_factor, int hashfn_id);
+int4hashset_t *int4hashset_resize(int4hashset_t * set);
+int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
+bool int4hashset_contains_element(int4hashset_t *set, int32 value);
+int32 *int4hashset_extract_sorted_elements(int4hashset_t *set);
+int4hashset_t *int4hashset_copy(int4hashset_t *src);
+bool hashset_isspace(char ch);
+Datum int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element);
+
+#endif /* HASHSET_H */
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
new file mode 100644
index 0000000..cc7c48a
--- /dev/null
+++ b/test/c_tests/test_send_recv.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <libpq-fe.h>
+
+void exit_nicely(PGconn *conn) {
+ PQfinish(conn);
+ exit(1);
+}
+
+int main() {
+ /* Connect to database specified by the PGDATABASE environment variable */
+ const char *hostname = getenv("PGHOST");
+ char conninfo[1024];
+ PGconn *conn;
+
+ if (hostname == NULL)
+ hostname = "localhost";
+
+ /* Connect to database specified by the PGDATABASE environment variable */
+ snprintf(conninfo, sizeof(conninfo), "host=%s port=5432", hostname);
+ conn = PQconnectdb(conninfo);
+ if (PQstatus(conn) != CONNECTION_OK) {
+ fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
+ exit_nicely(conn);
+ }
+
+ /* Create extension */
+ PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
+
+ /* Create temporary table */
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col int4hashset)");
+
+ /* Enable binary output */
+ PQexec(conn, "SET bytea_output = 'escape'");
+
+ /* Insert dummy data */
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::int4hashset)";
+ PGresult *res = PQexec(conn, insert_command);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Fetch the data in binary format */
+ const char *select_command = "SELECT hashset_col FROM test_hashset_send_recv";
+ int resultFormat = 1; /* 0 = text, 1 = binary */
+ res = PQexecParams(conn, select_command, 0, NULL, NULL, NULL, NULL, resultFormat);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Store binary data for later use */
+ const char *binary_data = PQgetvalue(res, 0, 0);
+ int binary_data_length = PQgetlength(res, 0, 0);
+ PQclear(res);
+
+ /* Re-insert the binary data */
+ const char *insert_binary_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ($1)";
+ const char *paramValues[1] = {binary_data};
+ int paramLengths[1] = {binary_data_length};
+ int paramFormats[1] = {1}; /* binary format */
+ res = PQexecParams(conn, insert_binary_command, 1, NULL, paramValues, paramLengths, paramFormats, 0);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Check the data */
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ res = PQexec(conn, check_command);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Print the results */
+ printf("unique_count: %s\n", PQgetvalue(res, 0, 0));
+ printf("count: %s\n", PQgetvalue(res, 0, 1));
+ PQclear(res);
+
+ /* Disconnect */
+ PQfinish(conn);
+
+ return 0;
+}
diff --git a/test/c_tests/test_send_recv.sh b/test/c_tests/test_send_recv.sh
new file mode 100755
index 0000000..ab308b3
--- /dev/null
+++ b/test/c_tests/test_send_recv.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+# Get the directory of this script
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+
+# Set up database
+export PGDATABASE=test_hashset_send_recv
+dropdb --if-exists "$PGDATABASE"
+createdb
+
+# Define directories
+EXPECTED_DIR="$SCRIPT_DIR/../expected"
+RESULTS_DIR="$SCRIPT_DIR/../results"
+
+# Create the results directory if it doesn't exist
+mkdir -p "$RESULTS_DIR"
+
+# Run the C test and save its output to the results directory
+"$SCRIPT_DIR/test_send_recv" > "$RESULTS_DIR/test_send_recv.out"
+
+printf "test test_send_recv ... "
+
+# Compare the actual output with the expected output
+if diff -q "$RESULTS_DIR/test_send_recv.out" "$EXPECTED_DIR/test_send_recv.out" > /dev/null 2>&1; then
+ echo "ok"
+ # Clean up by removing the results directory if the test passed
+ rm -r "$RESULTS_DIR"
+else
+ echo "failed"
+ git diff --no-index --color "$EXPECTED_DIR/test_send_recv.out" "$RESULTS_DIR/test_send_recv.out"
+fi
diff --git a/test/expected/array-and-multiset-semantics.out b/test/expected/array-and-multiset-semantics.out
new file mode 100644
index 0000000..8f989a1
--- /dev/null
+++ b/test/expected/array-and-multiset-semantics.out
@@ -0,0 +1,365 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+DROP TABLE IF EXISTS hashset_test_results_1;
+NOTICE: table "hashset_test_results_1" does not exist, skipping
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_2;
+NOTICE: table "hashset_test_results_2" does not exist, skipping
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_3;
+NOTICE: table "hashset_test_results_3" does not exist, skipping
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+SELECT * FROM hashset_test_results_1;
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+--------+------+-------------+--------------+------------------+------------
+ | | {NULL} | {NULL} | |
+ | 1 | {1} | {1} | |
+ | 4 | {4} | {4} | |
+ {} | | {NULL} | {NULL} | f | f
+ {} | 1 | {1} | {1} | f | f
+ {} | 4 | {4} | {4} | f | f
+ {NULL} | | {NULL} | {NULL,NULL} | |
+ {NULL} | 1 | {1,NULL} | {NULL,1} | |
+ {NULL} | 4 | {4,NULL} | {NULL,4} | |
+ {1} | | {1,NULL} | {1,NULL} | |
+ {1} | 1 | {1} | {1,1} | t | t
+ {1} | 4 | {1,4} | {1,4} | f | f
+ {2} | | {2,NULL} | {2,NULL} | |
+ {2} | 1 | {2,1} | {2,1} | f | f
+ {2} | 4 | {2,4} | {2,4} | f | f
+ {1,2} | | {1,2,NULL} | {1,2,NULL} | |
+ {1,2} | 1 | {1,2} | {1,2,1} | t | t
+ {1,2} | 4 | {4,1,2} | {1,2,4} | f | f
+ {2,3} | | {2,3,NULL} | {2,3,NULL} | |
+ {2,3} | 1 | {1,2,3} | {2,3,1} | f | f
+ {2,3} | 4 | {4,2,3} | {2,3,4} | f | f
+(21 rows)
+
+SELECT * FROM hashset_test_results_2;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+----------+----------+---------------+--------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+ | | | | | | | | | | | | |
+ | {} | | | | | | | | | | | |
+ | {NULL} | | | | | | | | | | | |
+ | {1} | | | | | | | | | | | |
+ | {1,NULL} | | | | | | | | | | | |
+ | {2} | | | | | | | | | | | |
+ | {1,2} | | | | | | | | | | | |
+ | {2,3} | | | | | | | | | | | |
+ {} | | | | | | | | | | | | |
+ {} | {} | {} | {} | {} | {} | {} | {} | {} | {} | t | t | f | f
+ {} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {} | {1} | {1} | {1} | {} | {} | {} | {} | {1} | {1} | f | f | t | t
+ {} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | {1,NULL} | {1,NULL} | f | f | t | t
+ {} | {2} | {2} | {2} | {} | {} | {} | {} | {2} | {2} | f | f | t | t
+ {} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | {1,2} | {1,2} | f | f | t | t
+ {} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | {2,3} | {2,3} | f | f | t | t
+ {NULL} | | | | | | | | | | | | |
+ {NULL} | {} | {NULL} | {NULL} | {} | {} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | t | t | f | f
+ {NULL} | {1} | {1,NULL} | {1,NULL} | {} | {} | {NULL} | {NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {} | {} | {1} | {1} | f | f | t | t
+ {NULL} | {2} | {2,NULL} | {2,NULL} | {} | {} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {NULL} | {1,2} | {2,1,NULL} | {1,2,NULL} | {} | {} | {NULL} | {NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {NULL} | {2,3} | {3,2,NULL} | {2,3,NULL} | {} | {} | {NULL} | {NULL} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {1} | | | | | | | | | | | | |
+ {1} | {} | {1} | {1} | {} | {} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1} | {NULL} | {1,NULL} | {1,NULL} | {} | {} | {1} | {1} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1} | {1} | {1} | {1} | {1} | {1} | {} | {} | {} | {} | t | t | f | f
+ {1} | {1,NULL} | {1,NULL} | {1,NULL} | {1} | {1} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {1} | {2} | {1,2} | {1,2} | {} | {} | {1} | {1} | {1,2} | {1,2} | f | f | t | t
+ {1} | {1,2} | {1,2} | {1,2} | {1} | {1} | {} | {} | {2} | {2} | f | f | t | t
+ {1} | {2,3} | {3,1,2} | {1,2,3} | {} | {} | {1} | {1} | {3,2,1} | {1,2,3} | f | f | t | t
+ {1,NULL} | | | | | | | | | | | | |
+ {1,NULL} | {} | {1,NULL} | {1,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1,NULL} | {NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,NULL} | {1} | {1,NULL} | {1,NULL} | {1} | {1} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | t | t | f | f
+ {1,NULL} | {2} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,NULL} | {1,2} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,NULL} | {2,3} | {3,1,2,NULL} | {1,2,3,NULL} | {} | {} | {1,NULL} | {1,NULL} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2} | | | | | | | | | | | | |
+ {2} | {} | {2} | {2} | {} | {} | {2} | {2} | {2} | {2} | f | f | t | t
+ {2} | {NULL} | {2,NULL} | {2,NULL} | {} | {} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {2} | {1} | {2,1} | {1,2} | {} | {} | {2} | {2} | {2,1} | {1,2} | f | f | t | t
+ {2} | {1,NULL} | {2,1,NULL} | {1,2,NULL} | {} | {} | {2} | {2} | {2,1,NULL} | {1,2,NULL} | f | f | t | t
+ {2} | {2} | {2} | {2} | {2} | {2} | {} | {} | {} | {} | t | t | f | f
+ {2} | {1,2} | {2,1} | {1,2} | {2} | {2} | {} | {} | {1} | {1} | f | f | t | t
+ {2} | {2,3} | {2,3} | {2,3} | {2} | {2} | {} | {} | {3} | {3} | f | f | t | t
+ {1,2} | | | | | | | | | | | | |
+ {1,2} | {} | {1,2} | {1,2} | {} | {} | {1,2} | {1,2} | {1,2} | {1,2} | f | f | t | t
+ {1,2} | {NULL} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,2} | {1,2} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,2} | {1} | {1,2} | {1,2} | {1} | {1} | {2} | {2} | {2} | {2} | f | f | t | t
+ {1,2} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,2} | {2} | {1,2} | {1,2} | {2} | {2} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | t | t | f | f
+ {1,2} | {2,3} | {3,1,2} | {1,2,3} | {2} | {2} | {1} | {1} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | | | | | | | | | | | | |
+ {2,3} | {} | {2,3} | {2,3} | {} | {} | {2,3} | {2,3} | {2,3} | {2,3} | f | f | t | t
+ {2,3} | {NULL} | {2,3,NULL} | {2,3,NULL} | {} | {} | {2,3} | {2,3} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {2,3} | {1} | {1,2,3} | {1,2,3} | {} | {} | {2,3} | {2,3} | {3,2,1} | {1,2,3} | f | f | t | t
+ {2,3} | {1,NULL} | {1,2,3,NULL} | {1,2,3,NULL} | {} | {} | {2,3} | {2,3} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2,3} | {2} | {2,3} | {2,3} | {2} | {2} | {3} | {3} | {3} | {3} | f | f | t | t
+ {2,3} | {1,2} | {1,2,3} | {1,2,3} | {2} | {2} | {3} | {3} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | t | t | f | f
+(64 rows)
+
+SELECT * FROM hashset_test_results_3;
+ arg1 | hashset_cardinality | cardinality
+--------+---------------------+-------------
+ | |
+ {} | 0 | 0
+ {NULL} | 1 | 1
+ {1} | 1 | 1
+ {2} | 1 | 1
+ {1,2} | 2 | 2
+ {2,3} | 2 | 2
+(7 rows)
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+\echo *** Testing: hashset_add()
+*** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_contains()
+*** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_union()
+*** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_intersection()
+*** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_difference()
+*** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_symmetric_difference()
+*** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_eq()
+*** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_ne()
+*** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_cardinality()
+*** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
+ arg1 | hashset_cardinality | cardinality
+------+---------------------+-------------
+(0 rows)
+
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..79c3230
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,298 @@
+/*
+ * Hashset Type
+ */
+SELECT '{}'::int4hashset; -- empty int4hashset
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset;
+ int4hashset
+-------------
+ {3,2,1}
+(1 row)
+
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+ int4hashset
+----------------------------
+ {0,2147483647,-2147483648}
+(1 row)
+
+SELECT '{-2147483649}'::int4hashset; -- out of range
+ERROR: value "-2147483649}" is out of range for type integer
+LINE 1: SELECT '{-2147483649}'::int4hashset;
+ ^
+SELECT '{2147483648}'::int4hashset; -- out of range
+ERROR: value "2147483648}" is out of range for type integer
+LINE 1: SELECT '{2147483648}'::int4hashset;
+ ^
+/*
+ * Hashset Functions
+ */
+SELECT int4hashset();
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT hashset_add(int4hashset(), 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add('{123}'::int4hashset, 456);
+ hashset_add
+-------------
+ {456,123}
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+ hashset_contains
+------------------
+ f
+(1 row)
+
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_union
+---------------
+ {3,1,2}
+(1 row)
+
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+ hashset_to_array
+------------------
+ {3,2,1}
+(1 row)
+
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
+ hashset_cardinality
+---------------------
+ 3
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_intersection
+----------------------
+ {2}
+(1 row)
+
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_difference
+--------------------
+ {1}
+(1 row)
+
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_symmetric_difference
+------------------------------
+ {1,3}
+(1 row)
+
+/*
+ * Aggregation Functions
+ */
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+ hashset_agg
+------------------------
+ {6,10,1,8,2,3,4,5,9,7}
+(1 row)
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset_agg
+------------------------
+ {6,8,1,3,2,10,4,5,9,7}
+(1 row)
+
+/*
+ * Operator Definitions
+ */
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset || 4;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+SELECT 4 || '{1,2,3}'::int4hashset;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+/*
+ * Hashset Hash Operators
+ */
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+ count | count
+-------+-------
+ 2 | 1
+(1 row)
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
+ h
+---------
+ {9,7,8}
+ {3,2,1}
+ {5,6,4}
+(3 rows)
+
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
new file mode 100644
index 0000000..bd44199
--- /dev/null
+++ b/test/expected/invalid.out
@@ -0,0 +1,4 @@
+SELECT '{1,2s}'::int4hashset;
+ERROR: unexpected character "s" in hashset input
+LINE 1: SELECT '{1,2s}'::int4hashset;
+ ^
diff --git a/test/expected/io_varying_lengths.out b/test/expected/io_varying_lengths.out
new file mode 100644
index 0000000..45e9fb1
--- /dev/null
+++ b/test/expected/io_varying_lengths.out
@@ -0,0 +1,100 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+SELECT hashset_sorted('{1}'::int4hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/parsing.out b/test/expected/parsing.out
new file mode 100644
index 0000000..263797e
--- /dev/null
+++ b/test/expected/parsing.out
@@ -0,0 +1,71 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+ERROR: malformed hashset literal: "1"
+LINE 2: SELECT ' { 1 , 23 , -456 } 1'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+ERROR: malformed hashset literal: ","
+LINE 1: SELECT ' { 1 , 23 , -456 } ,'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+ERROR: malformed hashset literal: "{"
+LINE 1: SELECT ' { 1 , 23 , -456 } {'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+ERROR: malformed hashset literal: "}"
+LINE 1: SELECT ' { 1 , 23 , -456 } }'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+ERROR: malformed hashset literal: "x"
+LINE 1: SELECT ' { 1 , 23 , -456 } x'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+ERROR: unexpected character "1" in hashset input
+LINE 2: SELECT ' { 1 , 23 , -456 1'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+ERROR: unexpected character "{" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 {'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+ERROR: unexpected character "x" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 x'::int4hashset;
+ ^
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ", 23 , -456 } "
+LINE 2: SELECT ' { , 23 , -456 } '::int4hashset;
+ ^
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ""
+LINE 1: SELECT ' { 1 , 23 , '::int4hashset;
+ ^
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: "s , 23 , -456 } "
+LINE 1: SELECT ' { s , 23 , -456 } '::int4hashset;
+ ^
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for hashset: "1 , 23 , -456 } "
+LINE 2: SELECT ' 1 , 23 , -456 } '::int4hashset;
+ ^
+DETAIL: Hashset representation must start with "{".
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..f34e190
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..9d9026b
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
new file mode 100644
index 0000000..03cc7c3
--- /dev/null
+++ b/test/expected/reported_bugs.out
@@ -0,0 +1,138 @@
+/*
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+ hashset_agg | hashset_add
+-------------+-------------
+ {3,1,2} | {3,4,1,2}
+(1 row)
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_hash('{2,1}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+ hashset_cmp
+-------------
+ 0
+(1 row)
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+ hashset_capacity
+------------------
+ 2
+(1 row)
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..f59494e
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
+ hashset_cardinality
+---------------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/expected/test_send_recv.out b/test/expected/test_send_recv.out
new file mode 100644
index 0000000..12382d5
--- /dev/null
+++ b/test/expected/test_send_recv.out
@@ -0,0 +1,2 @@
+unique_count: 1
+count: 2
diff --git a/test/sql/array-and-multiset-semantics.sql b/test/sql/array-and-multiset-semantics.sql
new file mode 100644
index 0000000..0db7065
--- /dev/null
+++ b/test/sql/array-and-multiset-semantics.sql
@@ -0,0 +1,232 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+
+DROP TABLE IF EXISTS hashset_test_results_1;
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+
+
+DROP TABLE IF EXISTS hashset_test_results_2;
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+
+DROP TABLE IF EXISTS hashset_test_results_3;
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+
+SELECT * FROM hashset_test_results_1;
+SELECT * FROM hashset_test_results_2;
+SELECT * FROM hashset_test_results_3;
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+
+\echo *** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+
+\echo *** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+
+\echo *** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+
+\echo *** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+
+\echo *** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+
+\echo *** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+
+\echo *** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+
+\echo *** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+
+\echo *** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..2bf5893
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,107 @@
+/*
+ * Hashset Type
+ */
+
+SELECT '{}'::int4hashset; -- empty int4hashset
+SELECT '{1,2,3}'::int4hashset;
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+SELECT '{-2147483649}'::int4hashset; -- out of range
+SELECT '{2147483648}'::int4hashset; -- out of range
+
+/*
+ * Hashset Functions
+ */
+
+SELECT int4hashset();
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+SELECT hashset_add(int4hashset(), 123);
+SELECT hashset_add('{123}'::int4hashset, 456);
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+
+/*
+ * Aggregation Functions
+ */
+
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+
+/*
+ * Operator Definitions
+ */
+
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset || 4;
+SELECT 4 || '{1,2,3}'::int4hashset;
+
+/*
+ * Hashset Hash Operators
+ */
+
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
new file mode 100644
index 0000000..e7a53f1
--- /dev/null
+++ b/test/sql/benchmark.sql
@@ -0,0 +1,191 @@
+DROP EXTENSION IF EXISTS hashset CASCADE;
+CREATE EXTENSION hashset;
+
+\timing on
+
+\echo * Benchmark array_agg(DISTINCT ...) vs hashset_agg()
+
+DROP TABLE IF EXISTS benchmark_input_100k;
+DROP TABLE IF EXISTS benchmark_input_10M;
+DROP TABLE IF EXISTS benchmark_array_agg;
+DROP TABLE IF EXISTS benchmark_hashset_agg;
+
+SELECT setseed(0.12345);
+
+CREATE TABLE benchmark_input_100k AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,100000) AS i;
+
+CREATE TABLE benchmark_input_10M AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,10000000) AS i;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k unique integers
+CREATE TABLE benchmark_array_agg AS
+SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
+CREATE TABLE benchmark_hashset_agg AS
+SELECT hashset_agg(i) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M unique integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(i) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_10M;
+
+SELECT cardinality(array_agg) FROM benchmark_array_agg ORDER BY 1;
+
+SELECT
+ hashset_cardinality(hashset_agg),
+ hashset_capacity(hashset_agg),
+ hashset_collisions(hashset_agg),
+ hashset_max_collisions(hashset_agg)
+FROM benchmark_hashset_agg;
+
+SELECT hashset_capacity(hashset_agg(rnd)) FROM benchmark_input_10M;
+
+\echo * Benchmark different hash functions
+
+\echo *** Elements in sequence 1..100000
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo *** Testing 100000 random ints
+
+SELECT setseed(0.12345);
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
new file mode 100644
index 0000000..43689ab
--- /dev/null
+++ b/test/sql/invalid.sql
@@ -0,0 +1 @@
+SELECT '{1,2s}'::int4hashset;
diff --git a/test/sql/io_varying_lengths.sql b/test/sql/io_varying_lengths.sql
new file mode 100644
index 0000000..8acb6b8
--- /dev/null
+++ b/test/sql/io_varying_lengths.sql
@@ -0,0 +1,21 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+
+SELECT hashset_sorted('{1}'::int4hashset);
+SELECT hashset_sorted('{1,2}'::int4hashset);
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
diff --git a/test/sql/parsing.sql b/test/sql/parsing.sql
new file mode 100644
index 0000000..1e56bbe
--- /dev/null
+++ b/test/sql/parsing.sql
@@ -0,0 +1,23 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..2fee0fc
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..7cc8f87
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
new file mode 100644
index 0000000..9e6b617
--- /dev/null
+++ b/test/sql/reported_bugs.sql
@@ -0,0 +1,85 @@
+/*
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+SELECT hashset_hash('{2,1}'::int4hashset);
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..bf05ffa
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
hashset-0.0.1-b7e5614-incremental.patchapplication/octet-stream; name="=?UTF-8?Q?hashset-0.0.1-b7e5614-incremental.patch?="Download
commit b7e56144366a0b97ddd0a77c06aac9e6051b5f5a
Author: Joel Jakobsson <joel@compiler.org>
Date: Tue Jun 27 09:39:15 2023 +0200
Align null semantics with SQL:2023 array and multiset standards
* Introduced a new boolean field, null_element, in the int4hashset_t type.
* Rename hashset_count() to hashset_cardinality().
* Rename hashset_merge() to hashset_union().
* Rename hashset_equals() to hashset_eq().
* Rename hashset_neq() to hashset_ne().
* Add hashset_to_sorted_array().
* Handle null semantics to work as in arrays and multisets.
* Update int4hashset_add() to allow creating a new set if none exists.
* Use more portable int32 typedef instead of int32_t.
This also adds a thorough test suite in array-and-multiset-semantics.sql,
which aims to test all relevant combinations of operations and values.
diff --git a/Makefile b/Makefile
index cfb8362..ee62511 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
CLIENT_INCLUDES=-I$(shell pg_config --includedir)
LIBRARY_PATH = -L$(shell pg_config --libdir)
-REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs strict
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs array-and-multiset-semantics
REGRESS_OPTS = --inputdir=test
PG_CONFIG = pg_config
diff --git a/README.md b/README.md
index 9c26cf6..91af6ee 100644
--- a/README.md
+++ b/README.md
@@ -49,7 +49,7 @@ UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
-- Count the number of likes a user has
-SELECT hashset_count(user_likes) FROM users WHERE user_id = 1; -- 2
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1; -- 2
```
You can also use the aggregate functions to perform operations on multiple rows.
@@ -75,9 +75,9 @@ a variable-length type.
- 3=Naive hash function
- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
-- `hashset_merge(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
+- `hashset_union(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
-- `hashset_count(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
+- `hashset_cardinality(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
- `hashset_max_collisions(int4hashset) -> bigint`: Returns the maximum number of collisions that have occurred for a single element
- `hashset_intersection(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that is the intersection of the two input sets.
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index d48260f..d0478ce 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -50,16 +50,16 @@ LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
RETURNS int4hashset
AS 'hashset', 'int4hashset_add'
-LANGUAGE C IMMUTABLE STRICT;
+LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
-RETURNS bool
+RETURNS boolean
AS 'hashset', 'int4hashset_contains'
-LANGUAGE C IMMUTABLE STRICT;
+LANGUAGE C IMMUTABLE;
-CREATE OR REPLACE FUNCTION hashset_merge(int4hashset, int4hashset)
+CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset)
RETURNS int4hashset
-AS 'hashset', 'int4hashset_merge'
+AS 'hashset', 'int4hashset_union'
LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
@@ -67,9 +67,14 @@ RETURNS int[]
AS 'hashset', 'int4hashset_to_array'
LANGUAGE C IMMUTABLE STRICT;
-CREATE OR REPLACE FUNCTION hashset_count(int4hashset)
+CREATE OR REPLACE FUNCTION hashset_to_sorted_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_sorted_array'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cardinality(int4hashset)
RETURNS bigint
-AS 'hashset', 'int4hashset_count'
+AS 'hashset', 'int4hashset_cardinality'
LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
@@ -162,28 +167,28 @@ CREATE AGGREGATE hashset_agg(int4hashset) (
* Operator Definitions
*/
-CREATE OR REPLACE FUNCTION hashset_equals(int4hashset, int4hashset)
-RETURNS bool
-AS 'hashset', 'int4hashset_equals'
+CREATE OR REPLACE FUNCTION hashset_eq(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_eq'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR = (
LEFTARG = int4hashset,
RIGHTARG = int4hashset,
- PROCEDURE = hashset_equals,
+ PROCEDURE = hashset_eq,
COMMUTATOR = =,
HASHES
);
-CREATE OR REPLACE FUNCTION hashset_neq(int4hashset, int4hashset)
-RETURNS bool
-AS 'hashset', 'int4hashset_neq'
+CREATE OR REPLACE FUNCTION hashset_ne(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ne'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR <> (
LEFTARG = int4hashset,
RIGHTARG = int4hashset,
- PROCEDURE = hashset_neq,
+ PROCEDURE = hashset_ne,
COMMUTATOR = '<>',
NEGATOR = '=',
RESTRICT = neqsel,
@@ -224,7 +229,7 @@ FUNCTION 1 hashset_hash(int4hashset);
*/
CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
-RETURNS bool
+RETURNS boolean
AS 'hashset', 'int4hashset_lt'
LANGUAGE C IMMUTABLE STRICT;
diff --git a/hashset-api.c b/hashset-api.c
index e00ef0c..a4beef4 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -18,8 +18,8 @@ PG_FUNCTION_INFO_V1(int4hashset_send);
PG_FUNCTION_INFO_V1(int4hashset_recv);
PG_FUNCTION_INFO_V1(int4hashset_add);
PG_FUNCTION_INFO_V1(int4hashset_contains);
-PG_FUNCTION_INFO_V1(int4hashset_count);
-PG_FUNCTION_INFO_V1(int4hashset_merge);
+PG_FUNCTION_INFO_V1(int4hashset_cardinality);
+PG_FUNCTION_INFO_V1(int4hashset_union);
PG_FUNCTION_INFO_V1(int4hashset_init);
PG_FUNCTION_INFO_V1(int4hashset_capacity);
PG_FUNCTION_INFO_V1(int4hashset_collisions);
@@ -29,8 +29,9 @@ PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
PG_FUNCTION_INFO_V1(int4hashset_agg_final);
PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
PG_FUNCTION_INFO_V1(int4hashset_to_array);
-PG_FUNCTION_INFO_V1(int4hashset_equals);
-PG_FUNCTION_INFO_V1(int4hashset_neq);
+PG_FUNCTION_INFO_V1(int4hashset_to_sorted_array);
+PG_FUNCTION_INFO_V1(int4hashset_eq);
+PG_FUNCTION_INFO_V1(int4hashset_ne);
PG_FUNCTION_INFO_V1(int4hashset_hash);
PG_FUNCTION_INFO_V1(int4hashset_lt);
PG_FUNCTION_INFO_V1(int4hashset_le);
@@ -47,8 +48,8 @@ Datum int4hashset_send(PG_FUNCTION_ARGS);
Datum int4hashset_recv(PG_FUNCTION_ARGS);
Datum int4hashset_add(PG_FUNCTION_ARGS);
Datum int4hashset_contains(PG_FUNCTION_ARGS);
-Datum int4hashset_count(PG_FUNCTION_ARGS);
-Datum int4hashset_merge(PG_FUNCTION_ARGS);
+Datum int4hashset_cardinality(PG_FUNCTION_ARGS);
+Datum int4hashset_union(PG_FUNCTION_ARGS);
Datum int4hashset_init(PG_FUNCTION_ARGS);
Datum int4hashset_capacity(PG_FUNCTION_ARGS);
Datum int4hashset_collisions(PG_FUNCTION_ARGS);
@@ -58,8 +59,9 @@ Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
Datum int4hashset_to_array(PG_FUNCTION_ARGS);
-Datum int4hashset_equals(PG_FUNCTION_ARGS);
-Datum int4hashset_neq(PG_FUNCTION_ARGS);
+Datum int4hashset_to_sorted_array(PG_FUNCTION_ARGS);
+Datum int4hashset_eq(PG_FUNCTION_ARGS);
+Datum int4hashset_ne(PG_FUNCTION_ARGS);
Datum int4hashset_hash(PG_FUNCTION_ARGS);
Datum int4hashset_lt(PG_FUNCTION_ARGS);
Datum int4hashset_le(PG_FUNCTION_ARGS);
@@ -114,33 +116,42 @@ int4hashset_in(PG_FUNCTION_ARGS)
break;
}
- /* Parse the number */
- value = strtol(str, &endptr, 10);
-
- if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ /* Check if "null" is encountered (case-insensitive) */
+ if (strncasecmp(str, "null", 4) == 0)
{
- ereport(ERROR,
- (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
- errmsg("value \"%s\" is out of range for type %s", str,
- "integer")));
+ set->null_element = true;
+ str = str + 4; /* Move past "null" */
}
-
- /* Add the value to the hashset, resize if needed */
- if (set->nelements >= set->capacity)
+ else
{
- set = int4hashset_resize(set);
- }
- set = int4hashset_add_element(set, (int32)value);
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
- /* Error handling for strtol */
- if (endptr == str)
- {
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
- errmsg("invalid input syntax for integer: \"%s\"", str)));
- }
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->capacity)
+ {
+ set = int4hashset_resize(set);
+ }
+ set = int4hashset_add_element(set, (int32)value);
- str = endptr; /* Move to next potential number or closing brace */
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+
+ str = endptr; /* Move to next number, "null" or closing brace */
+ }
/* Skip spaces before the next number or closing brace */
while (hashset_isspace(*str)) str++;
@@ -209,6 +220,14 @@ int4hashset_out(PG_FUNCTION_ARGS)
}
}
+ /* Check if the null_element field is set */
+ if (set->null_element)
+ {
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfoString(&str, "NULL");
+ }
+
/* Append the closing brace for the output hashset string */
appendStringInfoChar(&str, '}');
@@ -240,6 +259,7 @@ int4hashset_send(PG_FUNCTION_ARGS)
pq_sendint32(&buf, set->ncollisions);
pq_sendint32(&buf, set->max_collisions);
pq_sendint32(&buf, set->hash);
+ pq_sendbyte(&buf, set->null_element ? 1 : 0);
/* Compute and send the size of the data field */
data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
@@ -266,6 +286,7 @@ int4hashset_recv(PG_FUNCTION_ARGS)
int32 ncollisions;
int32 max_collisions;
int32 hash;
+ bool null_element;
version = pq_getmsgint(buf, 1);
if (version != 1)
@@ -281,6 +302,7 @@ int4hashset_recv(PG_FUNCTION_ARGS)
ncollisions = pq_getmsgint(buf, 4);
max_collisions = pq_getmsgint(buf, 4);
hash = pq_getmsgint(buf, 4);
+ null_element = pq_getmsgbyte(buf) == 1;
/* Compute the size of the data field */
data_size = buf->len - buf->cursor;
@@ -310,6 +332,7 @@ int4hashset_recv(PG_FUNCTION_ARGS)
set->ncollisions = ncollisions;
set->max_collisions = max_collisions;
set->hash = hash;
+ set->null_element = null_element;
memcpy(set->data, binary_data, data_size);
PG_RETURN_POINTER(set);
@@ -318,10 +341,31 @@ int4hashset_recv(PG_FUNCTION_ARGS)
Datum
int4hashset_add(PG_FUNCTION_ARGS)
{
- int4hashset_t *set = int4hashset_add_element(
- PG_GETARG_INT4HASHSET_COPY(0),
- PG_GETARG_INT32(1)
- );
+ int4hashset_t *set;
+ /* If there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
+ else
+ {
+ set = PG_GETARG_INT4HASHSET_COPY(0);
+ }
+
+ if (PG_ARGISNULL(1))
+ {
+ set->null_element = true;
+ }
+ else
+ {
+ int32 element = PG_GETARG_INT32(1);
+ set = int4hashset_add_element(set, element);
+ }
PG_RETURN_POINTER(set);
}
@@ -329,28 +373,47 @@ int4hashset_add(PG_FUNCTION_ARGS)
Datum
int4hashset_contains(PG_FUNCTION_ARGS)
{
- int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
- int32 value = PG_GETARG_INT32(1);
+ int4hashset_t *set;
+ int32 value;
+ bool result;
- PG_RETURN_BOOL(int4hashset_contains_element(set, value));
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ if (set->nelements == 0 && !set->null_element)
+ PG_RETURN_BOOL(false);
+
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ value = PG_GETARG_INT32(1);
+ result = int4hashset_contains_element(set, value);
+
+ if (!result && set->null_element)
+ PG_RETURN_NULL();
+
+ PG_RETURN_BOOL(result);
}
Datum
-int4hashset_count(PG_FUNCTION_ARGS)
+int4hashset_cardinality(PG_FUNCTION_ARGS)
{
int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ int64 cardinality = set->nelements + (set->null_element ? 1 : 0);
- PG_RETURN_INT64(set->nelements);
+ PG_RETURN_INT64(cardinality);
}
Datum
-int4hashset_merge(PG_FUNCTION_ARGS)
+int4hashset_union(PG_FUNCTION_ARGS)
{
int i;
int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
char *bitmap = setb->data;
- int32_t *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+ int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
for (i = 0; i < setb->capacity; i++)
{
@@ -361,6 +424,9 @@ int4hashset_merge(PG_FUNCTION_ARGS)
seta = int4hashset_add_element(seta, values[i]);
}
+ if (!seta->null_element && setb->null_element)
+ seta->null_element = true;
+
PG_RETURN_POINTER(seta);
}
@@ -629,6 +695,12 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
set = PG_GETARG_INT4HASHSET(0);
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element) {
+ Datum d = PointerGetDatum(construct_empty_array(INT4OID));
+ PG_RETURN_ARRAYTYPE_P(d);
+ }
+
sbitmap = set->data;
svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
@@ -648,11 +720,35 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
Assert(idx == nvalues);
- return int32_to_array(fcinfo, values, nvalues);
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
}
Datum
-int4hashset_equals(PG_FUNCTION_ARGS)
+int4hashset_to_sorted_array(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element) {
+ Datum d = PointerGetDatum(construct_empty_array(INT4OID));
+ PG_RETURN_ARRAYTYPE_P(d);
+ }
+
+ /* extract the sorted elements from the hashset */
+ values = int4hashset_extract_sorted_elements(set);
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
+}
+
+Datum
+int4hashset_eq(PG_FUNCTION_ARGS)
{
int i;
int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
@@ -686,6 +782,9 @@ int4hashset_equals(PG_FUNCTION_ARGS)
}
}
+ if (a->null_element != b->null_element)
+ PG_RETURN_BOOL(false);
+
/*
* All elements in a are in b and the number of elements is the same,
* so the sets must be equal.
@@ -695,13 +794,13 @@ int4hashset_equals(PG_FUNCTION_ARGS)
Datum
-int4hashset_neq(PG_FUNCTION_ARGS)
+int4hashset_ne(PG_FUNCTION_ARGS)
{
int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
/* If a is not equal to b, then they are not equal */
- if (!DatumGetBool(DirectFunctionCall2(int4hashset_equals, PointerGetDatum(a), PointerGetDatum(b))))
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_eq, PointerGetDatum(a), PointerGetDatum(b))))
PG_RETURN_BOOL(true);
PG_RETURN_BOOL(false);
@@ -842,7 +941,7 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
char *bitmap = setb->data;
- int32_t *values = (int32_t *)(bitmap + CEIL_DIV(setb->capacity, 8));
+ int32 *values = (int32 *)(bitmap + CEIL_DIV(setb->capacity, 8));
int4hashset_t *intersection;
intersection = int4hashset_allocate(
@@ -864,6 +963,9 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
}
}
+ if (seta->null_element && setb->null_element)
+ intersection->null_element = true;
+
PG_RETURN_POINTER(intersection);
}
@@ -875,7 +977,7 @@ int4hashset_difference(PG_FUNCTION_ARGS)
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *difference;
char *bitmap = seta->data;
- int32_t *values = (int32_t *)(bitmap + CEIL_DIV(seta->capacity, 8));
+ int32 *values = (int32 *)(bitmap + CEIL_DIV(seta->capacity, 8));
difference = int4hashset_allocate(
seta->capacity,
@@ -896,6 +998,9 @@ int4hashset_difference(PG_FUNCTION_ARGS)
}
}
+ if (seta->null_element && !setb->null_element)
+ difference->null_element = true;
+
PG_RETURN_POINTER(difference);
}
@@ -908,8 +1013,8 @@ int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
int4hashset_t *result;
char *bitmapa = seta->data;
char *bitmapb = setb->data;
- int32_t *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
- int32_t *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
+ int32 *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
+ int32 *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
result = int4hashset_allocate(
seta->nelements + setb->nelements,
@@ -946,5 +1051,8 @@ int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
}
}
+ if (seta->null_element ^ setb->null_element)
+ result->null_element = true;
+
PG_RETURN_POINTER(result);
}
diff --git a/hashset.c b/hashset.c
index 67cbdf3..91907ab 100644
--- a/hashset.c
+++ b/hashset.c
@@ -46,6 +46,7 @@ int4hashset_allocate(
set->ncollisions = 0;
set->max_collisions = 0;
set->hash = 0; /* Initial hash value */
+ set->null_element = false; /* No null element initially */
set->flags |= 0;
@@ -298,7 +299,7 @@ hashset_isspace(char ch)
* Construct an SQL array from a simple C double array
*/
Datum
-int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
+int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element)
{
ArrayBuildState *astate = NULL;
int i;
@@ -313,6 +314,15 @@ int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len)
CurrentMemoryContext);
}
+ if (null_element)
+ {
+ astate = accumArrayResult(astate,
+ (Datum) 0,
+ true,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
PG_RETURN_DATUM(makeArrayResult(astate,
CurrentMemoryContext));
}
diff --git a/hashset.h b/hashset.h
index 3f22133..86f5d1b 100644
--- a/hashset.h
+++ b/hashset.h
@@ -28,16 +28,17 @@
#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
typedef struct int4hashset_t {
- int32 vl_len_; /* varlena header (do not touch directly!) */
- int32 flags; /* reserved for future use (versioning, ...) */
- int32 capacity; /* max number of element we have space for */
- int32 nelements; /* number of items added to the hashset */
+ int32 vl_len_; /* Varlena header (do not touch directly!) */
+ int32 flags; /* Reserved for future use (versioning, ...) */
+ int32 capacity; /* Max number of element we have space for */
+ int32 nelements; /* Number of items added to the hashset */
int32 hashfn_id; /* ID of the hash function used */
float4 load_factor; /* Load factor before triggering resize */
float4 growth_factor; /* Growth factor when resizing the hashset */
int32 ncollisions; /* Number of collisions */
int32 max_collisions; /* Maximum collisions for a single element */
int32 hash; /* Stored hash value of the hashset */
+ bool null_element; /* Indicates if null is present in hashset */
char data[FLEXIBLE_ARRAY_MEMBER];
} int4hashset_t;
@@ -48,6 +49,6 @@ bool int4hashset_contains_element(int4hashset_t *set, int32 value);
int32 *int4hashset_extract_sorted_elements(int4hashset_t *set);
int4hashset_t *int4hashset_copy(int4hashset_t *src);
bool hashset_isspace(char ch);
-Datum int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len);
+Datum int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element);
#endif /* HASHSET_H */
diff --git a/test/expected/array-and-multiset-semantics.out b/test/expected/array-and-multiset-semantics.out
new file mode 100644
index 0000000..8f989a1
--- /dev/null
+++ b/test/expected/array-and-multiset-semantics.out
@@ -0,0 +1,365 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+DROP TABLE IF EXISTS hashset_test_results_1;
+NOTICE: table "hashset_test_results_1" does not exist, skipping
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_2;
+NOTICE: table "hashset_test_results_2" does not exist, skipping
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_3;
+NOTICE: table "hashset_test_results_3" does not exist, skipping
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+SELECT * FROM hashset_test_results_1;
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+--------+------+-------------+--------------+------------------+------------
+ | | {NULL} | {NULL} | |
+ | 1 | {1} | {1} | |
+ | 4 | {4} | {4} | |
+ {} | | {NULL} | {NULL} | f | f
+ {} | 1 | {1} | {1} | f | f
+ {} | 4 | {4} | {4} | f | f
+ {NULL} | | {NULL} | {NULL,NULL} | |
+ {NULL} | 1 | {1,NULL} | {NULL,1} | |
+ {NULL} | 4 | {4,NULL} | {NULL,4} | |
+ {1} | | {1,NULL} | {1,NULL} | |
+ {1} | 1 | {1} | {1,1} | t | t
+ {1} | 4 | {1,4} | {1,4} | f | f
+ {2} | | {2,NULL} | {2,NULL} | |
+ {2} | 1 | {2,1} | {2,1} | f | f
+ {2} | 4 | {2,4} | {2,4} | f | f
+ {1,2} | | {1,2,NULL} | {1,2,NULL} | |
+ {1,2} | 1 | {1,2} | {1,2,1} | t | t
+ {1,2} | 4 | {4,1,2} | {1,2,4} | f | f
+ {2,3} | | {2,3,NULL} | {2,3,NULL} | |
+ {2,3} | 1 | {1,2,3} | {2,3,1} | f | f
+ {2,3} | 4 | {4,2,3} | {2,3,4} | f | f
+(21 rows)
+
+SELECT * FROM hashset_test_results_2;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+----------+----------+---------------+--------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+ | | | | | | | | | | | | |
+ | {} | | | | | | | | | | | |
+ | {NULL} | | | | | | | | | | | |
+ | {1} | | | | | | | | | | | |
+ | {1,NULL} | | | | | | | | | | | |
+ | {2} | | | | | | | | | | | |
+ | {1,2} | | | | | | | | | | | |
+ | {2,3} | | | | | | | | | | | |
+ {} | | | | | | | | | | | | |
+ {} | {} | {} | {} | {} | {} | {} | {} | {} | {} | t | t | f | f
+ {} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {} | {1} | {1} | {1} | {} | {} | {} | {} | {1} | {1} | f | f | t | t
+ {} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | {1,NULL} | {1,NULL} | f | f | t | t
+ {} | {2} | {2} | {2} | {} | {} | {} | {} | {2} | {2} | f | f | t | t
+ {} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | {1,2} | {1,2} | f | f | t | t
+ {} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | {2,3} | {2,3} | f | f | t | t
+ {NULL} | | | | | | | | | | | | |
+ {NULL} | {} | {NULL} | {NULL} | {} | {} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | t | t | f | f
+ {NULL} | {1} | {1,NULL} | {1,NULL} | {} | {} | {NULL} | {NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {} | {} | {1} | {1} | f | f | t | t
+ {NULL} | {2} | {2,NULL} | {2,NULL} | {} | {} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {NULL} | {1,2} | {2,1,NULL} | {1,2,NULL} | {} | {} | {NULL} | {NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {NULL} | {2,3} | {3,2,NULL} | {2,3,NULL} | {} | {} | {NULL} | {NULL} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {1} | | | | | | | | | | | | |
+ {1} | {} | {1} | {1} | {} | {} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1} | {NULL} | {1,NULL} | {1,NULL} | {} | {} | {1} | {1} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1} | {1} | {1} | {1} | {1} | {1} | {} | {} | {} | {} | t | t | f | f
+ {1} | {1,NULL} | {1,NULL} | {1,NULL} | {1} | {1} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {1} | {2} | {1,2} | {1,2} | {} | {} | {1} | {1} | {1,2} | {1,2} | f | f | t | t
+ {1} | {1,2} | {1,2} | {1,2} | {1} | {1} | {} | {} | {2} | {2} | f | f | t | t
+ {1} | {2,3} | {3,1,2} | {1,2,3} | {} | {} | {1} | {1} | {3,2,1} | {1,2,3} | f | f | t | t
+ {1,NULL} | | | | | | | | | | | | |
+ {1,NULL} | {} | {1,NULL} | {1,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1,NULL} | {NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,NULL} | {1} | {1,NULL} | {1,NULL} | {1} | {1} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | t | t | f | f
+ {1,NULL} | {2} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,NULL} | {1,2} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,NULL} | {2,3} | {3,1,2,NULL} | {1,2,3,NULL} | {} | {} | {1,NULL} | {1,NULL} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2} | | | | | | | | | | | | |
+ {2} | {} | {2} | {2} | {} | {} | {2} | {2} | {2} | {2} | f | f | t | t
+ {2} | {NULL} | {2,NULL} | {2,NULL} | {} | {} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {2} | {1} | {2,1} | {1,2} | {} | {} | {2} | {2} | {2,1} | {1,2} | f | f | t | t
+ {2} | {1,NULL} | {2,1,NULL} | {1,2,NULL} | {} | {} | {2} | {2} | {2,1,NULL} | {1,2,NULL} | f | f | t | t
+ {2} | {2} | {2} | {2} | {2} | {2} | {} | {} | {} | {} | t | t | f | f
+ {2} | {1,2} | {2,1} | {1,2} | {2} | {2} | {} | {} | {1} | {1} | f | f | t | t
+ {2} | {2,3} | {2,3} | {2,3} | {2} | {2} | {} | {} | {3} | {3} | f | f | t | t
+ {1,2} | | | | | | | | | | | | |
+ {1,2} | {} | {1,2} | {1,2} | {} | {} | {1,2} | {1,2} | {1,2} | {1,2} | f | f | t | t
+ {1,2} | {NULL} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,2} | {1,2} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,2} | {1} | {1,2} | {1,2} | {1} | {1} | {2} | {2} | {2} | {2} | f | f | t | t
+ {1,2} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,2} | {2} | {1,2} | {1,2} | {2} | {2} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | t | t | f | f
+ {1,2} | {2,3} | {3,1,2} | {1,2,3} | {2} | {2} | {1} | {1} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | | | | | | | | | | | | |
+ {2,3} | {} | {2,3} | {2,3} | {} | {} | {2,3} | {2,3} | {2,3} | {2,3} | f | f | t | t
+ {2,3} | {NULL} | {2,3,NULL} | {2,3,NULL} | {} | {} | {2,3} | {2,3} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {2,3} | {1} | {1,2,3} | {1,2,3} | {} | {} | {2,3} | {2,3} | {3,2,1} | {1,2,3} | f | f | t | t
+ {2,3} | {1,NULL} | {1,2,3,NULL} | {1,2,3,NULL} | {} | {} | {2,3} | {2,3} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2,3} | {2} | {2,3} | {2,3} | {2} | {2} | {3} | {3} | {3} | {3} | f | f | t | t
+ {2,3} | {1,2} | {1,2,3} | {1,2,3} | {2} | {2} | {3} | {3} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | t | t | f | f
+(64 rows)
+
+SELECT * FROM hashset_test_results_3;
+ arg1 | hashset_cardinality | cardinality
+--------+---------------------+-------------
+ | |
+ {} | 0 | 0
+ {NULL} | 1 | 1
+ {1} | 1 | 1
+ {2} | 1 | 1
+ {1,2} | 2 | 2
+ {2,3} | 2 | 2
+(7 rows)
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+\echo *** Testing: hashset_add()
+*** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_contains()
+*** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_union()
+*** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_intersection()
+*** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_difference()
+*** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_symmetric_difference()
+*** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_eq()
+*** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_ne()
+*** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_cardinality()
+*** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
+ arg1 | hashset_cardinality | cardinality
+------+---------------------+-------------
+(0 rows)
+
diff --git a/test/expected/basic.out b/test/expected/basic.out
index b5326f2..79c3230 100644
--- a/test/expected/basic.out
+++ b/test/expected/basic.out
@@ -71,8 +71,8 @@ SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
f
(1 row)
-SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
- hashset_merge
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_union
---------------
{3,1,2}
(1 row)
@@ -83,10 +83,10 @@ SELECT hashset_to_array('{1,2,3}'::int4hashset);
{3,2,1}
(1 row)
-SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
- hashset_count
----------------
- 3
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
+ hashset_cardinality
+---------------------
+ 3
(1 row)
SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
index b356b64..03cc7c3 100644
--- a/test/expected/reported_bugs.out
+++ b/test/expected/reported_bugs.out
@@ -1,7 +1,7 @@
/*
- * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
*
- * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
* original hashset in-place, leading to unexpected results as the original data
* within the hashset was being altered.
*
@@ -10,7 +10,7 @@
* a copy of the hashset is created and subsequently modified, thereby preserving
* the integrity of the original hashset.
*
- * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
* a copied hashset, ensuring that the original data remains unaltered, and
* the query executes correctly.
*/
diff --git a/test/expected/strict.out b/test/expected/strict.out
deleted file mode 100644
index 4a9d904..0000000
--- a/test/expected/strict.out
+++ /dev/null
@@ -1,114 +0,0 @@
-/*
- * Test to verify all relevant functions return NULL if any of
- * the input parameters are NULL, i.e. testing that they are declared as STRICT
- */
-SELECT hashset_add(int4hashset(), NULL::int);
- hashset_add
--------------
-
-(1 row)
-
-SELECT hashset_add(NULL::int4hashset, 123::int);
- hashset_add
--------------
-
-(1 row)
-
-SELECT hashset_contains('{123,456}'::int4hashset, NULL::int);
- hashset_contains
-------------------
-
-(1 row)
-
-SELECT hashset_contains(NULL::int4hashset, 456::int);
- hashset_contains
-------------------
-
-(1 row)
-
-SELECT hashset_merge('{1,2}'::int4hashset, NULL::int4hashset);
- hashset_merge
----------------
-
-(1 row)
-
-SELECT hashset_merge(NULL::int4hashset, '{2,3}'::int4hashset);
- hashset_merge
----------------
-
-(1 row)
-
-SELECT hashset_to_array(NULL::int4hashset);
- hashset_to_array
-------------------
-
-(1 row)
-
-SELECT hashset_count(NULL::int4hashset);
- hashset_count
----------------
-
-(1 row)
-
-SELECT hashset_capacity(NULL::int4hashset);
- hashset_capacity
-------------------
-
-(1 row)
-
-SELECT hashset_intersection('{1,2}'::int4hashset,NULL::int4hashset);
- hashset_intersection
-----------------------
-
-(1 row)
-
-SELECT hashset_intersection(NULL::int4hashset,'{2,3}'::int4hashset);
- hashset_intersection
-----------------------
-
-(1 row)
-
-SELECT hashset_difference('{1,2}'::int4hashset,NULL::int4hashset);
- hashset_difference
---------------------
-
-(1 row)
-
-SELECT hashset_difference(NULL::int4hashset,'{2,3}'::int4hashset);
- hashset_difference
---------------------
-
-(1 row)
-
-SELECT hashset_symmetric_difference('{1,2}'::int4hashset,NULL::int4hashset);
- hashset_symmetric_difference
-------------------------------
-
-(1 row)
-
-SELECT hashset_symmetric_difference(NULL::int4hashset,'{2,3}'::int4hashset);
- hashset_symmetric_difference
-------------------------------
-
-(1 row)
-
-/*
- * For convenience, hashset_agg() is not STRICT and just ignore NULL values
- */
-SELECT hashset_agg(i) FROM (VALUES (NULL::int),(1::int),(2::int)) q(i);
- hashset_agg
--------------
- {1,2}
-(1 row)
-
-SELECT hashset_agg(h) FROM
-(
- SELECT NULL::int4hashset AS h
- UNION ALL
- SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
-) q;
- hashset_agg
---------------
- {6,7,8,9,10}
-(1 row)
-
diff --git a/test/expected/table.out b/test/expected/table.out
index 9793a49..f59494e 100644
--- a/test/expected/table.out
+++ b/test/expected/table.out
@@ -11,10 +11,10 @@ SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
t
(1 row)
-SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
- hashset_count
----------------
- 2
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
+ hashset_cardinality
+---------------------
+ 2
(1 row)
SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
diff --git a/test/sql/array-and-multiset-semantics.sql b/test/sql/array-and-multiset-semantics.sql
new file mode 100644
index 0000000..0db7065
--- /dev/null
+++ b/test/sql/array-and-multiset-semantics.sql
@@ -0,0 +1,232 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+
+DROP TABLE IF EXISTS hashset_test_results_1;
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+
+
+DROP TABLE IF EXISTS hashset_test_results_2;
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+
+DROP TABLE IF EXISTS hashset_test_results_3;
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+
+SELECT * FROM hashset_test_results_1;
+SELECT * FROM hashset_test_results_2;
+SELECT * FROM hashset_test_results_3;
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+
+\echo *** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+
+\echo *** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+
+\echo *** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+
+\echo *** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+
+\echo *** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+
+\echo *** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+
+\echo *** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+
+\echo *** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+
+\echo *** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
index 061794c..2bf5893 100644
--- a/test/sql/basic.sql
+++ b/test/sql/basic.sql
@@ -23,9 +23,9 @@ SELECT hashset_add(int4hashset(), 123);
SELECT hashset_add('{123}'::int4hashset, 456);
SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
-SELECT hashset_merge('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
SELECT hashset_to_array('{1,2,3}'::int4hashset);
-SELECT hashset_count('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
index 1535c22..e7a53f1 100644
--- a/test/sql/benchmark.sql
+++ b/test/sql/benchmark.sql
@@ -65,7 +65,7 @@ SELECT hashset_agg(rnd) FROM benchmark_input_10M;
SELECT cardinality(array_agg) FROM benchmark_array_agg ORDER BY 1;
SELECT
- hashset_count(hashset_agg),
+ hashset_cardinality(hashset_agg),
hashset_capacity(hashset_agg),
hashset_collisions(hashset_agg),
hashset_max_collisions(hashset_agg)
@@ -88,7 +88,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, i);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
@@ -106,7 +106,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, i);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
@@ -124,7 +124,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, i);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
@@ -145,7 +145,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
@@ -164,7 +164,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
@@ -183,7 +183,7 @@ BEGIN
FOR i IN 1..100000 LOOP
h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
END LOOP;
- RAISE NOTICE 'hashset_count: %', hashset_count(h);
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
index 9166f5d..9e6b617 100644
--- a/test/sql/reported_bugs.sql
+++ b/test/sql/reported_bugs.sql
@@ -1,7 +1,7 @@
/*
- * Bug in hashset_add() and hashset_merge() functions altering original hashset.
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
*
- * Previously, the hashset_add() and hashset_merge() functions were modifying the
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
* original hashset in-place, leading to unexpected results as the original data
* within the hashset was being altered.
*
@@ -10,7 +10,7 @@
* a copy of the hashset is created and subsequently modified, thereby preserving
* the integrity of the original hashset.
*
- * As a result of this fix, hashset_add() and hashset_merge() now operate on
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
* a copied hashset, ensuring that the original data remains unaltered, and
* the query executes correctly.
*/
diff --git a/test/sql/strict.sql b/test/sql/strict.sql
deleted file mode 100644
index d0f33bd..0000000
--- a/test/sql/strict.sql
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Test to verify all relevant functions return NULL if any of
- * the input parameters are NULL, i.e. testing that they are declared as STRICT
- */
-
-SELECT hashset_add(int4hashset(), NULL::int);
-SELECT hashset_add(NULL::int4hashset, 123::int);
-SELECT hashset_contains('{123,456}'::int4hashset, NULL::int);
-SELECT hashset_contains(NULL::int4hashset, 456::int);
-SELECT hashset_merge('{1,2}'::int4hashset, NULL::int4hashset);
-SELECT hashset_merge(NULL::int4hashset, '{2,3}'::int4hashset);
-SELECT hashset_to_array(NULL::int4hashset);
-SELECT hashset_count(NULL::int4hashset);
-SELECT hashset_capacity(NULL::int4hashset);
-SELECT hashset_intersection('{1,2}'::int4hashset,NULL::int4hashset);
-SELECT hashset_intersection(NULL::int4hashset,'{2,3}'::int4hashset);
-SELECT hashset_difference('{1,2}'::int4hashset,NULL::int4hashset);
-SELECT hashset_difference(NULL::int4hashset,'{2,3}'::int4hashset);
-SELECT hashset_symmetric_difference('{1,2}'::int4hashset,NULL::int4hashset);
-SELECT hashset_symmetric_difference(NULL::int4hashset,'{2,3}'::int4hashset);
-
-/*
- * For convenience, hashset_agg() is not STRICT and just ignore NULL values
- */
-SELECT hashset_agg(i) FROM (VALUES (NULL::int),(1::int),(2::int)) q(i);
-
-SELECT hashset_agg(h) FROM
-(
- SELECT NULL::int4hashset AS h
- UNION ALL
- SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
-) q;
diff --git a/test/sql/table.sql b/test/sql/table.sql
index 0472352..bf05ffa 100644
--- a/test/sql/table.sql
+++ b/test/sql/table.sql
@@ -6,5 +6,5 @@ INSERT INTO users (user_id) VALUES (1);
UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
-SELECT hashset_count(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
On Tue, Jun 27, 2023, at 10:26, Joel Jacobson wrote:
Attachments:
* hashset-0.0.1-b7e5614-full.patch
* hashset-0.0.1-b7e5614-incremental.patch
To help verify that the semantics, I thought it might be helpful to provide
a comprehensive set of examples that tries to cover all different ways of varying
the arguments to the functions.
Please let me know if you find any possible errors or if you think it looks good.
SELECT NULL::int4hashset;
int4hashset
-------------
(1 row)
SELECT '{}'::int4hashset;
int4hashset
-------------
{}
(1 row)
SELECT int4hashset();
int4hashset
-------------
{}
(1 row)
SELECT '{NULL}'::int4hashset;
int4hashset
-------------
{NULL}
(1 row)
SELECT '{NULL,NULL}'::int4hashset;
int4hashset
-------------
{NULL}
(1 row)
SELECT '{1,3,2,NULL,2,NULL,3,1}'::int4hashset;
int4hashset
--------------
{2,1,3,NULL}
(1 row)
SELECT hashset_add(NULL, NULL);
hashset_add
-------------
{NULL}
(1 row)
SELECT hashset_add(NULL, 1);
hashset_add
-------------
{1}
(1 row)
SELECT hashset_add('{}', 1);
hashset_add
-------------
{1}
(1 row)
SELECT hashset_add('{NULL}', 1);
hashset_add
-------------
{1,NULL}
(1 row)
SELECT hashset_add('{1}', 1);
hashset_add
-------------
{1}
(1 row)
SELECT hashset_add('{1}', 2);
hashset_add
-------------
{1,2}
(1 row)
SELECT hashset_add('{1}', NULL);
hashset_add
-------------
{1,NULL}
(1 row)
SELECT hashset_contains(NULL, NULL);
hashset_contains
------------------
(1 row)
SELECT hashset_contains('{}', NULL);
hashset_contains
------------------
f
(1 row)
SELECT hashset_contains('{NULL}', NULL);
hashset_contains
------------------
(1 row)
SELECT hashset_contains('{1}', 1);
hashset_contains
------------------
t
(1 row)
SELECT hashset_contains('{1,NULL}', 1);
hashset_contains
------------------
t
(1 row)
SELECT hashset_contains('{1}', 2);
hashset_contains
------------------
f
(1 row)
SELECT hashset_contains('{1,NULL}', 2);
hashset_contains
------------------
(1 row)
SELECT hashset_to_array(NULL);
hashset_to_array
------------------
(1 row)
SELECT hashset_to_array('{}');
hashset_to_array
------------------
{}
(1 row)
SELECT hashset_to_array('{NULL}');
hashset_to_array
------------------
{NULL}
(1 row)
SELECT hashset_to_array('{3,1,NULL,2}');
hashset_to_array
------------------
{1,3,2,NULL}
(1 row)
SELECT hashset_to_sorted_array(NULL);
hashset_to_sorted_array
-------------------------
(1 row)
SELECT hashset_to_sorted_array('{}');
hashset_to_sorted_array
-------------------------
{}
(1 row)
SELECT hashset_to_sorted_array('{NULL}');
hashset_to_sorted_array
-------------------------
{NULL}
(1 row)
SELECT hashset_to_sorted_array('{3,1,NULL,2}');
hashset_to_sorted_array
-------------------------
{1,2,3,NULL}
(1 row)
SELECT hashset_cardinality(NULL);
hashset_cardinality
---------------------
(1 row)
SELECT hashset_cardinality('{}');
hashset_cardinality
---------------------
0
(1 row)
SELECT hashset_cardinality('{NULL}');
hashset_cardinality
---------------------
1
(1 row)
SELECT hashset_cardinality('{NULL,NULL}');
hashset_cardinality
---------------------
1
(1 row)
SELECT hashset_cardinality('{1}');
hashset_cardinality
---------------------
1
(1 row)
SELECT hashset_cardinality('{1,1}');
hashset_cardinality
---------------------
1
(1 row)
SELECT hashset_cardinality('{1,2}');
hashset_cardinality
---------------------
2
(1 row)
SELECT hashset_cardinality('{1,2,NULL}');
hashset_cardinality
---------------------
3
(1 row)
SELECT hashset_union(NULL, NULL);
hashset_union
---------------
(1 row)
SELECT hashset_union(NULL, '{}');
hashset_union
---------------
(1 row)
SELECT hashset_union('{}', NULL);
hashset_union
---------------
(1 row)
SELECT hashset_union('{}', '{}');
hashset_union
---------------
{}
(1 row)
SELECT hashset_union('{}', '{NULL}');
hashset_union
---------------
{NULL}
(1 row)
SELECT hashset_union('{NULL}', '{}');
hashset_union
---------------
{NULL}
(1 row)
SELECT hashset_union('{NULL}', '{NULL}');
hashset_union
---------------
{NULL}
(1 row)
SELECT hashset_union('{}', '{1}');
hashset_union
---------------
{1}
(1 row)
SELECT hashset_union('{1}', '{}');
hashset_union
---------------
{1}
(1 row)
SELECT hashset_union('{1}', '{1}');
hashset_union
---------------
{1}
(1 row)
SELECT hashset_union('{1}', NULL);
hashset_union
---------------
(1 row)
SELECT hashset_union(NULL, '{1}');
hashset_union
---------------
(1 row)
SELECT hashset_union('{1}', '{NULL}');
hashset_union
---------------
{1,NULL}
(1 row)
SELECT hashset_union('{NULL}', '{1}');
hashset_union
---------------
{1,NULL}
(1 row)
SELECT hashset_union('{1}', '{2}');
hashset_union
---------------
{1,2}
(1 row)
SELECT hashset_union('{1,2}', '{2,3}');
hashset_union
---------------
{3,1,2}
(1 row)
SELECT hashset_intersection(NULL, NULL);
hashset_intersection
----------------------
(1 row)
SELECT hashset_intersection(NULL, '{}');
hashset_intersection
----------------------
(1 row)
SELECT hashset_intersection('{}', NULL);
hashset_intersection
----------------------
(1 row)
SELECT hashset_intersection('{}', '{}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{}', '{NULL}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{NULL}', '{}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{NULL}', '{NULL}');
hashset_intersection
----------------------
{NULL}
(1 row)
SELECT hashset_intersection('{}', '{1}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{1}', '{}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{1}', '{1}');
hashset_intersection
----------------------
{1}
(1 row)
SELECT hashset_intersection('{1}', NULL);
hashset_intersection
----------------------
(1 row)
SELECT hashset_intersection(NULL, '{1}');
hashset_intersection
----------------------
(1 row)
SELECT hashset_intersection('{1}', '{NULL}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{NULL}', '{1}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{1}', '{2}');
hashset_intersection
----------------------
{}
(1 row)
SELECT hashset_intersection('{1,2}', '{2,3}');
hashset_intersection
----------------------
{2}
(1 row)
SELECT hashset_difference(NULL, NULL);
hashset_difference
--------------------
(1 row)
SELECT hashset_difference(NULL, '{}');
hashset_difference
--------------------
(1 row)
SELECT hashset_difference('{}', NULL);
hashset_difference
--------------------
(1 row)
SELECT hashset_difference('{}', '{}');
hashset_difference
--------------------
{}
(1 row)
SELECT hashset_difference('{}', '{NULL}');
hashset_difference
--------------------
{}
(1 row)
SELECT hashset_difference('{NULL}', '{}');
hashset_difference
--------------------
{NULL}
(1 row)
SELECT hashset_difference('{NULL}', '{NULL}');
hashset_difference
--------------------
{}
(1 row)
SELECT hashset_difference('{}', '{1}');
hashset_difference
--------------------
{}
(1 row)
SELECT hashset_difference('{1}', '{}');
hashset_difference
--------------------
{1}
(1 row)
SELECT hashset_difference('{1}', '{1}');
hashset_difference
--------------------
{}
(1 row)
SELECT hashset_difference('{1}', NULL);
hashset_difference
--------------------
(1 row)
SELECT hashset_difference(NULL, '{1}');
hashset_difference
--------------------
(1 row)
SELECT hashset_difference('{1}', '{NULL}');
hashset_difference
--------------------
{1}
(1 row)
SELECT hashset_difference('{NULL}', '{1}');
hashset_difference
--------------------
{NULL}
(1 row)
SELECT hashset_difference('{1}', '{2}');
hashset_difference
--------------------
{1}
(1 row)
SELECT hashset_difference('{1,2}', '{2,3}');
hashset_difference
--------------------
{1}
(1 row)
SELECT hashset_symmetric_difference(NULL, NULL);
hashset_symmetric_difference
------------------------------
(1 row)
SELECT hashset_symmetric_difference(NULL, '{}');
hashset_symmetric_difference
------------------------------
(1 row)
SELECT hashset_symmetric_difference('{}', NULL);
hashset_symmetric_difference
------------------------------
(1 row)
SELECT hashset_symmetric_difference('{}', '{}');
hashset_symmetric_difference
------------------------------
{}
(1 row)
SELECT hashset_symmetric_difference('{}', '{NULL}');
hashset_symmetric_difference
------------------------------
{NULL}
(1 row)
SELECT hashset_symmetric_difference('{NULL}', '{}');
hashset_symmetric_difference
------------------------------
{NULL}
(1 row)
SELECT hashset_symmetric_difference('{NULL}', '{NULL}');
hashset_symmetric_difference
------------------------------
{}
(1 row)
SELECT hashset_symmetric_difference('{}', '{1}');
hashset_symmetric_difference
------------------------------
{1}
(1 row)
SELECT hashset_symmetric_difference('{1}', '{}');
hashset_symmetric_difference
------------------------------
{1}
(1 row)
SELECT hashset_symmetric_difference('{1}', '{1}');
hashset_symmetric_difference
------------------------------
{}
(1 row)
SELECT hashset_symmetric_difference('{1}', NULL);
hashset_symmetric_difference
------------------------------
(1 row)
SELECT hashset_symmetric_difference(NULL, '{1}');
hashset_symmetric_difference
------------------------------
(1 row)
SELECT hashset_symmetric_difference('{1}', '{NULL}');
hashset_symmetric_difference
------------------------------
{1,NULL}
(1 row)
SELECT hashset_symmetric_difference('{NULL}', '{1}');
hashset_symmetric_difference
------------------------------
{1,NULL}
(1 row)
SELECT hashset_symmetric_difference('{1}', '{2}');
hashset_symmetric_difference
------------------------------
{1,2}
(1 row)
SELECT hashset_symmetric_difference('{1,2}', '{2,3}');
hashset_symmetric_difference
------------------------------
{1,3}
(1 row)
SELECT hashset_eq(NULL, NULL);
hashset_eq
------------
(1 row)
SELECT hashset_eq(NULL, '{}');
hashset_eq
------------
(1 row)
SELECT hashset_eq('{}', NULL);
hashset_eq
------------
(1 row)
SELECT hashset_eq('{}', '{}');
hashset_eq
------------
t
(1 row)
SELECT hashset_eq('{}', '{NULL}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{NULL}', '{}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{NULL}', '{NULL}');
hashset_eq
------------
t
(1 row)
SELECT hashset_eq('{}', '{1}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{1}', '{}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{1}', '{1}');
hashset_eq
------------
t
(1 row)
SELECT hashset_eq('{1}', NULL);
hashset_eq
------------
(1 row)
SELECT hashset_eq(NULL, '{1}');
hashset_eq
------------
(1 row)
SELECT hashset_eq('{1}', '{NULL}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{NULL}', '{1}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{1}', '{2}');
hashset_eq
------------
f
(1 row)
SELECT hashset_eq('{1,2}', '{2,3}');
hashset_eq
------------
f
(1 row)
SELECT hashset_ne(NULL, NULL);
hashset_ne
------------
(1 row)
SELECT hashset_ne(NULL, '{}');
hashset_ne
------------
(1 row)
SELECT hashset_ne('{}', NULL);
hashset_ne
------------
(1 row)
SELECT hashset_ne('{}', '{}');
hashset_ne
------------
f
(1 row)
SELECT hashset_ne('{}', '{NULL}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{NULL}', '{}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{NULL}', '{NULL}');
hashset_ne
------------
f
(1 row)
SELECT hashset_ne('{}', '{1}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{1}', '{}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{1}', '{1}');
hashset_ne
------------
f
(1 row)
SELECT hashset_ne('{1}', NULL);
hashset_ne
------------
(1 row)
SELECT hashset_ne(NULL, '{1}');
hashset_ne
------------
(1 row)
SELECT hashset_ne('{1}', '{NULL}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{NULL}', '{1}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{1}', '{2}');
hashset_ne
------------
t
(1 row)
SELECT hashset_ne('{1,2}', '{2,3}');
hashset_ne
------------
t
(1 row)
/Joel
On Tue, Jun 27, 2023 at 4:27 PM Joel Jacobson <joel@compiler.org> wrote:
On Tue, Jun 27, 2023, at 04:35, jian he wrote:
in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.
/home/jian/hashset/0001-make-int4hashset_contains-strict-and-header-file-change.patch
(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)Thanks! This was exactly what I was looking for, I knew I've seen it but failed to find it.
Attached is a new incremental patch as well as a full patch, since this is a substantial change:
Align null semantics with SQL:2023 array and multiset standards
* Introduced a new boolean field, null_element, in the int4hashset_t type.
* Rename hashset_count() to hashset_cardinality().
* Rename hashset_merge() to hashset_union().
* Rename hashset_equals() to hashset_eq().
* Rename hashset_neq() to hashset_ne().
* Add hashset_to_sorted_array().
* Handle null semantics to work as in arrays and multisets.
* Update int4hashset_add() to allow creating a new set if none exists.
* Use more portable int32 typedef instead of int32_t.
This also adds a thorough test suite in array-and-multiset-semantics.sql,
which aims to test all relevant combinations of operations and values.Makefile | 2 +-
README.md | 6 ++--
hashset--0.0.1.sql | 37 +++++++++++---------
hashset-api.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------
hashset.c | 12 ++++++-
hashset.h | 11 +++---
test/expected/array-and-multiset-semantics.out | 365 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
test/expected/basic.out | 12 +++----
test/expected/reported_bugs.out | 6 ++--
test/expected/strict.out | 114 ------------------------------------------------------------
test/expected/table.out | 8 ++---
test/sql/array-and-multiset-semantics.sql | 232 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
test/sql/basic.sql | 4 +--
test/sql/benchmark.sql | 14 ++++----
test/sql/reported_bugs.sql | 6 ++--
test/sql/strict.sql | 32 -----------------
test/sql/table.sql | 2 +-
17 files changed, 823 insertions(+), 248 deletions(-)/Joel
Hi there.
I changed the function hashset_contains to strict.
also change the way to return an empty array.
in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
int4hashset can speed distinct aggregate and distinct counts?
like the following:
explain(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
explain(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
explain(costs off,timing off, analyze,buffers)
select count(distinct rnd) from benchmark_input_100k \watch c=3
explain(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
benchmark_input_100k) sub(x) \watch c=3
Attachments:
0001-make-int4hashset_contains-strict-and-header-file-change.patchtext/x-patch; charset=US-ASCII; name=0001-make-int4hashset_contains-strict-and-header-file-change.patchDownload
From 9030adbf9e46f66812fb11849c367bbcf5b3a427 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universality@gmail.com>
Date: Wed, 28 Jun 2023 14:09:58 +0800
Subject: [PATCH] make int4hashset_contains strict and header file changes.
---
hashset--0.0.1.sql | 2 +-
hashset-api.c | 20 ++++----------------
2 files changed, 5 insertions(+), 17 deletions(-)
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index d0478ce9..d448ee69 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -55,7 +55,7 @@ LANGUAGE C IMMUTABLE;
CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
RETURNS boolean
AS 'hashset', 'int4hashset_contains'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset)
RETURNS int4hashset
diff --git a/hashset-api.c b/hashset-api.c
index a4beef4e..ff948b55 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -1,8 +1,6 @@
#include "hashset.h"
-#include <stdio.h>
#include <math.h>
-#include <string.h>
#include <sys/time.h>
#include <unistd.h>
#include <limits.h>
@@ -377,17 +375,11 @@ int4hashset_contains(PG_FUNCTION_ARGS)
int32 value;
bool result;
- if (PG_ARGISNULL(0))
- PG_RETURN_NULL();
-
set = PG_GETARG_INT4HASHSET(0);
if (set->nelements == 0 && !set->null_element)
PG_RETURN_BOOL(false);
- if (PG_ARGISNULL(1))
- PG_RETURN_NULL();
-
value = PG_GETARG_INT32(1);
result = int4hashset_contains_element(set, value);
@@ -696,10 +688,8 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
set = PG_GETARG_INT4HASHSET(0);
/* if hashset is empty and does not contain null, return an empty array */
- if(set->nelements == 0 && !set->null_element) {
- Datum d = PointerGetDatum(construct_empty_array(INT4OID));
- PG_RETURN_ARRAYTYPE_P(d);
- }
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
sbitmap = set->data;
svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
@@ -733,10 +723,8 @@ int4hashset_to_sorted_array(PG_FUNCTION_ARGS)
set = PG_GETARG_INT4HASHSET(0);
/* if hashset is empty and does not contain null, return an empty array */
- if(set->nelements == 0 && !set->null_element) {
- Datum d = PointerGetDatum(construct_empty_array(INT4OID));
- PG_RETURN_ARRAYTYPE_P(d);
- }
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
/* extract the sorted elements from the hashset */
values = int4hashset_extract_sorted_elements(set);
--
2.34.1
On Wed, Jun 28, 2023, at 08:26, jian he wrote:
Hi there.
I changed the function hashset_contains to strict.
Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:
SELECT NULL = ANY('{}'::int4[]);
?column?
----------
f
(1 row)
Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.
Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:
test array-and-multiset-semantics ... FAILED 21 ms
Check the content of `regression.diffs` to see why:
% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27
@@ -158,7 +158,7 @@
| | {NULL} | {NULL} | |
| 1 | {1} | {1} | |
| 4 | {4} | {4} | |
- {} | | {NULL} | {NULL} | f | f
+ {} | | {NULL} | {NULL} | | f
{} | 1 | {1} | {1} | f | f
{} | 4 | {4} | {4} | f | f
{NULL} | | {NULL} | {NULL,NULL} | |
@@ -284,7 +284,8 @@
"= ANY(...)";
arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
------+------+-------------+--------------+------------------+------------
-(0 rows)
+ {} | | {NULL} | {NULL} | | f
+(1 row)
also change the way to return an empty array.
Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.
in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
int4hashset can speed distinct aggregate and distinct counts?
like the following:explain(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3explain(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
Execution Time: 26.790 ms
Execution Time: 30.616 ms
Execution Time: 33.253 ms
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
Execution Time: 32.797 ms
Execution Time: 27.605 ms
Execution Time: 26.228 ms
If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
Execution Time: 799.017 ms
Execution Time: 796.008 ms
Execution Time: 799.121 ms
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
Execution Time: 1204.873 ms
Execution Time: 1221.822 ms
Execution Time: 1216.340 ms
For random integers, hashset is a win though:
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1874.722 ms
Execution Time: 1878.760 ms
Execution Time: 1861.640 ms
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
Execution Time: 1253.709 ms
Execution Time: 1222.651 ms
Execution Time: 1237.849 ms
explain(costs off,timing off, analyze,buffers)
select count(distinct rnd) from benchmark_input_100k \watch c=3explain(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
benchmark_input_100k) sub(x) \watch c=3
I tried these with 10M:
EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1733.320 ms
Execution Time: 1725.214 ms
Execution Time: 1716.636 ms
EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
Execution Time: 1249.612 ms
Execution Time: 1240.558 ms
Execution Time: 1252.103 ms
Not sure what I think of the current benchmark suite.
I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.
Would be good with more examples of typical work loads for when the hashset
type would be useful.
/Joel
On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <joel@compiler.org> wrote:
On Wed, Jun 28, 2023, at 08:26, jian he wrote:
Hi there.
I changed the function hashset_contains to strict.Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:SELECT NULL = ANY('{}'::int4[]);
?column?
----------
f
(1 row)Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:test array-and-multiset-semantics ... FAILED 21 ms
Check the content of `regression.diffs` to see why:
% cat regression.diffs diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38 +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27 @@ -158,7 +158,7 @@ | | {NULL} | {NULL} | | | 1 | {1} | {1} | | | 4 | {4} | {4} | | - {} | | {NULL} | {NULL} | f | f + {} | | {NULL} | {NULL} | | f {} | 1 | {1} | {1} | f | f {} | 4 | {4} | {4} | f | f {NULL} | | {NULL} | {NULL,NULL} | | @@ -284,7 +284,8 @@ "= ANY(...)"; arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...) ------+------+-------------+--------------+------------------+------------ -(0 rows) + {} | | {NULL} | {NULL} | | f +(1 row)also change the way to return an empty array.
Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
int4hashset can speed distinct aggregate and distinct counts?
like the following:explain(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3explain(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
Execution Time: 26.790 ms
Execution Time: 30.616 ms
Execution Time: 33.253 msEXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
Execution Time: 32.797 ms
Execution Time: 27.605 ms
Execution Time: 26.228 msIf we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
Execution Time: 799.017 ms
Execution Time: 796.008 ms
Execution Time: 799.121 msEXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
Execution Time: 1204.873 ms
Execution Time: 1221.822 ms
Execution Time: 1216.340 msFor random integers, hashset is a win though:
EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1874.722 ms
Execution Time: 1878.760 ms
Execution Time: 1861.640 msEXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
Execution Time: 1253.709 ms
Execution Time: 1222.651 ms
Execution Time: 1237.849 msexplain(costs off,timing off, analyze,buffers)
select count(distinct rnd) from benchmark_input_100k \watch c=3explain(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
benchmark_input_100k) sub(x) \watch c=3I tried these with 10M:
EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1733.320 ms
Execution Time: 1725.214 ms
Execution Time: 1716.636 msEXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
Execution Time: 1249.612 ms
Execution Time: 1240.558 ms
Execution Time: 1252.103 msNot sure what I think of the current benchmark suite.
I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.Would be good with more examples of typical work loads for when the hashset
type would be useful./Joel
Did you try running `make installcheck` after your change?
First I use make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
I found out it uses another active cluster.
So I killed another active cluster.
later i found another database port so I took me sometime to found out
I need use following:
make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
PGPORT=5421
Anyway, this time, I added another macro,which seems to simplify the code.
#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
it passed all the tests on my local machine.
I should have only made a patch, but when I was committed, I forgot to
mention one file, so I needed 2 commits.
Not sure what I think of the current benchmark suite.
your result is so different from mine. I use the default config. I
see a big difference. yech, I agree, the performance test should be
more careful.
Attachments:
0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patchtext/x-patch; charset=US-ASCII; name=0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patchDownload
From ece7e6bf34facb67c7a938b2862f3d3af06aefc0 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universality@gmail.com>
Date: Thu, 29 Jun 2023 14:27:02 +0800
Subject: [PATCH 2/2] marco SET_DATA_PTR to quicly access hashset data region
---
hashset-api.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/hashset-api.c b/hashset-api.c
index 28eb2387..2d856b47 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -194,7 +194,7 @@ int4hashset_out(PG_FUNCTION_ARGS)
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(set);
/* Initialize the StringInfo buffer */
initStringInfo(&str);
@@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
char *bitmap = setb->data;
- int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+ int32 *values = (int32 *) SET_DATA_PTR(seta);
for (i = 0; i < setb->capacity; i++)
{
@@ -598,7 +598,7 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
value = PG_GETARG_INT4HASHSET(1);
bitmap = value->data;
- values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(value);
for (i = 0; i < value->capacity; i++)
{
@@ -665,7 +665,7 @@ int4hashset_agg_combine(PG_FUNCTION_ARGS)
dst = (int4hashset_t *) PG_GETARG_POINTER(0);
bitmap = src->data;
- values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(src);
for (i = 0; i < src->capacity; i++)
{
@@ -698,7 +698,7 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
sbitmap = set->data;
- svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ svalues = (int32 *) (int32 *) SET_DATA_PTR(set);
/* number of values to store in the array */
nvalues = set->nelements;
@@ -757,7 +757,7 @@ int4hashset_eq(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(false);
bitmap_a = a->data;
- values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+ values_a = (int32 *) SET_DATA_PTR(a);
/*
* Check if every element in a is also in b
@@ -935,7 +935,8 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
char *bitmap = setb->data;
- int32 *values = (int32 *)(bitmap + CEIL_DIV(setb->capacity, 8));
+ int32 *values = (int32 *) SET_DATA_PTR(setb);
+
int4hashset_t *intersection;
intersection = int4hashset_allocate(
@@ -971,7 +972,7 @@ int4hashset_difference(PG_FUNCTION_ARGS)
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *difference;
char *bitmap = seta->data;
- int32 *values = (int32 *)(bitmap + CEIL_DIV(seta->capacity, 8));
+ int32 *values = (int32 *) SET_DATA_PTR(seta);
difference = int4hashset_allocate(
seta->capacity,
@@ -1007,8 +1008,8 @@ int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
int4hashset_t *result;
char *bitmapa = seta->data;
char *bitmapb = setb->data;
- int32 *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
- int32 *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
+ int32 *valuesa = (int32 *) SET_DATA_PTR(seta);
+ int32 *valuesb = (int32 *) SET_DATA_PTR(setb);
result = int4hashset_allocate(
seta->nelements + setb->nelements,
--
2.34.1
0001-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patchtext/x-patch; charset=US-ASCII; name=0001-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patchDownload
From f94c1261f691c6473cd27de2a8d9465f77e73b40 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universality@gmail.com>
Date: Thu, 29 Jun 2023 14:20:00 +0800
Subject: [PATCH 1/2] marco SET_DATA_PTR to quicly access hashset data region
---
hashset.c | 8 ++++----
hashset.h | 3 +++
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/hashset.c b/hashset.c
index 91907abc..d786d379 100644
--- a/hashset.c
+++ b/hashset.c
@@ -82,7 +82,7 @@ int4hashset_resize(int4hashset_t * set)
/* Calculate the pointer to the bitmap and values array */
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(set);
for (i = 0; i < set->capacity; i++)
{
@@ -132,7 +132,7 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
position = hash % set->capacity;
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(set);
while (true)
{
@@ -204,7 +204,7 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
position = hash % set->capacity;
bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ values = (int32 *) SET_DATA_PTR(set);
while (true)
{
@@ -238,7 +238,7 @@ int4hashset_extract_sorted_elements(int4hashset_t *set)
/* Access the data array */
char *bitmap = set->data;
- int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
+ int32 *values = (int32 *) SET_DATA_PTR(set);
/* Counter for the number of extracted elements */
int32 nextracted = 0;
diff --git a/hashset.h b/hashset.h
index 86f5d1b0..2513ab55 100644
--- a/hashset.h
+++ b/hashset.h
@@ -42,6 +42,9 @@ typedef struct int4hashset_t {
char data[FLEXIBLE_ARRAY_MEMBER];
} int4hashset_t;
+#define SET_DATA_PTR(a) \
+ (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
+
int4hashset_t *int4hashset_allocate(int capacity, float4 load_factor, float4 growth_factor, int hashfn_id);
int4hashset_t *int4hashset_resize(int4hashset_t * set);
int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
--
2.34.1
On Thu, Jun 29, 2023, at 08:54, jian he wrote:
Anyway, this time, I added another macro,which seems to simplify the code.
#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))it passed all the tests on my local machine.
Hmm, this is interesting. There is a bug in your second patch,
that the tests catch, so it's really surprising if they pass on your machine.
Can you try to run `make clean && make && make install && make installcheck`?
I would guess you forgot to recompile or reinstall.
This is the bug in 0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch:
@@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
char *bitmap = setb->data;
- int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+ int32 *values = (int32 *) SET_DATA_PTR(seta);
You accidentally replaced `setb` with `seta`.
I renamed the macro to HASHSET_GET_VALUES and changed it slightly,
also added a HASHSET_GET_BITMAP for completeness:
#define HASHSET_GET_BITMAP(set) ((set)->data)
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))
Instead of your version:
#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
Changes:
* Parenthesize macro parameters.
* Prefix the macro names with "HASHSET_" to avoid potential conflicts.
* "GET_VALUES" more clearly communicates that it's the values we're extracting.
New patch attached.
Other changes in same commit:
* Add original friends-of-friends graph query to new benchmark/ directory
* Add table of content to README
* Update docs: Explain null semantics and add function examples
* Simplify empty hashset handling, remove unused includes
/Joel
Attachments:
hashset-0.0.1-a775594-full.patchapplication/octet-stream; name=hashset-0.0.1-a775594-full.patchDownload
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..91f216e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+.deps/
+results/
+**/*.o
+**/*.so
+regression.diffs
+regression.out
+.vscode
+test/c_tests/test_send_recv
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..908853d
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,16 @@
+Copyright (c) 2019, Tomas Vondra (tomas.vondra@postgresql.org).
+
+Permission to use, copy, modify, and distribute this software and its documentation
+for any purpose, without fee, and without a written agreement is hereby granted,
+provided that the above copyright notice and this paragraph and the following two
+paragraphs appear in all copies.
+
+IN NO EVENT SHALL $ORGANISATION BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
+INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE
+OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF TOMAS VONDRA HAS BEEN ADVISED OF
+THE POSSIBILITY OF SUCH DAMAGE.
+
+TOMAS VONDRA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
+SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND $ORGANISATION HAS NO
+OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..ee62511
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,33 @@
+MODULE_big = hashset
+OBJS = hashset.o hashset-api.o
+
+EXTENSION = hashset
+DATA = hashset--0.0.1.sql
+MODULES = hashset
+
+# Keep the CFLAGS separate
+SERVER_INCLUDES=-I$(shell pg_config --includedir-server)
+CLIENT_INCLUDES=-I$(shell pg_config --includedir)
+LIBRARY_PATH = -L$(shell pg_config --libdir)
+
+REGRESS = prelude basic io_varying_lengths random table invalid parsing reported_bugs array-and-multiset-semantics
+REGRESS_OPTS = --inputdir=test
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+C_TESTS_DIR = test/c_tests
+
+EXTRA_CLEAN = $(C_TESTS_DIR)/test_send_recv
+
+c_tests: $(C_TESTS_DIR)/test_send_recv
+
+$(C_TESTS_DIR)/test_send_recv: $(C_TESTS_DIR)/test_send_recv.c
+ $(CC) $(SERVER_INCLUDES) $(CLIENT_INCLUDES) -o $@ $< $(LIBRARY_PATH) -lpq
+
+run_c_tests: c_tests
+ cd $(C_TESTS_DIR) && ./test_send_recv.sh
+
+check: all $(REGRESS_PREP) run_c_tests
+
+include $(PGXS)
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..aaa84fd
--- /dev/null
+++ b/README.md
@@ -0,0 +1,343 @@
+# hashset
+
+This PostgreSQL extension implements hashset, a data structure (type)
+providing a collection of unique integer items with fast lookup.
+
+It provides several functions for working with these sets, including operations
+like addition, containment check, conversion to array, union, intersection,
+difference, equality check, and cardinality calculation.
+
+`NULL` values are also allowed in the hash set, and are considered as a unique
+element. When multiple `NULL` values are present in the input, they are treated
+as a single `NULL`.
+
+## Table of Contents
+1. [Version](#version)
+2. [Data Types](#data-types)
+ - [int4hashset](#int4hashset)
+3. [Functions](#functions)
+ - [int4hashset](#int4hashset-1)
+ - [hashset_add](#hashset_add)
+ - [hashset_contains](#hashset_contains)
+ - [hashset_to_array](#hashset_to_array)
+ - [hashset_to_sorted_array](#hashset_to_sorted_array)
+ - [hashset_cardinality](#hashset_cardinality)
+ - [hashset_capacity](#hashset_capacity)
+ - [hashset_max_collisions](#hashset_max_collisions)
+ - [hashset_union](#hashset_union)
+ - [hashset_intersection](#hashset_intersection)
+ - [hashset_difference](#hashset_difference)
+ - [hashset_symmetric_difference](#hashset_symmetric_difference)
+4. [Aggregation Functions](#aggregation-functions)
+5. [Operators](#operators)
+6. [Hashset Hash Operators](#hashset-hash-operators)
+7. [Hashset Btree Operators](#hashset-btree-operators)
+8. [Limitations](#limitations)
+9. [Installation](#installation)
+10. [License](#license)
+
+## Version
+
+0.0.1
+
+🚧 **NOTICE** 🚧 This repository is currently under active development and the hashset
+PostgreSQL extension is **not production-ready**. As the codebase is evolving
+with possible breaking changes, we are not providing any migration scripts
+until we reach our first release.
+
+
+## Data types
+
+### int4hashset
+
+This data type represents a set of integers. Internally, it uses a combination
+of a bitmap and a value array to store the elements in a set. It's a
+variable-length type.
+
+
+## Functions
+
+### int4hashset()
+
+`int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`
+
+Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
+
+
+### hashset_add()
+
+`hashset_add(int4hashset, int) -> int4hashset`
+
+Adds an integer to an int4hashset.
+
+```sql
+SELECT hashset_add(NULL, 1); -- {1}
+SELECT hashset_add('{NULL}', 1); -- {1,NULL}
+SELECT hashset_add('{1}', NULL); -- {1,NULL}
+SELECT hashset_add('{1}', 1); -- {1}
+SELECT hashset_add('{1}', 2); -- {1,2}
+```
+
+
+### hashset_contains()
+
+`hashset_contains(int4hashset, int) -> boolean`
+
+Checks if an int4hashset contains a given integer.
+
+```sql
+SELECT hashset_contains('{1}', 1); -- TRUE
+SELECT hashset_contains('{1}', 2); -- FALSE
+```
+
+If the *cardinality* of the hashset is zero (0), it is known that it doesn't
+contain any value, not even an Unknown value represented as `NULL`, so even in
+that case it returns `FALSE`.
+
+```sql
+SELECT hashset_contains('{}', 1); -- FALSE
+SELECT hashset_contains('{}', NULL); -- FALSE
+```
+
+If the hashset is `NULL`, then the result is `NULL`.
+
+```sql
+SELECT hashset_contains(NULL, NULL); -- NULL
+SELECT hashset_contains(NULL, 1); -- NULL
+```
+
+
+### hashset_to_array()
+
+`hashset_to_array(int4hashset) -> int[]`
+
+Converts an int4hashset to an array of unsorted integers.
+
+```sql
+SELECT hashset_to_array('{2,1,3}'); -- {3,2,1}
+```
+
+
+### hashset_to_sorted_array()
+
+`hashset_to_sorted_array(int4hashset) -> int[]`
+
+Converts an int4hashset to an array of sorted integers.
+
+```sql
+SELECT hashset_to_sorted_array('{2,1,3}'); -- {1,2,3}
+```
+
+If the hashset contains a `NULL` element, it follows the same behavior as the
+`ORDER BY` clause in SQL: the `NULL` element is positioned at the end of the
+sorted array.
+
+```sql
+SELECT hashset_to_sorted_array('{2,1,NULL,3}'); -- {1,2,3,NULL}
+```
+
+
+### hashset_cardinality()
+
+`hashset_cardinality(int4hashset) -> bigint`
+
+Returns the number of elements in an int4hashset.
+
+```sql
+SELECT hashset_cardinality(NULL); -- NULL
+SELECT hashset_cardinality('{}'); -- 0
+SELECT hashset_cardinality('{1}'); -- 1
+SELECT hashset_cardinality('{1,1}'); -- 1
+SELECT hashset_cardinality('{NULL,NULL}'); -- 1
+SELECT hashset_cardinality('{1,NULL}'); -- 2
+SELECT hashset_cardinality('{1,2,3}'); -- 3
+```
+
+
+### hashset_capacity()
+
+`hashset_capacity(int4hashset) -> bigint`
+
+Returns the current capacity of an int4hashset.
+
+
+### hashset_max_collisions()
+
+`hashset_max_collisions(int4hashset) -> bigint`
+
+Returns the maximum number of collisions that have occurred for a single element
+
+
+### hashset_union()
+
+`hashset_union(int4hashset, int4hashset) -> int4hashset`
+
+Merges two int4hashsets into a new int4hashset.
+
+```sql
+SELECT hashset_union('{1,2}', '{2,3}'); -- '{1,2,3}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_union('{1}', NULL); -- NULL
+SELECT hashset_union(NULL, '{1}'); -- NULL
+```
+
+
+### hashset_intersection()
+
+`hashset_intersection(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset that is the intersection of the two input sets.
+
+```sql
+SELECT hashset_intersection('{1,2}', '{2,3}'); -- {2}
+SELECT hashset_intersection('{1,2,NULL}', '{2,3,NULL}'); -- {2,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_intersection('{1,2}', NULL); -- NULL
+SELECT hashset_intersection(NULL, '{2,3}'); -- NULL
+```
+
+
+### hashset_difference()
+
+`hashset_difference(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset that contains the elements present in the first set
+but not in the second set.
+
+```sql
+SELECT hashset_difference('{1,2}', '{2,3}'); -- {1}
+SELECT hashset_difference('{1,2,NULL}', '{2,3,NULL}'); -- {1}
+SELECT hashset_difference('{1,2,NULL}', '{2,3}'); -- {1,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_difference('{1,2}', NULL); -- NULL
+SELECT hashset_difference(NULL, '{2,3}'); -- NULL
+```
+
+
+### hashset_symmetric_difference()
+
+`hashset_symmetric_difference(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset containing elements that are in either of the input sets, but not in their intersection.
+
+```sql
+SELECT hashset_symmetric_difference('{1,2}', '{2,3}'); -- {1,3}
+SELECT hashset_symmetric_difference('{1,2,NULL}', '{2,3,NULL}'); -- {1,3}
+SELECT hashset_symmetric_difference('{1,2,NULL}', '{2,3}'); -- {1,3,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_symmetric_difference('{1,2}', NULL); -- NULL
+SELECT hashset_symmetric_difference(NULL, '{2,3}'); -- NULL
+```
+
+
+## Aggregation Functions
+
+### hashset_agg(int4)
+
+`hashset_agg(int4) -> int4hashset`
+
+Aggregate integers into a hashset.
+
+```sql
+SELECT hashset_agg(some_int4_column) FROM some_table;
+```
+
+
+### hashset_agg(int4hashset)
+
+`hashset_agg(int4hashset) -> int4hashset`
+
+Aggregate hashsets into a hashset.
+
+```sql
+SELECT hashset_agg(some_int4hashset_column) FROM some_table;
+```
+
+
+## Operators
+
+- Equality (`=`): Checks if two hashsets are equal.
+- Inequality (`<>`): Checks if two hashsets are not equal.
+
+
+## Hashset Hash Operators
+
+- `hashset_hash(int4hashset) -> integer`: Returns the hash value of an int4hashset.
+
+
+## Hashset Btree Operators
+
+- `<`, `<=`, `>`, `>=`: Comparison operators for hashsets.
+
+
+## Limitations
+
+- The `int4hashset` data type currently supports integers within the range of int4
+(-2147483648 to 2147483647).
+
+
+## Installation
+
+To install the extension on any platform, follow these general steps:
+
+1. Ensure you have PostgreSQL installed on your system, including the development files.
+2. Clone the repository.
+3. Navigate to the cloned repository directory.
+4. Compile the extension using `make`.
+5. Install the extension using `sudo make install`.
+6. Run the tests using `make installcheck` (optional).
+
+To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command:
+```sh
+make PG_CONFIG=/else/where/pg_config
+sudo make install PG_CONFIG=/else/where/pg_config
+```
+
+In your PostgreSQL connection, enable the hashset extension using the following SQL command:
+```sql
+CREATE EXTENSION hashset;
+```
+
+This extension requires PostgreSQL version ?.? or later.
+
+For Ubuntu 22.04.1 LTS, you would run the following commands:
+
+```sh
+sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
+git clone https://github.com/tvondra/hashset.git
+cd hashset
+make
+sudo make install
+make installcheck
+```
+
+Please note that this project is currently under active development and is not yet considered production-ready.
+
+## License
+
+This software is distributed under the terms of PostgreSQL license.
+See LICENSE or http://www.opensource.org/licenses/bsd-license.php for
+more details.
diff --git a/benchmark/.gitignore b/benchmark/.gitignore
new file mode 100644
index 0000000..f3c4e1c
--- /dev/null
+++ b/benchmark/.gitignore
@@ -0,0 +1,2 @@
+soc-pokec-relationships.txt.gz
+soc-pokec-relationships.txt
diff --git a/benchmark/friends_of_friends.sh b/benchmark/friends_of_friends.sh
new file mode 100755
index 0000000..d4570c0
--- /dev/null
+++ b/benchmark/friends_of_friends.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+if [ ! -f "soc-pokec-relationships.txt" ]; then
+ wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
+ gunzip soc-pokec-relationships.txt.gz
+fi
+
+psql -X -c "CREATE TABLE edges (from_node INT, to_node INT);"
+psql -X -c "\COPY edges FROM soc-pokec-relationships.txt;"
+psql -X -c "ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);"
+psql -X -f friends_of_friends.sql
diff --git a/benchmark/friends_of_friends.sql b/benchmark/friends_of_friends.sql
new file mode 100644
index 0000000..ecc8f0e
--- /dev/null
+++ b/benchmark/friends_of_friends.sql
@@ -0,0 +1,72 @@
+CREATE EXTENSION IF NOT EXISTS hashset;
+
+\timing on
+
+CREATE OR REPLACE VIEW vfriends_of_friends_array_agg_distinct AS
+WITH RECURSIVE friends_of_friends AS (
+ SELECT
+ ARRAY[5867::bigint] AS current,
+ 0 AS depth
+ UNION ALL
+ SELECT
+ new_current,
+ friends_of_friends.depth + 1
+ FROM
+ friends_of_friends
+ CROSS JOIN LATERAL (
+ SELECT
+ array_agg(DISTINCT edges.to_node) AS new_current
+ FROM
+ edges
+ WHERE
+ from_node = ANY(friends_of_friends.current)
+ ) q
+ WHERE
+ friends_of_friends.depth < 3
+)
+SELECT
+ COALESCE(array_length(current, 1), 0) AS count_friends_at_depth_3
+FROM
+ friends_of_friends
+WHERE
+ depth = 3;
+
+CREATE OR REPLACE VIEW vfriends_of_friends_hashset_agg AS
+WITH RECURSIVE friends_of_friends AS
+(
+ SELECT
+ '{5867}'::int4hashset AS current,
+ 0 AS depth
+ UNION ALL
+ SELECT
+ new_current,
+ friends_of_friends.depth + 1
+ FROM
+ friends_of_friends
+ CROSS JOIN LATERAL
+ (
+ SELECT
+ hashset_agg(edges.to_node) AS new_current
+ FROM
+ edges
+ WHERE
+ from_node = ANY(hashset_to_array(friends_of_friends.current))
+ ) q
+ WHERE
+ friends_of_friends.depth < 3
+)
+SELECT
+ depth,
+ hashset_cardinality(current)
+FROM
+ friends_of_friends
+WHERE
+ depth = 3;
+
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+
+SELECT * FROM vfriends_of_friends_hashset_agg;
+SELECT * FROM vfriends_of_friends_hashset_agg;
+SELECT * FROM vfriends_of_friends_hashset_agg;
diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
new file mode 100644
index 0000000..d0478ce
--- /dev/null
+++ b/hashset--0.0.1.sql
@@ -0,0 +1,303 @@
+/*
+ * Hashset Type Definition
+ */
+
+CREATE TYPE int4hashset;
+
+CREATE OR REPLACE FUNCTION int4hashset_in(cstring)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_in'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_out(int4hashset)
+RETURNS cstring
+AS 'hashset', 'int4hashset_out'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_send(int4hashset)
+RETURNS bytea
+AS 'hashset', 'int4hashset_send'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_recv(internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_recv'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE TYPE int4hashset (
+ INPUT = int4hashset_in,
+ OUTPUT = int4hashset_out,
+ RECEIVE = int4hashset_recv,
+ SEND = int4hashset_send,
+ INTERNALLENGTH = variable,
+ STORAGE = external
+);
+
+/*
+ * Hashset Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset(
+ capacity int DEFAULT 0,
+ load_factor float4 DEFAULT 0.75,
+ growth_factor float4 DEFAULT 2.0,
+ hashfn_id int DEFAULT 1
+)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_init'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_add(int4hashset, int)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
+RETURNS boolean
+AS 'hashset', 'int4hashset_contains'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_union'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_to_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_array'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_to_sorted_array(int4hashset)
+RETURNS int[]
+AS 'hashset', 'int4hashset_to_sorted_array'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cardinality(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_cardinality'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_capacity(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_capacity'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_collisions'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_max_collisions(int4hashset)
+RETURNS bigint
+AS 'hashset', 'int4hashset_max_collisions'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
+RETURNS int4hashset
+AS $$SELECT $2 || $1$$
+LANGUAGE SQL
+IMMUTABLE PARALLEL SAFE STRICT COST 1;
+
+CREATE OR REPLACE FUNCTION hashset_intersection(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_intersection'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_symmetric_difference(int4hashset, int4hashset)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_symmetric_difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+/*
+ * Aggregation Functions
+ */
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add(p_pointer internal, p_value int)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int) (
+ SFUNC = int4hashset_agg_add,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_add_set(p_pointer internal, p_value int4hashset)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_add_set'
+LANGUAGE C IMMUTABLE;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_final(p_pointer internal)
+RETURNS int4hashset
+AS 'hashset', 'int4hashset_agg_final'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION int4hashset_agg_combine(p_pointer internal, p_pointer2 internal)
+RETURNS internal
+AS 'hashset', 'int4hashset_agg_combine'
+LANGUAGE C IMMUTABLE;
+
+CREATE AGGREGATE hashset_agg(int4hashset) (
+ SFUNC = int4hashset_agg_add_set,
+ STYPE = internal,
+ FINALFUNC = int4hashset_agg_final,
+ COMBINEFUNC = int4hashset_agg_combine,
+ PARALLEL = SAFE
+);
+
+/*
+ * Operator Definitions
+ */
+
+CREATE OR REPLACE FUNCTION hashset_eq(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_eq'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR = (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_eq,
+ COMMUTATOR = =,
+ HASHES
+);
+
+CREATE OR REPLACE FUNCTION hashset_ne(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ne'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR <> (
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ PROCEDURE = hashset_ne,
+ COMMUTATOR = '<>',
+ NEGATOR = '=',
+ RESTRICT = neqsel,
+ JOIN = neqjoinsel,
+ HASHES
+);
+
+CREATE OPERATOR || (
+ leftarg = int4hashset,
+ rightarg = int4,
+ function = hashset_add,
+ commutator = ||
+);
+
+CREATE OPERATOR || (
+ leftarg = int4,
+ rightarg = int4hashset,
+ function = int4_add_int4hashset,
+ commutator = ||
+);
+
+/*
+ * Hashset Hash Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_hash(int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_hash'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR CLASS int4hashset_hash_ops
+DEFAULT FOR TYPE int4hashset USING hash AS
+OPERATOR 1 = (int4hashset, int4hashset),
+FUNCTION 1 hashset_hash(int4hashset);
+
+/*
+ * Hashset Btree Operators
+ */
+
+CREATE OR REPLACE FUNCTION hashset_lt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_lt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_le(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_le'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_gt(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_gt'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_ge(int4hashset, int4hashset)
+RETURNS boolean
+AS 'hashset', 'int4hashset_ge'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION hashset_cmp(int4hashset, int4hashset)
+RETURNS integer
+AS 'hashset', 'int4hashset_cmp'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OPERATOR < (
+ PROCEDURE = hashset_lt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = >,
+ NEGATOR = >=,
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR <= (
+ PROCEDURE = hashset_le,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '>=',
+ NEGATOR = '>',
+ RESTRICT = scalarltsel,
+ JOIN = scalarltjoinsel
+);
+
+CREATE OPERATOR > (
+ PROCEDURE = hashset_gt,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<',
+ NEGATOR = '<=',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR >= (
+ PROCEDURE = hashset_ge,
+ LEFTARG = int4hashset,
+ RIGHTARG = int4hashset,
+ COMMUTATOR = '<=',
+ NEGATOR = '<',
+ RESTRICT = scalargtsel,
+ JOIN = scalargtjoinsel
+);
+
+CREATE OPERATOR CLASS int4hashset_btree_ops
+DEFAULT FOR TYPE int4hashset USING btree AS
+OPERATOR 1 < (int4hashset, int4hashset),
+OPERATOR 2 <= (int4hashset, int4hashset),
+OPERATOR 3 = (int4hashset, int4hashset),
+OPERATOR 4 >= (int4hashset, int4hashset),
+OPERATOR 5 > (int4hashset, int4hashset),
+FUNCTION 1 hashset_cmp(int4hashset, int4hashset);
diff --git a/hashset-api.c b/hashset-api.c
new file mode 100644
index 0000000..2b92363
--- /dev/null
+++ b/hashset-api.c
@@ -0,0 +1,1042 @@
+#include "hashset.h"
+
+#include <math.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <limits.h>
+
+#define PG_GETARG_INT4HASHSET(x) (int4hashset_t *) PG_DETOAST_DATUM(PG_GETARG_DATUM(x))
+#define PG_GETARG_INT4HASHSET_COPY(x) (int4hashset_t *) PG_DETOAST_DATUM_COPY(PG_GETARG_DATUM(x))
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(int4hashset_in);
+PG_FUNCTION_INFO_V1(int4hashset_out);
+PG_FUNCTION_INFO_V1(int4hashset_send);
+PG_FUNCTION_INFO_V1(int4hashset_recv);
+PG_FUNCTION_INFO_V1(int4hashset_add);
+PG_FUNCTION_INFO_V1(int4hashset_contains);
+PG_FUNCTION_INFO_V1(int4hashset_cardinality);
+PG_FUNCTION_INFO_V1(int4hashset_union);
+PG_FUNCTION_INFO_V1(int4hashset_init);
+PG_FUNCTION_INFO_V1(int4hashset_capacity);
+PG_FUNCTION_INFO_V1(int4hashset_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_max_collisions);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add);
+PG_FUNCTION_INFO_V1(int4hashset_agg_add_set);
+PG_FUNCTION_INFO_V1(int4hashset_agg_final);
+PG_FUNCTION_INFO_V1(int4hashset_agg_combine);
+PG_FUNCTION_INFO_V1(int4hashset_to_array);
+PG_FUNCTION_INFO_V1(int4hashset_to_sorted_array);
+PG_FUNCTION_INFO_V1(int4hashset_eq);
+PG_FUNCTION_INFO_V1(int4hashset_ne);
+PG_FUNCTION_INFO_V1(int4hashset_hash);
+PG_FUNCTION_INFO_V1(int4hashset_lt);
+PG_FUNCTION_INFO_V1(int4hashset_le);
+PG_FUNCTION_INFO_V1(int4hashset_gt);
+PG_FUNCTION_INFO_V1(int4hashset_ge);
+PG_FUNCTION_INFO_V1(int4hashset_cmp);
+PG_FUNCTION_INFO_V1(int4hashset_intersection);
+PG_FUNCTION_INFO_V1(int4hashset_difference);
+PG_FUNCTION_INFO_V1(int4hashset_symmetric_difference);
+
+Datum int4hashset_in(PG_FUNCTION_ARGS);
+Datum int4hashset_out(PG_FUNCTION_ARGS);
+Datum int4hashset_send(PG_FUNCTION_ARGS);
+Datum int4hashset_recv(PG_FUNCTION_ARGS);
+Datum int4hashset_add(PG_FUNCTION_ARGS);
+Datum int4hashset_contains(PG_FUNCTION_ARGS);
+Datum int4hashset_cardinality(PG_FUNCTION_ARGS);
+Datum int4hashset_union(PG_FUNCTION_ARGS);
+Datum int4hashset_init(PG_FUNCTION_ARGS);
+Datum int4hashset_capacity(PG_FUNCTION_ARGS);
+Datum int4hashset_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_max_collisions(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_add_set(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_final(PG_FUNCTION_ARGS);
+Datum int4hashset_agg_combine(PG_FUNCTION_ARGS);
+Datum int4hashset_to_array(PG_FUNCTION_ARGS);
+Datum int4hashset_to_sorted_array(PG_FUNCTION_ARGS);
+Datum int4hashset_eq(PG_FUNCTION_ARGS);
+Datum int4hashset_ne(PG_FUNCTION_ARGS);
+Datum int4hashset_hash(PG_FUNCTION_ARGS);
+Datum int4hashset_lt(PG_FUNCTION_ARGS);
+Datum int4hashset_le(PG_FUNCTION_ARGS);
+Datum int4hashset_gt(PG_FUNCTION_ARGS);
+Datum int4hashset_ge(PG_FUNCTION_ARGS);
+Datum int4hashset_cmp(PG_FUNCTION_ARGS);
+Datum int4hashset_intersection(PG_FUNCTION_ARGS);
+Datum int4hashset_difference(PG_FUNCTION_ARGS);
+Datum int4hashset_symmetric_difference(PG_FUNCTION_ARGS);
+
+Datum
+int4hashset_in(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ char *endptr;
+ int32 len = strlen(str);
+ int4hashset_t *set;
+ int64 value;
+
+ /* Skip initial spaces */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check the opening brace */
+ if (*str != '{')
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for hashset: \"%s\"", str),
+ errdetail("Hashset representation must start with \"{\".")));
+ }
+
+ /* Start parsing from the first number (after the opening brace) */
+ str++;
+
+ /* Initial size based on input length (arbitrary, could be optimized) */
+ set = int4hashset_allocate(
+ len/2,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ while (true)
+ {
+ /* Skip spaces before number */
+ while (hashset_isspace(*str)) str++;
+
+ /* Check for closing brace, handling the case for an empty set */
+ if (*str == '}')
+ {
+ str++; /* Move past the closing brace */
+ break;
+ }
+
+ /* Check if "null" is encountered (case-insensitive) */
+ if (strncasecmp(str, "null", 4) == 0)
+ {
+ set->null_element = true;
+ str = str + 4; /* Move past "null" */
+ }
+ else
+ {
+ /* Parse the number */
+ value = strtol(str, &endptr, 10);
+
+ if (errno == ERANGE || value < PG_INT32_MIN || value > PG_INT32_MAX)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s", str,
+ "integer")));
+ }
+
+ /* Add the value to the hashset, resize if needed */
+ if (set->nelements >= set->capacity)
+ {
+ set = int4hashset_resize(set);
+ }
+ set = int4hashset_add_element(set, (int32)value);
+
+ /* Error handling for strtol */
+ if (endptr == str)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for integer: \"%s\"", str)));
+ }
+
+ str = endptr; /* Move to next number, "null" or closing brace */
+ }
+
+ /* Skip spaces before the next number or closing brace */
+ while (hashset_isspace(*str)) str++;
+
+ if (*str == ',')
+ {
+ str++; /* Skip comma before next loop iteration */
+ }
+ else if (*str != '}')
+ {
+ /* Unexpected character */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("unexpected character \"%c\" in hashset input", *str)));
+ }
+ }
+
+ /* Only whitespace is allowed after the closing brace */
+ while (*str)
+ {
+ if (!hashset_isspace(*str))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed hashset literal: \"%s\"", str),
+ errdetail("Junk after closing right brace.")));
+ }
+ str++;
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_out(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ StringInfoData str;
+
+ /* Initialize the StringInfo buffer */
+ initStringInfo(&str);
+
+ /* Append the opening brace for the output hashset string */
+ appendStringInfoChar(&str, '{');
+
+ /* Loop through the elements and append them to the string */
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the bit in the bitmap is set */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Append the value */
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfo(&str, "%d", values[i]);
+ }
+ }
+
+ /* Check if the null_element field is set */
+ if (set->null_element)
+ {
+ if (str.len > 1)
+ appendStringInfoChar(&str, ',');
+ appendStringInfoString(&str, "NULL");
+ }
+
+ /* Append the closing brace for the output hashset string */
+ appendStringInfoChar(&str, '}');
+
+ /* Return the resulting string */
+ PG_RETURN_CSTRING(str.data);
+}
+
+Datum
+int4hashset_send(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ StringInfoData buf;
+ int32 data_size;
+ int version = 1;
+
+ /* Begin constructing the message */
+ pq_begintypsend(&buf);
+
+ /* Send the version number */
+ pq_sendint8(&buf, version);
+
+ /* Send the non-data fields */
+ pq_sendint32(&buf, set->flags);
+ pq_sendint32(&buf, set->capacity);
+ pq_sendint32(&buf, set->nelements);
+ pq_sendint32(&buf, set->hashfn_id);
+ pq_sendfloat4(&buf, set->load_factor);
+ pq_sendfloat4(&buf, set->growth_factor);
+ pq_sendint32(&buf, set->ncollisions);
+ pq_sendint32(&buf, set->max_collisions);
+ pq_sendint32(&buf, set->hash);
+ pq_sendbyte(&buf, set->null_element ? 1 : 0);
+
+ /* Compute and send the size of the data field */
+ data_size = VARSIZE(set) - offsetof(int4hashset_t, data);
+ pq_sendbytes(&buf, set->data, data_size);
+
+ PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+
+Datum
+int4hashset_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ int4hashset_t *set;
+ int32 data_size;
+ Size total_size;
+ const char *binary_data;
+ int version;
+ int32 flags;
+ int32 capacity;
+ int32 nelements;
+ int32 hashfn_id;
+ float4 load_factor;
+ float4 growth_factor;
+ int32 ncollisions;
+ int32 max_collisions;
+ int32 hash;
+ bool null_element;
+
+ version = pq_getmsgint(buf, 1);
+ if (version != 1)
+ elog(ERROR, "unsupported hashset version number %d", version);
+
+ /* Read fields from buffer */
+ flags = pq_getmsgint(buf, 4);
+ capacity = pq_getmsgint(buf, 4);
+ nelements = pq_getmsgint(buf, 4);
+ hashfn_id = pq_getmsgint(buf, 4);
+ load_factor = pq_getmsgfloat4(buf);
+ growth_factor = pq_getmsgfloat4(buf);
+ ncollisions = pq_getmsgint(buf, 4);
+ max_collisions = pq_getmsgint(buf, 4);
+ hash = pq_getmsgint(buf, 4);
+ null_element = pq_getmsgbyte(buf) == 1;
+
+ /* Compute the size of the data field */
+ data_size = buf->len - buf->cursor;
+
+ /* Read the binary data */
+ binary_data = pq_getmsgbytes(buf, data_size);
+
+ /* Make sure that there is no extra data left in the message */
+ pq_getmsgend(buf);
+
+ /* Compute total size of hashset_t */
+ total_size = offsetof(int4hashset_t, data) + data_size;
+
+ /* Allocate memory for hashset including the data field */
+ set = (int4hashset_t *) palloc0(total_size);
+
+ /* Set the size of the variable-length data structure */
+ SET_VARSIZE(set, total_size);
+
+ /* Populate the structure */
+ set->flags = flags;
+ set->capacity = capacity;
+ set->nelements = nelements;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = ncollisions;
+ set->max_collisions = max_collisions;
+ set->hash = hash;
+ set->null_element = null_element;
+ memcpy(set->data, binary_data, data_size);
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_add(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ /* If there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ set = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ }
+ else
+ {
+ set = PG_GETARG_INT4HASHSET_COPY(0);
+ }
+
+ if (PG_ARGISNULL(1))
+ {
+ set->null_element = true;
+ }
+ else
+ {
+ int32 element = PG_GETARG_INT32(1);
+ set = int4hashset_add_element(set, element);
+ }
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_contains(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 value;
+ bool result;
+
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ if (set->nelements == 0 && !set->null_element)
+ PG_RETURN_BOOL(false);
+
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ value = PG_GETARG_INT32(1);
+ result = int4hashset_contains_element(set, value);
+
+ if (!result && set->null_element)
+ PG_RETURN_NULL();
+
+ PG_RETURN_BOOL(result);
+}
+
+Datum
+int4hashset_cardinality(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ int64 cardinality = set->nelements + (set->null_element ? 1 : 0);
+
+ PG_RETURN_INT64(cardinality);
+}
+
+Datum
+int4hashset_union(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = HASHSET_GET_BITMAP(setb);
+ int32 *values = HASHSET_GET_VALUES(setb);
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ seta = int4hashset_add_element(seta, values[i]);
+ }
+
+ if (!seta->null_element && setb->null_element)
+ seta->null_element = true;
+
+ PG_RETURN_POINTER(seta);
+}
+
+Datum
+int4hashset_init(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 initial_capacity = PG_GETARG_INT32(0);
+ float4 load_factor = PG_GETARG_FLOAT4(1);
+ float4 growth_factor = PG_GETARG_FLOAT4(2);
+ int32 hashfn_id = PG_GETARG_INT32(3);
+
+ /* Validate input arguments */
+ if (!(initial_capacity >= 0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("initial capacity cannot be negative")));
+ }
+
+ if (!(load_factor > 0.0 && load_factor < 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("load factor must be between 0.0 and 1.0")));
+ }
+
+ if (!(growth_factor > 1.0))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("growth factor must be greater than 1.0")));
+ }
+
+ if (!(hashfn_id == JENKINS_LOOKUP3_HASHFN_ID ||
+ hashfn_id == MURMURHASH32_HASHFN_ID ||
+ hashfn_id == NAIVE_HASHFN_ID))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid hash function ID")));
+ }
+
+ set = int4hashset_allocate(
+ initial_capacity,
+ load_factor,
+ growth_factor,
+ hashfn_id
+ );
+
+ PG_RETURN_POINTER(set);
+}
+
+Datum
+int4hashset_capacity(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->capacity);
+}
+
+Datum
+int4hashset_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->ncollisions);
+}
+
+Datum
+int4hashset_max_collisions(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT64(set->max_collisions);
+}
+
+Datum
+int4hashset_agg_add(PG_FUNCTION_ARGS)
+{
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_add_element(state, PG_GETARG_INT32(1));
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_add_set(PG_FUNCTION_ARGS)
+{
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ int4hashset_t *state;
+
+ /* cannot be called directly because of internal-type argument */
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_add_add called in non-aggregate context");
+
+ /*
+ * We want to skip NULL values altogether - we return either the existing
+ * hashset (if it already exists) or NULL.
+ */
+ if (PG_ARGISNULL(1))
+ {
+ if (PG_ARGISNULL(0))
+ PG_RETURN_NULL();
+
+ /* if there already is a state accumulated, don't forget it */
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+ }
+
+ /* if there's no hashset allocated, create it now */
+ if (PG_ARGISNULL(0))
+ {
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ state = int4hashset_allocate(
+ DEFAULT_INITIAL_CAPACITY,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ state = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+
+ {
+ int i;
+ int4hashset_t *value = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = HASHSET_GET_BITMAP(value);
+ int32 *values = HASHSET_GET_VALUES(value);
+
+ for (i = 0; i < value->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ state = int4hashset_add_element(state, values[i]);
+ }
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(state);
+}
+
+Datum
+int4hashset_agg_final(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(PG_GETARG_POINTER(0));
+}
+
+Datum
+int4hashset_agg_combine(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *src;
+ int4hashset_t *dst;
+ MemoryContext aggcontext;
+ MemoryContext oldcontext;
+ char *bitmap;
+ int32 *values;
+
+ if (!AggCheckCallContext(fcinfo, &aggcontext))
+ elog(ERROR, "hashset_agg_combine called in non-aggregate context");
+
+ /* if no "merged" state yet, try creating it */
+ if (PG_ARGISNULL(0))
+ {
+ /* nope, the second argument is NULL to, so return NULL */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_NULL();
+
+ /* the second argument is not NULL, so copy it */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+
+ /* copy the hashset into the right long-lived memory context */
+ oldcontext = MemoryContextSwitchTo(aggcontext);
+ src = int4hashset_copy(src);
+ MemoryContextSwitchTo(oldcontext);
+
+ PG_RETURN_POINTER(src);
+ }
+
+ /*
+ * If the second argument is NULL, just return the first one (we know
+ * it's not NULL at this point).
+ */
+ if (PG_ARGISNULL(1))
+ PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+
+ /* Now we know neither argument is NULL, so merge them. */
+ src = (int4hashset_t *) PG_GETARG_POINTER(1);
+ dst = (int4hashset_t *) PG_GETARG_POINTER(0);
+
+ bitmap = HASHSET_GET_BITMAP(src);
+ values = HASHSET_GET_VALUES(src);
+
+ for (i = 0; i < src->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ dst = int4hashset_add_element(dst, values[i]);
+ }
+
+
+ PG_RETURN_POINTER(dst);
+}
+
+Datum
+int4hashset_to_array(PG_FUNCTION_ARGS)
+{
+ int i,
+ idx;
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ int32 *values;
+ int nvalues;
+ char *sbitmap;
+ int32 *svalues;
+
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
+
+ sbitmap = HASHSET_GET_BITMAP(set);
+ svalues = HASHSET_GET_VALUES(set);
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+ values = (int32 *) palloc(sizeof(int32) * nvalues);
+
+ idx = 0;
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (sbitmap[byte] & (0x01 << bit))
+ values[idx++] = svalues[i];
+ }
+
+ Assert(idx == nvalues);
+
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
+}
+
+Datum
+int4hashset_to_sorted_array(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set;
+ int32 *values;
+ int nvalues;
+
+ set = PG_GETARG_INT4HASHSET(0);
+
+ /* if hashset is empty and does not contain null, return an empty array */
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
+
+ /* extract the sorted elements from the hashset */
+ values = int4hashset_extract_sorted_elements(set);
+
+ /* number of values to store in the array */
+ nvalues = set->nelements;
+
+ return int32_to_array(fcinfo, values, nvalues, set->null_element);
+}
+
+Datum
+int4hashset_eq(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ char *bitmap_a;
+ int32 *values_a;
+
+ /*
+ * Check if the number of elements is the same
+ */
+ if (a->nelements != b->nelements)
+ PG_RETURN_BOOL(false);
+
+ bitmap_a = HASHSET_GET_BITMAP(a);
+ values_a = HASHSET_GET_VALUES(a);
+
+ /*
+ * Check if every element in a is also in b
+ */
+ for (i = 0; i < a->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap_a[byte] & (0x01 << bit))
+ {
+ int32 value = values_a[i];
+
+ if (!int4hashset_contains_element(b, value))
+ PG_RETURN_BOOL(false);
+ }
+ }
+
+ if (a->null_element != b->null_element)
+ PG_RETURN_BOOL(false);
+
+ /*
+ * All elements in a are in b and the number of elements is the same,
+ * so the sets must be equal.
+ */
+ PG_RETURN_BOOL(true);
+}
+
+
+Datum
+int4hashset_ne(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+
+ /* If a is not equal to b, then they are not equal */
+ if (!DatumGetBool(DirectFunctionCall2(int4hashset_eq, PointerGetDatum(a), PointerGetDatum(b))))
+ PG_RETURN_BOOL(true);
+
+ PG_RETURN_BOOL(false);
+}
+
+
+Datum int4hashset_hash(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+
+ PG_RETURN_INT32(set->hash);
+}
+
+
+Datum
+int4hashset_lt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp < 0);
+}
+
+
+Datum
+int4hashset_le(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp <= 0);
+}
+
+
+Datum
+int4hashset_gt(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp > 0);
+}
+
+
+Datum
+int4hashset_ge(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 cmp;
+
+ cmp = DatumGetInt32(DirectFunctionCall2(int4hashset_cmp,
+ PointerGetDatum(a),
+ PointerGetDatum(b)));
+
+ PG_RETURN_BOOL(cmp >= 0);
+}
+
+Datum
+int4hashset_cmp(PG_FUNCTION_ARGS)
+{
+ int4hashset_t *a = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *b = PG_GETARG_INT4HASHSET(1);
+ int32 *elements_a;
+ int32 *elements_b;
+
+ /*
+ * Compare the hashes first, if they are different,
+ * we can immediately tell which set is 'greater'
+ */
+ if (a->hash < b->hash)
+ PG_RETURN_INT32(-1);
+ else if (a->hash > b->hash)
+ PG_RETURN_INT32(1);
+
+ /*
+ * If hashes are equal, perform a more rigorous comparison
+ */
+
+ /*
+ * If number of elements are different,
+ * we can use that to deterministically return -1 or 1
+ */
+ if (a->nelements < b->nelements)
+ PG_RETURN_INT32(-1);
+ else if (a->nelements > b->nelements)
+ PG_RETURN_INT32(1);
+
+ /* Assert that the number of elements in both hashsets are equal */
+ Assert(a->nelements == b->nelements);
+
+ /* Extract and sort elements from each set */
+ elements_a = int4hashset_extract_sorted_elements(a);
+ elements_b = int4hashset_extract_sorted_elements(b);
+
+ /* Now we can perform a lexicographical comparison */
+ for (int32 i = 0; i < a->nelements; i++)
+ {
+ if (elements_a[i] < elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(-1);
+ }
+ else if (elements_a[i] > elements_b[i])
+ {
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(1);
+ }
+ }
+
+ /* All elements are equal, so the sets are equal */
+ pfree(elements_a);
+ pfree(elements_b);
+ PG_RETURN_INT32(0);
+}
+
+Datum
+int4hashset_intersection(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = HASHSET_GET_BITMAP(setb);
+ int32 *values = HASHSET_GET_VALUES(setb);
+
+ int4hashset_t *intersection;
+
+ intersection = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ int4hashset_contains_element(seta, values[i]))
+ {
+ intersection = int4hashset_add_element(intersection, values[i]);
+ }
+ }
+
+ if (seta->null_element && setb->null_element)
+ intersection->null_element = true;
+
+ PG_RETURN_POINTER(intersection);
+}
+
+Datum
+int4hashset_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ int4hashset_t *difference;
+ char *bitmap = HASHSET_GET_BITMAP(seta);
+ int32 *values = HASHSET_GET_VALUES(seta);
+
+ difference = int4hashset_allocate(
+ seta->capacity,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if ((bitmap[byte] & (0x01 << bit)) &&
+ !int4hashset_contains_element(setb, values[i]))
+ {
+ difference = int4hashset_add_element(difference, values[i]);
+ }
+ }
+
+ if (seta->null_element && !setb->null_element)
+ difference->null_element = true;
+
+ PG_RETURN_POINTER(difference);
+}
+
+Datum
+int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
+{
+ int i;
+ int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
+ int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
+ int4hashset_t *result;
+ char *bitmapa = HASHSET_GET_BITMAP(seta);
+ char *bitmapb = HASHSET_GET_BITMAP(setb);
+ int32 *valuesa = HASHSET_GET_VALUES(seta);
+ int32 *valuesb = HASHSET_GET_VALUES(setb);
+
+ result = int4hashset_allocate(
+ seta->nelements + setb->nelements,
+ DEFAULT_LOAD_FACTOR,
+ DEFAULT_GROWTH_FACTOR,
+ DEFAULT_HASHFN_ID
+ );
+
+ /* Add elements that are in seta but not in setb */
+ for (i = 0; i < seta->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapa[byte] & (0x01 << bit))
+ {
+ int32 value = valuesa[i];
+ if (!int4hashset_contains_element(setb, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ /* Add elements that are in setb but not in seta */
+ for (i = 0; i < setb->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ if (bitmapb[byte] & (0x01 << bit))
+ {
+ int32 value = valuesb[i];
+ if (!int4hashset_contains_element(seta, value))
+ result = int4hashset_add_element(result, value);
+ }
+ }
+
+ if (seta->null_element ^ setb->null_element)
+ result->null_element = true;
+
+ PG_RETURN_POINTER(result);
+}
diff --git a/hashset.c b/hashset.c
new file mode 100644
index 0000000..65ab25f
--- /dev/null
+++ b/hashset.c
@@ -0,0 +1,327 @@
+/*
+ * hashset.c
+ *
+ * Copyright (C) Tomas Vondra, 2019
+ */
+
+#include "hashset.h"
+
+static int int32_cmp(const void *a, const void *b);
+
+int4hashset_t *
+int4hashset_allocate(
+ int capacity,
+ float4 load_factor,
+ float4 growth_factor,
+ int hashfn_id
+)
+{
+ Size len;
+ int4hashset_t *set;
+ char *ptr;
+
+ /*
+ * Ensure that capacity is not divisible by HASHSET_STEP;
+ * i.e. the step size used in hashset_add_element()
+ * and hashset_contains_element().
+ */
+ while (capacity % HASHSET_STEP == 0)
+ capacity++;
+
+ len = offsetof(int4hashset_t, data);
+ len += CEIL_DIV(capacity, 8);
+ len += capacity * sizeof(int32);
+
+ ptr = palloc0(len);
+ SET_VARSIZE(ptr, len);
+
+ set = (int4hashset_t *) ptr;
+
+ set->flags = 0;
+ set->capacity = capacity;
+ set->nelements = 0;
+ set->hashfn_id = hashfn_id;
+ set->load_factor = load_factor;
+ set->growth_factor = growth_factor;
+ set->ncollisions = 0;
+ set->max_collisions = 0;
+ set->hash = 0; /* Initial hash value */
+ set->null_element = false; /* No null element initially */
+
+ set->flags |= 0;
+
+ return set;
+}
+
+int4hashset_t *
+int4hashset_resize(int4hashset_t * set)
+{
+ int i;
+ int4hashset_t *new;
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ int new_capacity;
+
+ new_capacity = (int)(set->capacity * set->growth_factor);
+
+ /*
+ * If growth factor is too small, new capacity might remain the same as
+ * the old capacity. This can lead to an infinite loop in resizing.
+ * To prevent this, we manually increment the capacity by 1 if new capacity
+ * equals the old capacity.
+ */
+ if (new_capacity == set->capacity)
+ new_capacity = set->capacity + 1;
+
+ new = int4hashset_allocate(
+ new_capacity,
+ set->load_factor,
+ set->growth_factor,
+ set->hashfn_id
+ );
+
+ for (i = 0; i < set->capacity; i++)
+ {
+ int byte = (i / 8);
+ int bit = (i % 8);
+
+ if (bitmap[byte] & (0x01 << bit))
+ int4hashset_add_element(new, values[i]);
+ }
+
+ return new;
+}
+
+int4hashset_t *
+int4hashset_add_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap;
+ int32 *values;
+ int32 current_collisions = 0;
+
+ if (set->nelements > set->capacity * set->load_factor)
+ set = int4hashset_resize(set);
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * 7691 + 4201);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ bitmap = HASHSET_GET_BITMAP(set);
+ values = HASHSET_GET_VALUES(set);
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* The item is already used - maybe it's the same value? */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Same value, we're done */
+ if (values[position] == value)
+ break;
+
+ /* Increment the collision counter */
+ set->ncollisions++;
+ current_collisions++;
+
+ if (current_collisions > set->max_collisions)
+ set->max_collisions = current_collisions;
+
+ position = (position + HASHSET_STEP) % set->capacity;
+ continue;
+ }
+
+ /* Found an empty spot, before hitting the value first */
+ bitmap[byte] |= (0x01 << bit);
+ values[position] = value;
+
+ set->hash ^= hash;
+
+ set->nelements++;
+
+ break;
+ }
+
+ return set;
+}
+
+bool
+int4hashset_contains_element(int4hashset_t *set, int32 value)
+{
+ int byte;
+ int bit;
+ uint32 hash;
+ uint32 position;
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ int num_probes = 0; /* Counter for the number of probes */
+
+ if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
+ {
+ hash = hash_bytes_uint32((uint32) value);
+ }
+ else if (set->hashfn_id == MURMURHASH32_HASHFN_ID)
+ {
+ hash = murmurhash32((uint32) value);
+ }
+ else if (set->hashfn_id == NAIVE_HASHFN_ID)
+ {
+ hash = ((uint32) value * NAIVE_HASHFN_MULTIPLIER + NAIVE_HASHFN_INCREMENT);
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid hash function ID: \"%d\"", set->hashfn_id)));
+ }
+
+ position = hash % set->capacity;
+
+ while (true)
+ {
+ byte = (position / 8);
+ bit = (position % 8);
+
+ /* Found an empty slot, value is not there */
+ if ((bitmap[byte] & (0x01 << bit)) == 0)
+ return false;
+
+ /* Is it the same value? */
+ if (values[position] == value)
+ return true;
+
+ /* Move to the next element */
+ position = (position + HASHSET_STEP) % set->capacity;
+
+ num_probes++; /* Increment the number of probes */
+
+ /* Check if we have probed all slots */
+ if (num_probes >= set->capacity)
+ return false; /* Avoid infinite loop */
+ }
+}
+
+int32 *
+int4hashset_extract_sorted_elements(int4hashset_t *set)
+{
+ int32 *elements = palloc(set->nelements * sizeof(int32));
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ int32 nextracted = 0;
+
+ /* Iterate through all elements */
+ for (int32 i = 0; i < set->capacity; i++)
+ {
+ int byte = i / 8;
+ int bit = i % 8;
+
+ /* Check if the current position is occupied */
+ if (bitmap[byte] & (0x01 << bit))
+ {
+ /* Add the value to the elements array */
+ elements[nextracted++] = values[i];
+ }
+ }
+
+ /* Make sure we extracted the correct number of elements */
+ Assert(nextracted == set->nelements);
+
+ /* Sort the elements array */
+ qsort(elements, nextracted, sizeof(int32), int32_cmp);
+
+ /* Return the sorted elements array */
+ return elements;
+}
+
+int4hashset_t *
+int4hashset_copy(int4hashset_t *src)
+{
+ return src;
+}
+
+/*
+ * hashset_isspace() --- a non-locale-dependent isspace()
+ *
+ * Identical to array_isspace() in src/backend/utils/adt/arrayfuncs.c.
+ * We used to use isspace() for parsing hashset values, but that has
+ * undesirable results: a hashset value might be silently interpreted
+ * differently depending on the locale setting. So here, we hard-wire
+ * the traditional ASCII definition of isspace().
+ */
+bool
+hashset_isspace(char ch)
+{
+ if (ch == ' ' ||
+ ch == '\t' ||
+ ch == '\n' ||
+ ch == '\r' ||
+ ch == '\v' ||
+ ch == '\f')
+ return true;
+ return false;
+}
+
+/*
+ * Construct an SQL array from a simple C double array
+ */
+Datum
+int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element)
+{
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ for (i = 0; i < len; i++)
+ {
+ /* Stash away this field */
+ astate = accumArrayResult(astate,
+ Int32GetDatum(d[i]),
+ false,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ if (null_element)
+ {
+ astate = accumArrayResult(astate,
+ (Datum) 0,
+ true,
+ INT4OID,
+ CurrentMemoryContext);
+ }
+
+ PG_RETURN_DATUM(makeArrayResult(astate,
+ CurrentMemoryContext));
+}
+
+static int
+int32_cmp(const void *a, const void *b)
+{
+ int32 arg1 = *(const int32 *)a;
+ int32 arg2 = *(const int32 *)b;
+
+ if (arg1 < arg2) return -1;
+ if (arg1 > arg2) return 1;
+ return 0;
+}
diff --git a/hashset.control b/hashset.control
new file mode 100644
index 0000000..0743003
--- /dev/null
+++ b/hashset.control
@@ -0,0 +1,3 @@
+comment = 'Provides hashset type.'
+default_version = '0.0.1'
+relocatable = true
diff --git a/hashset.h b/hashset.h
new file mode 100644
index 0000000..3631e22
--- /dev/null
+++ b/hashset.h
@@ -0,0 +1,56 @@
+#ifndef HASHSET_H
+#define HASHSET_H
+
+#include "postgres.h"
+#include "libpq/pqformat.h"
+#include "nodes/memnodes.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "catalog/pg_type.h"
+#include "common/hashfn.h"
+
+#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_GET_BITMAP(set) ((set)->data)
+#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))
+#define HASHSET_STEP 13
+#define JENKINS_LOOKUP3_HASHFN_ID 1
+#define MURMURHASH32_HASHFN_ID 2
+#define NAIVE_HASHFN_ID 3
+#define NAIVE_HASHFN_MULTIPLIER 7691
+#define NAIVE_HASHFN_INCREMENT 4201
+
+/*
+ * These defaults should match the the SQL function int4hashset()
+ */
+#define DEFAULT_INITIAL_CAPACITY 0
+#define DEFAULT_LOAD_FACTOR 0.75
+#define DEFAULT_GROWTH_FACTOR 2.0
+#define DEFAULT_HASHFN_ID JENKINS_LOOKUP3_HASHFN_ID
+
+typedef struct int4hashset_t {
+ int32 vl_len_; /* Varlena header (do not touch directly!) */
+ int32 flags; /* Reserved for future use (versioning, ...) */
+ int32 capacity; /* Max number of element we have space for */
+ int32 nelements; /* Number of items added to the hashset */
+ int32 hashfn_id; /* ID of the hash function used */
+ float4 load_factor; /* Load factor before triggering resize */
+ float4 growth_factor; /* Growth factor when resizing the hashset */
+ int32 ncollisions; /* Number of collisions */
+ int32 max_collisions; /* Maximum collisions for a single element */
+ int32 hash; /* Stored hash value of the hashset */
+ bool null_element; /* Indicates if null is present in hashset */
+ char data[FLEXIBLE_ARRAY_MEMBER];
+} int4hashset_t;
+
+int4hashset_t *int4hashset_allocate(int capacity, float4 load_factor, float4 growth_factor, int hashfn_id);
+int4hashset_t *int4hashset_resize(int4hashset_t * set);
+int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
+bool int4hashset_contains_element(int4hashset_t *set, int32 value);
+int32 *int4hashset_extract_sorted_elements(int4hashset_t *set);
+int4hashset_t *int4hashset_copy(int4hashset_t *src);
+bool hashset_isspace(char ch);
+Datum int32_to_array(FunctionCallInfo fcinfo, int32 *d, int len, bool null_element);
+
+#endif /* HASHSET_H */
diff --git a/test/c_tests/test_send_recv.c b/test/c_tests/test_send_recv.c
new file mode 100644
index 0000000..cc7c48a
--- /dev/null
+++ b/test/c_tests/test_send_recv.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <libpq-fe.h>
+
+void exit_nicely(PGconn *conn) {
+ PQfinish(conn);
+ exit(1);
+}
+
+int main() {
+ /* Connect to database specified by the PGDATABASE environment variable */
+ const char *hostname = getenv("PGHOST");
+ char conninfo[1024];
+ PGconn *conn;
+
+ if (hostname == NULL)
+ hostname = "localhost";
+
+ /* Connect to database specified by the PGDATABASE environment variable */
+ snprintf(conninfo, sizeof(conninfo), "host=%s port=5432", hostname);
+ conn = PQconnectdb(conninfo);
+ if (PQstatus(conn) != CONNECTION_OK) {
+ fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn));
+ exit_nicely(conn);
+ }
+
+ /* Create extension */
+ PQexec(conn, "CREATE EXTENSION IF NOT EXISTS hashset");
+
+ /* Create temporary table */
+ PQexec(conn, "CREATE TABLE IF NOT EXISTS test_hashset_send_recv (hashset_col int4hashset)");
+
+ /* Enable binary output */
+ PQexec(conn, "SET bytea_output = 'escape'");
+
+ /* Insert dummy data */
+ const char *insert_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ('{1,2,3}'::int4hashset)";
+ PGresult *res = PQexec(conn, insert_command);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Fetch the data in binary format */
+ const char *select_command = "SELECT hashset_col FROM test_hashset_send_recv";
+ int resultFormat = 1; /* 0 = text, 1 = binary */
+ res = PQexecParams(conn, select_command, 0, NULL, NULL, NULL, NULL, resultFormat);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Store binary data for later use */
+ const char *binary_data = PQgetvalue(res, 0, 0);
+ int binary_data_length = PQgetlength(res, 0, 0);
+ PQclear(res);
+
+ /* Re-insert the binary data */
+ const char *insert_binary_command = "INSERT INTO test_hashset_send_recv (hashset_col) VALUES ($1)";
+ const char *paramValues[1] = {binary_data};
+ int paramLengths[1] = {binary_data_length};
+ int paramFormats[1] = {1}; /* binary format */
+ res = PQexecParams(conn, insert_binary_command, 1, NULL, paramValues, paramLengths, paramFormats, 0);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK) {
+ fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+ PQclear(res);
+
+ /* Check the data */
+ const char *check_command = "SELECT COUNT(DISTINCT hashset_col) AS unique_count, COUNT(*) FROM test_hashset_send_recv";
+ res = PQexec(conn, check_command);
+ if (PQresultStatus(res) != PGRES_TUPLES_OK) {
+ fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
+ PQclear(res);
+ exit_nicely(conn);
+ }
+
+ /* Print the results */
+ printf("unique_count: %s\n", PQgetvalue(res, 0, 0));
+ printf("count: %s\n", PQgetvalue(res, 0, 1));
+ PQclear(res);
+
+ /* Disconnect */
+ PQfinish(conn);
+
+ return 0;
+}
diff --git a/test/c_tests/test_send_recv.sh b/test/c_tests/test_send_recv.sh
new file mode 100755
index 0000000..ab308b3
--- /dev/null
+++ b/test/c_tests/test_send_recv.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+# Get the directory of this script
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+
+# Set up database
+export PGDATABASE=test_hashset_send_recv
+dropdb --if-exists "$PGDATABASE"
+createdb
+
+# Define directories
+EXPECTED_DIR="$SCRIPT_DIR/../expected"
+RESULTS_DIR="$SCRIPT_DIR/../results"
+
+# Create the results directory if it doesn't exist
+mkdir -p "$RESULTS_DIR"
+
+# Run the C test and save its output to the results directory
+"$SCRIPT_DIR/test_send_recv" > "$RESULTS_DIR/test_send_recv.out"
+
+printf "test test_send_recv ... "
+
+# Compare the actual output with the expected output
+if diff -q "$RESULTS_DIR/test_send_recv.out" "$EXPECTED_DIR/test_send_recv.out" > /dev/null 2>&1; then
+ echo "ok"
+ # Clean up by removing the results directory if the test passed
+ rm -r "$RESULTS_DIR"
+else
+ echo "failed"
+ git diff --no-index --color "$EXPECTED_DIR/test_send_recv.out" "$RESULTS_DIR/test_send_recv.out"
+fi
diff --git a/test/expected/array-and-multiset-semantics.out b/test/expected/array-and-multiset-semantics.out
new file mode 100644
index 0000000..8f989a1
--- /dev/null
+++ b/test/expected/array-and-multiset-semantics.out
@@ -0,0 +1,365 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+DROP TABLE IF EXISTS hashset_test_results_1;
+NOTICE: table "hashset_test_results_1" does not exist, skipping
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_2;
+NOTICE: table "hashset_test_results_2" does not exist, skipping
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+DROP TABLE IF EXISTS hashset_test_results_3;
+NOTICE: table "hashset_test_results_3" does not exist, skipping
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+SELECT * FROM hashset_test_results_1;
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+--------+------+-------------+--------------+------------------+------------
+ | | {NULL} | {NULL} | |
+ | 1 | {1} | {1} | |
+ | 4 | {4} | {4} | |
+ {} | | {NULL} | {NULL} | f | f
+ {} | 1 | {1} | {1} | f | f
+ {} | 4 | {4} | {4} | f | f
+ {NULL} | | {NULL} | {NULL,NULL} | |
+ {NULL} | 1 | {1,NULL} | {NULL,1} | |
+ {NULL} | 4 | {4,NULL} | {NULL,4} | |
+ {1} | | {1,NULL} | {1,NULL} | |
+ {1} | 1 | {1} | {1,1} | t | t
+ {1} | 4 | {1,4} | {1,4} | f | f
+ {2} | | {2,NULL} | {2,NULL} | |
+ {2} | 1 | {2,1} | {2,1} | f | f
+ {2} | 4 | {2,4} | {2,4} | f | f
+ {1,2} | | {1,2,NULL} | {1,2,NULL} | |
+ {1,2} | 1 | {1,2} | {1,2,1} | t | t
+ {1,2} | 4 | {4,1,2} | {1,2,4} | f | f
+ {2,3} | | {2,3,NULL} | {2,3,NULL} | |
+ {2,3} | 1 | {1,2,3} | {2,3,1} | f | f
+ {2,3} | 4 | {4,2,3} | {2,3,4} | f | f
+(21 rows)
+
+SELECT * FROM hashset_test_results_2;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+----------+----------+---------------+--------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+ | | | | | | | | | | | | |
+ | {} | | | | | | | | | | | |
+ | {NULL} | | | | | | | | | | | |
+ | {1} | | | | | | | | | | | |
+ | {1,NULL} | | | | | | | | | | | |
+ | {2} | | | | | | | | | | | |
+ | {1,2} | | | | | | | | | | | |
+ | {2,3} | | | | | | | | | | | |
+ {} | | | | | | | | | | | | |
+ {} | {} | {} | {} | {} | {} | {} | {} | {} | {} | t | t | f | f
+ {} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {} | {1} | {1} | {1} | {} | {} | {} | {} | {1} | {1} | f | f | t | t
+ {} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | {1,NULL} | {1,NULL} | f | f | t | t
+ {} | {2} | {2} | {2} | {} | {} | {} | {} | {2} | {2} | f | f | t | t
+ {} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | {1,2} | {1,2} | f | f | t | t
+ {} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | {2,3} | {2,3} | f | f | t | t
+ {NULL} | | | | | | | | | | | | |
+ {NULL} | {} | {NULL} | {NULL} | {} | {} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {NULL} | {} | {} | {} | {} | t | t | f | f
+ {NULL} | {1} | {1,NULL} | {1,NULL} | {} | {} | {NULL} | {NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {} | {} | {1} | {1} | f | f | t | t
+ {NULL} | {2} | {2,NULL} | {2,NULL} | {} | {} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {NULL} | {1,2} | {2,1,NULL} | {1,2,NULL} | {} | {} | {NULL} | {NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {NULL} | {2,3} | {3,2,NULL} | {2,3,NULL} | {} | {} | {NULL} | {NULL} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {1} | | | | | | | | | | | | |
+ {1} | {} | {1} | {1} | {} | {} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1} | {NULL} | {1,NULL} | {1,NULL} | {} | {} | {1} | {1} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1} | {1} | {1} | {1} | {1} | {1} | {} | {} | {} | {} | t | t | f | f
+ {1} | {1,NULL} | {1,NULL} | {1,NULL} | {1} | {1} | {} | {} | {NULL} | {NULL} | f | f | t | t
+ {1} | {2} | {1,2} | {1,2} | {} | {} | {1} | {1} | {1,2} | {1,2} | f | f | t | t
+ {1} | {1,2} | {1,2} | {1,2} | {1} | {1} | {} | {} | {2} | {2} | f | f | t | t
+ {1} | {2,3} | {3,1,2} | {1,2,3} | {} | {} | {1} | {1} | {3,2,1} | {1,2,3} | f | f | t | t
+ {1,NULL} | | | | | | | | | | | | |
+ {1,NULL} | {} | {1,NULL} | {1,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | f | f | t | t
+ {1,NULL} | {NULL} | {1,NULL} | {1,NULL} | {NULL} | {NULL} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,NULL} | {1} | {1,NULL} | {1,NULL} | {1} | {1} | {NULL} | {NULL} | {NULL} | {NULL} | f | f | t | t
+ {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {1,NULL} | {} | {} | {} | {} | t | t | f | f
+ {1,NULL} | {2} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,NULL} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,NULL} | {1,2} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {NULL} | {NULL} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,NULL} | {2,3} | {3,1,2,NULL} | {1,2,3,NULL} | {} | {} | {1,NULL} | {1,NULL} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2} | | | | | | | | | | | | |
+ {2} | {} | {2} | {2} | {} | {} | {2} | {2} | {2} | {2} | f | f | t | t
+ {2} | {NULL} | {2,NULL} | {2,NULL} | {} | {} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {2} | {1} | {2,1} | {1,2} | {} | {} | {2} | {2} | {2,1} | {1,2} | f | f | t | t
+ {2} | {1,NULL} | {2,1,NULL} | {1,2,NULL} | {} | {} | {2} | {2} | {2,1,NULL} | {1,2,NULL} | f | f | t | t
+ {2} | {2} | {2} | {2} | {2} | {2} | {} | {} | {} | {} | t | t | f | f
+ {2} | {1,2} | {2,1} | {1,2} | {2} | {2} | {} | {} | {1} | {1} | f | f | t | t
+ {2} | {2,3} | {2,3} | {2,3} | {2} | {2} | {} | {} | {3} | {3} | f | f | t | t
+ {1,2} | | | | | | | | | | | | |
+ {1,2} | {} | {1,2} | {1,2} | {} | {} | {1,2} | {1,2} | {1,2} | {1,2} | f | f | t | t
+ {1,2} | {NULL} | {1,2,NULL} | {1,2,NULL} | {} | {} | {1,2} | {1,2} | {1,2,NULL} | {1,2,NULL} | f | f | t | t
+ {1,2} | {1} | {1,2} | {1,2} | {1} | {1} | {2} | {2} | {2} | {2} | f | f | t | t
+ {1,2} | {1,NULL} | {1,2,NULL} | {1,2,NULL} | {1} | {1} | {2} | {2} | {2,NULL} | {2,NULL} | f | f | t | t
+ {1,2} | {2} | {1,2} | {1,2} | {2} | {2} | {1} | {1} | {1} | {1} | f | f | t | t
+ {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {1,2} | {} | {} | {} | {} | t | t | f | f
+ {1,2} | {2,3} | {3,1,2} | {1,2,3} | {2} | {2} | {1} | {1} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | | | | | | | | | | | | |
+ {2,3} | {} | {2,3} | {2,3} | {} | {} | {2,3} | {2,3} | {2,3} | {2,3} | f | f | t | t
+ {2,3} | {NULL} | {2,3,NULL} | {2,3,NULL} | {} | {} | {2,3} | {2,3} | {2,3,NULL} | {2,3,NULL} | f | f | t | t
+ {2,3} | {1} | {1,2,3} | {1,2,3} | {} | {} | {2,3} | {2,3} | {3,2,1} | {1,2,3} | f | f | t | t
+ {2,3} | {1,NULL} | {1,2,3,NULL} | {1,2,3,NULL} | {} | {} | {2,3} | {2,3} | {3,2,1,NULL} | {1,2,3,NULL} | f | f | t | t
+ {2,3} | {2} | {2,3} | {2,3} | {2} | {2} | {3} | {3} | {3} | {3} | f | f | t | t
+ {2,3} | {1,2} | {1,2,3} | {1,2,3} | {2} | {2} | {3} | {3} | {1,3} | {1,3} | f | f | t | t
+ {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {2,3} | {} | {} | {} | {} | t | t | f | f
+(64 rows)
+
+SELECT * FROM hashset_test_results_3;
+ arg1 | hashset_cardinality | cardinality
+--------+---------------------+-------------
+ | |
+ {} | 0 | 0
+ {NULL} | 1 | 1
+ {1} | 1 | 1
+ {2} | 1 | 1
+ {1,2} | 2 | 2
+ {2,3} | 2 | 2
+(7 rows)
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+\echo *** Testing: hashset_add()
+*** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_contains()
+*** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+ arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
+------+------+-------------+--------------+------------------+------------
+(0 rows)
+
+\echo *** Testing: hashset_union()
+*** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_intersection()
+*** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_difference()
+*** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_symmetric_difference()
+*** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_eq()
+*** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_ne()
+*** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+ arg1 | arg2 | hashset_union | array_union | hashset_intersection | array_intersection | hashset_difference | array_difference | hashset_symmetric_difference | array_symmetric_difference | hashset_eq | array_eq | hashset_ne | array_ne
+------+------+---------------+-------------+----------------------+--------------------+--------------------+------------------+------------------------------+----------------------------+------------+----------+------------+----------
+(0 rows)
+
+\echo *** Testing: hashset_cardinality()
+*** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
+ arg1 | hashset_cardinality | cardinality
+------+---------------------+-------------
+(0 rows)
+
diff --git a/test/expected/basic.out b/test/expected/basic.out
new file mode 100644
index 0000000..79c3230
--- /dev/null
+++ b/test/expected/basic.out
@@ -0,0 +1,298 @@
+/*
+ * Hashset Type
+ */
+SELECT '{}'::int4hashset; -- empty int4hashset
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset;
+ int4hashset
+-------------
+ {3,2,1}
+(1 row)
+
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+ int4hashset
+----------------------------
+ {0,2147483647,-2147483648}
+(1 row)
+
+SELECT '{-2147483649}'::int4hashset; -- out of range
+ERROR: value "-2147483649}" is out of range for type integer
+LINE 1: SELECT '{-2147483649}'::int4hashset;
+ ^
+SELECT '{2147483648}'::int4hashset; -- out of range
+ERROR: value "2147483648}" is out of range for type integer
+LINE 1: SELECT '{2147483648}'::int4hashset;
+ ^
+/*
+ * Hashset Functions
+ */
+SELECT int4hashset();
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+ int4hashset
+-------------
+ {}
+(1 row)
+
+SELECT hashset_add(int4hashset(), 123);
+ hashset_add
+-------------
+ {123}
+(1 row)
+
+SELECT hashset_add('{123}'::int4hashset, 456);
+ hashset_add
+-------------
+ {456,123}
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+ hashset_contains
+------------------
+ f
+(1 row)
+
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+ hashset_union
+---------------
+ {3,1,2}
+(1 row)
+
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+ hashset_to_array
+------------------
+ {3,2,1}
+(1 row)
+
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
+ hashset_cardinality
+---------------------
+ 3
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_intersection
+----------------------
+ {2}
+(1 row)
+
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_difference
+--------------------
+ {1}
+(1 row)
+
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+ hashset_symmetric_difference
+------------------------------
+ {1,3}
+(1 row)
+
+/*
+ * Aggregation Functions
+ */
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+ hashset_agg
+------------------------
+ {6,10,1,8,2,3,4,5,9,7}
+(1 row)
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+ hashset_agg
+------------------------
+ {6,8,1,3,2,10,4,5,9,7}
+(1 row)
+
+/*
+ * Operator Definitions
+ */
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+ ?column?
+----------
+ f
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT '{1,2,3}'::int4hashset || 4;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+SELECT 4 || '{1,2,3}'::int4hashset;
+ ?column?
+-----------
+ {1,3,2,4}
+(1 row)
+
+/*
+ * Hashset Hash Operators
+ */
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+ hashset_hash
+--------------
+ 868123687
+(1 row)
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+ count | count
+-------+-------
+ 2 | 1
+(1 row)
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
+ h
+---------
+ {9,7,8}
+ {3,2,1}
+ {5,6,4}
+(3 rows)
+
diff --git a/test/expected/invalid.out b/test/expected/invalid.out
new file mode 100644
index 0000000..bd44199
--- /dev/null
+++ b/test/expected/invalid.out
@@ -0,0 +1,4 @@
+SELECT '{1,2s}'::int4hashset;
+ERROR: unexpected character "s" in hashset input
+LINE 1: SELECT '{1,2s}'::int4hashset;
+ ^
diff --git a/test/expected/io_varying_lengths.out b/test/expected/io_varying_lengths.out
new file mode 100644
index 0000000..45e9fb1
--- /dev/null
+++ b/test/expected/io_varying_lengths.out
@@ -0,0 +1,100 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+SELECT hashset_sorted('{1}'::int4hashset);
+ hashset_sorted
+----------------
+ {1}
+(1 row)
+
+SELECT hashset_sorted('{1,2}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+ hashset_sorted
+----------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+ hashset_sorted
+-----------------
+ {1,2,3,4,5,6,7}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+ hashset_sorted
+-------------------
+ {1,2,3,4,5,6,7,8}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+ hashset_sorted
+---------------------
+ {1,2,3,4,5,6,7,8,9}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+ hashset_sorted
+------------------------
+ {1,2,3,4,5,6,7,8,9,10}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+ hashset_sorted
+---------------------------
+ {1,2,3,4,5,6,7,8,9,10,11}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+ hashset_sorted
+------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+ hashset_sorted
+---------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+ hashset_sorted
+------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+ hashset_sorted
+---------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
+(1 row)
+
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
+ hashset_sorted
+------------------------------------------
+ {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
+(1 row)
+
diff --git a/test/expected/parsing.out b/test/expected/parsing.out
new file mode 100644
index 0000000..263797e
--- /dev/null
+++ b/test/expected/parsing.out
@@ -0,0 +1,71 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+ int4hashset
+-------------
+ {1,-456,23}
+(1 row)
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+ERROR: malformed hashset literal: "1"
+LINE 2: SELECT ' { 1 , 23 , -456 } 1'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+ERROR: malformed hashset literal: ","
+LINE 1: SELECT ' { 1 , 23 , -456 } ,'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+ERROR: malformed hashset literal: "{"
+LINE 1: SELECT ' { 1 , 23 , -456 } {'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+ERROR: malformed hashset literal: "}"
+LINE 1: SELECT ' { 1 , 23 , -456 } }'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+ERROR: malformed hashset literal: "x"
+LINE 1: SELECT ' { 1 , 23 , -456 } x'::int4hashset;
+ ^
+DETAIL: Junk after closing right brace.
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+ERROR: unexpected character "1" in hashset input
+LINE 2: SELECT ' { 1 , 23 , -456 1'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+ERROR: unexpected character "{" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 {'::int4hashset;
+ ^
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+ERROR: unexpected character "x" in hashset input
+LINE 1: SELECT ' { 1 , 23 , -456 x'::int4hashset;
+ ^
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ", 23 , -456 } "
+LINE 2: SELECT ' { , 23 , -456 } '::int4hashset;
+ ^
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+ERROR: invalid input syntax for integer: ""
+LINE 1: SELECT ' { 1 , 23 , '::int4hashset;
+ ^
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for integer: "s , 23 , -456 } "
+LINE 1: SELECT ' { s , 23 , -456 } '::int4hashset;
+ ^
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
+ERROR: invalid input syntax for hashset: "1 , 23 , -456 } "
+LINE 2: SELECT ' 1 , 23 , -456 } '::int4hashset;
+ ^
+DETAIL: Hashset representation must start with "{".
diff --git a/test/expected/prelude.out b/test/expected/prelude.out
new file mode 100644
index 0000000..f34e190
--- /dev/null
+++ b/test/expected/prelude.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION hashset;
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/expected/random.out b/test/expected/random.out
new file mode 100644
index 0000000..9d9026b
--- /dev/null
+++ b/test/expected/random.out
@@ -0,0 +1,38 @@
+SELECT setseed(0.12345);
+ setseed
+---------
+
+(1 row)
+
+\set MAX_INT 2147483647
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
+ md5
+----------------------------------
+ 4ad6e4233861becbeb4a665376952a16
+(1 row)
+
diff --git a/test/expected/reported_bugs.out b/test/expected/reported_bugs.out
new file mode 100644
index 0000000..03cc7c3
--- /dev/null
+++ b/test/expected/reported_bugs.out
@@ -0,0 +1,138 @@
+/*
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+ hashset_agg | hashset_add
+-------------+-------------
+ {3,1,2} | {3,4,1,2}
+(1 row)
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_hash('{2,1}'::int4hashset);
+ hashset_hash
+--------------
+ -840053840
+(1 row)
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+ hashset_cmp
+-------------
+ 0
+(1 row)
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+ hashset_capacity
+------------------
+ 2
+(1 row)
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+ hashset_capacity
+------------------
+ 10
+(1 row)
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+ capacity_10
+-------------
+ 10
+(1 row)
+
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+ capacity_1000
+---------------
+ 1000
+(1 row)
+
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
+ capacity_100000
+-----------------
+ 100000
+(1 row)
+
diff --git a/test/expected/table.out b/test/expected/table.out
new file mode 100644
index 0000000..f59494e
--- /dev/null
+++ b/test/expected/table.out
@@ -0,0 +1,25 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+ hashset_contains
+------------------
+ t
+(1 row)
+
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
+ hashset_cardinality
+---------------------
+ 2
+(1 row)
+
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
+ hashset_sorted
+----------------
+ {101,202}
+(1 row)
+
diff --git a/test/expected/test_send_recv.out b/test/expected/test_send_recv.out
new file mode 100644
index 0000000..12382d5
--- /dev/null
+++ b/test/expected/test_send_recv.out
@@ -0,0 +1,2 @@
+unique_count: 1
+count: 2
diff --git a/test/sql/array-and-multiset-semantics.sql b/test/sql/array-and-multiset-semantics.sql
new file mode 100644
index 0000000..0db7065
--- /dev/null
+++ b/test/sql/array-and-multiset-semantics.sql
@@ -0,0 +1,232 @@
+CREATE OR REPLACE FUNCTION array_union(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_intersection(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT unnest($1)
+ EXCEPT
+ SELECT unnest($2)
+ ) q
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_symmetric_difference(int4[], int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ $1 IS NULL OR $2 IS NULL
+ THEN
+ NULL
+ ELSE
+ COALESCE((
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM
+ (
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ UNION
+ SELECT unnest($2)
+ ) AS q1
+ EXCEPT
+ SELECT
+ *
+ FROM
+ (
+ SELECT unnest($1)
+ INTERSECT
+ SELECT unnest($2)
+ ) AS q2
+ ) AS q3
+ ),'{}'::int4[])
+ END
+$$ LANGUAGE sql;
+
+CREATE OR REPLACE FUNCTION array_sort_distinct(int4[])
+RETURNS int4[]
+AS
+$$
+SELECT
+ CASE
+ WHEN
+ cardinality($1) = 0
+ THEN
+ '{}'::int4[]
+ ELSE
+ (
+ SELECT array_agg(DISTINCT unnest ORDER BY unnest) FROM unnest($1)
+ )
+ END
+$$ LANGUAGE sql;
+
+DROP TABLE IF EXISTS hashset_test_results_1;
+CREATE TABLE hashset_test_results_1 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_add(arg1::int4hashset, arg2),
+ array_append(arg1::int4[], arg2),
+ hashset_contains(arg1::int4hashset, arg2),
+ arg2 = ANY(arg1::int4[]) AS "= ANY(...)"
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL::int4), (1::int4), (4::int4)) AS b(arg2);
+
+
+DROP TABLE IF EXISTS hashset_test_results_2;
+CREATE TABLE hashset_test_results_2 AS
+SELECT
+ arg1,
+ arg2,
+ hashset_union(arg1::int4hashset, arg2::int4hashset),
+ array_union(arg1::int4[], arg2::int4[]),
+ hashset_intersection(arg1::int4hashset, arg2::int4hashset),
+ array_intersection(arg1::int4[], arg2::int4[]),
+ hashset_difference(arg1::int4hashset, arg2::int4hashset),
+ array_difference(arg1::int4[], arg2::int4[]),
+ hashset_symmetric_difference(arg1::int4hashset, arg2::int4hashset),
+ array_symmetric_difference(arg1::int4[], arg2::int4[]),
+ hashset_eq(arg1::int4hashset, arg2::int4hashset),
+ array_eq(arg1::int4[], arg2::int4[]),
+ hashset_ne(arg1::int4hashset, arg2::int4hashset),
+ array_ne(arg1::int4[], arg2::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1)
+CROSS JOIN (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{1,NULL}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS b(arg2);
+
+DROP TABLE IF EXISTS hashset_test_results_3;
+CREATE TABLE hashset_test_results_3 AS
+SELECT
+ arg1,
+ hashset_cardinality(arg1::int4hashset),
+ cardinality(arg1::int4[])
+FROM (VALUES (NULL), ('{}'), ('{NULL}'), ('{1}'), ('{2}'), ('{1,2}'), ('{2,3}')) AS a(arg1);
+
+SELECT * FROM hashset_test_results_1;
+SELECT * FROM hashset_test_results_2;
+SELECT * FROM hashset_test_results_3;
+
+/*
+ * The queries below should not return any rows since the hashset
+ * semantics should be identical to array semantics, given the array elements
+ * are distinct and both are compared as sorted arrays.
+ */
+
+\echo *** Testing: hashset_add()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_to_sorted_array(hashset_add)
+IS DISTINCT FROM
+ array_sort_distinct(array_append);
+
+\echo *** Testing: hashset_contains()
+SELECT * FROM hashset_test_results_1
+WHERE
+ hashset_contains
+IS DISTINCT FROM
+ "= ANY(...)";
+
+\echo *** Testing: hashset_union()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_union)
+IS DISTINCT FROM
+ array_sort_distinct(array_union);
+
+\echo *** Testing: hashset_intersection()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_intersection)
+IS DISTINCT FROM
+ array_sort_distinct(array_intersection);
+
+\echo *** Testing: hashset_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_difference);
+
+\echo *** Testing: hashset_symmetric_difference()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_to_sorted_array(hashset_symmetric_difference)
+IS DISTINCT FROM
+ array_sort_distinct(array_symmetric_difference);
+
+\echo *** Testing: hashset_eq()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_eq
+IS DISTINCT FROM
+ array_eq;
+
+\echo *** Testing: hashset_ne()
+SELECT * FROM hashset_test_results_2
+WHERE
+ hashset_ne
+IS DISTINCT FROM
+ array_ne;
+
+\echo *** Testing: hashset_cardinality()
+SELECT * FROM hashset_test_results_3
+WHERE
+ hashset_cardinality
+IS DISTINCT FROM
+ cardinality;
diff --git a/test/sql/basic.sql b/test/sql/basic.sql
new file mode 100644
index 0000000..2bf5893
--- /dev/null
+++ b/test/sql/basic.sql
@@ -0,0 +1,107 @@
+/*
+ * Hashset Type
+ */
+
+SELECT '{}'::int4hashset; -- empty int4hashset
+SELECT '{1,2,3}'::int4hashset;
+SELECT '{-2147483648,0,2147483647}'::int4hashset;
+SELECT '{-2147483649}'::int4hashset; -- out of range
+SELECT '{2147483648}'::int4hashset; -- out of range
+
+/*
+ * Hashset Functions
+ */
+
+SELECT int4hashset();
+SELECT int4hashset(
+ capacity := 10,
+ load_factor := 0.9,
+ growth_factor := 1.1,
+ hashfn_id := 1
+);
+SELECT hashset_add(int4hashset(), 123);
+SELECT hashset_add('{123}'::int4hashset, 456);
+SELECT hashset_contains('{123,456}'::int4hashset, 456); -- true
+SELECT hashset_contains('{123,456}'::int4hashset, 789); -- false
+SELECT hashset_union('{1,2}'::int4hashset, '{2,3}'::int4hashset);
+SELECT hashset_to_array('{1,2,3}'::int4hashset);
+SELECT hashset_cardinality('{1,2,3}'::int4hashset); -- 3
+SELECT hashset_capacity(int4hashset(capacity := 10)); -- 10
+SELECT hashset_intersection('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+SELECT hashset_symmetric_difference('{1,2}'::int4hashset,'{2,3}'::int4hashset);
+
+/*
+ * Aggregation Functions
+ */
+
+SELECT hashset_agg(i) FROM generate_series(1,10) AS i;
+
+SELECT hashset_agg(h) FROM
+(
+ SELECT hashset_agg(i) AS h FROM generate_series(1,5) AS i
+ UNION ALL
+ SELECT hashset_agg(j) AS h FROM generate_series(6,10) AS j
+) q;
+
+/*
+ * Operator Definitions
+ */
+
+SELECT '{2}'::int4hashset = '{1}'::int4hashset; -- false
+SELECT '{2}'::int4hashset = '{2}'::int4hashset; -- true
+SELECT '{2}'::int4hashset = '{3}'::int4hashset; -- false
+
+SELECT '{1,2,3}'::int4hashset = '{1,2,3}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{2,3,1}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset = '{4,5,6}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset = '{1,2,3,4}'::int4hashset; -- false
+
+SELECT '{2}'::int4hashset <> '{1}'::int4hashset; -- true
+SELECT '{2}'::int4hashset <> '{2}'::int4hashset; -- false
+SELECT '{2}'::int4hashset <> '{3}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{2,3,1}'::int4hashset; -- false
+SELECT '{1,2,3}'::int4hashset <> '{4,5,6}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2}'::int4hashset; -- true
+SELECT '{1,2,3}'::int4hashset <> '{1,2,3,4}'::int4hashset; -- true
+
+SELECT '{1,2,3}'::int4hashset || 4;
+SELECT 4 || '{1,2,3}'::int4hashset;
+
+/*
+ * Hashset Hash Operators
+ */
+
+SELECT hashset_hash('{1,2,3}'::int4hashset);
+SELECT hashset_hash('{3,2,1}'::int4hashset);
+
+SELECT COUNT(*), COUNT(DISTINCT h)
+FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{3,2,1}'::int4hashset AS h
+) q;
+
+/*
+ * Hashset Btree Operators
+ *
+ * Ordering of hashsets is not based on lexicographic order of elements.
+ * - If two hashsets are not equal, they retain consistent relative order.
+ * - If two hashsets are equal but have elements in different orders, their
+ * ordering is non-deterministic. This is inherent since the comparison
+ * function must return 0 for equal hashsets, giving no indication of order.
+ */
+
+SELECT h FROM
+(
+ SELECT '{1,2,3}'::int4hashset AS h
+ UNION ALL
+ SELECT '{4,5,6}'::int4hashset AS h
+ UNION ALL
+ SELECT '{7,8,9}'::int4hashset AS h
+) q
+ORDER BY h;
diff --git a/test/sql/benchmark.sql b/test/sql/benchmark.sql
new file mode 100644
index 0000000..e7a53f1
--- /dev/null
+++ b/test/sql/benchmark.sql
@@ -0,0 +1,191 @@
+DROP EXTENSION IF EXISTS hashset CASCADE;
+CREATE EXTENSION hashset;
+
+\timing on
+
+\echo * Benchmark array_agg(DISTINCT ...) vs hashset_agg()
+
+DROP TABLE IF EXISTS benchmark_input_100k;
+DROP TABLE IF EXISTS benchmark_input_10M;
+DROP TABLE IF EXISTS benchmark_array_agg;
+DROP TABLE IF EXISTS benchmark_hashset_agg;
+
+SELECT setseed(0.12345);
+
+CREATE TABLE benchmark_input_100k AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,100000) AS i;
+
+CREATE TABLE benchmark_input_10M AS
+SELECT
+ i,
+ i/10 AS j,
+ (floor(4294967296 * random()) - 2147483648)::int AS rnd
+FROM generate_series(1,10000000) AS i;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k unique integers
+CREATE TABLE benchmark_array_agg AS
+SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
+CREATE TABLE benchmark_hashset_agg AS
+SELECT hashset_agg(i) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M unique integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(i) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M integers (10% uniqueness)
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT j) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(j) FROM benchmark_input_10M;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 100k random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_100k;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_100k;
+
+\echo *** Benchmark array_agg(DISTINCT ...) vs hashset_agg(...) for 10M random integers
+INSERT INTO benchmark_array_agg
+SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
+INSERT INTO benchmark_hashset_agg
+SELECT hashset_agg(rnd) FROM benchmark_input_10M;
+
+SELECT cardinality(array_agg) FROM benchmark_array_agg ORDER BY 1;
+
+SELECT
+ hashset_cardinality(hashset_agg),
+ hashset_capacity(hashset_agg),
+ hashset_collisions(hashset_agg),
+ hashset_max_collisions(hashset_agg)
+FROM benchmark_hashset_agg;
+
+SELECT hashset_capacity(hashset_agg(rnd)) FROM benchmark_input_10M;
+
+\echo * Benchmark different hash functions
+
+\echo *** Elements in sequence 1..100000
+
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, i);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+\echo *** Testing 100000 random ints
+
+SELECT setseed(0.12345);
+\echo - Testing default hash function (Jenkins/lookup3)
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 1);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing Murmurhash32
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 2);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
+
+SELECT setseed(0.12345);
+\echo - Testing naive hash function
+
+DO
+$$
+DECLARE
+ h int4hashset;
+BEGIN
+ h := int4hashset(hashfn_id := 3);
+ FOR i IN 1..100000 LOOP
+ h := hashset_add(h, (floor(4294967296 * random()) - 2147483648)::int);
+ END LOOP;
+ RAISE NOTICE 'hashset_cardinality: %', hashset_cardinality(h);
+ RAISE NOTICE 'hashset_capacity: %', hashset_capacity(h);
+ RAISE NOTICE 'hashset_collisions: %', hashset_collisions(h);
+ RAISE NOTICE 'hashset_max_collisions: %', hashset_max_collisions(h);
+END
+$$ LANGUAGE plpgsql;
diff --git a/test/sql/invalid.sql b/test/sql/invalid.sql
new file mode 100644
index 0000000..43689ab
--- /dev/null
+++ b/test/sql/invalid.sql
@@ -0,0 +1 @@
+SELECT '{1,2s}'::int4hashset;
diff --git a/test/sql/io_varying_lengths.sql b/test/sql/io_varying_lengths.sql
new file mode 100644
index 0000000..8acb6b8
--- /dev/null
+++ b/test/sql/io_varying_lengths.sql
@@ -0,0 +1,21 @@
+/*
+ * This test verifies the hashset input/output functions for varying
+ * initial capacities, ensuring functionality across different sizes.
+ */
+
+SELECT hashset_sorted('{1}'::int4hashset);
+SELECT hashset_sorted('{1,2}'::int4hashset);
+SELECT hashset_sorted('{1,2,3}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}'::int4hashset);
+SELECT hashset_sorted('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}'::int4hashset);
diff --git a/test/sql/parsing.sql b/test/sql/parsing.sql
new file mode 100644
index 0000000..1e56bbe
--- /dev/null
+++ b/test/sql/parsing.sql
@@ -0,0 +1,23 @@
+/* Valid */
+SELECT '{1,23,-456}'::int4hashset;
+SELECT ' { 1 , 23 , -456 } '::int4hashset;
+
+/* Only whitespace is allowed after the closing brace */
+SELECT ' { 1 , 23 , -456 } 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } ,'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } }'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 } x'::int4hashset; -- error
+
+/* Unexpected character when expecting closing brace */
+SELECT ' { 1 , 23 , -456 1'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 {'::int4hashset; -- error
+SELECT ' { 1 , 23 , -456 x'::int4hashset; -- error
+
+/* Error handling for strtol */
+SELECT ' { , 23 , -456 } '::int4hashset; -- error
+SELECT ' { 1 , 23 , '::int4hashset; -- error
+SELECT ' { s , 23 , -456 } '::int4hashset; -- error
+
+/* Missing opening brace */
+SELECT ' 1 , 23 , -456 } '::int4hashset; -- error
diff --git a/test/sql/prelude.sql b/test/sql/prelude.sql
new file mode 100644
index 0000000..2fee0fc
--- /dev/null
+++ b/test/sql/prelude.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION hashset;
+
+CREATE OR REPLACE FUNCTION hashset_sorted(int4hashset)
+RETURNS TEXT AS
+$$
+SELECT array_agg(i ORDER BY i::int)::text
+FROM regexp_split_to_table(regexp_replace($1::text,'^{|}$','','g'),',') i
+$$ LANGUAGE sql;
diff --git a/test/sql/random.sql b/test/sql/random.sql
new file mode 100644
index 0000000..7cc8f87
--- /dev/null
+++ b/test/sql/random.sql
@@ -0,0 +1,27 @@
+SELECT setseed(0.12345);
+
+\set MAX_INT 2147483647
+
+CREATE TABLE hashset_random_int4_numbers AS
+ SELECT
+ (random()*:MAX_INT)::int AS i
+ FROM generate_series(1,(random()*10000)::int)
+;
+
+SELECT
+ md5(hashset_sorted)
+FROM
+(
+ SELECT
+ hashset_sorted(int4hashset(format('{%s}',string_agg(i::text,','))))
+ FROM hashset_random_int4_numbers
+) q;
+
+SELECT
+ md5(input_sorted)
+FROM
+(
+ SELECT
+ format('{%s}',string_agg(i::text,',' ORDER BY i)) AS input_sorted
+ FROM hashset_random_int4_numbers
+) q;
diff --git a/test/sql/reported_bugs.sql b/test/sql/reported_bugs.sql
new file mode 100644
index 0000000..9e6b617
--- /dev/null
+++ b/test/sql/reported_bugs.sql
@@ -0,0 +1,85 @@
+/*
+ * Bug in hashset_add() and hashset_union() functions altering original hashset.
+ *
+ * Previously, the hashset_add() and hashset_union() functions were modifying the
+ * original hashset in-place, leading to unexpected results as the original data
+ * within the hashset was being altered.
+ *
+ * The issue was addressed by implementing a macro function named
+ * PG_GETARG_INT4HASHSET_COPY() within the C code. This function guarantees that
+ * a copy of the hashset is created and subsequently modified, thereby preserving
+ * the integrity of the original hashset.
+ *
+ * As a result of this fix, hashset_add() and hashset_union() now operate on
+ * a copied hashset, ensuring that the original data remains unaltered, and
+ * the query executes correctly.
+ */
+SELECT
+ q.hashset_agg,
+ hashset_add(hashset_agg,4)
+FROM
+(
+ SELECT
+ hashset_agg(generate_series)
+ FROM generate_series(1,3)
+) q;
+
+/*
+ * Bug in hashset_hash() function with respect to element insertion order.
+ *
+ * Prior to the fix, the hashset_hash() function was accumulating the hashes
+ * of individual elements in a non-commutative manner. As a consequence, the
+ * final hash value was sensitive to the order in which elements were inserted
+ * into the hashset. This behavior led to inconsistencies, as logically
+ * equivalent sets (i.e., sets with the same elements but in different orders)
+ * produced different hash values.
+ *
+ * The bug was fixed by modifying the hashset_hash() function to use a
+ * commutative operation when combining the hashes of individual elements.
+ * This change ensures that the final hash value is independent of the
+ * element insertion order, and logically equivalent sets produce the
+ * same hash.
+ */
+SELECT hashset_hash('{1,2}'::int4hashset);
+SELECT hashset_hash('{2,1}'::int4hashset);
+
+SELECT hashset_cmp('{1,2}','{2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2,1}')
+UNION
+SELECT hashset_cmp('{1,2}','{1,2}');
+
+/*
+ * Bug in int4hashset_resize() not utilizing growth_factor.
+ *
+ * The previous implementation hard-coded a growth factor of 2, neglecting
+ * the struct's growth_factor field. This bug was addressed by properly
+ * using growth_factor for new capacity calculation, with an additional
+ * safety check to prevent possible infinite loops in resizing.
+ */
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 1.1
+), 123), 456));
+
+SELECT hashset_capacity(hashset_add(hashset_add(int4hashset(
+ capacity := 0,
+ load_factor := 0.75,
+ growth_factor := 10
+), 123), 456));
+
+/*
+ * Bug in int4hashset_capacity() not detoasting input correctly.
+ */
+SELECT hashset_capacity(int4hashset(capacity:=10)) AS capacity_10;
+SELECT hashset_capacity(int4hashset(capacity:=1000)) AS capacity_1000;
+SELECT hashset_capacity(int4hashset(capacity:=100000)) AS capacity_100000;
+
+CREATE TABLE test_capacity_10 AS SELECT int4hashset(capacity:=10) AS capacity_10;
+CREATE TABLE test_capacity_1000 AS SELECT int4hashset(capacity:=1000) AS capacity_1000;
+CREATE TABLE test_capacity_100000 AS SELECT int4hashset(capacity:=100000) AS capacity_100000;
+
+SELECT hashset_capacity(capacity_10) AS capacity_10 FROM test_capacity_10;
+SELECT hashset_capacity(capacity_1000) AS capacity_1000 FROM test_capacity_1000;
+SELECT hashset_capacity(capacity_100000) AS capacity_100000 FROM test_capacity_100000;
diff --git a/test/sql/table.sql b/test/sql/table.sql
new file mode 100644
index 0000000..bf05ffa
--- /dev/null
+++ b/test/sql/table.sql
@@ -0,0 +1,10 @@
+CREATE TABLE users (
+ user_id int PRIMARY KEY,
+ user_likes int4hashset DEFAULT int4hashset(capacity := 2)
+);
+INSERT INTO users (user_id) VALUES (1);
+UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
+UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
+SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1;
+SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1;
+SELECT hashset_sorted(user_likes) FROM users WHERE user_id = 1;
hashset-0.0.1-a775594-incremental.patchapplication/octet-stream; name="=?UTF-8?Q?hashset-0.0.1-a775594-incremental.patch?="Download
diff --git a/README.md b/README.md
index 91af6ee..aaa84fd 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,40 @@
# hashset
This PostgreSQL extension implements hashset, a data structure (type)
-providing a collection of unique, not null integer items with fast lookup.
+providing a collection of unique integer items with fast lookup.
+It provides several functions for working with these sets, including operations
+like addition, containment check, conversion to array, union, intersection,
+difference, equality check, and cardinality calculation.
+
+`NULL` values are also allowed in the hash set, and are considered as a unique
+element. When multiple `NULL` values are present in the input, they are treated
+as a single `NULL`.
+
+## Table of Contents
+1. [Version](#version)
+2. [Data Types](#data-types)
+ - [int4hashset](#int4hashset)
+3. [Functions](#functions)
+ - [int4hashset](#int4hashset-1)
+ - [hashset_add](#hashset_add)
+ - [hashset_contains](#hashset_contains)
+ - [hashset_to_array](#hashset_to_array)
+ - [hashset_to_sorted_array](#hashset_to_sorted_array)
+ - [hashset_cardinality](#hashset_cardinality)
+ - [hashset_capacity](#hashset_capacity)
+ - [hashset_max_collisions](#hashset_max_collisions)
+ - [hashset_union](#hashset_union)
+ - [hashset_intersection](#hashset_intersection)
+ - [hashset_difference](#hashset_difference)
+ - [hashset_symmetric_difference](#hashset_symmetric_difference)
+4. [Aggregation Functions](#aggregation-functions)
+5. [Operators](#operators)
+6. [Hashset Hash Operators](#hashset-hash-operators)
+7. [Hashset Btree Operators](#hashset-btree-operators)
+8. [Limitations](#limitations)
+9. [Installation](#installation)
+10. [License](#license)
## Version
@@ -14,80 +46,235 @@ with possible breaking changes, we are not providing any migration scripts
until we reach our first release.
-## Usage
-
-After installing the extension, you can use the `int4hashset` data type and
-associated functions within your PostgreSQL queries.
-
-To demonstrate the usage, let's consider a hypothetical table `users` which has
-a `user_id` and a `user_likes` of type `int4hashset`.
-
-Firstly, let's create the table:
-
-```sql
-CREATE TABLE users(
- user_id int PRIMARY KEY,
- user_likes int4hashset DEFAULT int4hashset()
-);
-```
-In the above statement, the `int4hashset()` initializes an empty hashset
-with zero capacity. The hashset will automatically resize itself when more
-elements are added.
-
-Now, we can perform operations on this table. Here are some examples:
-
-```sql
--- Insert a new user with id 1. The user_likes will automatically be initialized
--- as an empty hashset
-INSERT INTO users (user_id) VALUES (1);
-
--- Add elements (likes) for a user
-UPDATE users SET user_likes = hashset_add(user_likes, 101) WHERE user_id = 1;
-UPDATE users SET user_likes = hashset_add(user_likes, 202) WHERE user_id = 1;
-
--- Check if a user likes a particular item
-SELECT hashset_contains(user_likes, 101) FROM users WHERE user_id = 1; -- true
-
--- Count the number of likes a user has
-SELECT hashset_cardinality(user_likes) FROM users WHERE user_id = 1; -- 2
-```
-
-You can also use the aggregate functions to perform operations on multiple rows.
-
-
## Data types
-- **int4hashset**: This data type represents a set of integers. Internally, it uses
-a combination of a bitmap and a value array to store the elements in a set. It's
-a variable-length type.
+### int4hashset
+
+This data type represents a set of integers. Internally, it uses a combination
+of a bitmap and a value array to store the elements in a set. It's a
+variable-length type.
## Functions
-- `int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`:
- Initialize an empty int4hashset with optional parameters.
- - `capacity` specifies the initial capacity, which is zero by default.
- - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
- - `growth_factor` is the multiplier for resizing and defaults to 2.0.
- - `hashfn_id` represents the hash function used.
- - 1=Jenkins/lookup3 (default)
- - 2=MurmurHash32
- - 3=Naive hash function
-- `hashset_add(int4hashset, int) -> int4hashset`: Adds an integer to an int4hashset.
-- `hashset_contains(int4hashset, int) -> boolean`: Checks if an int4hashset contains a given integer.
-- `hashset_union(int4hashset, int4hashset) -> int4hashset`: Merges two int4hashsets into a new int4hashset.
-- `hashset_to_array(int4hashset) -> int[]`: Converts an int4hashset to an array of integers.
-- `hashset_cardinality(int4hashset) -> bigint`: Returns the number of elements in an int4hashset.
-- `hashset_capacity(int4hashset) -> bigint`: Returns the current capacity of an int4hashset.
-- `hashset_max_collisions(int4hashset) -> bigint`: Returns the maximum number of collisions that have occurred for a single element
-- `hashset_intersection(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that is the intersection of the two input sets.
-- `hashset_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset that contains the elements present in the first set but not in the second set.
-- `hashset_symmetric_difference(int4hashset, int4hashset) -> int4hashset`: Returns a new int4hashset containing elements that are in either of the input sets, but not in their intersection.
+### int4hashset()
+
+`int4hashset([capacity int, load_factor float4, growth_factor float4, hashfn_id int4]) -> int4hashset`
+
+Initialize an empty int4hashset with optional parameters.
+ - `capacity` specifies the initial capacity, which is zero by default.
+ - `load_factor` represents the threshold for resizing the hashset and defaults to 0.75.
+ - `growth_factor` is the multiplier for resizing and defaults to 2.0.
+ - `hashfn_id` represents the hash function used.
+ - 1=Jenkins/lookup3 (default)
+ - 2=MurmurHash32
+ - 3=Naive hash function
+
+
+### hashset_add()
+
+`hashset_add(int4hashset, int) -> int4hashset`
+
+Adds an integer to an int4hashset.
+
+```sql
+SELECT hashset_add(NULL, 1); -- {1}
+SELECT hashset_add('{NULL}', 1); -- {1,NULL}
+SELECT hashset_add('{1}', NULL); -- {1,NULL}
+SELECT hashset_add('{1}', 1); -- {1}
+SELECT hashset_add('{1}', 2); -- {1,2}
+```
+
+
+### hashset_contains()
+
+`hashset_contains(int4hashset, int) -> boolean`
+
+Checks if an int4hashset contains a given integer.
+
+```sql
+SELECT hashset_contains('{1}', 1); -- TRUE
+SELECT hashset_contains('{1}', 2); -- FALSE
+```
+
+If the *cardinality* of the hashset is zero (0), it is known that it doesn't
+contain any value, not even an Unknown value represented as `NULL`, so even in
+that case it returns `FALSE`.
+
+```sql
+SELECT hashset_contains('{}', 1); -- FALSE
+SELECT hashset_contains('{}', NULL); -- FALSE
+```
+
+If the hashset is `NULL`, then the result is `NULL`.
+
+```sql
+SELECT hashset_contains(NULL, NULL); -- NULL
+SELECT hashset_contains(NULL, 1); -- NULL
+```
+
+
+### hashset_to_array()
+
+`hashset_to_array(int4hashset) -> int[]`
+
+Converts an int4hashset to an array of unsorted integers.
+
+```sql
+SELECT hashset_to_array('{2,1,3}'); -- {3,2,1}
+```
+
+
+### hashset_to_sorted_array()
+
+`hashset_to_sorted_array(int4hashset) -> int[]`
+
+Converts an int4hashset to an array of sorted integers.
+
+```sql
+SELECT hashset_to_sorted_array('{2,1,3}'); -- {1,2,3}
+```
+
+If the hashset contains a `NULL` element, it follows the same behavior as the
+`ORDER BY` clause in SQL: the `NULL` element is positioned at the end of the
+sorted array.
+
+```sql
+SELECT hashset_to_sorted_array('{2,1,NULL,3}'); -- {1,2,3,NULL}
+```
+
+
+### hashset_cardinality()
+
+`hashset_cardinality(int4hashset) -> bigint`
+
+Returns the number of elements in an int4hashset.
+
+```sql
+SELECT hashset_cardinality(NULL); -- NULL
+SELECT hashset_cardinality('{}'); -- 0
+SELECT hashset_cardinality('{1}'); -- 1
+SELECT hashset_cardinality('{1,1}'); -- 1
+SELECT hashset_cardinality('{NULL,NULL}'); -- 1
+SELECT hashset_cardinality('{1,NULL}'); -- 2
+SELECT hashset_cardinality('{1,2,3}'); -- 3
+```
+
+
+### hashset_capacity()
+
+`hashset_capacity(int4hashset) -> bigint`
+
+Returns the current capacity of an int4hashset.
+
+
+### hashset_max_collisions()
+
+`hashset_max_collisions(int4hashset) -> bigint`
+
+Returns the maximum number of collisions that have occurred for a single element
+
+
+### hashset_union()
+
+`hashset_union(int4hashset, int4hashset) -> int4hashset`
+
+Merges two int4hashsets into a new int4hashset.
+
+```sql
+SELECT hashset_union('{1,2}', '{2,3}'); -- '{1,2,3}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_union('{1}', NULL); -- NULL
+SELECT hashset_union(NULL, '{1}'); -- NULL
+```
+
+
+### hashset_intersection()
+
+`hashset_intersection(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset that is the intersection of the two input sets.
+
+```sql
+SELECT hashset_intersection('{1,2}', '{2,3}'); -- {2}
+SELECT hashset_intersection('{1,2,NULL}', '{2,3,NULL}'); -- {2,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_intersection('{1,2}', NULL); -- NULL
+SELECT hashset_intersection(NULL, '{2,3}'); -- NULL
+```
+
+
+### hashset_difference()
+
+`hashset_difference(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset that contains the elements present in the first set
+but not in the second set.
+
+```sql
+SELECT hashset_difference('{1,2}', '{2,3}'); -- {1}
+SELECT hashset_difference('{1,2,NULL}', '{2,3,NULL}'); -- {1}
+SELECT hashset_difference('{1,2,NULL}', '{2,3}'); -- {1,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_difference('{1,2}', NULL); -- NULL
+SELECT hashset_difference(NULL, '{2,3}'); -- NULL
+```
+
+
+### hashset_symmetric_difference()
+
+`hashset_symmetric_difference(int4hashset, int4hashset) -> int4hashset`
+
+Returns a new int4hashset containing elements that are in either of the input sets, but not in their intersection.
+
+```sql
+SELECT hashset_symmetric_difference('{1,2}', '{2,3}'); -- {1,3}
+SELECT hashset_symmetric_difference('{1,2,NULL}', '{2,3,NULL}'); -- {1,3}
+SELECT hashset_symmetric_difference('{1,2,NULL}', '{2,3}'); -- {1,3,NULL}
+```
+
+If any of the operands are `NULL`, the result is `NULL`.
+
+```sql
+SELECT hashset_symmetric_difference('{1,2}', NULL); -- NULL
+SELECT hashset_symmetric_difference(NULL, '{2,3}'); -- NULL
+```
+
## Aggregation Functions
-- `hashset_agg(int) -> int4hashset`: Aggregate integers into a hashset.
-- `hashset_agg(int4hashset) -> int4hashset`: Aggregate hashsets into a hashset.
+### hashset_agg(int4)
+
+`hashset_agg(int4) -> int4hashset`
+
+Aggregate integers into a hashset.
+
+```sql
+SELECT hashset_agg(some_int4_column) FROM some_table;
+```
+
+
+### hashset_agg(int4hashset)
+
+`hashset_agg(int4hashset) -> int4hashset`
+
+Aggregate hashsets into a hashset.
+
+```sql
+SELECT hashset_agg(some_int4hashset_column) FROM some_table;
+```
## Operators
diff --git a/benchmark/.gitignore b/benchmark/.gitignore
new file mode 100644
index 0000000..f3c4e1c
--- /dev/null
+++ b/benchmark/.gitignore
@@ -0,0 +1,2 @@
+soc-pokec-relationships.txt.gz
+soc-pokec-relationships.txt
diff --git a/benchmark/friends_of_friends.sh b/benchmark/friends_of_friends.sh
new file mode 100755
index 0000000..d4570c0
--- /dev/null
+++ b/benchmark/friends_of_friends.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+if [ ! -f "soc-pokec-relationships.txt" ]; then
+ wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
+ gunzip soc-pokec-relationships.txt.gz
+fi
+
+psql -X -c "CREATE TABLE edges (from_node INT, to_node INT);"
+psql -X -c "\COPY edges FROM soc-pokec-relationships.txt;"
+psql -X -c "ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);"
+psql -X -f friends_of_friends.sql
diff --git a/benchmark/friends_of_friends.sql b/benchmark/friends_of_friends.sql
new file mode 100644
index 0000000..ecc8f0e
--- /dev/null
+++ b/benchmark/friends_of_friends.sql
@@ -0,0 +1,72 @@
+CREATE EXTENSION IF NOT EXISTS hashset;
+
+\timing on
+
+CREATE OR REPLACE VIEW vfriends_of_friends_array_agg_distinct AS
+WITH RECURSIVE friends_of_friends AS (
+ SELECT
+ ARRAY[5867::bigint] AS current,
+ 0 AS depth
+ UNION ALL
+ SELECT
+ new_current,
+ friends_of_friends.depth + 1
+ FROM
+ friends_of_friends
+ CROSS JOIN LATERAL (
+ SELECT
+ array_agg(DISTINCT edges.to_node) AS new_current
+ FROM
+ edges
+ WHERE
+ from_node = ANY(friends_of_friends.current)
+ ) q
+ WHERE
+ friends_of_friends.depth < 3
+)
+SELECT
+ COALESCE(array_length(current, 1), 0) AS count_friends_at_depth_3
+FROM
+ friends_of_friends
+WHERE
+ depth = 3;
+
+CREATE OR REPLACE VIEW vfriends_of_friends_hashset_agg AS
+WITH RECURSIVE friends_of_friends AS
+(
+ SELECT
+ '{5867}'::int4hashset AS current,
+ 0 AS depth
+ UNION ALL
+ SELECT
+ new_current,
+ friends_of_friends.depth + 1
+ FROM
+ friends_of_friends
+ CROSS JOIN LATERAL
+ (
+ SELECT
+ hashset_agg(edges.to_node) AS new_current
+ FROM
+ edges
+ WHERE
+ from_node = ANY(hashset_to_array(friends_of_friends.current))
+ ) q
+ WHERE
+ friends_of_friends.depth < 3
+)
+SELECT
+ depth,
+ hashset_cardinality(current)
+FROM
+ friends_of_friends
+WHERE
+ depth = 3;
+
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+SELECT * FROM vfriends_of_friends_array_agg_distinct;
+
+SELECT * FROM vfriends_of_friends_hashset_agg;
+SELECT * FROM vfriends_of_friends_hashset_agg;
+SELECT * FROM vfriends_of_friends_hashset_agg;
diff --git a/hashset-api.c b/hashset-api.c
index a4beef4..2b92363 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -1,8 +1,6 @@
#include "hashset.h"
-#include <stdio.h>
#include <math.h>
-#include <string.h>
#include <sys/time.h>
#include <unistd.h>
#include <limits.h>
@@ -188,15 +186,11 @@ int4hashset_in(PG_FUNCTION_ARGS)
Datum
int4hashset_out(PG_FUNCTION_ARGS)
{
- int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
- char *bitmap;
- int32 *values;
- int i;
- StringInfoData str;
-
- /* Calculate the pointer to the bitmap and values array */
- bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ int i;
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ StringInfoData str;
/* Initialize the StringInfo buffer */
initStringInfo(&str);
@@ -412,8 +406,8 @@ int4hashset_union(PG_FUNCTION_ARGS)
int i;
int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
- char *bitmap = setb->data;
- int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+ char *bitmap = HASHSET_GET_BITMAP(setb);
+ int32 *values = HASHSET_GET_VALUES(setb);
for (i = 0; i < setb->capacity; i++)
{
@@ -593,14 +587,9 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
{
int i;
- char *bitmap;
- int32 *values;
- int4hashset_t *value;
-
- value = PG_GETARG_INT4HASHSET(1);
-
- bitmap = value->data;
- values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+ int4hashset_t *value = PG_GETARG_INT4HASHSET(1);
+ char *bitmap = HASHSET_GET_BITMAP(value);
+ int32 *values = HASHSET_GET_VALUES(value);
for (i = 0; i < value->capacity; i++)
{
@@ -666,8 +655,8 @@ int4hashset_agg_combine(PG_FUNCTION_ARGS)
src = (int4hashset_t *) PG_GETARG_POINTER(1);
dst = (int4hashset_t *) PG_GETARG_POINTER(0);
- bitmap = src->data;
- values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+ bitmap = HASHSET_GET_BITMAP(src);
+ values = HASHSET_GET_VALUES(src);
for (i = 0; i < src->capacity; i++)
{
@@ -687,22 +676,18 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
{
int i,
idx;
- int4hashset_t *set;
+ int4hashset_t *set = PG_GETARG_INT4HASHSET(0);
int32 *values;
int nvalues;
char *sbitmap;
int32 *svalues;
- set = PG_GETARG_INT4HASHSET(0);
-
/* if hashset is empty and does not contain null, return an empty array */
- if(set->nelements == 0 && !set->null_element) {
- Datum d = PointerGetDatum(construct_empty_array(INT4OID));
- PG_RETURN_ARRAYTYPE_P(d);
- }
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
- sbitmap = set->data;
- svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ sbitmap = HASHSET_GET_BITMAP(set);
+ svalues = HASHSET_GET_VALUES(set);
/* number of values to store in the array */
nvalues = set->nelements;
@@ -733,10 +718,8 @@ int4hashset_to_sorted_array(PG_FUNCTION_ARGS)
set = PG_GETARG_INT4HASHSET(0);
/* if hashset is empty and does not contain null, return an empty array */
- if(set->nelements == 0 && !set->null_element) {
- Datum d = PointerGetDatum(construct_empty_array(INT4OID));
- PG_RETURN_ARRAYTYPE_P(d);
- }
+ if(set->nelements == 0 && !set->null_element)
+ PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
/* extract the sorted elements from the hashset */
values = int4hashset_extract_sorted_elements(set);
@@ -762,8 +745,8 @@ int4hashset_eq(PG_FUNCTION_ARGS)
if (a->nelements != b->nelements)
PG_RETURN_BOOL(false);
- bitmap_a = a->data;
- values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+ bitmap_a = HASHSET_GET_BITMAP(a);
+ values_a = HASHSET_GET_VALUES(a);
/*
* Check if every element in a is also in b
@@ -940,8 +923,9 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
int i;
int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
- char *bitmap = setb->data;
- int32 *values = (int32 *)(bitmap + CEIL_DIV(setb->capacity, 8));
+ char *bitmap = HASHSET_GET_BITMAP(setb);
+ int32 *values = HASHSET_GET_VALUES(setb);
+
int4hashset_t *intersection;
intersection = int4hashset_allocate(
@@ -976,8 +960,8 @@ int4hashset_difference(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *difference;
- char *bitmap = seta->data;
- int32 *values = (int32 *)(bitmap + CEIL_DIV(seta->capacity, 8));
+ char *bitmap = HASHSET_GET_BITMAP(seta);
+ int32 *values = HASHSET_GET_VALUES(seta);
difference = int4hashset_allocate(
seta->capacity,
@@ -1011,10 +995,10 @@ int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
int4hashset_t *seta = PG_GETARG_INT4HASHSET(0);
int4hashset_t *setb = PG_GETARG_INT4HASHSET(1);
int4hashset_t *result;
- char *bitmapa = seta->data;
- char *bitmapb = setb->data;
- int32 *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
- int32 *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
+ char *bitmapa = HASHSET_GET_BITMAP(seta);
+ char *bitmapb = HASHSET_GET_BITMAP(setb);
+ int32 *valuesa = HASHSET_GET_VALUES(seta);
+ int32 *valuesb = HASHSET_GET_VALUES(setb);
result = int4hashset_allocate(
seta->nelements + setb->nelements,
diff --git a/hashset.c b/hashset.c
index 91907ab..65ab25f 100644
--- a/hashset.c
+++ b/hashset.c
@@ -58,8 +58,8 @@ int4hashset_resize(int4hashset_t * set)
{
int i;
int4hashset_t *new;
- char *bitmap;
- int32 *values;
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
int new_capacity;
new_capacity = (int)(set->capacity * set->growth_factor);
@@ -80,10 +80,6 @@ int4hashset_resize(int4hashset_t * set)
set->hashfn_id
);
- /* Calculate the pointer to the bitmap and values array */
- bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
-
for (i = 0; i < set->capacity; i++)
{
int byte = (i / 8);
@@ -131,8 +127,8 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
position = hash % set->capacity;
- bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+ bitmap = HASHSET_GET_BITMAP(set);
+ values = HASHSET_GET_VALUES(set);
while (true)
{
@@ -178,8 +174,8 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
int bit;
uint32 hash;
uint32 position;
- char *bitmap;
- int32 *values;
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
int num_probes = 0; /* Counter for the number of probes */
if (set->hashfn_id == JENKINS_LOOKUP3_HASHFN_ID)
@@ -203,9 +199,6 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
position = hash % set->capacity;
- bitmap = set->data;
- values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
-
while (true)
{
byte = (position / 8);
@@ -233,15 +226,10 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
int32 *
int4hashset_extract_sorted_elements(int4hashset_t *set)
{
- /* Allocate memory for the elements array */
- int32 *elements = palloc(set->nelements * sizeof(int32));
-
- /* Access the data array */
- char *bitmap = set->data;
- int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
-
- /* Counter for the number of extracted elements */
- int32 nextracted = 0;
+ int32 *elements = palloc(set->nelements * sizeof(int32));
+ char *bitmap = HASHSET_GET_BITMAP(set);
+ int32 *values = HASHSET_GET_VALUES(set);
+ int32 nextracted = 0;
/* Iterate through all elements */
for (int32 i = 0; i < set->capacity; i++)
diff --git a/hashset.h b/hashset.h
index 86f5d1b..3631e22 100644
--- a/hashset.h
+++ b/hashset.h
@@ -12,6 +12,8 @@
#include "common/hashfn.h"
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
+#define HASHSET_GET_BITMAP(set) ((set)->data)
+#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))
#define HASHSET_STEP 13
#define JENKINS_LOOKUP3_HASHFN_ID 1
#define MURMURHASH32_HASHFN_ID 2
On Thu, Jun 29, 2023 at 4:43 PM Joel Jacobson <joel@compiler.org> wrote:
On Thu, Jun 29, 2023, at 08:54, jian he wrote:
Anyway, this time, I added another macro,which seems to simplify the code.
#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))it passed all the tests on my local machine.
Hmm, this is interesting. There is a bug in your second patch,
that the tests catch, so it's really surprising if they pass on your machine.Can you try to run `make clean && make && make install && make installcheck`?
I would guess you forgot to recompile or reinstall.
This is the bug in 0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch:
@@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS) int4hashset_t *seta = PG_GETARG_INT4HASHSET_COPY(0); int4hashset_t *setb = PG_GETARG_INT4HASHSET(1); char *bitmap = setb->data; - int32 *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8)); + int32 *values = (int32 *) SET_DATA_PTR(seta);You accidentally replaced `setb` with `seta`.
I renamed the macro to HASHSET_GET_VALUES and changed it slightly,
also added a HASHSET_GET_BITMAP for completeness:#define HASHSET_GET_BITMAP(set) ((set)->data)
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))Instead of your version:
#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))Changes:
* Parenthesize macro parameters.
* Prefix the macro names with "HASHSET_" to avoid potential conflicts.
* "GET_VALUES" more clearly communicates that it's the values we're extracting.New patch attached.
Other changes in same commit:
* Add original friends-of-friends graph query to new benchmark/ directory
* Add table of content to README
* Update docs: Explain null semantics and add function examples
* Simplify empty hashset handling, remove unused includes/Joel
more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?
Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?
On Fri, Jun 30, 2023, at 06:50, jian he wrote:
more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?
Yes, it's fine. Macros are just text substitutions done pre-compilation.
Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?
Maybe, but might be less error-prone more descriptive with different
macros for each type, e.g. INT4HASHSET_GET_VALUES,
similar to the existing PG_GETARG_INT4HASHSET
Curious to hear what everybody thinks about the interface, documentation,
semantics and implementation?
Is there anything missing or something that you think should be changed/improved?
/Joel
Has anyone put this in a git repo / extension package or similar ?
I’d like to try it out outside the core pg tree.
Show quoted text
On 1 Jul 2023, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
On Fri, Jun 30, 2023, at 06:50, jian he wrote:
more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?Yes, it's fine. Macros are just text substitutions done pre-compilation.
Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?Maybe, but might be less error-prone more descriptive with different
macros for each type, e.g. INT4HASHSET_GET_VALUES,
similar to the existing PG_GETARG_INT4HASHSETCurious to hear what everybody thinks about the interface, documentation,
semantics and implementation?Is there anything missing or something that you think should be changed/improved?
/Joel
https://github.com/tvondra/hashset
On Mon, Aug 14, 2023 at 11:23 PM Florents Tselai
<florents.tselai@gmail.com> wrote:
Has anyone put this in a git repo / extension package or similar ?
I’d like to try it out outside the core pg tree.
On 1 Jul 2023, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
On Fri, Jun 30, 2023, at 06:50, jian he wrote:
more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?Yes, it's fine. Macros are just text substitutions done pre-compilation.
Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?Maybe, but might be less error-prone more descriptive with different
macros for each type, e.g. INT4HASHSET_GET_VALUES,
similar to the existing PG_GETARG_INT4HASHSETCurious to hear what everybody thinks about the interface, documentation,
semantics and implementation?Is there anything missing or something that you think should be changed/improved?
/Joel
--
I recommend David Deutsch's <<The Beginning of Infinity>>
Jian