Safer hash table initialization macro
Hi hackers,
Currently to create a hash table we do things like:
A) create a struct, say:
typedef struct SeenRelsEntry
{
Oid rel_id;
int list_index;
} SeenRelsEntry;
where the first member is the hash key, and then later:
B)
ctl.keysize = sizeof(Oid);
ctl.entrysize = sizeof(SeenRelsEntry);
ctl.hcxt = CurrentMemoryContext;
seen_rels = hash_create("find_all_inheritors temporary table",
32, /* start small and extend */
&ctl,
I can see 2 possible issues:
1)
We manually specify the type for keysize, which could become incorrect (from the
start) or if the key member's type changes.
2)
It may be possible to remove the key member without the compiler noticing it.
Take this example and remove:
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 929bb53b620..eb11976afef 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -36,7 +36,6 @@
*/
typedef struct SeenRelsEntry
{
- Oid rel_id; /* relation oid */
int list_index; /* its position in output list(s) */
} SeenRelsEntry;
That would compile without any issues because this rel_id member is not
referenced in the code (for this particular example). That's rare but possible.
But then, on my machine, during make check:
TRAP: failed Assert("!found"), File: "nodeModifyTable.c", Line: 5157, PID: 140430
The reason is that the struct member access is done only for bytes level
operations (within the hash related macros). So it's easy to think that this
member is unused (because it is not referenced in the code).
I'm thinking about what kind of safety we could put in place to better deal with
1) and 2).
What about adding a macro that:
- requests the key member name
- ensures that it is at offset 0
- computes the key size based on the member
Something like:
"
#define HASH_ELEM_INIT(ctl, entrytype, keymember) \
do { \
StaticAssertStmt(offsetof(entrytype, keymember) == 0, \
#keymember " must be first member in " #entrytype); \
(ctl).keysize = sizeof(((entrytype *)0)->keymember); \
(ctl).entrysize = sizeof(entrytype); \
} while (0)
"
That way:
- The key member is explicitly referenced in the code (preventing "unused"
false positives)
- The key size is automatically computed from the actual member type (preventing
type mismatches)
- We enforce that the key is at offset 0
An additional benefit: it avoids repeating the "keysize =" followed by "entrysize ="
in a lot of places in the code (currently about 100 times).
If that sounds like a good idea, I could work on a patch doing so.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Thoughts?
I think the hashtable creation API in postgres is so terrible that it
actively discourages usage. At Citus we definitely had the problem
that we would use Lists for cases where a hashtable was preferable
perf wise, just because the API was so terrible.
That's why I eventually implemented some wrapper helper functions to
make it less verbose and error prone to use in by far the most common
patterns we had[1]https://github.com/citusdata/citus/blob/ae2eb65be082d52db646b68a644474e24bc6cea1/src/include/distributed/hash_helpers.h#L74.
So I'm definitely in favor of improving this API (probably by adding a
few new functions). I have a few main thoughts on what could be
improved:
1. Automatically determine keysize and entrysize given a keymember and
entrytype (like you suggested).
2. Autodect most of the flags.
a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE,
HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected
based on the relevant fields from HASHCTL. Passing them in explicitly
is just duplication that causes code noise and is easy to forget
accidentally.
b. HASH_ELEM is useless noise because it is required
c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by
default if the keytype is char*)
3. Default to CurrentMemoryContext instead of TopMemoryContext. Like
all of the other functions that allocate, because right now it's way
too easy to accidentally use TopMemoryContext when you did not intend
to.
4. Have a creation function that doesn't require HASHCTL at all (just
takes entrytype and keymember and maybe a version of this that takes a
memorycontext).
Hi,
On Mon, Dec 01, 2025 at 03:44:41PM +0100, Jelte Fennema-Nio wrote:
On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Thoughts?
I think the hashtable creation API in postgres is so terrible that it
actively discourages usage.
Thanks for sharing your thoughts!
So I'm definitely in favor of improving this API (probably by adding a
few new functions). I have a few main thoughts on what could be
improved:1. Automatically determine keysize and entrysize given a keymember and
entrytype (like you suggested).
PFA a patch introducing and using the new macro. Note that it also introduces
HASH_ELEM_INIT_FULL for the rare cases where the whole struct is the
key.
Also one case remains untouched:
$ git grep "entrysize = sizeof" "*.c"
src/backend/replication/logical/relation.c: ctl.entrysize = sizeof(LogicalRepRelMapEntry);
That's because the key is a member of a nested struct so that the new
macro can not be used. As there is only one occurrence of it, I think we can keep
it as it is. But we could create a dedicated macro for those cases if we feel
the need. Now that I'm writing this, that might be a better idea: that way we'd
avoid any "entrysize/keysize = " in the .c files.
Also a nice side effect of using the macros:
138 insertions(+), 203 deletions(-)
2. Autodect most of the flags.
a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE,
HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected
based on the relevant fields from HASHCTL. Passing them in explicitly
is just duplication that causes code noise and is easy to forget
accidentally.
b. HASH_ELEM is useless noise because it is required
c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by
default if the keytype is char*)
3. Default to CurrentMemoryContext instead of TopMemoryContext. Like
all of the other functions that allocate, because right now it's way
too easy to accidentally use TopMemoryContext when you did not intend
to.
4. Have a creation function that doesn't require HASHCTL at all (just
takes entrytype and keymember and maybe a version of this that takes a
memorycontext).
Thanks for the above suggestions! I did not think so deep as you did during
your Citus time, but will think about those too. I suggest we move forward one
step at a time, first step being the new macros. Does that make sense to you?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
On Wed Dec 3, 2025 at 12:17 PM CET, Bertrand Drouvot wrote:
I suggest we move forward one
step at a time, first step being the new macros. Does that make sense to you?
Normally I would agree, but in this case I think the new macros you
proposing would become obsolete once we have the better hash table
creation functions I have in mind. And if we're going to update all
places where we create hash tables, I'd rather update them to something
really nice than a small improvement.
I couldn't let it go (nerd-sniped). So here's a patchset that adds some
macros that I think are pretty nice. Including a foreach_hash macro.
I'm a bit on the fence about the C11 _Generic code to determine whether
we should use HASH_BLOBS or HASH_STRINGS based on the type of the key.
It works really nicely in practice, but I'm worried it's a bit too much
magic. Probably we should at least have an override to allow using
HASH_BLOBS anyway for a char array (in case it's not null terminated).
Attachments:
On Fri, Dec 5, 2025 at 4:30 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
I'm a bit on the fence about the C11 _Generic code to determine whether
we should use HASH_BLOBS or HASH_STRINGS based on the type of the key.
It works really nicely in practice, but I'm worried it's a bit too much
magic. Probably we should at least have an override to allow using
HASH_BLOBS anyway for a char array (in case it's not null terminated).
How much of our header stuff is supposed to work from C++ too? I
assume that if this is passing cpluspluscheck then it's only because
those _Generic macros aren't being expanded. I suppose you could
write the typeof-based version you already hinted at, but only use it
for __cplusplus__ (where typeof exists as decltype).
What I mean is, if something like DuckDB (C++ IIRC) wants to actually
use these APIs with the new interfaces, they won't work if the macros
expant to _Generic (not legal C++), but a typeof/decltype-based
version should work just fine, but at the same time I don't think
we'd want to use typeof just because it is available, or we'd never
test the _Generic version: all 3 C compilers have something we can use
for typeof, it's just not standard C before C23, and we want
PostgreSQL to be a valid C11 program and actually have coverage of
those code branches.
Another consideration is what impact we have on the Rust world, and
potentially other languages used for extensions that call C via FFI
etc, if we start using _Generic in our public interfaces, as opposed
to just using it in smaller ways as part of our internal assertions
and C implementation code that doesn't affect extensions. Of course
people using C APIs have to deal with the complexities of C evolution
too, I'm not saying that's a blocker, I'm just trying to think through
the impact of these sorts of API changes on the ecosystem.
On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
How much of our header stuff is supposed to work from C++ too?
I think it's nice if it works, but it doesn't seem the most important.
Especially since C++ has its own hashmaps. And if it really needs to
create a hashmap it's still possible to call the.
I suppose you could
write the typeof-based version you already hinted at, but only use it
for __cplusplus__ (where typeof exists as decltype).
I tried to figure something out that would work in C++ (with help of
Claude), but I wasn't able to create a version of the macros without
also needing to add:
#ifdef __cplusplus
}
#include <type_traits>
extern "C" {
#endif
It seems quite ugly to escape the extern "C" from the parent like that
and then re-enter it. Overall it doesn't seem worth the hassle to me
to make these macros work in C++.
Another consideration is what impact we have on the Rust world, and
potentially other languages used for extensions that call C via FFI
etc
FFI generally cannot call macros anyway, only actual symbols.
On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
How much of our header stuff is supposed to work from C++ too?
I think it's nice if it works, but it doesn't seem the most important.
Especially since C++ has its own hashmaps. And if it really needs to
create a hashmap it's still possible to call the.
... C functions without the helper macros. Yeah. That seems OK to me.
I suppose you could
write the typeof-based version you already hinted at, but only use it
for __cplusplus__ (where typeof exists as decltype).I tried to figure something out that would work in C++ (with help of
Claude), but I wasn't able to create a version of the macros without
also needing to add:#ifdef __cplusplus
}
#include <type_traits>
extern "C" {
#endifIt seems quite ugly to escape the extern "C" from the parent like that
and then re-enter it. Overall it doesn't seem worth the hassle to me
to make these macros work in C++.
Yeah. I don't think we want that sort of thing all over the place.
We could eventually come up with a small set of tools in a central
place though, so people can work with this stuff without also known
C++ meta-programming voodoo. For example something like (untested, I
didn't think about char[size], just spitballing here...):
(pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *))
... given the definition I posted recently[1]/messages/by-id/CA+hUKGL7trhWiJ4qxpksBztMMTWDyPnP1QN+Lq341V7QL775DA@mail.gmail.com.
I take your point that it's not really important for this case though.
Another consideration is what impact we have on the Rust world, and
potentially other languages used for extensions that call C via FFI
etcFFI generally cannot call macros anyway, only actual symbols.
Sure, I was just thinking about how such cross-language usage would be
forced to unpick our macrology and call the underlying C functions
without them. Doesn't seem like the end of the world anyway, I was
just thinking out loud about the consequences of this phenomenon in
headers.
[1]: /messages/by-id/CA+hUKGL7trhWiJ4qxpksBztMMTWDyPnP1QN+Lq341V7QL775DA@mail.gmail.com
On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote:
On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
create a hashmap it's still possible to call the.... C functions without the helper macros.
Oops, forgot to finish that sentence.
Yeah. I don't think we want that sort of thing all over the place.
We could eventually come up with a small set of tools in a central
place though, so people can work with this stuff without also known
C++ meta-programming voodoo. For example something like (untested, I
didn't think about char[size], just spitballing here...):(pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *))
... given the definition I posted recently[1].
Ugh... It would have saved me some time if I'd seen that email before. I
also had no clue that 'extern "C++"' was a thing. Attached is a new
patchset where your proposed macro is used. I also needed a
pg_nullptr_of macro, because the trick that worked in C didn't in C++.
Also, it's probably worth checking out this thread[2]/messages/by-id/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com. Especially
because it overlaps significantly with[1].
Sure, I was just thinking about how such cross-language usage would be
forced to unpick our macrology and call the underlying C functions
without them. Doesn't seem like the end of the world anyway, I was
just thinking out loud about the consequences of this phenomenon in
headers.
It's not great, but yeah that's the situation. Stuff like PG_TRY and
PG_CATCH are especially painful to reimplement. Luckily those things
can usually be re-implemented similarly in the target language, so the
macros only need to be disected once.
[1] /messages/by-id/CA+hUKGL7trhWiJ4qxpksBztMMTWDyPnP1QN+Lq341V7QL775DA@mail.gmail.com
[2]: /messages/by-id/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com
Attachments:
Hi,
On Mon, Dec 08, 2025 at 11:53:02AM +0100, Jelte Fennema-Nio wrote:
On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote:
On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
create a hashmap it's still possible to call the.... C functions without the helper macros.
Oops, forgot to finish that sentence.
Thanks for this patch series!
Yeah. I don't think we want that sort of thing all over the place.
We could eventually come up with a small set of tools in a central
place though, so people can work with this stuff without also known
C++ meta-programming voodoo. For example something like (untested, I
didn't think about char[size], just spitballing here...):(pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *))
... given the definition I posted recently[1].
+#if defined(__cplusplus)
+#define pg_expr_has_type_p(expr, type) (std::is_same<decltype(expr), type>::value)
+#else
+#define pg_expr_has_type_p(expr, type) \
+ _Generic((expr), type: 1, default: 0)
+#endif
What about relying on the existing __builtin_types_compatible_p() instead of
_Generic() here?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Dec 9, 2025 at 8:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
On Mon, Dec 08, 2025 at 11:53:02AM +0100, Jelte Fennema-Nio wrote:
On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote:
+#if defined(__cplusplus) +#define pg_expr_has_type_p(expr, type) (std::is_same<decltype(expr), type>::value) +#else +#define pg_expr_has_type_p(expr, type) \ + _Generic((expr), type: 1, default: 0) +#endifWhat about relying on the existing __builtin_types_compatible_p() instead of
_Generic() here?
If we used standard C/C++ it'd work on MSVC too.
On Tue, 9 Dec 2025 at 10:11, Thomas Munro <thomas.munro@gmail.com> wrote:
What about relying on the existing __builtin_types_compatible_p() instead of
_Generic() here?If we used standard C/C++ it'd work on MSVC too.
And to be clear, that's important because the result of
pg_expr_has_type_p fundamentally impacts the meaning of the code (i.e.
it determines the hash function). Our existing usage of
__builtin_types_compatible_p only adds some *optional* type checking,
so for that it's not critical that it works on MSVC too.
On Tue, 9 Dec 2025 at 08:27, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for this patch series!
To clarify: Does that mean +1 from you on the proposed API?
Hi,
On Tue, Dec 09, 2025 at 11:45:11AM +0100, Jelte Fennema-Nio wrote:
On Tue, 9 Dec 2025 at 08:27, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:Thanks for this patch series!
To clarify: Does that mean +1 from you on the proposed API?
Yeah, I think the hash table API needs improvement. I did not look at all the
details of your patches, but from what I have seen I think your API proposal
make sense.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com