Fix "could not find memoization table entry"
Hi,
Recently, I encountered an error: could not find memoization table
entry on TPCH(S=1) test.
The query was as follows:
psql (19devel)
Type "help" for help.
postgres=# SELECT
FROM partsupp AS ref_0,
LATERAL (SELECT ref_0.ps_suppkey AS c3,
(SELECT n_regionkey
FROM nation
LIMIT 1) AS c4
LIMIT ALL) AS subq_0
WHERE hash_numeric(ref_0.ps_supplycost) = subq_0.c4;
ERROR: could not find memoization table entry
postgres=# explain SELECT
FROM partsupp AS ref_0,
LATERAL (SELECT ref_0.ps_suppkey AS c3,
(SELECT n_regionkey
FROM nation
LIMIT 1) AS c4
LIMIT ALL) AS subq_0
WHERE hash_numeric(ref_0.ps_supplycost) = subq_0.c4;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=0.06..57942.04 rows=4000 width=0)
-> Seq Scan on partsupp ref_0 (cost=0.00..25560.00 rows=800000 width=10)
-> Memoize (cost=0.06..0.09 rows=1 width=4)
Cache Key: hash_numeric(ref_0.ps_supplycost), ref_0.ps_suppkey
Cache Mode: binary
Estimates: capacity=80659 distinct keys=88915 lookups=800000
hit percent=80.63%
-> Subquery Scan on subq_0 (cost=0.05..0.08 rows=1 width=4)
Filter: (hash_numeric(ref_0.ps_supplycost) = subq_0.c4)
-> Result (cost=0.05..0.06 rows=1 width=8)
InitPlan expr_1
-> Limit (cost=0.00..0.05 rows=1 width=4)
-> Seq Scan on nation (cost=0.00..1.25
rows=25 width=4)
(12 rows)
The hash_numeric result type is int. If I forced binary_mode to
logical, there was no error anymore.
So I think this may be a bug.
How to easily reproduce:
1. prepare tpch(s=1) data
2. Using gdb to set the mstate->mem_limit to 170 after the first tuple
was put into the cache in func cache_lookup()
3. When putting the second tuple into the cache, the mem_used will
exceed the mem_limit, so
calling the cache_reduce_memory() to remove the first tuple. But it
cannot find the first tuple in the hash table.
I did some research about this issue. When we insert the first tuple
into the cache, the first column(hash_numeric(ref_0.ps_supplycost))
value is:
(gdb) p /x pslot->tts_values[i]
$2 = 0xaf27c7c7
(gdb) p hkey
$2 = 38469220
But in the cache_reduce_memory(), its value is like this:
(gdb) p /x pslot->tts_values[i]
$5 = 0xffffffffaf27c7c7
(gdb) p hkey
$7 = 288723292
The hkeys returned by datum_image_hash() are different, so we can't
find the entry in the hash table.
In the datum_image_hash(), if typByVal is true, calling
result = hash_bytes((unsigned char *) &value, sizeof(Datum));
I think we should use typLen here, not sizeof(Datum).
I tried this way and didn't encounter any errors again.
I added David to the cc list. He may know more about this module.
--
Thanks,
Tender Wang
Attachments:
0001-Fix-could-not-find-memoization-table-entry.patchapplication/octet-stream; name=0001-Fix-could-not-find-memoization-table-entry.patchDownload+1-2
On Mon, 23 Mar 2026 at 19:30, Tender Wang <tndrwang@gmail.com> wrote:
Recently, I encountered an error: could not find memoization table
The hkeys returned by datum_image_hash() are different, so we can't
find the entry in the hash table.In the datum_image_hash(), if typByVal is true, calling
result = hash_bytes((unsigned char *) &value, sizeof(Datum));I think we should use typLen here, not sizeof(Datum).
I tried this way and didn't encounter any errors again.
The Datum values should be the same. You can't just compare the lowest
attlen bytes of a Datum.
It looks to me like the bug is in hash_numeric(). Seems like it has no
idea what type it's meant to return. hash_numeric_extended() doesn't
seem to be much better.
Do you still get the ERROR after patching with the attached?
David
Attachments:
numeric.patchapplication/octet-stream; name=numeric.patchDownload+3-3
David Rowley <dgrowleyml@gmail.com> 于2026年3月24日周二 08:12写道:
On Mon, 23 Mar 2026 at 19:30, Tender Wang <tndrwang@gmail.com> wrote:
Recently, I encountered an error: could not find memoization table
The hkeys returned by datum_image_hash() are different, so we can't
find the entry in the hash table.In the datum_image_hash(), if typByVal is true, calling
result = hash_bytes((unsigned char *) &value, sizeof(Datum));I think we should use typLen here, not sizeof(Datum).
I tried this way and didn't encounter any errors again.The Datum values should be the same. You can't just compare the lowest
attlen bytes of a Datum.
Thanks for pointing this out.
Actually, I haven't quite figured out why `typLen` cannot be used here.
It looks to me like the bug is in hash_numeric(). Seems like it has no
idea what type it's meant to return. hash_numeric_extended() doesn't
seem to be much better.Do you still get the ERROR after patching with the attached?
No error anymore with your patch.
--
Thanks,
Tender Wang
On Tue, 24 Mar 2026 at 14:10, Tender Wang <tndrwang@gmail.com> wrote:
Actually, I haven't quite figured out why `typLen` cannot be used here.
This is because we sign-extend rather than zero-extend up to Datum.
Consider this code from master:
typedef uint64_t Datum;
static inline Datum
Int32GetDatum(int32 X)
{
return (Datum) X;
}
So we cast to uint64_t when converting int32 to Datum.
And consider the output of this C program:
drowley@amd3990x:~$ cat datum.c
#include <stdio.h>
#include <stdint.h>
int main(void)
{
int32_t i32 = -1;
uint64_t i64 = (uint64_t) i32;
for (int i = 0; i < 64; i++)
{
putchar(i64 & 0x8000000000000000ULL ? '1' : '0');
i64 <<= 1;
}
putchar('\n');
return 0;
}
drowley@amd3990x:~$ gcc datum.c -o datum && ./datum
1111111111111111111111111111111111111111111111111111111111111111
If you only look at the lower 32 bits of that Datum, then you're not
looking at the entire value.
I don't have hardware to try it, but I also don't suppose comparing
the first attlen bytes of a Datum does the same thing on big-endian
machines as that would look at the most significant side of the Datum
rather than the least significant side as it would on little-endian.
David
David Rowley <dgrowleyml@gmail.com> 于2026年3月24日周二 10:39写道:
On Tue, 24 Mar 2026 at 14:10, Tender Wang <tndrwang@gmail.com> wrote:
Actually, I haven't quite figured out why `typLen` cannot be used here.
This is because we sign-extend rather than zero-extend up to Datum.
Consider this code from master:typedef uint64_t Datum;
static inline Datum
Int32GetDatum(int32 X)
{
return (Datum) X;
}So we cast to uint64_t when converting int32 to Datum.
And consider the output of this C program:
drowley@amd3990x:~$ cat datum.c
#include <stdio.h>
#include <stdint.h>int main(void)
{
int32_t i32 = -1;
uint64_t i64 = (uint64_t) i32;for (int i = 0; i < 64; i++)
{
putchar(i64 & 0x8000000000000000ULL ? '1' : '0');
i64 <<= 1;
}
putchar('\n');
return 0;
}drowley@amd3990x:~$ gcc datum.c -o datum && ./datum
1111111111111111111111111111111111111111111111111111111111111111If you only look at the lower 32 bits of that Datum, then you're not
looking at the entire value.
Got it. Thanks for your explanation.
I don't have hardware to try it, but I also don't suppose comparing
the first attlen bytes of a Datum does the same thing on big-endian
machines as that would look at the most significant side of the Datum
rather than the least significant side as it would on little-endian.
I find comments in datumIsEqual() when typByVal is true:
* just compare the two datums. NOTE: just comparing "len" bytes will
* not do the work, because we do not know how these bytes are aligned
* inside the "Datum". We assume instead that any given datatype is
* consistent about how it fills extraneous bits in the Datum.
Using sizeof(Datum) in datum_image_hash() is totally correct.
--
Thanks,
Tender Wang
On Tue, 24 Mar 2026 at 13:12, David Rowley <dgrowleyml@gmail.com> wrote:
It looks to me like the bug is in hash_numeric(). Seems like it has no
idea what type it's meant to return. hash_numeric_extended() doesn't
seem to be much better.
I recreated this locally and spent some time debugging to understand
the issue more clearly. Because hash_numeric() isn't being careful to
convert the Datum return value into int32, in cases where the Datum is
a number above 2^31, (which would cause an int32 to wrap to negative),
that isn't happening when the expression evaluation code is run and
the Memoize hash table is populated. During the eviction code, when
we re-lookup the hash table with the stored key later, we deform the
MinimalTuple for the Memoize key, because the Datum has been formed
into a MinimalTuple, it's been forced into the correct 32-bit
representation via store_att_byval()'s DatumGetInt32() call. The
values below are examples of what we're dealing with:
1) 3041168208 (0xB5448B50)
2) 18446744072455752528 (0xFFFFFFFFB5448B50)
#1 is the number that the expression evaluation code gets from
hash_numeric(). This is the value that's used to choose the hashtable
bucket. #2 is what we probe the hash table for after reading the
Memoize key from the MinimalTuple, which has wrapped to negative
correctly from being stored in the MinimalTuple via store_att_byval().
If we fix hash_numeric() so it casts its value to int32, it will
return #2, the sign-extended representation.
Obviously, we don't want to back-patch anything that would cause a
user-visible change in the return value of hash_numeric(), so I've
been experimenting to see if there's any way to get PostgreSQL to
output any value from hash_numeric() larger than 2^31 and I've been
unable to. I tried:
Experiment a:
create table bigint (a bigint);
insert into bigint select hash_numeric(n) from
(values('1234.124'::numeric),('1234.124'::numeric)) n(n);
In this case, the hash_numeric() value goes through int48() to cast it
to the bigint before storing it in the table. int48() does
PG_GETARG_INT32(0), which fixes the issue.
Experiment b:
I tried just getting PostgreSQL to output the data:
select hash_numeric(n) from
(values('1234.124'::numeric),('1234.124'::numeric)) n(n);
The larger than 2^31 makes it all the way to int4out() in this case,
but it gets fixed by int4out's PG_GETARG_INT32(0) call.
My 2 experiments aren't exactly exhaustive, so they've only slightly
reduced my concern level.
Does anyone else have any opinions on this one?
I'm also wondering about other places where we're using the wrong
return macro in functions. Maybe there's something we can do in debug
builds and put the return type in the fcinfo and add some Asserts to
the PG_RETURN_* macros.
David
On Wed, 25 Mar 2026 at 13:31, David Rowley <dgrowleyml@gmail.com> wrote:
Obviously, we don't want to back-patch anything that would cause a
user-visible change in the return value of hash_numeric(), so I've
been experimenting to see if there's any way to get PostgreSQL to
output any value from hash_numeric() larger than 2^31 and I've been
unable to. I tried:
I was experimenting with a less risky fix by having the datum_image
functions force the sign-extended representation of the Datum before
hashing or comparing.
Attached is a quick PoC of what that would look like. It does fix the
reported problem. But it is a hack and doesn't fix the root cause of
the issue.
Despite the hackiness, I feel this might be better than the
whack-a-mole approach of just fixing incorrect usages of the
PG_RETURN_* macros as and when we find them.
David
Attachments:
hack_to_fix_datum_image_problem.patchapplication/octet-stream; name=hack_to_fix_datum_image_problem.patchDownload+42-1
David Rowley <dgrowleyml@gmail.com> 于2026年3月25日周三 10:09写道:
On Wed, 25 Mar 2026 at 13:31, David Rowley <dgrowleyml@gmail.com> wrote:
Obviously, we don't want to back-patch anything that would cause a
user-visible change in the return value of hash_numeric(), so I've
been experimenting to see if there's any way to get PostgreSQL to
output any value from hash_numeric() larger than 2^31 and I've been
unable to. I tried:I was experimenting with a less risky fix by having the datum_image
functions force the sign-extended representation of the Datum before
hashing or comparing.Attached is a quick PoC of what that would look like. It does fix the
reported problem. But it is a hack and doesn't fix the root cause of
the issue.Despite the hackiness, I feel this might be better than the
whack-a-mole approach of just fixing incorrect usages of the
PG_RETURN_* macros as and when we find them.
No objection from me.
It seems no users have complained about hash_numberic(), and except
for this reported issue, no internal errors have been reported due to
hash_numberic().
--
Thanks,
Tender Wang
On Wed, 25 Mar 2026 at 22:00, Tender Wang <tndrwang@gmail.com> wrote:
David Rowley <dgrowleyml@gmail.com> 于2026年3月25日周三 10:09写道:
On Wed, 25 Mar 2026 at 13:31, David Rowley <dgrowleyml@gmail.com> wrote:
I was experimenting with a less risky fix by having the datum_image
functions force the sign-extended representation of the Datum before
hashing or comparing.Attached is a quick PoC of what that would look like. It does fix the
reported problem. But it is a hack and doesn't fix the root cause of
the issue.Despite the hackiness, I feel this might be better than the
whack-a-mole approach of just fixing incorrect usages of the
PG_RETURN_* macros as and when we find them.No objection from me.
It seems no users have complained about hash_numberic(), and except
for this reported issue, no internal errors have been reported due to
hash_numberic().
I cleaned that method up and pushed it.
Thanks for the report.
David