Don't overwrite scan key in systable_beginscan()
When systable_beginscan() and systable_beginscan_ordered() choose an
index scan, they remap the attribute numbers in the passed-in scan keys
to the attribute numbers of the index, and then write those remapped
attribute numbers back into the scan key passed by the caller. This
second part is surprising and gratuitous. It means that a scan key
cannot safely be used more than once (but it might sometimes work,
depending on circumstances). Also, there is no value in providing these
remapped attribute numbers back to the caller, since they can't do
anything with that.
I propose to fix that by making a copy of the scan keys passed by the
caller and make the modifications there.
In order to prove to myself that there are no other cases where
caller-provided scan keys are modified, I went through and
const-qualified all the APIs. This works out correctly. Several levels
down in the stack, the access methods make their own copy of the scan
keys that they store in their scan descriptors, and they use those in
non-const-clean ways, but that's ok, that's their business. As far as
the top-level callers are concerned, they can rely on their scan keys to
be const after this.
I'm not proposing this second patch for committing at this time, since
that would modify the public access method APIs in an incompatible way.
I've made a proposal of a similar nature in [0]/messages/by-id/14c31f4a-0347-0805-dce8-93a9072c05a5@eisentraut.org. At some point, it
might be worth batching these and other changes together and make the
change. I might come back to that later.
[0]: /messages/by-id/14c31f4a-0347-0805-dce8-93a9072c05a5@eisentraut.org
/messages/by-id/14c31f4a-0347-0805-dce8-93a9072c05a5@eisentraut.org
While researching how the scan keys get copied around, I noticed that
the index access methods all use memmove() to make the above-mentioned
copy into their own scan descriptor. This is fine, but memmove() is
usually only used when something special is going on that would prevent
memcpy() from working, which is not the case there. So to avoid the
confusion for future readers, I changed those to memcpy(). I suspect
that this code has been copied between the different index AM over time.
(The nbtree version of this code is literally unchanged since July 1996.)
Attachments:
0001-Don-t-overwrite-scan-key-in-systable_beginscan.patchtext/plain; charset=UTF-8; name=0001-Don-t-overwrite-scan-key-in-systable_beginscan.patchDownload+17-8
0002-Augment-ScanKey-arguments-with-const-qualifiers.patchtext/plain; charset=UTF-8; name=0002-Augment-ScanKey-arguments-with-const-qualifiers.patchDownload+55-56
0003-Replace-gratuitous-memmove-with-memcpy.patchtext/plain; charset=UTF-8; name=0003-Replace-gratuitous-memmove-with-memcpy.patchDownload+9-27
On Thu, Aug 8, 2024 at 2:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
When systable_beginscan() and systable_beginscan_ordered() choose an
index scan, they remap the attribute numbers in the passed-in scan keys
to the attribute numbers of the index, and then write those remapped
attribute numbers back into the scan key passed by the caller. This
second part is surprising and gratuitous. It means that a scan key
cannot safely be used more than once (but it might sometimes work,
depending on circumstances). Also, there is no value in providing these
remapped attribute numbers back to the caller, since they can't do
anything with that.I propose to fix that by making a copy of the scan keys passed by the
caller and make the modifications there.
This does have the disadvantage of adding more palloc overhead.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Aug 8, 2024 at 2:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
I propose to fix that by making a copy of the scan keys passed by the
caller and make the modifications there.
This does have the disadvantage of adding more palloc overhead.
It seems hard to believe that one more palloc + memcpy is going to be
noticeable in the context of an index scan.
regards, tom lane
On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
When systable_beginscan() and systable_beginscan_ordered() choose an index
scan, they remap the attribute numbers in the passed-in scan keys to the
attribute numbers of the index, and then write those remapped attribute
numbers back into the scan key passed by the caller. This second part is
surprising and gratuitous. It means that a scan key cannot safely be used
more than once (but it might sometimes work, depending on circumstances).
Also, there is no value in providing these remapped attribute numbers back
to the caller, since they can't do anything with that.I propose to fix that by making a copy of the scan keys passed by the caller
and make the modifications there.
No objection, but this would obsolete at least some of these comments (the
catcache.c ones if nothing else):
$ git grep -in scankey | grep -i copy
src/backend/access/gist/gistscan.c:257: * Copy consistent support function to ScanKey structure instead
src/backend/access/gist/gistscan.c:330: * Copy distance support function to ScanKey structure instead of
src/backend/access/nbtree/nbtutils.c:281: ScanKey arrayKeyData; /* modified copy of scan->keyData */
src/backend/access/nbtree/nbtutils.c:3303: * We copy the appropriate indoption value into the scankey sk_flags
src/backend/access/nbtree/nbtutils.c:3318: * It's a bit ugly to modify the caller's copy of the scankey but in practice
src/backend/access/spgist/spgscan.c:385: /* copy scankeys into local storage */
src/backend/utils/cache/catcache.c:1474: * Ok, need to make a lookup in the relation, copy the scankey and
src/backend/utils/cache/catcache.c:1794: * Ok, need to make a lookup in the relation, copy the scankey and
src/backend/utils/cache/relfilenumbermap.c:192: /* copy scankey to local copy, it will be modified during the scan */
In order to prove to myself that there are no other cases where
caller-provided scan keys are modified, I went through and const-qualified
all the APIs. This works out correctly. Several levels down in the stack,
the access methods make their own copy of the scan keys that they store in
their scan descriptors, and they use those in non-const-clean ways, but
that's ok, that's their business. As far as the top-level callers are
concerned, they can rely on their scan keys to be const after this.
We do still have code mutating IndexScanDescData.keyData, but that's fine. We
must be copying function args to form IndexScanDescData.keyData, or else your
proof patch would have noticed an assignment of const to non-const.
Noah Misch <noah@leadboat.com> writes:
On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
I propose to fix that by making a copy of the scan keys passed by the caller
and make the modifications there.
No objection, but this would obsolete at least some of these comments (the
catcache.c ones if nothing else):
That ties into something I forgot to ask: aren't there any copy steps
or other overhead that we could remove, given this new API constraint?
That would help address Robert's concern.
regards, tom lane
On 09.08.24 06:55, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
I propose to fix that by making a copy of the scan keys passed by the caller
and make the modifications there.No objection, but this would obsolete at least some of these comments (the
catcache.c ones if nothing else):That ties into something I forgot to ask: aren't there any copy steps
or other overhead that we could remove, given this new API constraint?
That would help address Robert's concern.
I added two more patches to the series here.
First (or 0004), some additional cleanup for code that had to workaround
systable_beginscan() overwriting the scan keys, along the lines
suggested by Noah.
Second (or 0005), an alternative to palloc is to make the converted scan
keys a normal local variable. Then it's just a question of whether a
smaller palloc is preferred over an over-allocated local variable. I
think I still prefer the palloc version, but it doesn't matter much
either way I think.
Attachments:
v2-0001-Don-t-overwrite-scan-key-in-systable_beginscan.patchtext/plain; charset=UTF-8; name=v2-0001-Don-t-overwrite-scan-key-in-systable_beginscan.patchDownload+17-8
v2-0002-Augment-ScanKey-arguments-with-const-qualifiers.patchtext/plain; charset=UTF-8; name=v2-0002-Augment-ScanKey-arguments-with-const-qualifiers.patchDownload+55-56
v2-0003-Replace-gratuitous-memmove-with-memcpy.patchtext/plain; charset=UTF-8; name=v2-0003-Replace-gratuitous-memmove-with-memcpy.patchDownload+9-27
v2-0004-Update-some-code-that-handled-systable_beginscan-.patchtext/plain; charset=UTF-8; name=v2-0004-Update-some-code-that-handled-systable_beginscan-.patchDownload+23-27
v2-0005-Alternative-Store-converted-key-in-local-variable.patchtext/plain; charset=UTF-8; name=v2-0005-Alternative-Store-converted-key-in-local-variable.patchDownload+6-11
On 12.08.24 09:44, Peter Eisentraut wrote:
On 09.08.24 06:55, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
I propose to fix that by making a copy of the scan keys passed by
the caller
and make the modifications there.No objection, but this would obsolete at least some of these comments
(the
catcache.c ones if nothing else):That ties into something I forgot to ask: aren't there any copy steps
or other overhead that we could remove, given this new API constraint?
That would help address Robert's concern.I added two more patches to the series here.
First (or 0004), some additional cleanup for code that had to workaround
systable_beginscan() overwriting the scan keys, along the lines
suggested by Noah.Second (or 0005), an alternative to palloc is to make the converted scan
keys a normal local variable. Then it's just a question of whether a
smaller palloc is preferred over an over-allocated local variable. I
think I still prefer the palloc version, but it doesn't matter much
either way I think.
Looks like the discussion here has settled, so I plan to go head with
patches
[PATCH v2 1/5] Don't overwrite scan key in systable_beginscan()
[PATCH v2 3/5] Replace gratuitous memmove() with memcpy()
[PATCH v2 4/5] Update some code that handled systable_beginscan()
overwriting scan key
(folding patch 4 into patch 1)
On Mon, Aug 12, 2024 at 09:44:02AM +0200, Peter Eisentraut wrote:
Second (or 0005), an alternative to palloc is to make the converted scan
keys a normal local variable. Then it's just a question of whether a
smaller palloc is preferred over an over-allocated local variable. I think
I still prefer the palloc version, but it doesn't matter much either way I
think.
Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
throughout the command.
I noticed this on the master branch while running ANALYZE on partitioned
table with 600 attributes, even though only 6 were being analyzed.
LOG: level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528 used
Several indexes are being scanned many thousands of times.
310690 beginscan 2659
310689 beginscan 2662
77672 beginscan 2678
postgres=# SELECT 2659::REGCLASS;
regclass | pg_attribute_relid_attnum_index
postgres=# SELECT 2662::REGCLASS;
regclass | pg_class_oid_index
postgres=# SELECT 2678::REGCLASS;
regclass | pg_index_indrelid_index
From 2f84e4d21ce611a4e6f428f72a18518e22776400 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 8 Aug 2024 08:27:26 +0200
Subject: [PATCH v2 1/5] Don't overwrite scan key in systable_beginscan()
...
Show quoted text
Fix that by making a copy of the scan keys passed by the caller and
make the modifications there.diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index de751e8e4a3..b5eba549d3e 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -419,17 +419,22 @@ systable_beginscan(Relation heapRelation, if (irel) { int i; + ScanKey idxkey;- /* Change attribute numbers to be index column numbers. */ + idxkey = palloc_array(ScanKeyData, nkeys); + + /* Convert attribute numbers to be index column numbers. */ for (i = 0; i < nkeys; i++)
On 26.11.24 14:56, Justin Pryzby wrote:
Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
throughout the command.I noticed this on the master branch while running ANALYZE on partitioned
table with 600 attributes, even though only 6 were being analyzed.LOG: level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528 used
Several indexes are being scanned many thousands of times.
Hmm, this patch inserts one additional palloc() call per
systable_beginscan(). So it won't have much of an impact for isolated
calls, but for thousands of scans you get thousands of small chunks of
memory.
Does your test case get better if you insert corresponding pfree() calls?
A higher-level question would be whether there should be better memory
management in the caller. But it would be okay to address it with a
pfree() as well, I think.
On Wed, Nov 27, 2024 at 04:33:25PM +0100, Peter Eisentraut wrote:
On 26.11.24 14:56, Justin Pryzby wrote:
Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
throughout the command.I noticed this on the master branch while running ANALYZE on partitioned
table with 600 attributes, even though only 6 were being analyzed.LOG: level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528 used
Several indexes are being scanned many thousands of times.
Hmm, this patch inserts one additional palloc() call per
systable_beginscan(). So it won't have much of an impact for isolated
calls, but for thousands of scans you get thousands of small chunks of
memory.Does your test case get better if you insert corresponding pfree() calls?
Yes -- I'd already checked.
--
Justin
On 27.11.24 16:35, Justin Pryzby wrote:
On Wed, Nov 27, 2024 at 04:33:25PM +0100, Peter Eisentraut wrote:
On 26.11.24 14:56, Justin Pryzby wrote:
Since 811af9786b, the palloc'd idxkey's seem to be leaking/accumulating
throughout the command.I noticed this on the master branch while running ANALYZE on partitioned
table with 600 attributes, even though only 6 were being analyzed.LOG: level: 3; BuildRelationExtStatistics: 1239963512 total in 278 blocks; 5082984 free (296 chunks); 1234880528 used
Several indexes are being scanned many thousands of times.
Hmm, this patch inserts one additional palloc() call per
systable_beginscan(). So it won't have much of an impact for isolated
calls, but for thousands of scans you get thousands of small chunks of
memory.Does your test case get better if you insert corresponding pfree() calls?
Yes -- I'd already checked.
Ok, I committed a fix that inserts some pfree() calls. Thanks for the
report.