Expanding HOT updates for expression and partial indexes
Attached find a patch that expands the cases where heap-only tuple (HOT) updates are possible without changing the basic semantics of HOT. This is accomplished by examining expression indexes for changes to determine if indexes require updating or not. A similar approach is taken for partial indexes, the predicate is evaluated and, in some cases, HOT updates are allowed. Even with this patch if any index is changed, all indexes are updated. Only in cases where none are modified will this patch allow the HOT path. Previously, an expression index on a modified column would disqualify the update from the HOT path in the heap access manager. This patch is functional, includes new tests, and passes check-world however I’m sure it will require more work after community review.
Why is this important? A growing number of Postgres users work with JSONB data. Indexes on fields within JSONB columns use expressions preventing all updates on data within JSONB columns from the HOT path. This is unnecessary and a major drawback when using Postgres.
This is not a new idea; indeed, this patch grew out of Surjective functional indexes [1]/messages/by-id/4d9928ee-a9e6-15f9-9c82-5981f13ffca6@postgrespro.ru which was applied [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8 and then reverted [3]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256 after it was discovered that it caused a bug and had other quality issues. This patch is a new approach that hopefully address the identified bug and most of the other concerns raised in that and other email threads on the subject. I’m also aware of PHOT [4]/messages/by-id/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2@amazon.com and WARM [5]/messages/by-id/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com which allow for updating some, but not all indexes while remaining on the HOT update path, this patch does not attempt to accomplish that.
Attached you’ll find two slightly different approaches. The first (v3) patch is slightly less intrusive than the second (v4), but both apply to master (59d6c03956193f622c069a4ab985bade27384ac4). Tests have been added (heap_hot_updates.sql) that exercise this new feature set. Of the two patches I personally prefer v4 as it cleans up the summarizing index logic and removes TU_UpdateIndexes. This opens the door to future improvements by providing a way to pass a bitmap of modified indexes along to be addressed by something similar to the PHOT/WARM logic.
I have a few concerns with the patch, things I’d greatly appreciate your thoughts on:
First, I pass an EState along the update path to enable running the checks in heapam, this works but leaves me feeling as if I violated separation of concerns. If there is a better way to do this let me know or if you think the cost of creating one in the execIndexing.c ExecIndexesRequiringUpdates() is okay that’s another possibility.
Second, I’m sure that creating the rd_indexinfolist should be improved/changed and likely cached via relcache.c as this is likely part of the performance overhead mentioned below.
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose HOT or not in heap_update(). Sometimes this patch will perform expensive additional checks and ultimately not go down the HOT path, new overhead with no benefit. Some expressions are more expensive than others to evaluate, there is no logic to adjust for that. The Surjective patch/email thread had quite a bit of discussion on this without resolution. I’ve chosen to add a GUC that optionally avoids the expression evaluation. I’m open to ideas here as well, addition of another GUC or removal of the one I’ve added. I’ve tried to avoid rechecking indexes for changes when possible.
Fourth, I’d like to know which version the community prefers (v3 or v4). I think v4 moves the code in a direction that is cleaner overall, but you may disagree. I realize that the way I use the modified_indexes bitmapset is a tad overloaded (NULL means all indexes should be updated, otherwise only update the indexes in the set which may be all/some/none of the indexes) and that may violate the principal of least surprise but I feel that it is better than the TU_UpdateIndexes enum in the code today.
I’ve run two performance tests against this; a very synthetic workload that updates only non-indexed fields in a JSONB document that is small enough not to be TOASTed, and one that measures TPC-C-like workload using a MongoDB API mapped into JSONB.
The synthetic workload randomly updated non-indexed fields within the JSONB documents as fast as possible from 50 client connections. The test started pre-loaded with 10,000,000 documents within a single column with 50 expression indexes (BTREE) into fields in those documents. This test showed a dramatic increase in throughput (28-110%), reduction in per-operation latency (22-52%), and lower storage requirements all while CPU remained within 1% of the unpatched server. The tests ran for 2hrs during which time we observed about a 20-30% reduction in IOPs. On the unpatched server there were no HOT updates, on the patched server with default fillfactor HOT updates made up at least 30% and at times much more as the random-access pattern and pruneheap opened space on pages for HOT updates.
The TPC-C-like workload [7]https://github.com/mongodb-labs/py-tpcc ran [8]python3 tpcc.py —no-load —duration 3600 —warehouses 2000 —clients 50 —stop-on-error —config gpure.config mongodb first with —no-execute then [8]python3 tpcc.py —no-load —duration 3600 —warehouses 2000 —clients 50 —stop-on-error —config gpure.config mongodb with —no-load. This test showed HOT updates for CUSTOMER (7%), DISTRICT (48%), and WAREHOUSE (99%) but zero for STOCK and ORDERS. When compared to a non-patched server the performance with this patch was 7.8% slower than without. This was clearly not the result I expected. I believe that the lower performance may in part be due to how I build and maintain the rd_indexinfolist and the overhead of executing expressions on indexes repeatedly only to find that the update still doesn’t qualify for the HOT path. I’d be very happy to hear thoughts on how I might reduce this gap if you have suggestions.
I hope to further develop this patch into a final form acceptable to this community.
best regards,
-greg
[1]: /messages/by-id/4d9928ee-a9e6-15f9-9c82-5981f13ffca6@postgrespro.ru
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256
[4]: /messages/by-id/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2@amazon.com
[5]: /messages/by-id/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com
[6]: /messages/by-id/CABOikdMNy6yowA+wTGK9RVd8iw+CzqHeQSGpW7Yka_4RSZ_LOQ@mail.gmail.com
[7]: https://github.com/mongodb-labs/py-tpcc
[8]: python3 tpcc.py —no-load —duration 3600 —warehouses 2000 —clients 50 —stop-on-error —config gpure.config mongodb
Attachments:
v3-0001-Expand-HOT-update-path-to-include-expression-and-.patchapplication/octet-stream; name=v3-0001-Expand-HOT-update-path-to-include-expression-and-.patchDownload+867-25
v4-0001-Expand-HOT-update-path-to-include-expression-and-.patchapplication/octet-stream; name=v4-0001-Expand-HOT-update-path-to-include-expression-and-.patchDownload+921-176
On Thu, 2025-02-06 at 22:24 +0000, Burd, Greg wrote:
Attached find a patch that expands the cases where heap-only tuple (HOT) updates are possible
without changing the basic semantics of HOT. This is accomplished by examining expression
indexes for changes to determine if indexes require updating or not. A similar approach is
taken for partial indexes, the predicate is evaluated and, in some cases, HOT updates are
allowed.[...]
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose
HOT or not in heap_update(). Sometimes this patch will perform expensive additional checks
and ultimately not go down the HOT path, new overhead with no benefit. Some expressions are
more expensive than others to evaluate, there is no logic to adjust for that. The Surjective
patch/email thread had quite a bit of discussion on this without resolution. I’ve chosen to
add a GUC that optionally avoids the expression evaluation. I’m open to ideas here as well,
addition of another GUC or removal of the one I’ve added. I’ve tried to avoid rechecking
indexes for changes when possible.
I think that the goal of this patch is interesting and desirable.
The greatest concern for me is the performance impact. I think that a switch is warranted,
but I am not sure if it should be a GUC. Wouldn't it be better to have a reloption, so that
this can be configured per table? I am not sure if a global switch is necessary, but I am
not fundamentally against it.
Yours,
Laurenz Albe
On Feb 9, 2025, at 1:14 AM, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
I think that the goal of this patch is interesting and desirable.
Thanks for taking a look at it. Which version did you prefer, v3 or v4?
The greatest concern for me is the performance impact.
Agreed, I’m still looking for ways to minimize it (suggestions welcome).
I think that a switch is warranted, but I am not sure if it should be a GUC.
Wouldn't it be better to have a reloption, so that this can be configured per table?
I can remove the GUC in favor of a reloption, that makes sense to me.
Yours,
Laurenz Albe
best,
-greg
On Thu, 6 Feb 2025 at 23:24, Burd, Greg <gregburd@amazon.com> wrote:
Attached find a patch that expands the cases where heap-only tuple (HOT) updates are possible without changing the basic semantics of HOT. This is accomplished by examining expression indexes for changes to determine if indexes require updating or not. A similar approach is taken for partial indexes, the predicate is evaluated and, in some cases, HOT updates are allowed. Even with this patch if any index is changed, all indexes are updated. Only in cases where none are modified will this patch allow the HOT path.
So, effectively this disables the amsummarizing-based optimizations of
https://postgr.es/c/19d8e2308 ? That sounds like a bad degradation in
behaviour.
I’m also aware of PHOT [4] and WARM [5] which allow for updating some, but not all indexes while remaining on the HOT update path, this patch does not attempt to accomplish that.
[...] This opens the door to future improvements by providing a way to pass a bitmap of modified indexes along to be addressed by something similar to the PHOT/WARM logic.
<sidetrack>
I have serious doubts about the viability of any proposal working to
implement PHOT/WARM in PostgreSQL, as they seem to have an inherent
nature of fundamentally breaking the TID lifecycle:
We won't be able to clean up dead-to-everyone TIDs that were
PHOT-updated, because some index Y may still rely on it, and we can't
remove the TID from that same index Y because there is still a live
PHOT/WARM tuple later in the chain whose values for that index haven't
changed since that dead-to-everyone tuple, and thus this PHOT/WARM
tuple is the one pointed to by that index.
For HOT, this isn't much of an issue, because there is just one TID
that's impacted (and it only occupies a single LP slot, with
LP_REDIRECT). However, with PHOT/WARM, you'd relatively easily be able
to fill a page with TIDs (or even full tuples) you can't clean up with
VACUUM until the moment a the PHOT/WARM/HOT chain is broken (due to
UPDATE leaving the page or the final entry getting DELETE-d).
Unless we are somehow are able to replace the TIDs in indexes from
"intermediate dead PHOT" to "base TID"/"latest TID" (either of which
is probably also problematic for indexes that expect a TID to appear
exactly once in the index at any point in time) I don't think the
system is viable if we maintain only a single data structure to
contain all dead TIDs. If we had a datastore for dead items per index,
that'd be more likely to work, but it also would significantly
increase the memory overhead of vacuuming tables.
</sidetrack>
I have a few concerns with the patch, things I’d greatly appreciate your thoughts on:
First, I pass an EState along the update path to enable running the checks in heapam, this works but leaves me feeling as if I violated separation of concerns. If there is a better way to do this let me know or if you think the cost of creating one in the execIndexing.c ExecIndexesRequiringUpdates() is okay that’s another possibility.
I think that doesn't have to be bad.
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose HOT or not in heap_update().
Why can't it mostly be that simple in simple cases?
I mean, it's clear that "updated indexed column's value == non-HOT
update". And that to determine whether an updated *projected* column's
value (i.e., expression index column's value) was actually updated we
need to calculate the previous and current index value, thus execute
the projection twice. But why would we have significant additional
overhead if there are no expression indexes, or when we can know by
bitmap overlap that the only interesting cases are summarizing
indexes?
I would've implemented this with (1) two new bitmaps, one each for
normal and summarizing indexes, each containing which columns are
exclusively used in expression indexes (and which should thus be used
to trigger the (comparatively) expensive recalculation).
Then, I'd maintain a (cached) list of unique projections/expressions
found in indexes, so that 30 indexes on e.g.
((mycolumn::jsonb)->>'metadata') only extend to 1 check for
differences, rather than 30. The "new" output of these expression
evaluations would be stored to be used later as index datums, reducing
the number of per-expression evaluations down to 2 at most, rather
than 2+1 when the index needs an insertion but the expression itself
wasn't updated.
So, it'd be something like (pseudocode):
if (bms_overlap(updated_columns, hotblocking))
/* if columns only indexed through expressions were updated, do
expensive stuff. Otherwise, it's a normal non-HOT update. */
if (bms_subset_compare(updated_columns, hot_expression_columns) in
(BMS_EQUAL, BMS_SUBSET1))
expensive check for expression changes + populate index column data
else
normal_update
else if (bms_overlap(updated_columns, summarizing))
/* same as above for hotblocking, but now summarizing */
if (bms_subset_compare(updated_columns, sum_expression_columns) in
(BMS_EQUAL, BMS_SUBSET1))
expensive check for summarized expression changes + populate
summarized index column data
else
summarizing_update
else
hot_update
Note that it is relatively expensive to do check whether any one index
needs to be updated. It's generally cheaper to do all those checks at
once, where possible; using one or 2 more bitmaps would be sufficient.
Also note that this approach doesn't update specific summarizing
indexes, just all of them or none. I think that "update only
summarizing indexes that were updated" should be a separate patch from
"check if indexed expressions' values changed", potentially in the
patchset, but not as part of the main bulk.
Fourth, I’d like to know which version the community prefers (v3 or v4). I think v4 moves the code in a direction that is cleaner overall, but you may disagree. I realize that the way I use the modified_indexes bitmapset is a tad overloaded (NULL means all indexes should be updated, otherwise only update the indexes in the set which may be all/some/none of the indexes) and that may violate the principal of least surprise but I feel that it is better than the TU_UpdateIndexes enum in the code today.
I would be hesitant to let table AMs decide which indexes to update at
that precision. Note that this API would allow the AM to update only
(say) the PK index and no other indexes, which is not allowed to
happen if index consistentcy is required (which it is).
----->8-----
Do you have any documentation on the approaches used, and the specific
differences between v3 and v4? I don't see much of that in your
initial mail, and the patches themselves also don't show much of that
in their details. I'd like at least some documentation of the new
behaviour in src/backend/access/heap/README.HOT at some point before
this got marked as RFC in the commitfest app, though preferably sooner
rather than later.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Apologies for not being clear, this preserves the current behavior for summarizing indexes allowing for HOT updates while also updating the index. No degradation here that I’m aware of, indeed the tests that ensure that behavior are unchanged and pass.
-greg
Show quoted text
On Feb 10, 2025, at 12:17 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
So, effectively this disables the amsummarizing-based optimizations of
https://postgr.es/c/19d8e2308 ? That sounds like a bad degradation in
behaviour.
On Feb 10, 2025, at 12:17 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
I have a few concerns with the patch, things I’d greatly appreciate your thoughts on:
First, I pass an EState along the update path to enable running the checks in heapam, this works but leaves me feeling as if I violated separation of concerns. If there is a better way to do this let me know or if you think the cost of creating one in the execIndexing.c ExecIndexesRequiringUpdates() is okay that’s another possibility.
I think that doesn't have to be bad.
Meaning that the approach I’ve taken is okay with you?
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose HOT or not in heap_update().
Why can't it mostly be that simple in simple cases?
It can remain that simple in the cases you mention. In relcache the hot blocking attributes are nearly the same, only the summarizing attributes are removed. The first test then in heap_update() is for overlap with the modified set. When there is none, the update will proceed on the HOT path.
The presence of a summarizing index is determined in ExecIndexesRequiringUpdates() in execIndexing.c, so a slightly longer code path but not much new overhead.
I mean, it's clear that "updated indexed column's value == non-HOT
update". And that to determine whether an updated *projected* column's
value (i.e., expression index column's value) was actually updated we
need to calculate the previous and current index value, thus execute
the projection twice. But why would we have significant additional
overhead if there are no expression indexes, or when we can know by
bitmap overlap that the only interesting cases are summarizing
indexes?
You’re right, there’s not a lot of new overhead in that case except what happens in ExecIndexesRequiringUpdates() to scan over the list of IndexInfo. It is really only when there are many expressions/predicates requiring examination that there is any significant cost to this approach AFAICT (but if you see something please point it out).
I would've implemented this with (1) two new bitmaps, one each for
normal and summarizing indexes, each containing which columns are
exclusively used in expression indexes (and which should thus be used
to trigger the (comparatively) expensive recalculation).
That was one where I started, over time that became harder to work as the bitmaps contain the union of index attributes for the table not per-column. Now there is one bitmap to cover the broadest case and then a function to find the modified set of indexes where each is examined against bitmaps that contain only attributes specific to the index in question. This helped in cases where there were both expression and non-expression indexes on the same attribute.
Then, I'd maintain a (cached) list of unique projections/expressions
found in indexes, so that 30 indexes on e.g.
((mycolumn::jsonb)->>'metadata') only extend to 1 check for
differences, rather than 30.
An optimization to avoid rechecking isn’t a bad idea. I wonder how hard it would be to surface the field (->>’metadata’) from the index expression to track for redundancy, I’ll have to look into that.
The "new" output of these expression
evaluations would be stored to be used later as index datums, reducing
the number of per-expression evaluations down to 2 at most, rather
than 2+1 when the index needs an insertion but the expression itself
wasn't updated.
Not reforming the new index tuples is also an interesting optimization. I wonder how that can be passed from within heapam’s call into a function in execIndexing up into nodeModifiyTable and back down into execIndexing and on to the index access method? I’ll have to think about that, ideas welcome.
So, it'd be something like (pseudocode):
if (bms_overlap(updated_columns, hotblocking))
/* if columns only indexed through expressions were updated, do
expensive stuff. Otherwise, it's a normal non-HOT update. */
if (bms_subset_compare(updated_columns, hot_expression_columns) in
(BMS_EQUAL, BMS_SUBSET1))
expensive check for expression changes + populate index column data
else
normal_update
else if (bms_overlap(updated_columns, summarizing))
/* same as above for hotblocking, but now summarizing */
if (bms_subset_compare(updated_columns, sum_expression_columns) in
(BMS_EQUAL, BMS_SUBSET1))
expensive check for summarized expression changes + populate
summarized index column data
else
summarizing_update
else
hot_updateNote that it is relatively expensive to do check whether any one index
needs to be updated. It's generally cheaper to do all those checks at
once, where possible; using one or 2 more bitmaps would be sufficient.Also note that this approach doesn't update specific summarizing
indexes, just all of them or none. I think that "update only
summarizing indexes that were updated" should be a separate patch from
"check if indexed expressions' values changed", potentially in the
patchset, but not as part of the main bulk.Fourth, I’d like to know which version the community prefers (v3 or v4). I think v4 moves the code in a direction that is cleaner overall, but you may disagree. I realize that the way I use the modified_indexes bitmapset is a tad overloaded (NULL means all indexes should be updated, otherwise only update the indexes in the set which may be all/some/none of the indexes) and that may violate the principal of least surprise but I feel that it is better than the TU_UpdateIndexes enum in the code today.
I would be hesitant to let table AMs decide which indexes to update at
that precision. Note that this API would allow the AM to update only
(say) the PK index and no other indexes, which is not allowed to
happen if index consistentcy is required (which it is).
Interesting, thanks for the feedback. I’ll think on this a bit more and provide more detail with the next update.
----->8-----
Do you have any documentation on the approaches used, and the specific
differences between v3 and v4? I don't see much of that in your
initial mail, and the patches themselves also don't show much of that
in their details. I'd like at least some documentation of the new
behaviour in src/backend/access/heap/README.HOT at some point before
this got marked as RFC in the commitfest app, though preferably sooner
rather than later.
Good point, I should have updated README.HOT with the initial patchset. I’ll jump on that and update ASAP.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
thanks for the thoughtful reply.
-greg
On Mon, Feb 10, 2025 at 06:17:42PM +0100, Matthias van de Meent wrote:
I have serious doubts about the viability of any proposal working to
implement PHOT/WARM in PostgreSQL, as they seem to have an inherent
nature of fundamentally breaking the TID lifecycle:
We won't be able to clean up dead-to-everyone TIDs that were
PHOT-updated, because some index Y may still rely on it, and we can't
remove the TID from that same index Y because there is still a live
PHOT/WARM tuple later in the chain whose values for that index haven't
changed since that dead-to-everyone tuple, and thus this PHOT/WARM
tuple is the one pointed to by that index.
For HOT, this isn't much of an issue, because there is just one TID
that's impacted (and it only occupies a single LP slot, with
LP_REDIRECT). However, with PHOT/WARM, you'd relatively easily be able
to fill a page with TIDs (or even full tuples) you can't clean up with
VACUUM until the moment a the PHOT/WARM/HOT chain is broken (due to
UPDATE leaving the page or the final entry getting DELETE-d).Unless we are somehow are able to replace the TIDs in indexes from
"intermediate dead PHOT" to "base TID"/"latest TID" (either of which
is probably also problematic for indexes that expect a TID to appear
exactly once in the index at any point in time) I don't think the
system is viable if we maintain only a single data structure to
contain all dead TIDs. If we had a datastore for dead items per index,
that'd be more likely to work, but it also would significantly
increase the memory overhead of vacuuming tables.
I share your concerns, but I don't think things are as dire as you suggest.
For example, perhaps we put a limit on how long a PHOT chain can be, or
maybe we try to detect update patterns that don't work well with PHOT.
Another option could be to limit PHOT updates to only when the same set of
indexed columns are updated or when <50% of the indexed columns are
updated. These aren't fully fleshed-out ideas, of course, but I am at
least somewhat optimistic we could find appropriate trade-offs.
--
nathan
On Tue, 11 Feb 2025 at 00:20, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Feb 10, 2025 at 06:17:42PM +0100, Matthias van de Meent wrote:
I have serious doubts about the viability of any proposal working to
implement PHOT/WARM in PostgreSQL, as they seem to have an inherent
nature of fundamentally breaking the TID lifecycle:
[... concerns]I share your concerns, but I don't think things are as dire as you suggest.
For example, perhaps we put a limit on how long a PHOT chain can be, or
maybe we try to detect update patterns that don't work well with PHOT.
Another option could be to limit PHOT updates to only when the same set of
indexed columns are updated or when <50% of the indexed columns are
updated. These aren't fully fleshed-out ideas, of course, but I am at
least somewhat optimistic we could find appropriate trade-offs.
Yes, there are methods which could limit the overhead. But I'm not
sure there are cheap-enough designs which would make PHOT a
universally good choice (i.e. not tunable with guc/table option),
considering its significantly larger un-reclaimable storage overhead
vs HOT.
Kind regards,
Matthias van de Meent.
On Mon, 10 Feb 2025 at 20:11, Burd, Greg <gregburd@amazon.com> wrote:
On Feb 10, 2025, at 12:17 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
I have a few concerns with the patch, things I’d greatly appreciate your thoughts on:
First, I pass an EState along the update path to enable running the checks in heapam, this works but leaves me feeling as if I violated separation of concerns. If there is a better way to do this let me know or if you think the cost of creating one in the execIndexing.c ExecIndexesRequiringUpdates() is okay that’s another possibility.
I think that doesn't have to be bad.
Meaning that the approach I’ve taken is okay with you?
If you mean "passing EState down through the AM so we can check if we
really need HOT updates", then Yes, that's OK. I don't think the basic
idea (executing the projections to check for differences) is bad per
se, however I do think we may need more design work on the exact shape
of how ExecIndexesRequiringUpdates() receives its information (which
includes the EState).
E.g. we could pass an opaquely typed pointer that's passed from the
executor state through the table_update method into this
ExecIndexesRequiringUpdates(). That opaque struct would then contain
the important information for index update state checking, so that the
AM can't realistically break things without bypassing the separation
of concerns, and doesn't have to know about any executor nodes.
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose HOT or not in heap_update().
Why can't it mostly be that simple in simple cases?
It can remain that simple in the cases you mention. In relcache the hot blocking attributes are nearly the same, only the summarizing attributes are removed. The first test then in heap_update() is for overlap with the modified set. When there is none, the update will proceed on the HOT path.
The presence of a summarizing index is determined in ExecIndexesRequiringUpdates() in execIndexing.c, so a slightly longer code path but not much new overhead.
Yes, but that's only determined at an index-by-index level, rather
than determined all at once, and that's real bad when you have
hundreds of indexes to go through (however unlikely it might be, I've
heard of cases where there are 1000s of indexes on one table). So, I
prefer limiting any O(n_indexes) operations to only the most critical
cases.
I mean, it's clear that "updated indexed column's value == non-HOT
update". And that to determine whether an updated *projected* column's
value (i.e., expression index column's value) was actually updated we
need to calculate the previous and current index value, thus execute
the projection twice. But why would we have significant additional
overhead if there are no expression indexes, or when we can know by
bitmap overlap that the only interesting cases are summarizing
indexes?You’re right, there’s not a lot of new overhead in that case except what happens in ExecIndexesRequiringUpdates() to scan over the list of IndexInfo. It is really only when there are many expressions/predicates requiring examination that there is any significant cost to this approach AFAICT (but if you see something please point it out).
See the attached approach. Evaluation of the expressions only has to
happen if there are any HOT-blocking attributes which are exclusively
hot-blockingly indexed through expressions, so if the updated
attribute numbers are a subset of hotblockingexprattrs. (substitute
hotblocking with summarizing for the summarizing approach)
I would've implemented this with (1) two new bitmaps, one each for
normal and summarizing indexes, each containing which columns are
exclusively used in expression indexes (and which should thus be used
to trigger the (comparatively) expensive recalculation).That was one where I started, over time that became harder to work as the bitmaps contain the union of index attributes for the table not per-column.
I think it's fairly easy to create, though.
Now there is one bitmap to cover the broadest case and then a function to find the modified set of indexes where each is examined against bitmaps that contain only attributes specific to the index in question. This helped in cases where there were both expression and non-expression indexes on the same attribute.
Fair, but do we care about one expression index on (attr1->>'data')'s
value *not* changing when an index on (attr1) exists and attr1 has
changed? That index on att1 would block HOT updates regardless of the
(lack of) changes to the (att1->>'data') index, so doing those
expensive calculations seems quite wasteful.
So, in my opinion, we should also keep track of those attributes only
included in expressions of indexes, and that's fairly easy: see
attached prototype.diff.txt (might need some work, the patch was
drafted on v16's codebase, but the idea is clear).
The resulting %exprattrs bitmap contains attributes that are used only
in expressions of those index types.
The "new" output of these expression
evaluations would be stored to be used later as index datums, reducing
the number of per-expression evaluations down to 2 at most, rather
than 2+1 when the index needs an insertion but the expression itself
wasn't updated.Not reforming the new index tuples is also an interesting optimization. I wonder how that can be passed from within heapam’s call into a function in execIndexing up into nodeModifiyTable and back down into execIndexing and on to the index access method? I’ll have to think about that, ideas welcome.
Note that index tuple forming happens only in the index AM, it's the
Datum construction (i.e. projection from attributes/tuple to indexed
value) that I'd like to deduplicate. Though looking at the current
code, I don't think it's reasonable to have that as a requirement for
this work. It'd be a nice-to-have for sure, but not as requirement.
Do you have any documentation on the approaches used, and the specific
differences between v3 and v4? I don't see much of that in your
initial mail, and the patches themselves also don't show much of that
in their details. I'd like at least some documentation of the new
behaviour in src/backend/access/heap/README.HOT at some point before
this got marked as RFC in the commitfest app, though preferably sooner
rather than later.Good point, I should have updated README.HOT with the initial patchset. I’ll jump on that and update ASAP.
Thanks in advance.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachments:
prototype.diff.txttext/plain; charset=US-ASCII; name=prototype.diff.txtDownload+29-4
On Mon, 10 Feb 2025 at 19:15, Burd, Greg <gregburd@amazon.com> wrote:
Apologies for not being clear, this preserves the current behavior for summarizing indexes allowing for HOT updates while also updating the index. No degradation here that I’m aware of, indeed the tests that ensure that behavior are unchanged and pass.
Looking at the code again, while it does indeed preserve the current
behaviour, it doesn't actually improve the behavior for summarizing
indexes when that would be expected.
Example:
CREATE INDEX hotblocking ON mytab USING btree((att1->'data'));
CREATE INDEX summarizing ON mytab USING BRIN(att2);
UPDATE mytab SET att1 = att1 || '{"check": "mate"}';
In v3 (same code present in v4), I notice that in the above case we
hit the "indexed attribute updated" path (hotblocking indeed indexes
the updated attribute att1), go into ExecIndexesRequiringUpdates, and
mark index 'summarizing' as 'needs an update', even though no
attribute of that index has a new value. Then we notice that
att1->'data' hasn't changed, and so we don't need to update the
'hotblocking' index, but we do update the (unchanged) 'summarizing'
index.
This indicates that in practice (with this version of the patch) this
will improve the HOT applicability situation while summarizing indexes
don't really gain a benefit from this - they're always updated when
any indexed column is updated, even if we could detect that there were
no changes to any indexed values.
Actually, you could say we find ourselves in the counter-intuitive
situation that the addition of the 'hotblocking' index whose value
were not updated now caused index insertions into summarizing indexes.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Matthias,
Thanks for the in-depth review, you are correct and I appreciate you uncovering that oversight with summarizing indexes. I’ll add a test case and modify the logic to prevent updates to unchanged summarizing indexes by testing their attributes against the modified set while keeping the HOT optimization when only summarizing indexes are changed.
thanks for finding this,
-greg
Show quoted text
On Feb 11, 2025, at 4:40 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 10 Feb 2025 at 19:15, Burd, Greg <gregburd@amazon.com> wrote:
Apologies for not being clear, this preserves the current behavior for summarizing indexes allowing for HOT updates while also updating the index. No degradation here that I’m aware of, indeed the tests that ensure that behavior are unchanged and pass.
Looking at the code again, while it does indeed preserve the current
behaviour, it doesn't actually improve the behavior for summarizing
indexes when that would be expected.Example:
CREATE INDEX hotblocking ON mytab USING btree((att1->'data'));
CREATE INDEX summarizing ON mytab USING BRIN(att2);
UPDATE mytab SET att1 = att1 || '{"check": "mate"}';In v3 (same code present in v4), I notice that in the above case we
hit the "indexed attribute updated" path (hotblocking indeed indexes
the updated attribute att1), go into ExecIndexesRequiringUpdates, and
mark index 'summarizing' as 'needs an update', even though no
attribute of that index has a new value. Then we notice that
att1->'data' hasn't changed, and so we don't need to update the
'hotblocking' index, but we do update the (unchanged) 'summarizing'
index.This indicates that in practice (with this version of the patch) this
will improve the HOT applicability situation while summarizing indexes
don't really gain a benefit from this - they're always updated when
any indexed column is updated, even if we could detect that there were
no changes to any indexed values.Actually, you could say we find ourselves in the counter-intuitive
situation that the addition of the 'hotblocking' index whose value
were not updated now caused index insertions into summarizing indexes.Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attached find an updated patchset v5 that is an evolution of v4.
Changes v4 to v5 are:
* replaced GUC with table reloption called "expression_checks" (open to other name ideas)
* minimal documentation updates to README.HOT to address changes
* avoid, when possible, the expensive path that requires evaluating an estate using bitmaps
* determines the set of summarized indexes requiring updates, only updates those
* more tests in heap_hot_updates.sql (perhaps too many...)
* rebased to master, formatted, and make check-world passes
More comments in context below...
-greg
On Feb 11, 2025, at 4:18 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 10 Feb 2025 at 20:11, Burd, Greg <gregburd@amazon.com> wrote:
On Feb 10, 2025, at 12:17 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
I have a few concerns with the patch, things I’d greatly appreciate your thoughts on:
First, I pass an EState along the update path to enable running the checks in heapam, this works but leaves me feeling as if I violated separation of concerns. If there is a better way to do this let me know or if you think the cost of creating one in the execIndexing.c ExecIndexesRequiringUpdates() is okay that’s another possibility.
I think that doesn't have to be bad.
Meaning that the approach I’ve taken is okay with you?
If you mean "passing EState down through the AM so we can check if we
really need HOT updates", then Yes, that's OK. I don't think the basic
idea (executing the projections to check for differences) is bad per
se, however I do think we may need more design work on the exact shape
of how ExecIndexesRequiringUpdates() receives its information (which
includes the EState).E.g. we could pass an opaquely typed pointer that's passed from the
executor state through the table_update method into this
ExecIndexesRequiringUpdates(). That opaque struct would then contain
the important information for index update state checking, so that the
AM can't realistically break things without bypassing the separation
of concerns, and doesn't have to know about any executor nodes.
I'm open to this idea and will attempt an implementation in v6, ideas welcome.
Third, there is overhead to this patch, it is no longer a single simple bitmap test to choose HOT or not in heap_update().
Why can't it mostly be that simple in simple cases?
It can remain that simple in the cases you mention. In relcache the hot blocking attributes are nearly the same, only the summarizing attributes are removed. The first test then in heap_update() is for overlap with the modified set. When there is none, the update will proceed on the HOT path.
The presence of a summarizing index is determined in ExecIndexesRequiringUpdates() in execIndexing.c, so a slightly longer code path but not much new overhead.
Yes, but that's only determined at an index-by-index level, rather
than determined all at once, and that's real bad when you have
hundreds of indexes to go through (however unlikely it might be, I've
heard of cases where there are 1000s of indexes on one table). So, I
prefer limiting any O(n_indexes) operations to only the most critical
cases.
This makes sense, and I agree that avoiding O(n_indexes) operations is a good goal when possible.
I mean, it's clear that "updated indexed column's value == non-HOT
update". And that to determine whether an updated *projected* column's
value (i.e., expression index column's value) was actually updated we
need to calculate the previous and current index value, thus execute
the projection twice. But why would we have significant additional
overhead if there are no expression indexes, or when we can know by
bitmap overlap that the only interesting cases are summarizing
indexes?You’re right, there’s not a lot of new overhead in that case except what happens in ExecIndexesRequiringUpdates() to scan over the list of IndexInfo. It is really only when there are many expressions/predicates requiring examination that there is any significant cost to this approach AFAICT (but if you see something please point it out).
See the attached approach. Evaluation of the expressions only has to
happen if there are any HOT-blocking attributes which are exclusively
hot-blockingly indexed through expressions, so if the updated
attribute numbers are a subset of hotblockingexprattrs. (substitute
hotblocking with summarizing for the summarizing approach)
I believe I've incorporated the gist of your idea in this v5 patch, let me know if I missed something.
I would've implemented this with (1) two new bitmaps, one each for
normal and summarizing indexes, each containing which columns are
exclusively used in expression indexes (and which should thus be used
to trigger the (comparatively) expensive recalculation).That was one where I started, over time that became harder to work as the bitmaps contain the union of index attributes for the table not per-column.
I think it's fairly easy to create, though.
Now there is one bitmap to cover the broadest case and then a function to find the modified set of indexes where each is examined against bitmaps that contain only attributes specific to the index in question. This helped in cases where there were both expression and non-expression indexes on the same attribute.
Fair, but do we care about one expression index on (attr1->>'data')'s
value *not* changing when an index on (attr1) exists and attr1 has
changed? That index on att1 would block HOT updates regardless of the
(lack of) changes to the (att1->>'data') index, so doing those
expensive calculations seems quite wasteful.
Agreed, when both a non-expression and an expression index exist on the same attribute then the expression checks are unnecessary and should be avoided. In this v5 patchset this case becomes two checks of bitmaps (first hot_attrs, then exclusively exp_attrs) before proceeding with a non-HOT update.
So, in my opinion, we should also keep track of those attributes only
included in expressions of indexes, and that's fairly easy: see
attached prototype.diff.txt (might need some work, the patch was
drafted on v16's codebase, but the idea is clear).
Thank you for your patch, I've included and expanded it.
The resulting %exprattrs bitmap contains attributes that are used only
in expressions of those index types.The "new" output of these expression
evaluations would be stored to be used later as index datums, reducing
the number of per-expression evaluations down to 2 at most, rather
than 2+1 when the index needs an insertion but the expression itself
wasn't updated.Not reforming the new index tuples is also an interesting optimization. I wonder how that can be passed from within heapam’s call into a function in execIndexing up into nodeModifiyTable and back down into execIndexing and on to the index access method? I’ll have to think about that, ideas welcome.
Note that index tuple forming happens only in the index AM, it's the
Datum construction (i.e. projection from attributes/tuple to indexed
value) that I'd like to deduplicate. Though looking at the current
code, I don't think it's reasonable to have that as a requirement for
this work. It'd be a nice-to-have for sure, but not as requirement.
Agreed that it's a nice-to-have, but not a priority.
Show quoted text
Do you have any documentation on the approaches used, and the specific
differences between v3 and v4? I don't see much of that in your
initial mail, and the patches themselves also don't show much of that
in their details. I'd like at least some documentation of the new
behaviour in src/backend/access/heap/README.HOT at some point before
this got marked as RFC in the commitfest app, though preferably sooner
rather than later.Good point, I should have updated README.HOT with the initial patchset. I’ll jump on that and update ASAP.
Thanks in advance.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
<prototype.diff.txt>
Attachments:
v5-0001-Expand-HOT-update-path-to-include-expression-and-.patchapplication/octet-stream; name=v5-0001-Expand-HOT-update-path-to-include-expression-and-.patchDownload+1550-173
On Thu, 13 Feb 2025 at 19:46, Burd, Greg <gregburd@amazon.com> wrote:
Attached find an updated patchset v5 that is an evolution of v4.
Changes v4 to v5 are:
* replaced GUC with table reloption called "expression_checks" (open to other name ideas)
* minimal documentation updates to README.HOT to address changes
* avoid, when possible, the expensive path that requires evaluating an estate using bitmaps
* determines the set of summarized indexes requiring updates, only updates those
* more tests in heap_hot_updates.sql (perhaps too many...)
* rebased to master, formatted, and make check-world passes
Thank you for the update. Below some comments, in no particular order.
-----
I'm not a fan of how you replaced TU_UpdateIndexes with a bitmap. It
seems unergonomic and a waste of performance.
In HEAD, we don't do any expensive computations for the fast path of
"indexed attribute updated" - at most we do the bitmap compare and
then set a pointer. With this patch, the way we signal that is by
allocating a bitmap of size O(n_indexes). That's potentially quite
expensive (given 1000s of indexes), and definitely more expensive than
only a pointer assignment.
In HEAD, we have a clear indication of which classes of indexes to
update, with TU_UpdateIndexes. With this patch, we have to derive that
from the (lack of) bits in the bitmap that might be output by the
table_update procedure.
I think we can do with an additional parameter for which indexes would
be updated (or store that info in the parameter which also will hold
EState et al). I think it's cheaper that way, too - only when
update_indexes could be TU_SUMMARIZING we might need the exact
information for which indexes to insert new tuples into, and it only
really needs to be sized to the number of summarizing indexes (usually
small/nonexistent, but potentially huge).
-----
I think your patch design came from trying to include at least two
distinct optimizations:
1) to make the HOT-or-not check include whether the expressions of
indexes were updated, and
2) to only insert index tuples into indexes that got updated values
when table_tuple_update returns update_indexes=TU_Summarizing.
While they touch similar code (clearly seen here), I think those
should be implemented in different patches. For (1), the current API
surface is good enough when the EState is passed down. For (2), you'll
indeed also need an additional argument we can use to fill with the
right summarizing indexes, but I don't think that can nor should
replace the function of TU_UpdateIndexes.
If you agree with my observation of those being distinct
optimizations, could you split this patch into parts (but still within
the same series) so that these are separately reviewable?
-----
I notice that ExecIndexesRequiringUpdates() does work on all indexes,
rather than just indexes relevant to this exact phase of checking. I
think that is a waste of time, so if we sort the indexes in order of
[hotblocking without expressions, hotblocking with expressions,
summarizing], then (with stored start/end indexes) we can save time in
cases where there are comparatively few of the types we're not going
to look at.
As an extreme example: we shouldn't do the (comparatively) expensive
work evaluating expressions to determine which of 1000s of summarizing
indexes has been updated when we're still not sure if we can apply HOT
at all.
(Sidenote: Though, arguably, we could be smarter by skipping index
insertions into unmodified summarizing indexes altogether regardless
of HOT status, as long as the update is on the same page - but that's
getting ahead of ourselves and not relevant to this discussion.)
-----
I noticed you've disabled any passing of "HOT or not" in the
simple_update cases, and have done away with the various checks that
are in place to prevent corruption. I don't think that's a great idea,
it's quite likely to cause bugs.
-----
You're extracting type info from the opclass, to use in
datum_image_eq(). Couldn't you instead use the index relation's
TupleDesc and its stored attribute information instead? That saves us
from having to do further catalog lookups during execution. I'm also
fairly sure that that information is supposed to be a more accurate
representation of attributes' expression output types than the
opclass' type information (though, they probably should match).
-----
The operations applied in ExecIndexesRequiringUpdates partially
duplicate those done in index_unchanged_by_update. Can we (partially)
unify this, and pass which indexes were updated through the IndexInfo,
rather than the current bitmap?
-----
I don't see a good reason to add IndexInfo to Relation, by way of
rd_indexInfoList. It seems like an ad-hoc way of passing data around,
and I don't think that's the right way.
I mean, it's clear that "updated indexed column's value == non-HOT
update". And that to determine whether an updated *projected* column's
value (i.e., expression index column's value) was actually updated we
need to calculate the previous and current index value, thus execute
the projection twice. But why would we have significant additional
overhead if there are no expression indexes, or when we can know by
bitmap overlap that the only interesting cases are summarizing
indexes?See the attached approach. Evaluation of the expressions only has to
happen if there are any HOT-blocking attributes which are exclusively
hot-blockingly indexed through expressions, so if the updated
attribute numbers are a subset of hotblockingexprattrs. (substitute
hotblocking with summarizing for the summarizing approach)I believe I've incorporated the gist of your idea in this v5 patch, let me know if I missed something.
Seems about accurate.
Show quoted text
I would've implemented this with (1) two new bitmaps, one each for
normal and summarizing indexes, each containing which columns are
exclusively used in expression indexes (and which should thus be used
to trigger the (comparatively) expensive recalculation).That was one where I started, over time that became harder to work as the bitmaps contain the union of index attributes for the table not per-column.
I think it's fairly easy to create, though.
Now there is one bitmap to cover the broadest case and then a function to find the modified set of indexes where each is examined against bitmaps that contain only attributes specific to the index in question. This helped in cases where there were both expression and non-expression indexes on the same attribute.
Fair, but do we care about one expression index on (attr1->>'data')'s
value *not* changing when an index on (attr1) exists and attr1 has
changed? That index on att1 would block HOT updates regardless of the
(lack of) changes to the (att1->>'data') index, so doing those
expensive calculations seems quite wasteful.Agreed, when both a non-expression and an expression index exist on the same attribute then the expression checks are unnecessary and should be avoided. In this v5 patchset this case becomes two checks of bitmaps (first hot_attrs, then exclusively exp_attrs) before proceeding with a non-HOT update.
So, in my opinion, we should also keep track of those attributes only
included in expressions of indexes, and that's fairly easy: see
attached prototype.diff.txt (might need some work, the patch was
drafted on v16's codebase, but the idea is clear).Thank you for your patch, I've included and expanded it.
Matthias,
First off, I can't thank you enough for taking the time to review in detail the patch. I appreciate and value your time and excellent feedback.
Second, I think that I should admit to the fact that I've also been working on making PHOT functional again. I have it rebased against master, however it is still at the proof-of-concept phase and there are some larger issues to flesh out. One of the changes in this patch would have enabled that future with PHOT, specifically the bitmap as the method for conveying what indexes need updating.
That said, I agree that this patch should simply focus on expanding HOT to expression indexes and partial indexes. With that in mind I will return to the TU_UpdateIndexes approach, even though I'm not a fan of it, and later propose the bitmap approach (or something better) as part of PHOT if and when that's ready.
Changes v5 to v6:
* reverted to TU_UpdateIndexes rather than Bitmapset
* renamed ExecIndexesRequiringUpdates to ExecIndexesExpressionsWereNotUpdated
* ExecIndexesExpressionsWereNotUpdated returns bool, exits early if possible
* simple_update paths can be HOT once again
* removed efforts to determine the subset of updated summarizing indexes
* create filtered IndexInfo list in relcache containing only indexes with expressions
* now using index TupleDesc CompactAttributes for arguments to datum_is_equal() <- did I get this one right, I'm not sure it is what you had in mind
best.
-greg
On Feb 15, 2025, at 5:49 AM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Thu, 13 Feb 2025 at 19:46, Burd, Greg <gregburd@amazon.com> wrote:
-----
I'm not a fan of how you replaced TU_UpdateIndexes with a bitmap. It
seems unergonomic and a waste of performance.In HEAD, we don't do any expensive computations for the fast path of
"indexed attribute updated" - at most we do the bitmap compare and
then set a pointer. With this patch, the way we signal that is by
allocating a bitmap of size O(n_indexes). That's potentially quite
expensive (given 1000s of indexes), and definitely more expensive than
only a pointer assignment.
Fair point. I'm not a fan of the TU_UpdateIndexes enum, but I appreciate your argument against the bitmapset. Under the conditions you describe (1000s of indexes) it could grow unwieldy and impact performance.
In HEAD, we have a clear indication of which classes of indexes to
update, with TU_UpdateIndexes. With this patch, we have to derive that
from the (lack of) bits in the bitmap that might be output by the
table_update procedure.
Yes, but... that "clear indication" is lacking the ability to convey more detailed information. It doesn't tell you which summarizing indexes really need updating just that as a result of being on the HOT path all summarizing indexes require updates.
I think we can do with an additional parameter for which indexes would
be updated (or store that info in the parameter which also will hold
EState et al). I think it's cheaper that way, too - only when
update_indexes could be TU_SUMMARIZING we might need the exact
information for which indexes to insert new tuples into, and it only
really needs to be sized to the number of summarizing indexes (usually
small/nonexistent, but potentially huge).
Okay, yes with this patch we need only concern ourselves with all, none, or some subset of summarizing as before. I'll work on the opaque parameter next iteration.
-----
I think your patch design came from trying to include at least two
distinct optimizations:
1) to make the HOT-or-not check include whether the expressions of
indexes were updated, and
2) to only insert index tuples into indexes that got updated values
when table_tuple_update returns update_indexes=TU_Summarizing.
It did. (1) for sure, but the second is more related to the PHOT work (as mentioned above). With that work almost all updates are on the HOT/PHOT path and so the bitmap of changed indexes is small. It would take an update of a table with 1000s of indexes where almost all those were modified to create a bitmap that was large, which certainly could (and somewhere likely does) exist, but it's not the common case (I'd imagine). In that case the work to update all those indexes will likely dwarf the work to build that bitmap, but I could be wrong.
But you're fundamentally right, I'm conflating two ideas and I shouldn't. I will focus the changes required for this idea without pulling in changes useful to a future ones.
While they touch similar code (clearly seen here), I think those
should be implemented in different patches. For (1), the current API
surface is good enough when the EState is passed down. For (2), you'll
indeed also need an additional argument we can use to fill with the
right summarizing indexes, but I don't think that can nor should
replace the function of TU_UpdateIndexes.
Okay, I'll combine the earlier v3 patch with the changes in v5. That should leave TU_UpdateIndexes in place and allow for HOT with expression indexes. I'll change the ExecIndexesRequiringUpdates() a bit (and rename it) so that it is just a boolean test for any index that spoils the HOT path and can exit early in that case potentially avoiding extra work.
If you agree with my observation of those being distinct
optimizations, could you split this patch into parts (but still within
the same series) so that these are separately reviewable?
I agree, but I think that a single simple more focused patch will suffice.
-----
I notice that ExecIndexesRequiringUpdates() does work on all indexes,
rather than just indexes relevant to this exact phase of checking. I
think that is a waste of time, so if we sort the indexes in order of
[hotblocking without expressions, hotblocking with expressions,
summarizing], then (with stored start/end indexes) we can save time in
cases where there are comparatively few of the types we're not going
to look at.
If I plan on just having ExecIndexesRequiringUpdates() return a bool rather than a bitmap then sorting, or even just filtering the list of IndexInfo to only include indexes with expressions, makes sense. That way the only indexes in question in that function's loop will be those that may spoil the HOT path. When that list is length 0, we can skip the tests entirely.
As an extreme example: we shouldn't do the (comparatively) expensive
work evaluating expressions to determine which of 1000s of summarizing
indexes has been updated when we're still not sure if we can apply HOT
at all.
That makes sense, and your sorting idea would inform that kind of work. I'll keep that in mind if I reintroduce code that aims to only update changed summarized indexes.
(Sidenote: Though, arguably, we could be smarter by skipping index
insertions into unmodified summarizing indexes altogether regardless
of HOT status, as long as the update is on the same page - but that's
getting ahead of ourselves and not relevant to this discussion.)-----
I noticed you've disabled any passing of "HOT or not" in the
simple_update cases, and have done away with the various checks that
are in place to prevent corruption. I don't think that's a great idea,
it's quite likely to cause bugs.
Yes. I'll resurrect that.
-----
You're extracting type info from the opclass, to use in
datum_image_eq(). Couldn't you instead use the index relation's
TupleDesc and its stored attribute information instead? That saves us
from having to do further catalog lookups during execution. I'm also
fairly sure that that information is supposed to be a more accurate
representation of attributes' expression output types than the
opclass' type information (though, they probably should match).
I hadn't thought of that, I think it's a valid idea and I'll update accordingly. I think I understand what you are suggesting.
-----
The operations applied in ExecIndexesRequiringUpdates partially
duplicate those done in index_unchanged_by_update. Can we (partially)
unify this, and pass which indexes were updated through the IndexInfo,
rather than the current bitmap?
I think I do that now, feel free to say otherwise. When the expression is checked in ExecIndexesExpressionsWereNotUpdated() I set:
/* Shortcut index_unchanged_by_update(), we know the answer. */ indexInfo->ii_CheckedUnchanged = true; indexInfo->ii_IndexUnchanged = !changed;
That prevents duplicate effort in index_unchanged_by_update().
-----
I don't see a good reason to add IndexInfo to Relation, by way of
rd_indexInfoList. It seems like an ad-hoc way of passing data around,
and I don't think that's the right way.
At one point I'd created a way to get this set via relcache, I will resurrect that approach but I'm not sure it is what you were hinting at. The current method avoids pulling a the lock on the index to build the list, but doing that once in relcache isn't horrible. Maybe you were suggesting using that opaque struct to pass around the list of IndexInfo? Let me know on this one if you had a specific idea. The swap I've made in v6 really just moves the IndexInfo list to a filtered list with a new name created in relcache.
Attachments:
v6-0001-Expand-HOT-update-path-to-include-expression-and-.patchapplication/octet-stream; name=v6-0001-Expand-HOT-update-path-to-include-expression-and-.patchDownload+1560-61
Changes v6 to v7:
* Fixed documentation oversight causing build failure
* Changed how I convey attribute len/by-val in IndexInfo
* Fixed method to shortcut index_unchanged_by_update() when possible
-greg
Attachments:
v7-0001-Expand-HOT-update-path-to-include-expression-and-.patchapplication/octet-stream; name=v7-0001-Expand-HOT-update-path-to-include-expression-and-.patchDownload+1566-62
Hello,
I've rebased and updated the patch a bit. The biggest change is that the performance penalty measured with v1 of this patch is essentially gone in v10. The overhead was due to re-creating IndexInfo information unnecessarily, which I found existed in the estate. I've added a few fields in IndexInfo that are not populated by default but necessary when checking expression indexes, those fields are populated on demand and only once limiting their overhead.
Here's what you'll find if you look into execIndexing.c where the majority of changes happened.
* assumes estate->es_result_relations[0] is the ResultRelInfo being updated
* uses ri_IndexRelationInfo[] from within estate rather than re-creating it
* augments IndexInfo only when needed for testing expressions and only once
* only creates a local old/new TupleTableSlot when not present in estate
* retains existing summarized index HOT update logic
One remaining concern stems from the assumption that estate->es_result_relations[0] is always going to be the relation being updated. This is guarded by assert()'s in the patch. It seems this is safe, all tests are passing (including TAP) and my review of the code seems to line up with that assumption. That said... opinions?
Another lingering question is under what conditions the old/new TupleTableSlots are not created and available via the ResultRelInfo found in estate. I've only seen this happen when there is an INSERT ... ON CONFLICT UPDATE ... with expression indexes. I was hopeful that in all cases I could avoid re-creating those when checking expression indexes to avoid that repeated overhead. I still avoid it when possible in this patch.
When you have time I'd appreciate any feedback.
-greg
Amazon Web Services: https://aws.amazon.com
Attachments:
v10-0001-Expand-HOT-update-path-to-include-expression-and.patchapplication/octet-stream; name=v10-0001-Expand-HOT-update-path-to-include-expression-and.patchDownload+1619-53
Hi,
Sorry for the delay. This is a reply for the mail thread up to 17 Feb,
so it might be very out-of-date by now, in which case sorry for the
noise.
On Mon, 17 Feb 2025 at 20:54, Burd, Greg <gregburd@amazon.com> wrote:
On Feb 15, 2025, at 5:49 AM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
In HEAD, we have a clear indication of which classes of indexes to
update, with TU_UpdateIndexes. With this patch, we have to derive that
from the (lack of) bits in the bitmap that might be output by the
table_update procedure.Yes, but... that "clear indication" is lacking the ability to convey more detailed information. It doesn't tell you which summarizing indexes really need updating just that as a result of being on the HOT path all summarizing indexes require updates.
Agreed that it's not great if you want to know about which indexes
were meaningfully updated. I think that barring significant advances
in performance of update checks, we can devise a way of transfering
this info to the table_tuple_update caller once we get a need for more
detailed information (e.g. this could be transfered through the
IndexInfo* that's currently also used by index_unchanged_by_update).
I think we can do with an additional parameter for which indexes would
be updated (or store that info in the parameter which also will hold
EState et al). I think it's cheaper that way, too - only when
update_indexes could be TU_SUMMARIZING we might need the exact
information for which indexes to insert new tuples into, and it only
really needs to be sized to the number of summarizing indexes (usually
small/nonexistent, but potentially huge).Okay, yes with this patch we need only concern ourselves with all, none, or some subset of summarizing as before. I'll work on the opaque parameter next iteration.
Thanks!
-----
I notice that ExecIndexesRequiringUpdates() does work on all indexes,
rather than just indexes relevant to this exact phase of checking. I
think that is a waste of time, so if we sort the indexes in order of
[hotblocking without expressions, hotblocking with expressions,
summarizing], then (with stored start/end indexes) we can save time in
cases where there are comparatively few of the types we're not going
to look at.If I plan on just having ExecIndexesRequiringUpdates() return a bool rather than a bitmap then sorting, or even just filtering the list of IndexInfo to only include indexes with expressions, makes sense. That way the only indexes in question in that function's loop will be those that may spoil the HOT path. When that list is length 0, we can skip the tests entirely.
Yes, exactly. Though I'm not sure it should hit that path if the
length is 0, as would mean we had expression indexes that matched the
updated columns, but somehow none are in the list?
You're extracting type info from the opclass, to use in
datum_image_eq(). Couldn't you instead use the index relation's
TupleDesc and its stored attribute information instead? That saves us
from having to do further catalog lookups during execution. I'm also
fairly sure that that information is supposed to be a more accurate
representation of attributes' expression output types than the
opclass' type information (though, they probably should match).I hadn't thought of that, I think it's a valid idea and I'll update accordingly. I think I understand what you are suggesting.
Thanks, that change was exactly what I meant.
-----
The operations applied in ExecIndexesRequiringUpdates partially
duplicate those done in index_unchanged_by_update. Can we (partially)
unify this, and pass which indexes were updated through the IndexInfo,
rather than the current bitmap?I think I do that now, feel free to say otherwise. When the expression is checked in ExecIndexesExpressionsWereNotUpdated() I set:
/* Shortcut index_unchanged_by_update(), we know the answer. */ indexInfo->ii_CheckedUnchanged = true; indexInfo->ii_IndexUnchanged = !changed;
That prevents duplicate effort in index_unchanged_by_update().
Exactly, yes.
-----
I don't see a good reason to add IndexInfo to Relation, by way of
rd_indexInfoList. It seems like an ad-hoc way of passing data around,
and I don't think that's the right way.At one point I'd created a way to get this set via relcache, I will resurrect that approach but I'm not sure it is what you were hinting at.
AFAIK, we don't have IndexInfo in the relcaches currently. I'm very
hesitant to add an executor node (!) subtype to catalog caches, as
IndexInfos are also used to store temporary information about e.g.
index tuple insertion state, which (if IndexInfo is stored in
relcaches) would imply modifying relcache entries without any further
locks, and I'm not sure that's at all an OK thing to do.
The current method avoids pulling a the lock on the index to build the list, but doing that once in relcache isn't horrible. Maybe you were suggesting using that opaque struct to pass around the list of IndexInfo? Let me know on this one if you had a specific idea. The swap I've made in v6 really just moves the IndexInfo list to a filtered list with a new name created in relcache.
My main concern is the storage of executor nodes(!) directly in the
relcache. I don't think we need that: We have relatively direct access
to the right IndexInfo** in ResultRelInfo->ri_IndexRelationInfo, which
I think should be sufficient for this purpose. (The relevant RRI is
available in table_tuple_update caller ExecUpdateAct; and could be
passed down by ExecSimpleRelationUpdate to simple_table_tuple_update,
covering both (current, core) callers of table_tuple_update). That
would then be passed down to the TableAM using an opaque pointer type;
for example (names, file locations, exact layout all bikesheddable):
/* tableam.h */
/* exact definition somewhere else, in e.g. an executor_internal.h */
typedef struct TU_UpdateIndexData TU_UpdateIndexData;
table_tuple_update(..., TU_UpdateIndexData *idxupdate, ...)
/* executor.h */
TU_UpdateIndexes
UpdateDetermineChangedIndexes(TU_UpdateIndexData *idxupdate,
TableTupleSlot *old, TableTupleSlot *new, bitmap *changed_atts, ...);
/* executor_internal.h */
struct TU_UpdateIndexData
{
EState estate;
IndexInfo **idxinfos;
...
}
-----
Looking at your later comments about RRI in patch v8, I think that
would solve and clean up the way that you currently get access to the
RRI and thus index set.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 5 Mar 2025 at 18:21, Burd, Greg <gregburd@amazon.com> wrote:
Hello,
I've rebased and updated the patch a bit. The biggest change is that the performance penalty measured with v1 of this patch is essentially gone in v10. The overhead was due to re-creating IndexInfo information unnecessarily, which I found existed in the estate. I've added a few fields in IndexInfo that are not populated by default but necessary when checking expression indexes, those fields are populated on demand and only once limiting their overhead.
This review is based on a light reading of patch v10. I have not read
all 90kB, and am unlikely to finish a full review soon:
* assumes estate->es_result_relations[0] is the ResultRelInfo being updated
I'm not sure that's a valid assumption. I suspect it might be false in
cases of nested updates, like
$ UPDATE table1 SET value = other.value FROM (UPDATE table2 SET value
= 2 ) other WHERE other.id = table1.id;
If this table1 or table2 has expression indexes I suspect it may
result in this assertion failing (but I haven't spun up a server with
the patch).
Alternatively, please also check that it doesn't break if any of these
two tables is partitioned with multiple partitions (and/or has
expression indexes, etc.).
* uses ri_IndexRelationInfo[] from within estate rather than re-creating it
As I mentioned above, I think it's safer to pass the known-correct RRI
(known by callers of table_tuple_update) down the stack.
* augments IndexInfo only when needed for testing expressions and only once
ExecExpressionIndexesUpdated seems to always loop over all indexes,
always calling AttributeIndexInfo which always updates the fields in
the IndexInfo when the index has only !byval attributes (e.g. text,
json, or other such varlena types). You say it happens only once, have
I missed something?
I'm also somewhat concerned about the use of typecache lookups on
index->rd_opcintype[i], rather than using
TupleDescCompactAttr(index->rd_att, i); the latter of which I think
should be faster, especially when multiple wide indexes are scanned
with various column types. In hot loops of single-tuple update
statements I think this may make a few 0.1%pt difference - not a lot,
but worth considering.
* only creates a local old/new TupleTableSlot when not present in estate
I'm not sure it's safe for us to touch that RRI's tupleslots.
* retains existing summarized index HOT update logic
Great, thanks!
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Mar 5, 2025, at 5:56 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Hi,
Sorry for the delay. This is a reply for the mail thread up to 17 Feb,
so it might be very out-of-date by now, in which case sorry for the
noise.
Never noise, always helpful.
On Mon, 17 Feb 2025 at 20:54, Burd, Greg <gregburd@amazon.com> wrote:
On Feb 15, 2025, at 5:49 AM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
In HEAD, we have a clear indication of which classes of indexes to
update, with TU_UpdateIndexes. With this patch, we have to derive that
from the (lack of) bits in the bitmap that might be output by the
table_update procedure.Yes, but... that "clear indication" is lacking the ability to convey more detailed information. It doesn't tell you which summarizing indexes really need updating just that as a result of being on the HOT path all summarizing indexes require updates.
Agreed that it's not great if you want to know about which indexes
were meaningfully updated. I think that barring significant advances
in performance of update checks, we can devise a way of transfering
this info to the table_tuple_update caller once we get a need for more
detailed information (e.g. this could be transfered through the
IndexInfo* that's currently also used by index_unchanged_by_update).
One idea I had and tested a bit was to re-order the arrays of ri_IndexRelationInfo/Desc[] and then have a ri_NumModifiedIndices. This avoided allocation of a Bitmapset and was something downstream code could use or not depending on the requirements within that path. It may be, and I didn't check, that the order of indexes in that array has meaning in other contexts in the code so I put this aside. I think I'll keep focused as much as possible and consider that within the context of another patch later if needed.
I think we can do with an additional parameter for which indexes would
be updated (or store that info in the parameter which also will hold
EState et al). I think it's cheaper that way, too - only when
update_indexes could be TU_SUMMARIZING we might need the exact
information for which indexes to insert new tuples into, and it only
really needs to be sized to the number of summarizing indexes (usually
small/nonexistent, but potentially huge).Okay, yes with this patch we need only concern ourselves with all, none, or some subset of summarizing as before. I'll work on the opaque parameter next iteration.
Thanks!
I still haven't added the opaque parameter as suggested, but it's on my mind to give it a shot.
-----
I don't see a good reason to add IndexInfo to Relation, by way of
rd_indexInfoList. It seems like an ad-hoc way of passing data around,
and I don't think that's the right way.At one point I'd created a way to get this set via relcache, I will resurrect that approach but I'm not sure it is what you were hinting at.
AFAIK, we don't have IndexInfo in the relcaches currently. I'm very
hesitant to add an executor node (!) subtype to catalog caches, as
IndexInfos are also used to store temporary information about e.g.
index tuple insertion state, which (if IndexInfo is stored in
relcaches) would imply modifying relcache entries without any further
locks, and I'm not sure that's at all an OK thing to do.
This is gone in the v10 patch in favor of finding IndexInfo within the EState's ri_IndexRelationInfo[].
Thanks again for your continued support!
-greg
On Mar 5, 2025, at 6:39 PM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Mar 2025 at 18:21, Burd, Greg <gregburd@amazon.com> wrote:
Hello,
I've rebased and updated the patch a bit. The biggest change is that the performance penalty measured with v1 of this patch is essentially gone in v10. The overhead was due to re-creating IndexInfo information unnecessarily, which I found existed in the estate. I've added a few fields in IndexInfo that are not populated by default but necessary when checking expression indexes, those fields are populated on demand and only once limiting their overhead.
This review is based on a light reading of patch v10. I have not read
all 90kB, and am unlikely to finish a full review soon:* assumes estate->es_result_relations[0] is the ResultRelInfo being updated
I'm not sure that's a valid assumption. I suspect it might be false in
cases of nested updates, like$ UPDATE table1 SET value = other.value FROM (UPDATE table2 SET value
= 2 ) other WHERE other.id = table1.id;If this table1 or table2 has expression indexes I suspect it may
result in this assertion failing (but I haven't spun up a server with
the patch).
Alternatively, please also check that it doesn't break if any of these
two tables is partitioned with multiple partitions (and/or has
expression indexes, etc.).
Valid, and possible. I'll check and find a way to pass along the known-correct RRI index into that array.
* uses ri_IndexRelationInfo[] from within estate rather than re-creating it
As I mentioned above, I think it's safer to pass the known-correct RRI
(known by callers of table_tuple_update) down the stack.
I think passing the known-correct RRI index is the way to go as I need information from both ri_IndexRelationInfo/Desc[] arrays.
* augments IndexInfo only when needed for testing expressions and only once
ExecExpressionIndexesUpdated seems to always loop over all indexes,
always calling AttributeIndexInfo which always updates the fields in
the IndexInfo when the index has only !byval attributes (e.g. text,
json, or other such varlena types). You say it happens only once, have
I missed something?
There's a test that avoids doing it more than once, but I'm going to rename this as BuildExpressionIndexInfo() and call it from ExecOpenIndices() if there are expressions on the index. I think that's cleaner and there's precedent for it in the form of BuildSpeculativeIndexInfo().
I'm also somewhat concerned about the use of typecache lookups on
index->rd_opcintype[i], rather than using
TupleDescCompactAttr(index->rd_att, i); the latter of which I think
should be faster, especially when multiple wide indexes are scanned
with various column types. In hot loops of single-tuple update
statements I think this may make a few 0.1%pt difference - not a lot,
but worth considering.
I was just working on that. Good idea.
* only creates a local old/new TupleTableSlot when not present in estate
I'm not sure it's safe for us to touch that RRI's tupleslots.
Me neither, that's why I mentioned it. It was my attempt to avoid the work to create/destroy temp slots over and over that led to that idea. It's working, but needs more thought.
* retains existing summarized index HOT update logic
Great, thanks!
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
I might widen this patch a bit to include support for testing equality of index tuples using custom operators when they exist for the index. In the use case I'm solving for we use a custom operator for equality that is not the same as a memcmp(). Do you have thoughts on that? It may be hard to accomplish this as the notion of an equality operator is specific to the index access method and not well-defined outside that AFAICT. If that's the case I'd have to augment the definition of an index access method to provide that information.
-greg