Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello, hackers!
I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.
Yes, a serious bug was (2) caused by this optimization and now it reverted.
But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)
Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.
But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)
The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?
[1]: /messages/by-id/20210115133858.GA18931@alvherre.pgsql
[2]: /messages/by-id/17485-396609c6925b982d@postgresql.org
On Fri, 15 Dec 2023, 20:07 Michail Nikolaev, <michail.nikolaev@gmail.com>
wrote:
Hello, hackers!
I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.Yes, a serious bug was (2) caused by this optimization and now it reverted.
But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys
queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?
I hihgly doubt this is worth the additional cognitive overhead of another
liveness state, and I think there might be other issues with marking index
tuples dead in indexes before the table tuple is dead that I can't think of
right now.
I've thought about alternative solutions, too: how about getting a new
snapshot every so often?
We don't really care about the liveness of the already-scanned data; the
snapshots used for RIC are used only during the scan. C/RIC's relation's
lock level means vacuum can't run to clean up dead line items, so as long
as we only swap the backend's reported snapshot (thus xmin) while the scan
is between pages we should be able to reduce the time C/RIC is the one
backend holding back cleanup of old tuples.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Show quoted text
[1]: /messages/by-id/20210115133858.GA18931@alvherre.pgsql
[2]: /messages/by-id/17485-396609c6925b982d@postgresql.org
Hello!
I've thought about alternative solutions, too: how about getting a new snapshot every so often?
We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
the scan is between pages we should be able to reduce the time C/RIC is the one backend
holding back cleanup of old tuples.
Hm, it looks like an interesting idea! It may be more dangerous, but
at least it feels much more elegant than an LP_DEAD-related way.
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).
But for the first scan we allowed to do so only for non-unique indexes
because of:
* The reason for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
Also, (1) was limited to indexes without expressions and predicates
(2) because such may execute queries to other tables (sic!).
One possible solution is to add some checks to make sure no
user-defined functions are used.
But as far as I understand, it affects only CIC for now and does not
affect the ability to use the proposed technique (updating snapshot
time to time).
However, I think we need some more-less formal proof it is safe - it
is really challenging to keep all the possible cases in the head. I’ll
try to do something here.
Another possible issue may be caused by the new locking pattern - we
will be required to wait for all transaction started before the ending
of the phase to exit.
[1]: /messages/by-id/20210115133858.GA18931@alvherre.pgsql
[2]: /messages/by-id/CAAaqYe_tq_Mtd9tdeGDsgQh+wMvouithAmcOXvCbLaH2PPGHvA@mail.gmail.com
On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, <michail.nikolaev@gmail.com> wrote:
Hello!
I've thought about alternative solutions, too: how about getting a new snapshot every so often?
We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
the scan is between pages we should be able to reduce the time C/RIC is the one backend
holding back cleanup of old tuples.Hm, it looks like an interesting idea! It may be more dangerous, but
at least it feels much more elegant than an LP_DEAD-related way.
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).But for the first scan we allowed to do so only for non-unique indexes
because of:* The reason for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
Yes, for that we'd need an extra scan of the index that validates
uniqueness. I think there was a proposal (though it may only have been
an idea) some time ago, about turning existing non-unique indexes into
unique ones by validating the data. Such a system would likely be very
useful to enable this optimization.
Also, (1) was limited to indexes without expressions and predicates
(2) because such may execute queries to other tables (sic!).
Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.
One possible solution is to add some checks to make sure no
user-defined functions are used.
But as far as I understand, it affects only CIC for now and does not
affect the ability to use the proposed technique (updating snapshot
time to time).However, I think we need some more-less formal proof it is safe - it
is really challenging to keep all the possible cases in the head. I’ll
try to do something here.
I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered. If we didn't do that, the
following would occur:
1. The index is marked ready for inserts from concurrent backends, but
not yet ready for queries.
2. We get the bulkdelete data
3. A concurrent backend inserts a new tuple T on heap page P, inserts
it into the index, and commits. This tuple is not in the summary, but
has been inserted into the index.
4. R/CIC resets the snapshot, making T visible.
5. R/CIC scans page P, finds that tuple T has to be indexed but is not
present in the summary, thus inserts that tuple into the index (which
already had it inserted at 3)
This thus would be a logic bug, as indexes assume at-most-once
semantics for index tuple insertion; duplicate insertion are an error.
So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.
Alternatively, we'd need to find another way to prevent us from
inserting these duplicate entries - maybe by storing the scan's data
in a buffer to later load into the index after another
index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely
require a buffer much larger than the result index would be.
Either way, for the first scan (i.e. phase 2 "build new indexes") this
is not an issue: we don't care about what transaction adds/deletes
tuples at that point.
For all we know, all tuples of the table may be deleted concurrently
before we even allow concurrent backends to start inserting tuples,
and the algorithm would still work as it does right now.
Another possible issue may be caused by the new locking pattern - we
will be required to wait for all transaction started before the ending
of the phase to exit.
What do you mean by "new locking pattern"? We already keep an
ShareUpdateExclusiveLock on every heap table we're accessing during
R/CIC, and that should already prevent any concurrent VACUUM
operations, right?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello!
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).
Oops, I was wrong here. The original version of the patch was also applied to
both phases.
Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.
Yes, but such cases could probably cause crashes also...
So, I think it is better to check them for custom functions. But I
still not sure -
if such limitations still required for proposed optimization or not.
I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered.
So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.
Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.
What do you mean by "new locking pattern"? We already keep an
ShareUpdateExclusiveLock on every heap table we're accessing during
R/CIC, and that should already prevent any concurrent VACUUM
operations, right?
I was thinking not about "classical" locking, but about waiting for
other backends
by WaitForLockers(heaplocktag, ShareLock, true). But I think
everything should be
fine.
Best regards,
Michail.
On Wed, 20 Dec 2023 at 10:56, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.Yes, but such cases could probably cause crashes also...
So, I think it is better to check them for custom functions. But I
still not sure -
if such limitations still required for proposed optimization or not.
I think contents could be inconsistent, but not more inconsistent than
if the index was filled across multiple transactions using inserts.
Either way I don't see it breaking more things that are not already
broken in that way in other places - at most it will introduce another
path that exposes the broken state caused by mislabeled functions.
I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered.So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.
That's an interesting method.
How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?
HeapTupleSatisfiesHistoricMVCC
That function has this comment marker:
"Only usable on tuples from catalog tables!"
Is that correct even for this?
Should this deal with any potential XID wraparound, too?
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...
Either way, "Historic snapshot" isn't something I've worked with
before, so that goes onto my "figure out how it works" pile.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello!
How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?
Yes, exactly.
HeapTupleSatisfiesHistoricMVCC
That function has this comment marker:
"Only usable on tuples from catalog tables!"
Is that correct even for this?
Yeah, we just need HeapTupleSatisfiesVisibility (which calls
HeapTupleSatisfiesMVCC) instead.
Should this deal with any potential XID wraparound, too?
Yeah, looks like we should care about such case somehow.
Possible options here:
1) Skip vac_truncate_clog while CIC is running. In fact, I think it's
not that much worse than the current state - datfrozenxid is still
updated in the catalog and will be considered the next time
vac_update_datfrozenxid is called (the next VACCUM on any table).
2) Delay vac_truncate_clog while CIC is running.
In such a case, if it was skipped, we will need to re-run it using the
index builds backend later.
3) Wait for 64-bit xids :)
4) Any ideas?
In addition, for the first and second options, we need logic to cancel
the second phase in the case of ForceTransactionIdLimitUpdate.
But maybe I'm missing something and the tuples may be frozen, ignoring
the set datfrozenxid values (over some horizon calculated at runtime
based on the xmin backends).
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...
Yes, good catch.
Assuming we have somehow prevented vac_truncate_clog from occurring
during CIC, we can leave frozen and potentially frozen
(xmin<frozenXID) for the second phase.
So, first phase processing items:
* not frozen
* xmin>frozenXID (may not be frozen)
* visible by snapshot
second phase:
* frozen
* xmin>frozenXID (may be frozen)
* not in the index summary
* visible by "old" snapshot
You might also think – why is the first stage needed at all? Just use
batch processing during initial index building?
Best regards,
Mikhail.
Yes, good catch.
Assuming we have somehow prevented vac_truncate_clog from occurring
during CIC, we can leave frozen and potentially frozen
(xmin<frozenXID) for the second phase.
Just realized that we can leave this for the first stage to improve efficiency.
Since the ID is locked, anything that can be frozen will be visible in
the first stage.
Hello.
Realized my last idea is invalid (because tuples are frozen by using
dynamically calculated horizon) - so, don't waste your time on it :)
Need to think a little bit more here.
Thanks,
Mikhail.
Hello!
It seems like the idea of "old" snapshot is still a valid one.
Should this deal with any potential XID wraparound, too?
As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...
As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.
Merry Christmas,
Mikhail.
On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello!
It seems like the idea of "old" snapshot is still a valid one.
Should this deal with any potential XID wraparound, too?
As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.
Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0]/messages/by-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com, which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.
As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com
Hello!
Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.
As another option, we could extract a dedicated horizon value for an
opportunistic freezing.
And use some flags in R/CIC backend to keep it at the required value.
Best regards,
Michail.
Hello, Melanie!
Sorry to interrupt you, just a quick question.
Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.
Do you have any patches\threads related to that refactoring
(opportunistic freezing of tuples during pruning) [0]/messages/by-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com?
This may affect the idea of the current thread (latest version of it
mostly in [1]/messages/by-id/CANtu0ojRX=osoiXL9JJG6g6qOowXVbVYX+mDsN+2jmFVe=eG7w@mail.gmail.com) - it may be required to disable such a feature for
particular relation temporary or affect horizon used for pruning
(without holding xmin).
Just no sure - is it reasonable to start coding right now, or wait for
some prune-freeze-related patch first?
[0]: /messages/by-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com
[1]: /messages/by-id/CANtu0ojRX=osoiXL9JJG6g6qOowXVbVYX+mDsN+2jmFVe=eG7w@mail.gmail.com
I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered.So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.That's an interesting method.
How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?
I made a draft of the patch and this idea is not working.
The problem is generally the same:
* reference snapshot sees tuple X
* reference snapshot is used to create index summary (but there is no
tuple X in the index summary)
* tuple X is updated to Y creating a HOT-chain
* we started scan with new temporary snapshot (it sees Y, X is too old for it)
* tuple X is pruned from HOT-chain because it is not protected by any snapshot
* we see tuple Y in the scan with temporary snapshot
* it is not in the index summary - so, we need to check if
reference snapshot can see it
* there is no way to understand if the reference snapshot was able
to see tuple X - because we need the full HOT chain (with X tuple) for
that
Best regards,
Michail.
On Thu, 1 Feb 2024, 17:06 Michail Nikolaev, <michail.nikolaev@gmail.com> wrote:
I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered.
I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash). Then we could do something
like this for the second table scan phase:
0. index->indisready is set
[...]
1. Empty the "changelog index", resetting storage and the generation number.
2. Take index contents snapshot of new index, store this.
3. Loop until completion:
4a. Take visibility snapshot
4b. Update generation number of the changelog index, store this.
4c. Take index snapshot of "changelog index" for data up to the
current stored generation number. Not including, because we only need
to scan that part of the index that were added before we created our
visibility snapshot, i.e. TIDs labeled with generation numbers between
the previous iteration's generation number (incl.) and this
iteration's generation (excl.).
4d. Combine the current index snapshot with that of the "changelog"
index, and save this.
Note that this needs to take care to remove duplicates.
4e. Scan segment of table (using the combined index snapshot) until we
need to update our visibility snapshot or have scanned the whole
table.
This should give similar, if not the same, behavour as that which we
have when we RIC a table with several small indexes, without requiring
us to scan a full index of data several times.
Attemp on proving this approach's correctness:
In phase 3, after each step 4b:
All matching tuples of the table that are in the visibility snapshot:
* Were created before scan 1's snapshot, thus in the new index's snapshot, or
* Were created after scan 1's snapshot but before index->indisready,
thus not in the new index's snapshot, nor in the changelog index, or
* Were created after the index was set as indisready, and committed
before the previous iteration's visibility snapshot, thus in the
combined index snapshot, or
* Were created after the index was set as indisready, after the
previous visibility snapshot was taken, but before the current
visibility snapshot was taken, and thus definitely included in the
changelog index.
Because we hold a snapshot, no data in the table that we should see is
removed, so we don't have a chance of broken HOT chains.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello!
I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash)
Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged. Also, some way to remove the index from the catalog in case of
a crash may be required.
A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :) Yes, in some cases it may be
too bad because of the huge amount of TIDs we need to scan + also
btree copies whole page despite we need single item. But some
additional index method may help - feels like something related to
uniqueness (but it is only in btree anyway).
Thanks,
Mikhail.
One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.
Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).
Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.
On Wed, 21 Feb 2024 at 00:33, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello!
I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash)Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged.
I suppose so, though persistence is usually just to keep things
correct in case of crashes, and this "index" is only there to support
processes that don't expect to survive crashes.
Also, some way to remove the index from the catalog in case of
a crash may be required.
That's less of an issue though, we already accept that a crash during
CIC/RIC leaves unusable indexes around, so "needs more cleanup" is not
exactly a blocker.
A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
While possible, I don't think this would be more performant than the
combination approach, at the cost of potentially much more random IO
when the table is aggressively being updated.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
Tuplesort requires the leader process to wait for concurrent backends
to finish their sort before it can start consuming their runs. This
would make it a very bad alternative to the "changelog index" as the
CIC process would require on-demand actions from concurrent backends
(flush of sort state). I'm not convinced that's somehow easier.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :)
We can't rely on that, because we have no guarantee we can find the
tuple quickly enough. Equality-based indexing is very much optional,
and so are TID-based checks (outside the current vacuum-related APIs),
so finding one TID can (and probably will) take O(indexsize) when the
tuple is not in the index, which is one reason for ambulkdelete() to
exist.
Kind regards,
Matthias van de Meent
On Wed, 21 Feb 2024 at 09:35, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.
How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Would it be exclusively for that relation? How would this be
integrated with e.g. heap_page_prune_opt?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi!
How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.
Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.
How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.
If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).
(1): /messages/by-id/CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-=uZKUpNJmuZtw@mail.gmail.com
On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hi!
How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.
I suppose this could work. We'd also need to be very sure that the
toast relation isn't cleaned up either: Even though that's currently
DELETE+INSERT only and can't apply HOT, it would be an issue if we
couldn't find the TOAST data of a deleted for everyone (but visible to
us) tuple.
Note that disabling cleanup for a relation will also disable cleanup
of tuple versions in that table that are not used for the R/CIC
snapshots, and that'd be an issue, too.
How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.
If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).
I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.
But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.
-Matthias
Hello!
I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.
Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
But yes, still an inelegant and a little strange-looking option.
At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.
But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.
Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.
One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.
Do you know any less-hacky way? Or is it a nice way to go?
On Thu, 7 Mar 2024 at 19:37, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello!
I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
Yeah, correct. But it still needs to update the table relations'
information after finishing creating the indexes, which I'd rather not
have to do.
But yes, still an inelegant and a little strange-looking option.
At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.
Not sure what you mean here, but I don't think
ComputeXidHorizonsResult should have anything to do with actual
relations.
But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.
We can just release the current snapshot, and get a new one, right? I
mean, we don't actually use the transaction for much else than
visibility during the first scan, and I don't think there is a need
for an actual transaction ID until we're ready to mark the index entry
with indisready.
One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.
Not a fan of that, as it is too sensitive to abuse. Note that
extensions will also have access to these tools, and I think we should
build a system here that's not easy to break, rather than one that is.
Do you know any less-hacky way? Or is it a nice way to go?
I suppose we could be resetting the snapshot every so often? Or use
multiple successive TID range scans with a new snapshot each?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello, Matthias!
We can just release the current snapshot, and get a new one, right? I
mean, we don't actually use the transaction for much else than
visibility during the first scan, and I don't think there is a need
for an actual transaction ID until we're ready to mark the index entry
with indisready.
I suppose we could be resetting the snapshot every so often? Or use
multiple successive TID range scans with a new snapshot each?
It seems like it is not so easy in that case. Because we still need to hold
catalog snapshot xmin, releasing the snapshot which used for the scan does
not affect xmin propagated to the horizon.
That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the
data horizon, but not the catalog's one.
So, in such a situation, we may:
1) starts scan from scratch with some TID range multiple times. But such an
approach feels too complex and error-prone for me.
2) split horizons propagated by `MyProc` to data-related xmin and
catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark
snapshots as affecting some of the horizons, or both. Such a change feels
easy to be done but touches pretty core logic, so we need someone's
approval for such a proposal, probably.
3) provide some less invasive (but less non-kludge) way: add some kind of
process flag like `PROC_IN_SAFE_IC_XMIN` and function like
`AdvanceIndexSafeXmin` which changes the way backend affect horizon
calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons`
uses value from `proc->safeIcXmin` which is updated by
`AdvanceIndexSafeXmin` while switching scan snapshots.
So, with option 2 or 3, we may avoid holding data horizon during the first
phase scan by resetting the scan snapshot every so often (and, optionally,
using `AdvanceIndexSafeXmin` in case of 3rd approach).
The same will be possible for the second phase (validate).
We may do the same "resetting the snapshot every so often" technique, but
there is still the issue with the way we distinguish tuples which were
missed by the first phase scan or were inserted into the index after the
visibility snapshot was taken.
So, I see two options here:
1) approach with additional index with some custom AM proposed by you.
It looks correct and reliable but feels complex to implement and
maintain. Also, it negatively affects performance of table access (because
of an additional index) and validation scan (because we need to merge
additional index content with visibility snapshot).
2) one more tricky approach.
We may add some boolean flag to `Relation` about information of index
building in progress (`indexisbuilding`).
It may be easily calculated using `(index->indisready &&
!index->indisvalid)`. For a more reliable solution, we also need to somehow
check if backend/transaction building the index still in progress. Also, it
is better to check if index is building concurrently using the "safe_index"
way.
I think there is a non too complex and expensive way to do so, probably by
addition of some flag to index catalog record.
Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt`
affecting the relation updating `GlobalVisHorizonKindForRel` like this:
if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
return VISHORIZON_CATALOG;
So, in common it works this way:
* backend building the index affects catalog horizon as usual, but data
horizon is regularly propagated forward during the scan. So, other
relations are processed by vacuum and `heap_page_prune_opt` without any
restrictions
* but our relation (with CIC in progress) accessed by `heap_page_prune_opt`
(or any other vacuum-like mechanics) with catalog horizon to honor CIC
work. Therefore, validating scan may be sure what none of the HOT-chain
will be truncated. Even regular vacuum can't affect it (but yes, it can't
be anyway because of relation locking).
As a result, we may easily distinguish tuples missed by first phase scan,
just by testing them against reference snapshot (which used to take
visibility snapshot).
So, for me, this approach feels non-kludge enough, safe and effective and
the same time.
I have a prototype of this approach and looks like it works (I have a good
test catching issues with index content for CIC).
What do you think about all this?
[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
Hello, Matthias!
I just realized there is a much simpler and safe way to deal with the
problem.
So, d9d076222f5b94a85e0e318339cfc44b8f26022d(1) had a bug because the scan
was not protected by a snapshot. At the same time, we want this snapshot to
affect not all the relations, but only a subset of them. And there is
already a proper way to achieve that - different types of visibility
horizons!
So, to resolve the issue, we just need to create a separated horizon value
for such situation as building an index concurrently.
For now, let's name it `VISHORIZON_BUILD_INDEX_CONCURRENTLY` for example.
By default, its value is equal to `VISHORIZON_DATA`. But in some cases it
"stops" propagating forward while concurrent index is building, like this:
h->create_index_concurrently_oldest_nonremovable
=TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, xmin);
if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);
The `PROC_IN_SAFE_IC` marks backend xmin as ignored by `VISHORIZON_DATA`
but not by `VISHORIZON_BUILD_INDEX_CONCURRENTLY`.
After, we need to use appropriate horizon for relations which are processed
by `PROC_IN_SAFE_IC` backends. There are a few ways to do it, we may start
prototyping with `rd_indexisbuilding` from previous message:
static inline GlobalVisHorizonKind
GlobalVisHorizonKindForRel(Relation rel)
........
if (rel != NULL && rel->rd_indexvalid &&
rel->rd_indexisbuilding)
return VISHORIZON_BUILD_INDEX_CONCURRENTLY;
There are few more moments need to be considered:
* Does it move the horizon backwards?
It is allowed for the horizon to move backwards (like said in
`ComputeXidHorizons`) but anyway - in that case the horizon for particular
relations just starts to lag behind the horizon for other relations.
Invariant is like that: `VISHORIZON_BUILD_INDEX_CONCURRENTLY` <=
`VISHORIZON_DATA` <= `VISHORIZON_CATALOG` <= `VISHORIZON_SHARED`.
* What is about old cached versions of `Relation` objects without
`rd_indexisbuilding` yet set?
This is not a problem because once the backend registers a new index, it
waits for all transactions without that knowledge to end
(`WaitForLockers`). So, new ones will also get information about new
horizon for that particular relation.
* What is about TOAST?
To keep TOAST horizon aligned with relation building the index, we may do
the next thing (as first implementation iteration):
else if (rel != NULL && ((rel->rd_indexvalid &&
rel->rd_indexisbuilding) || IsToastRelation(rel)))
return VISHORIZON_BUILD_INDEX_CONCURRENTLY;
For the normal case, `VISHORIZON_BUILD_INDEX_CONCURRENTLY` is equal to
`VISHORIZON_DATA` - nothing is changed at all. But while the concurrent
index is building, the TOAST horizon is guaranteed to be aligned with its
parent relation. And yes, it is better to find an easy way to affect only
TOAST relations related to the relation with index building in progress.
New horizon adds some complexity, but not too much, in my opinion. I am
pretty sure it is worth being done because the ability to rebuild indexes
without performance degradation is an extremely useful feature.
Things to be improved:
* better way to track relations with concurrent indexes being built (with
mechanics to understood what index build was failed)
* better way to affect TOAST tables only related to concurrent index build
* better naming
Patch prototype in attachment.
Also, maybe it is worth committing test separately - it was based on Andrey
Borodin work (2). The test fails well in the case of incorrect
implementation.
[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
[2]: https://github.com/x4m/postgres_g/commit/d0651e7d0d14862d5a4dac076355
Attachments:
v1-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchtext/x-patch; charset=US-ASCII; name=v1-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchDownload
From b463dd180ab5820ac5b4144ff22622b2a6340b09 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 6 May 2024 01:08:49 +0200
Subject: [PATCH v1] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations
with CONCURRENTLY" which was reverted by e28bb8851969.
Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).
---
src/backend/catalog/index.c | 3 +
src/backend/storage/ipc/procarray.c | 72 ++++++++++-
src/backend/utils/cache/relcache.c | 6 +
src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
src/include/utils/rel.h | 1 +
5 files changed, 231 insertions(+), 6 deletions(-)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..6ad9254d49 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3320,6 +3320,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
/* Open and lock the parent heap relation */
heapRelation = table_open(heapId, ShareUpdateExclusiveLock);
+ /* Load information about the building indexes */
+ RelationGetIndexList(heapRelation);
+ Assert(heapRelation->rd_indexisbuilding);
/*
* Switch to the table owner's userid, so that any index functions are run
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..a2fe173bb6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -236,6 +236,12 @@ typedef struct ComputeXidHorizonsResult
*/
TransactionId data_oldest_nonremovable;
+ /*
+ * Oldest xid for which deleted tuples need to be retained in normal user
+ * defined tables with index building in progress.
+ */
+ TransactionId create_index_concurrently_oldest_nonremovable;
+
/*
* Oldest xid for which deleted tuples need to be retained in this
* session's temporary tables.
@@ -251,6 +257,7 @@ typedef enum GlobalVisHorizonKind
VISHORIZON_SHARED,
VISHORIZON_CATALOG,
VISHORIZON_DATA,
+ VISHORIZON_BUILD_INDEX_CONCURRENTLY,
VISHORIZON_TEMP,
} GlobalVisHorizonKind;
@@ -297,6 +304,7 @@ static TransactionId standbySnapshotPendingXmin;
static GlobalVisState GlobalVisSharedRels;
static GlobalVisState GlobalVisCatalogRels;
static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisBuildIndexConcurrentlyRels;
static GlobalVisState GlobalVisTempRels;
/*
@@ -1727,9 +1735,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
- /* inferred after ProcArrayLock is released */
- h->catalog_oldest_nonremovable = InvalidTransactionId;
-
LWLockAcquire(ProcArrayLock, LW_SHARED);
h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1754,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running = initial;
h->shared_oldest_nonremovable = initial;
+ h->catalog_oldest_nonremovable = initial;
h->data_oldest_nonremovable = initial;
+ h->create_index_concurrently_oldest_nonremovable = initial;
/*
* Only modifications made by this backend affect the horizon for
@@ -1847,11 +1854,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
in_recovery)
{
- h->data_oldest_nonremovable =
- TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ h->create_index_concurrently_oldest_nonremovable =
+ TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, xmin);
+
+ if (!(statusFlags & PROC_IN_SAFE_IC))
+ h->data_oldest_nonremovable =
+ TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+ /* Catalog tables need to consider all backends in this db */
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
}
}
+ /* catalog horizon should never be later than data */
+ Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
+ /* data horizon should never be later than index building horizon */
+ Assert(TransactionIdPrecedesOrEquals(h->create_index_concurrently_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
/*
* If in recovery fetch oldest xid in KnownAssignedXids, will be applied
* after lock is released.
@@ -1873,6 +1897,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+ h->create_index_concurrently_oldest_nonremovable =
+ TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, kaxmin);
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
/* temp relations cannot be accessed in recovery */
}
@@ -1880,6 +1908,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->create_index_concurrently_oldest_nonremovable));
/*
* Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1918,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+ h->create_index_concurrently_oldest_nonremovable =
+ TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, h->slot_xmin);
/*
* The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1932,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable =
TransactionIdOlder(h->shared_oldest_nonremovable,
h->slot_catalog_xmin);
- h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable,
+ h->slot_xmin);
h->catalog_oldest_nonremovable =
TransactionIdOlder(h->catalog_oldest_nonremovable,
h->slot_catalog_xmin);
@@ -1918,6 +1952,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running =
TransactionIdOlder(h->oldest_considered_running,
h->data_oldest_nonremovable);
+ h->oldest_considered_running =
+ TransactionIdOlder(h->oldest_considered_running,
+ h->create_index_concurrently_oldest_nonremovable);
/*
* shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1962,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
*/
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->create_index_concurrently_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->catalog_oldest_nonremovable));
@@ -1938,6 +1977,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->catalog_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+ h->create_index_concurrently_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->temp_oldest_nonremovable));
Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1972,6 +2013,8 @@ GlobalVisHorizonKindForRel(Relation rel)
else if (IsCatalogRelation(rel) ||
RelationIsAccessibleInLogicalDecoding(rel))
return VISHORIZON_CATALOG;
+ else if (rel != NULL && ((rel->rd_indexvalid && rel->rd_indexisbuilding) || IsToastRelation(rel)))
+ return VISHORIZON_BUILD_INDEX_CONCURRENTLY;
else if (!RELATION_IS_LOCAL(rel))
return VISHORIZON_DATA;
else
@@ -2004,6 +2047,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
return horizons.catalog_oldest_nonremovable;
case VISHORIZON_DATA:
return horizons.data_oldest_nonremovable;
+ case VISHORIZON_BUILD_INDEX_CONCURRENTLY:
+ return horizons.create_index_concurrently_oldest_nonremovable;
case VISHORIZON_TEMP:
return horizons.temp_oldest_nonremovable;
}
@@ -2454,6 +2499,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(def_vis_fxid_data,
GlobalVisDataRels.definitely_needed);
+ GlobalVisBuildIndexConcurrentlyRels.definitely_needed =
+ FullTransactionIdNewer(def_vis_fxid_data,
+ GlobalVisBuildIndexConcurrentlyRels.definitely_needed);
/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
if (TransactionIdIsNormal(myxid))
GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2526,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisCatalogRels.maybe_needed =
FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
oldestfxid);
+ GlobalVisBuildIndexConcurrentlyRels.maybe_needed =
+ FullTransactionIdNewer(GlobalVisBuildIndexConcurrentlyRels.maybe_needed,
+ oldestfxid);
GlobalVisDataRels.maybe_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
oldestfxid);
@@ -4106,6 +4157,9 @@ GlobalVisTestFor(Relation rel)
case VISHORIZON_DATA:
state = &GlobalVisDataRels;
break;
+ case VISHORIZON_BUILD_INDEX_CONCURRENTLY:
+ state = &GlobalVisBuildIndexConcurrentlyRels;
+ break;
case VISHORIZON_TEMP:
state = &GlobalVisTempRels;
break;
@@ -4158,6 +4212,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->data_oldest_nonremovable);
+ GlobalVisBuildIndexConcurrentlyRels.maybe_needed =
+ FullXidRelativeTo(horizons->latest_completed,
+ horizons->create_index_concurrently_oldest_nonremovable);
GlobalVisTempRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->temp_oldest_nonremovable);
@@ -4176,6 +4233,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
GlobalVisDataRels.definitely_needed);
+ GlobalVisBuildIndexConcurrentlyRels.definitely_needed =
+ FullTransactionIdNewer(GlobalVisBuildIndexConcurrentlyRels.maybe_needed,
+ GlobalVisBuildIndexConcurrentlyRels.definitely_needed);
GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..677ba61205 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4769,6 +4769,7 @@ RelationGetIndexList(Relation relation)
Oid pkeyIndex = InvalidOid;
Oid candidateIndex = InvalidOid;
bool pkdeferrable = false;
+ bool indexisbuilding = false;
MemoryContext oldcxt;
/* Quick exit if we already computed the list. */
@@ -4809,6 +4810,10 @@ RelationGetIndexList(Relation relation)
/* add index's OID to result list */
result = lappend_oid(result, index->indexrelid);
+ /* consider index as building if it is ready but not yet valid */
+ if (index->indisready && !index->indisvalid)
+ indexisbuilding = true;
+
/*
* Non-unique or predicate indexes aren't interesting for either oid
* indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4874,7 @@ RelationGetIndexList(Relation relation)
relation->rd_indexlist = list_copy(result);
relation->rd_pkindex = pkeyIndex;
relation->rd_ispkdeferrable = pkdeferrable;
+ relation->rd_indexisbuilding = indexisbuilding;
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0,c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=5 --transactions=25000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '002_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ sleep(1);
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr);
+ while (1)
+ {
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ }
+
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+}
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..a9e2d1beab 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
List *rd_indexlist; /* list of OIDs of indexes on relation */
Oid rd_pkindex; /* OID of (deferrable?) primary key, if any */
bool rd_ispkdeferrable; /* is rd_pkindex a deferrable PK? */
+ bool rd_indexisbuilding; /* is index building in progress for relation */
Oid rd_replidindex; /* OID of replica identity index, if any */
/* data managed by RelationGetStatExtList: */
--
2.34.1
Hello, Matthias and others!
Updated WIP in attach.
Changes are:
* Renaming, now it feels better for me
* More reliable approach in `GlobalVisHorizonKindForRel` to make sure we
have not missed `rd_safeindexconcurrentlybuilding` by calling
`RelationGetIndexList` if required
* Optimization to avoid any additional `RelationGetIndexList` if zero of
concurrently indexes are being built
* TOAST moved to TODO, since looks like it is out of scope - but not sure
yet, need to dive dipper
TODO:
* TOAST
* docs and comments
* make sure non-data tables are not affected
* Per-database scope of optimization
* Handle index building errors correctly in optimization code
* More tests: create index, multiple re-indexes, multiple tables
Thanks,
Michail.
Attachments:
v2-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchtext/x-patch; charset=US-ASCII; name=v2-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchDownload
From 63677046efc9b6a1d93f9248c6d9dce14a945a42 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 7 May 2024 14:24:09 +0200
Subject: [PATCH v2] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations
with CONCURRENTLY" which was reverted by e28bb8851969.
Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.
Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).
Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.
To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
src/backend/catalog/index.c | 36 ++++++
src/backend/commands/indexcmds.c | 20 +++
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/procarray.c | 88 ++++++++++++-
src/backend/utils/cache/relcache.c | 11 ++
src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
src/include/catalog/index.h | 5 +
src/include/utils/rel.h | 1 +
8 files changed, 311 insertions(+), 7 deletions(-)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
Oid pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
} SerializedReindexState;
+typedef struct {
+ pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
/* non-export function prototypes */
static bool relationHasPrimaryKey(Relation rel);
static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
return result;
}
+
+void SafeICStateShmemInit(void)
+{
+ bool found;
+
+ SafeICStateShmem = (SafeICSharedState *)
+ ShmemInitStruct("Safe Concurrently Build Indexes",
+ sizeof(SafeICSharedState),
+ &found);
+
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+ } else
+ Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+ if (increment)
+ pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+ else
+ pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+ return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
/*
* index_check_primary_key
* Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
* hold lock on the parent table. This might need to change later.
*/
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+ if (safe_index && concurrent)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
* to replan; so relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId);
+ /* Commit index as valid before reducing counter of safe concurrently build indexes */
+ CommitTransactionCommand();
+ Assert(concurrent);
+ if (safe_index)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+ /* Start a new transaction to finish process properly */
+ StartTransactionCommand();
/*
* Last thing to do is release the session-level lock on the parent table.
*/
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
indexRel->rd_indpred == NIL);
idx->tableId = RelationGetRelid(heapRel);
idx->amId = indexRel->rd_rel->relam;
+ if (idx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
/* This function shouldn't be called for temporary relations. */
if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
}
+ // now we may clear safe index building flags
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ if (newidx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+ }
+
/* Start a new transaction to finish process properly */
StartTransactionCommand();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "catalog/index.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventExtensionShmemInit();
InjectionPointShmemInit();
+ SafeICStateShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..de3b3a5c0c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/pg_authid.h"
#include "commands/dbcommands.h"
#include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
*/
TransactionId data_oldest_nonremovable;
+ /*
+ * Oldest xid for which deleted tuples need to be retained in normal user
+ * defined tables with index building in progress by process with PROC_INSAFE_IC.
+ */
+ TransactionId data_safe_ic_oldest_nonremovable;
+
/*
* Oldest xid for which deleted tuples need to be retained in this
* session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
VISHORIZON_SHARED,
VISHORIZON_CATALOG,
VISHORIZON_DATA,
+ VISHORIZON_DATA_SAFE_IC,
VISHORIZON_TEMP,
} GlobalVisHorizonKind;
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
static GlobalVisState GlobalVisSharedRels;
static GlobalVisState GlobalVisCatalogRels;
static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
static GlobalVisState GlobalVisTempRels;
/*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
- /* inferred after ProcArrayLock is released */
- h->catalog_oldest_nonremovable = InvalidTransactionId;
-
LWLockAcquire(ProcArrayLock, LW_SHARED);
h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running = initial;
h->shared_oldest_nonremovable = initial;
+ h->catalog_oldest_nonremovable = initial;
h->data_oldest_nonremovable = initial;
+ h->data_safe_ic_oldest_nonremovable = initial;
/*
* Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
in_recovery)
{
- h->data_oldest_nonremovable =
- TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+ if (!(statusFlags & PROC_IN_SAFE_IC))
+ h->data_oldest_nonremovable =
+ TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+ /* Catalog tables need to consider all backends in this db */
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
}
}
+ /* catalog horizon should never be later than data */
+ Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
+ /* data horizon should never be later than safe index building horizon */
+ Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
/*
* If in recovery fetch oldest xid in KnownAssignedXids, will be applied
* after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
/* temp relations cannot be accessed in recovery */
}
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
/*
* Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
/*
* The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable =
TransactionIdOlder(h->shared_oldest_nonremovable,
h->slot_catalog_xmin);
- h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable,
+ h->slot_xmin);
h->catalog_oldest_nonremovable =
TransactionIdOlder(h->catalog_oldest_nonremovable,
h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running =
TransactionIdOlder(h->oldest_considered_running,
h->data_oldest_nonremovable);
+ h->oldest_considered_running =
+ TransactionIdOlder(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable);
/*
* shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
*/
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->catalog_oldest_nonremovable));
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->catalog_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->temp_oldest_nonremovable));
Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,22 @@ GlobalVisHorizonKindForRel(Relation rel)
RelationIsAccessibleInLogicalDecoding(rel))
return VISHORIZON_CATALOG;
else if (!RELATION_IS_LOCAL(rel))
- return VISHORIZON_DATA;
+ {
+ // TODO: do we need to do something special about the TOAST?
+ if (!rel->rd_indexvalid)
+ {
+ // skip loading indexes if we know there is not safe concurrent index builds in the cluster
+ if (IsAnySafeIndexBuildsConcurrently())
+ {
+ RelationGetIndexList(rel);
+ Assert(rel->rd_indexvalid);
+
+ if (rel->rd_safeindexconcurrentlybuilding)
+ return VISHORIZON_DATA_SAFE_IC;
+ }
+ return VISHORIZON_DATA;
+ }
+ }
else
return VISHORIZON_TEMP;
}
@@ -2004,6 +2061,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
return horizons.catalog_oldest_nonremovable;
case VISHORIZON_DATA:
return horizons.data_oldest_nonremovable;
+ case VISHORIZON_DATA_SAFE_IC:
+ return horizons.data_safe_ic_oldest_nonremovable;
case VISHORIZON_TEMP:
return horizons.temp_oldest_nonremovable;
}
@@ -2454,6 +2513,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(def_vis_fxid_data,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(def_vis_fxid_data,
+ GlobalVisDataSafeIcRels.definitely_needed);
/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
if (TransactionIdIsNormal(myxid))
GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2540,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisCatalogRels.maybe_needed =
FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
oldestfxid);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ oldestfxid);
GlobalVisDataRels.maybe_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
oldestfxid);
@@ -4106,6 +4171,9 @@ GlobalVisTestFor(Relation rel)
case VISHORIZON_DATA:
state = &GlobalVisDataRels;
break;
+ case VISHORIZON_DATA_SAFE_IC:
+ state = &GlobalVisDataSafeIcRels;
+ break;
case VISHORIZON_TEMP:
state = &GlobalVisTempRels;
break;
@@ -4158,6 +4226,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->data_oldest_nonremovable);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullXidRelativeTo(horizons->latest_completed,
+ horizons->data_safe_ic_oldest_nonremovable);
GlobalVisTempRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->temp_oldest_nonremovable);
@@ -4176,6 +4247,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ GlobalVisDataSafeIcRels.definitely_needed);
GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..21e8521ab8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
#include "access/xact.h"
#include "catalog/binary_upgrade.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/indexing.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
Oid pkeyIndex = InvalidOid;
Oid candidateIndex = InvalidOid;
bool pkdeferrable = false;
+ bool safeindexconcurrentlybuilding = false;
MemoryContext oldcxt;
/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
/* add index's OID to result list */
result = lappend_oid(result, index->indexrelid);
+ /*
+ * Consider index as building if it is ready but not yet valid.
+ * Also, we must deal only with indexes which are built using the
+ * concurrent safe mode.
+ */
+ if (index->indisready && !index->indisvalid)
+ safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
/*
* Non-unique or predicate indexes aren't interesting for either oid
* indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
relation->rd_indexlist = list_copy(result);
relation->rd_pkindex = pkeyIndex;
relation->rd_ispkdeferrable = pkdeferrable;
+ relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0,c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=5 --transactions=25000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '002_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ sleep(1);
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr);
+ while (1)
+ {
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ }
+
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
/*
* itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
List *rd_indexlist; /* list of OIDs of indexes on relation */
Oid rd_pkindex; /* OID of (deferrable?) primary key, if any */
bool rd_ispkdeferrable; /* is rd_pkindex a deferrable PK? */
+ bool rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
Oid rd_replidindex; /* OID of replica identity index, if any */
/* data managed by RelationGetStatExtList: */
--
2.34.1
Hi again!
Made an error in `GlobalVisHorizonKindForRel` logic, and it was caught by a
new test.
Fixed version in attach.
Show quoted text
Attachments:
v3-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchtext/x-patch; charset=US-ASCII; name=v3-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchDownload
From 9a8ea366f6d2d144979e825c4ac0bdd2937bf7c1 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 7 May 2024 22:10:56 +0200
Subject: [PATCH v3] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations
with CONCURRENTLY" which was reverted by e28bb8851969.
Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.
Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).
Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.
To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
src/backend/catalog/index.c | 36 ++++++
src/backend/commands/indexcmds.c | 20 +++
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/procarray.c | 85 ++++++++++++-
src/backend/utils/cache/relcache.c | 11 ++
src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
src/include/catalog/index.h | 5 +
src/include/utils/rel.h | 1 +
8 files changed, 309 insertions(+), 6 deletions(-)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
Oid pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
} SerializedReindexState;
+typedef struct {
+ pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
/* non-export function prototypes */
static bool relationHasPrimaryKey(Relation rel);
static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
return result;
}
+
+void SafeICStateShmemInit(void)
+{
+ bool found;
+
+ SafeICStateShmem = (SafeICSharedState *)
+ ShmemInitStruct("Safe Concurrently Build Indexes",
+ sizeof(SafeICSharedState),
+ &found);
+
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+ } else
+ Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+ if (increment)
+ pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+ else
+ pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+ return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
/*
* index_check_primary_key
* Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
* hold lock on the parent table. This might need to change later.
*/
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+ if (safe_index && concurrent)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
* to replan; so relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId);
+ /* Commit index as valid before reducing counter of safe concurrently build indexes */
+ CommitTransactionCommand();
+ Assert(concurrent);
+ if (safe_index)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+ /* Start a new transaction to finish process properly */
+ StartTransactionCommand();
/*
* Last thing to do is release the session-level lock on the parent table.
*/
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
indexRel->rd_indpred == NIL);
idx->tableId = RelationGetRelid(heapRel);
idx->amId = indexRel->rd_rel->relam;
+ if (idx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
/* This function shouldn't be called for temporary relations. */
if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
}
+ // now we may clear safe index building flags
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ if (newidx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+ }
+
/* Start a new transaction to finish process properly */
StartTransactionCommand();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "catalog/index.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventExtensionShmemInit();
InjectionPointShmemInit();
+ SafeICStateShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..446df34dab 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/pg_authid.h"
#include "commands/dbcommands.h"
#include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
*/
TransactionId data_oldest_nonremovable;
+ /*
+ * Oldest xid for which deleted tuples need to be retained in normal user
+ * defined tables with index building in progress by process with PROC_INSAFE_IC.
+ */
+ TransactionId data_safe_ic_oldest_nonremovable;
+
/*
* Oldest xid for which deleted tuples need to be retained in this
* session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
VISHORIZON_SHARED,
VISHORIZON_CATALOG,
VISHORIZON_DATA,
+ VISHORIZON_DATA_SAFE_IC,
VISHORIZON_TEMP,
} GlobalVisHorizonKind;
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
static GlobalVisState GlobalVisSharedRels;
static GlobalVisState GlobalVisCatalogRels;
static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
static GlobalVisState GlobalVisTempRels;
/*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
- /* inferred after ProcArrayLock is released */
- h->catalog_oldest_nonremovable = InvalidTransactionId;
-
LWLockAcquire(ProcArrayLock, LW_SHARED);
h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running = initial;
h->shared_oldest_nonremovable = initial;
+ h->catalog_oldest_nonremovable = initial;
h->data_oldest_nonremovable = initial;
+ h->data_safe_ic_oldest_nonremovable = initial;
/*
* Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
in_recovery)
{
- h->data_oldest_nonremovable =
- TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+ if (!(statusFlags & PROC_IN_SAFE_IC))
+ h->data_oldest_nonremovable =
+ TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+ /* Catalog tables need to consider all backends in this db */
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
}
}
+ /* catalog horizon should never be later than data */
+ Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
+ /* data horizon should never be later than safe index building horizon */
+ Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
/*
* If in recovery fetch oldest xid in KnownAssignedXids, will be applied
* after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
/* temp relations cannot be accessed in recovery */
}
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
/*
* Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
/*
* The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable =
TransactionIdOlder(h->shared_oldest_nonremovable,
h->slot_catalog_xmin);
- h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable,
+ h->slot_xmin);
h->catalog_oldest_nonremovable =
TransactionIdOlder(h->catalog_oldest_nonremovable,
h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running =
TransactionIdOlder(h->oldest_considered_running,
h->data_oldest_nonremovable);
+ h->oldest_considered_running =
+ TransactionIdOlder(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable);
/*
* shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
*/
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->catalog_oldest_nonremovable));
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->catalog_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->temp_oldest_nonremovable));
Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,21 @@ GlobalVisHorizonKindForRel(Relation rel)
RelationIsAccessibleInLogicalDecoding(rel))
return VISHORIZON_CATALOG;
else if (!RELATION_IS_LOCAL(rel))
+ {
+ // TODO: do we need to do something special about the TOAST?
+ if (!rel->rd_indexvalid)
+ {
+ // skip loading indexes if we know there is not safe concurrent index builds in the cluster
+ if (IsAnySafeIndexBuildsConcurrently())
+ {
+ RelationGetIndexList(rel);
+ Assert(rel->rd_indexvalid);
+ } else return VISHORIZON_DATA;
+ }
+ if (rel->rd_safeindexconcurrentlybuilding)
+ return VISHORIZON_DATA_SAFE_IC;
return VISHORIZON_DATA;
+ }
else
return VISHORIZON_TEMP;
}
@@ -2004,6 +2060,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
return horizons.catalog_oldest_nonremovable;
case VISHORIZON_DATA:
return horizons.data_oldest_nonremovable;
+ case VISHORIZON_DATA_SAFE_IC:
+ return horizons.data_safe_ic_oldest_nonremovable;
case VISHORIZON_TEMP:
return horizons.temp_oldest_nonremovable;
}
@@ -2454,6 +2512,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(def_vis_fxid_data,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(def_vis_fxid_data,
+ GlobalVisDataSafeIcRels.definitely_needed);
/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
if (TransactionIdIsNormal(myxid))
GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2539,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisCatalogRels.maybe_needed =
FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
oldestfxid);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ oldestfxid);
GlobalVisDataRels.maybe_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
oldestfxid);
@@ -4106,6 +4170,9 @@ GlobalVisTestFor(Relation rel)
case VISHORIZON_DATA:
state = &GlobalVisDataRels;
break;
+ case VISHORIZON_DATA_SAFE_IC:
+ state = &GlobalVisDataSafeIcRels;
+ break;
case VISHORIZON_TEMP:
state = &GlobalVisTempRels;
break;
@@ -4158,6 +4225,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->data_oldest_nonremovable);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullXidRelativeTo(horizons->latest_completed,
+ horizons->data_safe_ic_oldest_nonremovable);
GlobalVisTempRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->temp_oldest_nonremovable);
@@ -4176,6 +4246,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ GlobalVisDataSafeIcRels.definitely_needed);
GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..21e8521ab8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
#include "access/xact.h"
#include "catalog/binary_upgrade.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/indexing.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
Oid pkeyIndex = InvalidOid;
Oid candidateIndex = InvalidOid;
bool pkdeferrable = false;
+ bool safeindexconcurrentlybuilding = false;
MemoryContext oldcxt;
/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
/* add index's OID to result list */
result = lappend_oid(result, index->indexrelid);
+ /*
+ * Consider index as building if it is ready but not yet valid.
+ * Also, we must deal only with indexes which are built using the
+ * concurrent safe mode.
+ */
+ if (index->indisready && !index->indisvalid)
+ safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
/*
* Non-unique or predicate indexes aren't interesting for either oid
* indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
relation->rd_indexlist = list_copy(result);
relation->rd_pkindex = pkeyIndex;
relation->rd_ispkdeferrable = pkdeferrable;
+ relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0,c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=5 --transactions=25000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '002_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ sleep(1);
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr);
+ while (1)
+ {
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ }
+
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
/*
* itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
List *rd_indexlist; /* list of OIDs of indexes on relation */
Oid rd_pkindex; /* OID of (deferrable?) primary key, if any */
bool rd_ispkdeferrable; /* is rd_pkindex a deferrable PK? */
+ bool rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
Oid rd_replidindex; /* OID of replica identity index, if any */
/* data managed by RelationGetStatExtList: */
--
2.34.1
Hello, Matthias and others!
Realized new horizon was applied only during validation phase (once index
is marked as ready).
Now it applied if index is not marked as valid yet.
Updated version in attach.
--------------------------------------------------
I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash). Then we could do something
like this for the second table scan phase:
Regarding that approach to dealing with validation phase and resetting of
snapshot:
I was thinking about it and realized: once we go for an additional index -
we don't need the second heap scan at all!
We may do it this way:
* create target index, not marked as indisready yet
* create a temporary unlogged index with the same parameters to store tids
(optionally with the indexes columns data, see below), marked as indisready
(but not indisvalid)
* commit them both in a single transaction
* wait for other transaction to know about them and honor in HOT
constraints and new inserts (for temporary index)
* now our temporary index is filled by the tuples inserted to the table
* start building out target index, resetting snapshot every so often (if it
is "safe" index)
* finish target index building phase
* mark target index as indisready
* now, start validation of the index:
* take the reference snapshot
* take a visibility snapshot of the target index, sort it (as it done
currently)
* take a visibility snapshot of our temporary index, sort it
* start merging loop using two synchronized cursors over both
visibility snapshots
* if we encountered tid which is not present in target visibility
snapshot
* insert it to target index
* if a temporary index contains the column's data - we may
even avoid the tuple fetch
* if temporary index is tid-only - we fetch tuple from the
heap, but as plus we are also skipping dead tuples from insertion to the
new index (I think it is better option)
* commit everything, release reference snapshot
* wait for transactions older than reference snapshot (as it done currently)
* mark target index as indisvalid, drop temporary index
* done
So, pros:
* just a single heap scan
* snapshot is reset periodically
Cons:
* we need to maintain the additional index during the main building phase
* one more tuplesort
If the temporary index is unlogged, cheap to maintain (just append-only
mechanics) this feels like a perfect tradeoff for me.
This approach will work perfectly with low amount of tuple inserts during
the building phase. And looks like even in the worst case it still better
than the current approach.
What do you think? Have I missed something?
Thanks,
Michail.
Attachments:
v4-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchtext/x-patch; charset=US-ASCII; name=v4-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patchDownload
From 4878cc22c9176e5bf2b7d3d9d8c95cc66c8ac007 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Wed, 8 May 2024 22:31:33 +0200
Subject: [PATCH v4] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations
with CONCURRENTLY" which was reverted by e28bb8851969.
Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.
Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).
Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.
To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
src/backend/catalog/index.c | 36 ++++++
src/backend/commands/indexcmds.c | 20 +++
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/procarray.c | 85 ++++++++++++-
src/backend/utils/cache/relcache.c | 11 ++
src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
src/include/catalog/index.h | 5 +
src/include/utils/rel.h | 1 +
8 files changed, 309 insertions(+), 6 deletions(-)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
Oid pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
} SerializedReindexState;
+typedef struct {
+ pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
/* non-export function prototypes */
static bool relationHasPrimaryKey(Relation rel);
static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
return result;
}
+
+void SafeICStateShmemInit(void)
+{
+ bool found;
+
+ SafeICStateShmem = (SafeICSharedState *)
+ ShmemInitStruct("Safe Concurrently Build Indexes",
+ sizeof(SafeICSharedState),
+ &found);
+
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+ } else
+ Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+ if (increment)
+ pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+ else
+ pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+ return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
/*
* index_check_primary_key
* Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
* hold lock on the parent table. This might need to change later.
*/
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+ if (safe_index && concurrent)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
* to replan; so relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId);
+ /* Commit index as valid before reducing counter of safe concurrently build indexes */
+ CommitTransactionCommand();
+ Assert(concurrent);
+ if (safe_index)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+ /* Start a new transaction to finish process properly */
+ StartTransactionCommand();
/*
* Last thing to do is release the session-level lock on the parent table.
*/
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
indexRel->rd_indpred == NIL);
idx->tableId = RelationGetRelid(heapRel);
idx->amId = indexRel->rd_rel->relam;
+ if (idx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(true);
/* This function shouldn't be called for temporary relations. */
if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
}
+ // now we may clear safe index building flags
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ if (newidx->safe)
+ UpdateNumSafeConcurrentlyBuiltIndexes(false);
+ }
+
/* Start a new transaction to finish process properly */
StartTransactionCommand();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "catalog/index.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventExtensionShmemInit();
InjectionPointShmemInit();
+ SafeICStateShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..446df34dab 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/pg_authid.h"
#include "commands/dbcommands.h"
#include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
*/
TransactionId data_oldest_nonremovable;
+ /*
+ * Oldest xid for which deleted tuples need to be retained in normal user
+ * defined tables with index building in progress by process with PROC_INSAFE_IC.
+ */
+ TransactionId data_safe_ic_oldest_nonremovable;
+
/*
* Oldest xid for which deleted tuples need to be retained in this
* session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
VISHORIZON_SHARED,
VISHORIZON_CATALOG,
VISHORIZON_DATA,
+ VISHORIZON_DATA_SAFE_IC,
VISHORIZON_TEMP,
} GlobalVisHorizonKind;
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
static GlobalVisState GlobalVisSharedRels;
static GlobalVisState GlobalVisCatalogRels;
static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
static GlobalVisState GlobalVisTempRels;
/*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
- /* inferred after ProcArrayLock is released */
- h->catalog_oldest_nonremovable = InvalidTransactionId;
-
LWLockAcquire(ProcArrayLock, LW_SHARED);
h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running = initial;
h->shared_oldest_nonremovable = initial;
+ h->catalog_oldest_nonremovable = initial;
h->data_oldest_nonremovable = initial;
+ h->data_safe_ic_oldest_nonremovable = initial;
/*
* Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
in_recovery)
{
- h->data_oldest_nonremovable =
- TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+ if (!(statusFlags & PROC_IN_SAFE_IC))
+ h->data_oldest_nonremovable =
+ TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+ /* Catalog tables need to consider all backends in this db */
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
}
}
+ /* catalog horizon should never be later than data */
+ Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
+ /* data horizon should never be later than safe index building horizon */
+ Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+ h->data_oldest_nonremovable));
+
/*
* If in recovery fetch oldest xid in KnownAssignedXids, will be applied
* after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
/* temp relations cannot be accessed in recovery */
}
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
/*
* Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+ h->data_safe_ic_oldest_nonremovable =
+ TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
/*
* The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->shared_oldest_nonremovable =
TransactionIdOlder(h->shared_oldest_nonremovable,
h->slot_catalog_xmin);
- h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable,
+ h->slot_xmin);
h->catalog_oldest_nonremovable =
TransactionIdOlder(h->catalog_oldest_nonremovable,
h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->oldest_considered_running =
TransactionIdOlder(h->oldest_considered_running,
h->data_oldest_nonremovable);
+ h->oldest_considered_running =
+ TransactionIdOlder(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable);
/*
* shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
*/
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
h->catalog_oldest_nonremovable));
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
h->catalog_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->data_oldest_nonremovable));
+ Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+ h->data_safe_ic_oldest_nonremovable));
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
h->temp_oldest_nonremovable));
Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,21 @@ GlobalVisHorizonKindForRel(Relation rel)
RelationIsAccessibleInLogicalDecoding(rel))
return VISHORIZON_CATALOG;
else if (!RELATION_IS_LOCAL(rel))
+ {
+ // TODO: do we need to do something special about the TOAST?
+ if (!rel->rd_indexvalid)
+ {
+ // skip loading indexes if we know there is not safe concurrent index builds in the cluster
+ if (IsAnySafeIndexBuildsConcurrently())
+ {
+ RelationGetIndexList(rel);
+ Assert(rel->rd_indexvalid);
+ } else return VISHORIZON_DATA;
+ }
+ if (rel->rd_safeindexconcurrentlybuilding)
+ return VISHORIZON_DATA_SAFE_IC;
return VISHORIZON_DATA;
+ }
else
return VISHORIZON_TEMP;
}
@@ -2004,6 +2060,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
return horizons.catalog_oldest_nonremovable;
case VISHORIZON_DATA:
return horizons.data_oldest_nonremovable;
+ case VISHORIZON_DATA_SAFE_IC:
+ return horizons.data_safe_ic_oldest_nonremovable;
case VISHORIZON_TEMP:
return horizons.temp_oldest_nonremovable;
}
@@ -2454,6 +2512,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(def_vis_fxid_data,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(def_vis_fxid_data,
+ GlobalVisDataSafeIcRels.definitely_needed);
/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
if (TransactionIdIsNormal(myxid))
GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2539,9 @@ GetSnapshotData(Snapshot snapshot)
GlobalVisCatalogRels.maybe_needed =
FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
oldestfxid);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ oldestfxid);
GlobalVisDataRels.maybe_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
oldestfxid);
@@ -4106,6 +4170,9 @@ GlobalVisTestFor(Relation rel)
case VISHORIZON_DATA:
state = &GlobalVisDataRels;
break;
+ case VISHORIZON_DATA_SAFE_IC:
+ state = &GlobalVisDataSafeIcRels;
+ break;
case VISHORIZON_TEMP:
state = &GlobalVisTempRels;
break;
@@ -4158,6 +4225,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->data_oldest_nonremovable);
+ GlobalVisDataSafeIcRels.maybe_needed =
+ FullXidRelativeTo(horizons->latest_completed,
+ horizons->data_safe_ic_oldest_nonremovable);
GlobalVisTempRels.maybe_needed =
FullXidRelativeTo(horizons->latest_completed,
horizons->temp_oldest_nonremovable);
@@ -4176,6 +4246,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
GlobalVisDataRels.definitely_needed =
FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
GlobalVisDataRels.definitely_needed);
+ GlobalVisDataSafeIcRels.definitely_needed =
+ FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+ GlobalVisDataSafeIcRels.definitely_needed);
GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..93b7794b48 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
#include "access/xact.h"
#include "catalog/binary_upgrade.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/indexing.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
Oid pkeyIndex = InvalidOid;
Oid candidateIndex = InvalidOid;
bool pkdeferrable = false;
+ bool safeindexconcurrentlybuilding = false;
MemoryContext oldcxt;
/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
/* add index's OID to result list */
result = lappend_oid(result, index->indexrelid);
+ /*
+ * Consider index as building if it is not yet valid.
+ * Also, we must deal only with indexes which are built using the
+ * concurrent safe mode.
+ */
+ if (!index->indisvalid)
+ safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
/*
* Non-unique or predicate indexes aren't interesting for either oid
* indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
relation->rd_indexlist = list_copy(result);
relation->rd_pkindex = pkeyIndex;
relation->rd_ispkdeferrable = pkdeferrable;
+ relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0,c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=5 --transactions=25000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '002_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ sleep(1);
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr);
+ while (1)
+ {
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ }
+
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
/*
* itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
List *rd_indexlist; /* list of OIDs of indexes on relation */
Oid rd_pkindex; /* OID of (deferrable?) primary key, if any */
bool rd_ispkdeferrable; /* is rd_pkindex a deferrable PK? */
+ bool rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
Oid rd_replidindex; /* OID of replica identity index, if any */
/* data managed by RelationGetStatExtList: */
--
2.34.1
Hello.
I did the POC (1) of the method described in the previous email, and it
looks promising.
It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs
15 mins). Additional index is lightweight and does not produce any WAL.
I'll continue the more stress testing for a while. Also, I need to
restructure the commits (my path was no direct) into some meaningful and
reviewable patches.
[1]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach
On Tue, 11 Jun 2024 at 10:58, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Hello.
I did the POC (1) of the method described in the previous email, and it looks promising.
It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs 15 mins).
That's a nice improvement.
Additional index is lightweight and does not produce any WAL.
That doesn't seem to be what I see in the current patchset:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-cc3cb8968cf833c4b8498ad2c561c786099c910515c4bf397ba853ae60aa2bf7R311
I'll continue the more stress testing for a while. Also, I need to restructure the commits (my path was no direct) into some meaningful and reviewable patches.
While waiting for this, here are some initial comments on the github diffs:
- I notice you've added a new argument to
heapam_index_build_range_scan. I think this could just as well be
implemented by reading the indexInfo->ii_Concurrent field, as the
values should be equivalent, right?
- In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)
- In heapam_index_build_range_scan, you pop the snapshot before the
returned heaptuple is processed and passed to the index-provided
callback. I think that's incorrect, as it'll change the visibility of
the returned tuple before it's passed to the index's callback. I think
the snapshot manipulation is best added at the end of the loop, if we
add it at all in that function.
- The snapshot reset interval is quite high, at 500ms. Why did you
configure it that low, and didn't you make this configurable?
- You seem to be using WAL in the STIR index, while it doesn't seem
that relevant for the use case of auxiliary indexes that won't return
any data and are only used on the primary. It would imply that the
data is being sent to replicas and more data being written than
strictly necessary, which to me seems wasteful.
- The locking in stirinsert can probably be improved significantly if
we use things like atomic operations on STIR pages. We'd need an
exclusive lock only for page initialization, while share locks are
enough if the page's data is modified without WAL. That should improve
concurrent insert performance significantly, as it would further
reduce the length of the exclusively locked hot path.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hello, Matthias!
While waiting for this, here are some initial comments on the github
diffs:
Thanks for your review!
While stress testing the POC, I found some issues unrelated to the patch
that need to be fixed first.
This is [1]/messages/by-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA@mail.gmail.com and [2]/messages/by-id/CANtu0ojga8s9+J89cAgLzn2e-bQgy3L0iQCKaCnTL=ppot=qhw@mail.gmail.com.
Additional index is lightweight and does not produce any WAL.
That doesn't seem to be what I see in the current patchset:
Persistence is passed as parameter [3]https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e555255545e6655933R93 and set to RELPERSISTENCE_UNLOGGED
for auxiliary indexes [4]https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L1600.
- I notice you've added a new argument to
heapam_index_build_range_scan. I think this could just as well be
implemented by reading the indexInfo->ii_Concurrent field, as the
values should be equivalent, right?
Not always; currently, it is set by ResetSnapshotsAllowed[5]https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657.
We fall back to regular index build if there is a predicate or expression
in the index (which should be considered "safe" according to [6]https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129).
However, we may remove this check later.
Additionally, there is no sense in resetting the snapshot if we already
have an xmin assigned to the backend for some reason.
In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)
Yeah, good catch.
Initially, I implemented a different approach by extracting the catalog
xmin to a separate horizon [7]https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c. It might be better to return to this option.
In heapam_index_build_range_scan, you pop the snapshot before the
returned heaptuple is processed and passed to the index-provided
callback. I think that's incorrect, as it'll change the visibility of
the returned tuple before it's passed to the index's callback. I think
the snapshot manipulation is best added at the end of the loop, if we
add it at all in that function.
Yes, this needs to be fixed as well.
The snapshot reset interval is quite high, at 500ms. Why did you
configure it that low, and didn't you make this configurable?
It is just a random value for testing purposes.
I don't think there is a need to make it configurable.
Getting a new snapshot is a cheap operation now, so we can do it more often
if required.
Internally, I was testing it with a 0ms interval.
You seem to be using WAL in the STIR index, while it doesn't seem
that relevant for the use case of auxiliary indexes that won't return
any data and are only used on the primary. It would imply that the
data is being sent to replicas and more data being written than
strictly necessary, which to me seems wasteful.
It just looks like an index with WAL, but as mentioned above, it is
unlogged in actual usage.
The locking in stirinsert can probably be improved significantly if
we use things like atomic operations on STIR pages. We'd need an
exclusive lock only for page initialization, while share locks are
enough if the page's data is modified without WAL. That should improve
concurrent insert performance significantly, as it would further
reduce the length of the exclusively locked hot path.
Hm, good idea. I'll check it later.
Best regards & thanks again,
Mikhail
[1]: /messages/by-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA@mail.gmail.com
/messages/by-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA@mail.gmail.com
[2]: /messages/by-id/CANtu0ojga8s9+J89cAgLzn2e-bQgy3L0iQCKaCnTL=ppot=qhw@mail.gmail.com
/messages/by-id/CANtu0ojga8s9+J89cAgLzn2e-bQgy3L0iQCKaCnTL=ppot=qhw@mail.gmail.com
[3]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e555255545e6655933R93
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e555255545e6655933R93
[4]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L1600
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L1600
[5]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657
[6]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129
[7]: https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c
https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c
Hello, Matthias!
Just wanted to update you with some information about the next steps in
work.
In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)
I have returned to the solution with a dedicated catalog_xmin for backends
[1]: https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0
Additionally, I have added catalog_xmin to pg_stat_activity [2]https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6.
In heapam_index_build_range_scan, you pop the snapshot before the
returned heaptuple is processed and passed to the index-provided
callback. I think that's incorrect, as it'll change the visibility of
the returned tuple before it's passed to the index's callback. I think
the snapshot manipulation is best added at the end of the loop, if we
add it at all in that function.
Now it's fixed, and the snapshot is reset between pages [3]https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf.
Additionally, I resolved the issue with potential duplicates in unique
indexes. It looks a bit clunky, but it works for now [4]https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d.
Single commit from [5]https://commitfest.postgresql.org/49/5160/ also included, just for stable stress testing.
Full diff is available at [6]https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=.
Best regards,
Mikhail.
[1]: https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0
https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0
[2]: https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6
https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6
[3]: https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf
https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf
[4]: https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d
https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d
[5]: https://commitfest.postgresql.org/49/5160/
[6]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=
Hello, Matthias!
- I notice you've added a new argument to
heapam_index_build_range_scan. I think this could just as well be
implemented by reading the indexInfo->ii_Concurrent field, as the
values should be equivalent, right?
Not always; currently, it is set by ResetSnapshotsAllowed[5].
We fall back to regular index build if there is a predicate or expression
in the index (which should be considered "safe" according to [6]).
However, we may remove this check later.
Additionally, there is no sense in resetting the snapshot if we already
have an xmin assigned to the backend for some reason.
I realized you were right. It's always possible to reset snapshots for
concurrent index building without any limitations related to predicates or
expressions.
Additionally, the PROC_IN_SAFE_IC flag is no longer necessary since
snapshots are rotating quickly, and it's possible to wait for them without
requiring any special exceptions for CREATE/REINDEX INDEX CONCURRENTLY.
Currently, it looks like this [1]https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach_rebased?expand=1. I've also attached a single large patch
just for the case.
I plan to restructure the patch into the following set:
* Introduce catalogXmin as a separate value to calculate the horizon for
the catalog.
* Add the STIR access method.
* Modify concurrent build/reindex to use an aux-index approach without
snapshot rotation.
* Add support for snapshot rotation for non-parallel and non-unique cases.
* Extend support for snapshot rotation in parallel index builds.
* Implement snapshot rotation support for unique indexes.
Best regards,
Mikhail
[1]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach_rebased?expand=1
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach_rebased?expand=1
Show quoted text
Attachments:
create_index_concurrently_with_aux_index_or_rotated_snapshots.patchtext/x-patch; charset=US-ASCII; name=create_index_concurrently_with_aux_index_or_rotated_snapshots.patchDownload
Subject: [PATCH] a lot of refactoring
Ensure the correct determination of index safety to be used with set_indexsafe_procflags during REINDEX CONCURRENTLY
Revert "Revert "backend_catalog_xmin in pg_stat_activity""
revert the revert of catalogXmin
fix resetting snapshot during heapam_index_build_range_scan (snapshot is reset between pages)
apply v3-0002-Modify-the-infer_arbiter_indexes-function-to-cons.patch for test stability
fix unique check for building unique indexes
support for unique indexes
revert ThereAreNoPriorRegisteredSnapshots changes
revert ThereAreNoPriorRegisteredSnapshots changes
do not hold xmin while inserting to the index
rename jam to stir
delete ii_Auxiliary
Revert "introduce PROC->catalogXmin"
Revert "backend_catalog_xmin in pg_stat_activity"
some fixes for jam
few tunes
backend_catalog_xmin in pg_stat_activity
disable snapshot reset for unique indexes
just access method to use as index for validation
support for parallel building with snapshot reset
resetting snapshot during heap scan in the case of serial index build
resetting snapshot during validate_index
introduce PROC->catalogXmin
create index concurrently using auxiliary index
---
Index: src/backend/access/heap/heapam_handler.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
--- a/src/backend/access/heap/heapam_handler.c (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/access/heap/heapam_handler.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -41,10 +41,12 @@
#include "storage/bufpage.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
+#include "utils/injection_point.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
@@ -1191,11 +1193,11 @@
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool pop_active_snapshot = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-
/*
* sanity checks
*/
@@ -1213,6 +1215,8 @@
* only one of those is requested.
*/
Assert(!(anyvisible && checking_uniqueness));
+ Assert(!(anyvisible && indexInfo->ii_Concurrent));
+ Assert(!indexInfo->ii_Concurrent || !HaveRegisteredOrActiveSnapshot() || scan);
/*
* Need an EState for evaluation of index expressions and partial-index
@@ -1252,17 +1256,22 @@
if (!TransactionIdIsValid(OldestXmin))
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ PushActiveSnapshot(snapshot);
+ need_unregister_snapshot = pop_active_snapshot = !indexInfo->ii_Concurrent;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ indexInfo->ii_Concurrent);
}
else
{
@@ -1726,8 +1735,12 @@
table_endscan(scan);
/* we can now forget our snapshot, if set and registered by us */
+ if (pop_active_snapshot)
+ PopActiveSnapshot();
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
+ if (indexInfo->ii_Concurrent && !hscan)
+ Assert(!TransactionIdIsValid(MyProc->xmin));
ExecDropSingleTupleTableSlot(slot);
@@ -1740,245 +1753,206 @@
return reltuples;
}
-static void
-heapam_index_validate_scan(Relation heapRelation,
- Relation indexRelation,
- IndexInfo *indexInfo,
+static TransactionId
+heapam_index_validate_scan(Relation table_rel,
+ Relation index_rel,
+ Relation aux_index_rel,
+ struct IndexInfo *index_info,
+ struct IndexInfo *aux_index_info,
Snapshot snapshot,
- ValidateIndexState *state)
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *aux_state)
{
- TableScanDesc scan;
- HeapScanDesc hscan;
- HeapTuple heapTuple;
+ IndexFetchTableData *fetch;
+ TransactionId limitXmin;
+
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
- ExprState *predicate;
- TupleTableSlot *slot;
- EState *estate;
- ExprContext *econtext;
- BlockNumber root_blkno = InvalidBlockNumber;
- OffsetNumber root_offsets[MaxHeapTuplesPerPage];
- bool in_index[MaxHeapTuplesPerPage];
- BlockNumber previous_blkno = InvalidBlockNumber;
+
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
/* state variables for the merge */
- ItemPointer indexcursor = NULL;
- ItemPointerData decoded;
- bool tuplesort_empty = false;
+ ItemPointer indexcursor = NULL,
+ auxindexcursor = NULL,
+ prev_indexcursor = NULL;
+ ItemPointerData decoded,
+ auxdecoded,
+ prev_decoded,
+ fetched;
+ bool tuplesort_empty = false,
+ auxtuplesort_empty = false;
+ instr_time snapshotTime,
+ currentTime,
+ elapsed;
+
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ INSTR_TIME_SET_CURRENT(snapshotTime);
+ limitXmin = snapshot->xmin;
/*
* sanity checks
*/
- Assert(OidIsValid(indexRelation->rd_rel->relam));
+ Assert(OidIsValid(index_rel->rd_rel->relam));
+ Assert(OidIsValid(aux_index_rel->rd_rel->relam));
- /*
- * Need an EState for evaluation of index expressions and partial-index
- * predicates. Also a slot to hold the current tuple.
- */
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
- slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
- &TTSOpsHeapTuple);
+
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(table_rel),
+ &TTSOpsBufferHeapTuple);
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ fetch = heapam_index_fetch_begin(table_rel);
- /*
- * Prepare for scan of the base relation. We need just those tuples
- * satisfying the passed-in reference snapshot. We must disable syncscan
- * here, because it's critical that we read from block zero forward to
- * match the sorted TIDs.
- */
- scan = table_beginscan_strat(heapRelation, /* relation */
- snapshot, /* snapshot */
- 0, /* number of keys */
- NULL, /* scan key */
- true, /* buffer access strategy OK */
- false); /* syncscan not OK */
- hscan = (HeapScanDesc) scan;
+ ItemPointerSetInvalid(&decoded);
+ ItemPointerSetInvalid(&prev_decoded);
+ ItemPointerSetInvalid(&auxdecoded);
+ ItemPointerSetInvalid(&fetched);
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
- hscan->rs_nblocks);
+ prev_indexcursor = &prev_decoded;
- /*
- * Scan all tuples matching the snapshot.
- */
- while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ while (!auxtuplesort_empty)
{
- ItemPointer heapcursor = &heapTuple->t_self;
- ItemPointerData rootTuple;
- OffsetNumber root_offnum;
-
CHECK_FOR_INTERRUPTS();
- state->htups += 1;
-
- if ((previous_blkno == InvalidBlockNumber) ||
- (hscan->rs_cblock != previous_blkno))
+ INSTR_TIME_SET_CURRENT(currentTime);
+ elapsed = currentTime;
+ INSTR_TIME_SUBTRACT(elapsed, snapshotTime);
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
{
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- hscan->rs_cblock);
- previous_blkno = hscan->rs_cblock;
- }
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
- /*
- * As commented in table_index_build_scan, we should index heap-only
- * tuples under the TIDs of their root tuples; so when we advance onto
- * a new heap page, build a map of root item offsets on the page.
- *
- * This complicates merging against the tuplesort output: we will
- * visit the live tuples in order by their offsets, but the root
- * offsets that we need to compare against the index contents might be
- * ordered differently. So we might have to "look back" within the
- * tuplesort output, but only within the current page. We handle that
- * by keeping a bool array in_index[] showing all the
- * already-passed-over tuplesort output TIDs of the current page. We
- * clear that array here, when advancing onto a new heap page.
- */
- if (hscan->rs_cblock != root_blkno)
- {
- Page page = BufferGetPage(hscan->rs_cbuf);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
- heap_get_root_tuples(page, root_offsets);
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
- memset(in_index, 0, sizeof(in_index));
-
- root_blkno = hscan->rs_cblock;
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+ INSTR_TIME_SET_CURRENT(snapshotTime);
}
- /* Convert actual tuple TID to root TID */
- rootTuple = *heapcursor;
- root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
- if (HeapTupleIsHeapOnly(heapTuple))
- {
- root_offnum = root_offsets[root_offnum - 1];
- if (!OffsetNumberIsValid(root_offnum))
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
- ItemPointerGetBlockNumber(heapcursor),
- ItemPointerGetOffsetNumber(heapcursor),
- RelationGetRelationName(heapRelation))));
- ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
- }
-
- /*
- * "merge" by skipping through the index tuples until we find or pass
- * the current root tuple.
- */
- while (!tuplesort_empty &&
- (!indexcursor ||
- ItemPointerCompare(indexcursor, &rootTuple) < 0))
{
- Datum ts_val;
- bool ts_isnull;
-
- if (indexcursor)
+ Datum ts_val;
+ bool ts_isnull;
+ auxtuplesort_empty = !tuplesort_getdatum(aux_state->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(auxtuplesort_empty || !ts_isnull);
+ if (!auxtuplesort_empty)
+ {
+ itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+ auxindexcursor = &auxdecoded;
+ }
+ else
{
- /*
- * Remember index items seen earlier on the current heap page
- */
- if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
- in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+ auxindexcursor = NULL;
}
+ }
- tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
- false, &ts_val, &ts_isnull,
- NULL);
- Assert(tuplesort_empty || !ts_isnull);
- if (!tuplesort_empty)
- {
- itemptr_decode(&decoded, DatumGetInt64(ts_val));
- indexcursor = &decoded;
- }
- else
- {
- /* Be tidy */
- indexcursor = NULL;
- }
- }
+ if (!auxtuplesort_empty)
+ {
+ while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+ ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+ {
+ Datum ts_val;
+ bool ts_isnull;
+ prev_decoded = decoded;
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(tuplesort_empty || !ts_isnull);
+ if (!tuplesort_empty)
+ {
+ itemptr_decode(&decoded, DatumGetInt64(ts_val));
+ indexcursor = &decoded;
+
+ if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+ {
+ elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+ ItemPointerGetBlockNumber(indexcursor),
+ ItemPointerGetOffsetNumber(indexcursor));
+ }
+ }
+ else
+ {
+ indexcursor = NULL;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
- /*
- * If the tuplesort has overshot *and* we didn't see a match earlier,
- * then this tuple is missing from the index, so insert it.
- */
- if ((tuplesort_empty ||
- ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
- !in_index[root_offnum - 1])
- {
- MemoryContextReset(econtext->ecxt_per_tuple_memory);
+ if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+ {
+ bool call_again = false;
+ bool all_dead = false;
+ ItemPointer tid;
+
+ fetched = *auxindexcursor;
+ tid = &fetched;
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
- /* Set up for predicate or expression evaluation */
- ExecStoreHeapTuple(heapTuple, slot, false);
-
- /*
- * In a partial index, discard tuples that don't satisfy the
- * predicate.
- */
- if (predicate != NULL)
- {
- if (!ExecQual(predicate, econtext))
- continue;
- }
+ if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+ {
- /*
- * For the current heap tuple, extract all the attributes we use
- * in this index, and note which are null. This also performs
- * evaluation of any expressions needed.
- */
- FormIndexDatum(indexInfo,
- slot,
- estate,
- values,
- isnull);
+ FormIndexDatum(index_info,
+ slot,
+ estate,
+ values,
+ isnull);
- /*
- * You'd think we should go ahead and build the index tuple here,
- * but some index AMs want to do further processing on the data
- * first. So pass the values[] and isnull[] arrays, instead.
- */
-
- /*
- * If the tuple is already committed dead, you might think we
- * could suppress uniqueness checking, but this is no longer true
- * in the presence of HOT, because the insert is actually a proxy
- * for a uniqueness check on the whole HOT-chain. That is, the
- * tuple we have here could be dead because it was already
- * HOT-updated, and if so the updating transaction will not have
- * thought it should insert index entries. The index AM will
- * check the whole HOT-chain and correctly detect a conflict if
- * there is one.
- */
-
- index_insert(indexRelation,
- values,
- isnull,
- &rootTuple,
- heapRelation,
- indexInfo->ii_Unique ?
- UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- false,
- indexInfo);
+ index_insert(index_rel,
+ values,
+ isnull,
+ auxindexcursor, /* insert root tuple */
+ table_rel,
+ index_info->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ false,
+ index_info);
- state->tups_inserted += 1;
+ state->tups_inserted += 1;
+
+ elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor),
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ }
+ else
+ {
+ elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor));
+ }
+ }
}
}
-
- table_endscan(scan);
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
- /* These may have been pointing to the now-gone estate */
- indexInfo->ii_ExpressionsState = NIL;
- indexInfo->ii_PredicateState = NULL;
+ heapam_index_fetch_end(fetch);
+
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ InvalidateCatalogSnapshot();
+ Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+ if (MyProc->xid == InvalidTransactionId)
+ INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
+
+ return limitXmin;
}
/*
Index: src/backend/catalog/index.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
--- a/src/backend/catalog/index.c (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/catalog/index.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -67,6 +67,7 @@
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/fmgroids.h"
@@ -741,7 +742,8 @@
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId)
+ Oid *constraintId,
+ char relpersistence)
{
Oid heapRelationId = RelationGetRelid(heapRelation);
Relation pg_class;
@@ -752,7 +754,6 @@
bool is_exclusion;
Oid namespaceId;
int i;
- char relpersistence;
bool isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
bool invalid = (flags & INDEX_CREATE_INVALID) != 0;
bool concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
@@ -782,7 +783,6 @@
namespaceId = RelationGetNamespace(heapRelation);
shared_relation = heapRelation->rd_rel->relisshared;
mapped_relation = RelationIsMapped(heapRelation);
- relpersistence = heapRelation->rd_rel->relpersistence;
/*
* check parameters
@@ -1459,13 +1459,151 @@
0,
true, /* allow table to be a system catalog? */
false, /* is_internal? */
- NULL);
+ NULL,
+ heapRelation->rd_rel->relpersistence);
/* Close the relations used and clean up */
index_close(indexRelation, NoLock);
ReleaseSysCache(indexTuple);
ReleaseSysCache(classTuple);
+ return newIndexId;
+}
+
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+ Oid tablespaceOid, const char *newName)
+{
+ Relation indexRelation;
+ IndexInfo *oldInfo,
+ *newInfo;
+ Oid newIndexId = InvalidOid;
+ HeapTuple indexTuple;
+
+ List *indexColNames = NIL;
+ List *indexExprs = NIL;
+ List *indexPreds = NIL;
+
+ Oid *auxOpclassIds;
+ int16 *auxColoptions;
+
+ indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+ /* The new index needs some information from the old index */
+ oldInfo = BuildIndexInfo(indexRelation);
+
+ /*
+ * Build of an auxiliary index with exclusion constraints is not
+ * supported.
+ */
+ if (oldInfo->ii_ExclusionOps != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+ /* Get the array of class and column options IDs from index info */
+ indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+ if (!HeapTupleIsValid(indexTuple))
+ elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+ /*
+ * Fetch the list of expressions and predicates directly from the
+ * catalogs. This cannot rely on the information from IndexInfo of the
+ * old index as these have been flattened for the planner.
+ */
+ if (oldInfo->ii_Expressions != NIL)
+ {
+ Datum exprDatum;
+ char *exprString;
+
+ exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indexprs);
+ exprString = TextDatumGetCString(exprDatum);
+ indexExprs = (List *) stringToNode(exprString);
+ pfree(exprString);
+ }
+ if (oldInfo->ii_Predicate != NIL)
+ {
+ Datum predDatum;
+ char *predString;
+
+ predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indpred);
+ predString = TextDatumGetCString(predDatum);
+ indexPreds = (List *) stringToNode(predString);
+
+ /* Also convert to implicit-AND format */
+ indexPreds = make_ands_implicit((Expr *) indexPreds);
+ pfree(predString);
+ }
+
+ /*
+ * Build the index information for the new index. Note that rebuild of
+ * indexes with exclusion constraints is not supported, hence there is no
+ * need to fill all the ii_Exclusion* fields.
+ */
+ newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+ oldInfo->ii_NumIndexKeyAttrs,
+ STIR_AM_OID,
+ indexExprs,
+ indexPreds,
+ false, /* aux index are not unique */
+ oldInfo->ii_NullsNotDistinct,
+ false, /* not ready for inserts */
+ true,
+ false); /* aux are not summarizing */
+
+ /*
+ * Extract the list of column names and the column numbers for the new
+ * index information. All this information will be used for the index
+ * creation.
+ */
+ for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+ {
+ TupleDesc indexTupDesc = RelationGetDescr(indexRelation);
+ Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+ indexColNames = lappend(indexColNames, NameStr(att->attname));
+ newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+ }
+
+ auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+ auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+ for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+ {
+ auxOpclassIds[i] = RECORD_STIR_OPS_OID;
+ auxColoptions[i] = 0;
+ }
+
+ newIndexId = index_create(heapRelation,
+ newName,
+ InvalidOid, /* indexRelationId */
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidRelFileNumber, /* relFileNumber */
+ newInfo,
+ indexColNames,
+ STIR_AM_OID,
+ tablespaceOid,
+ indexRelation->rd_indcollation,
+ auxOpclassIds,
+ NULL,
+ auxColoptions,
+ NULL,
+ (Datum) 0,
+ INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT,
+ 0,
+ true, /* allow table to be a system catalog? */
+ false, /* is_internal? */
+ NULL,
+ RELPERSISTENCE_UNLOGGED);
+
+ /* Close the relations used and clean up */
+ index_close(indexRelation, NoLock);
+ ReleaseSysCache(indexTuple);
+
return newIndexId;
}
@@ -1488,9 +1626,7 @@
int save_nestlevel;
Relation indexRelation;
IndexInfo *indexInfo;
-
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Snapshot snapshot = InvalidSnapshot;
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1508,6 +1644,12 @@
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
+
+ /* BuildIndexInfo requires as snapshot for expressions and predicates */
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
@@ -1518,11 +1660,17 @@
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ snapshot = InvalidSnapshot;
+
/* Now build the index */
- index_build(heapRel, indexRelation, indexInfo, false, true);
+ index_build(heapRel, indexRelation, indexInfo, false, true);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
- AtEOXact_GUC(false, save_nestlevel);
+ AtEOXact_GUC(false, save_nestlevel);
/* Restore userid and security context */
SetUserIdAndSecContext(save_userid, save_sec_context);
@@ -3177,7 +3325,8 @@
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3288,34 +3437,59 @@
* making the table append-only by setting use_fsm). However that would
* add yet more locking issues.
*/
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
{
Relation heapRelation,
- indexRelation;
- IndexInfo *indexInfo;
- IndexVacuumInfo ivinfo;
- ValidateIndexState state;
+ indexRelation,
+ auxIndexRelation;
+ IndexInfo *indexInfo,
+ *auxIndexInfo;
+ Snapshot snapshot;
+ TransactionId limitXmin;
+ IndexVacuumInfo ivinfo, auxivinfo;
+ ValidateIndexState state, auxState;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ int main_work_mem_part = (maintenance_work_mem * 8) / 10;
{
const int progress_index[] = {
- PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_TUPLES_DONE,
- PROGRESS_CREATEIDX_TUPLES_TOTAL,
- PROGRESS_SCAN_BLOCKS_DONE,
- PROGRESS_SCAN_BLOCKS_TOTAL
+ PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_TUPLES_DONE,
+ PROGRESS_CREATEIDX_TUPLES_TOTAL,
+ PROGRESS_SCAN_BLOCKS_DONE,
+ PROGRESS_SCAN_BLOCKS_TOTAL
};
const int64 progress_vals[] = {
- PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN,
- 0, 0, 0, 0
+ PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN,
+ 0, 0, 0, 0
};
pgstat_progress_update_multi_param(5, progress_index, progress_vals);
}
+ /*
+ * Now take the "reference snapshot" that will be used by validate_index()
+ * to filter candidate tuples. Beware! There might still be snapshots in
+ * use that treat some transaction as in-progress that our reference
+ * snapshot treats as committed. If such a recently-committed transaction
+ * deleted tuples in the table, we will not include them in the index; yet
+ * those transactions which see the deleting one as still-in-progress will
+ * expect such tuples to be there once we mark the index as valid.
+ *
+ * We solve this by waiting for all endangered transactions to exit before
+ * we mark the index as valid.
+ *
+ * We also set ActiveSnapshot to this snap, since functions in indexes may
+ * need a snapshot.
+ */
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+
+ Assert(TransactionIdIsValid(MyProc->xmin));
+
/* Open and lock the parent heap relation */
heapRelation = table_open(heapId, ShareUpdateExclusiveLock);
@@ -3331,6 +3505,7 @@
RestrictSearchPath();
indexRelation = index_open(indexId, RowExclusiveLock);
+ auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
/*
* Fetch info needed for index_insert. (You might think this should be
@@ -3338,9 +3513,11 @@
* been built in a previous transaction.)
*/
indexInfo = BuildIndexInfo(indexRelation);
+ auxIndexInfo = BuildIndexInfo(auxIndexRelation);
/* mark build is concurrent just for consistency */
indexInfo->ii_Concurrent = true;
+ auxIndexInfo->ii_Concurrent = true;
/*
* Scan the index and gather up all the TIDs into a tuplesort object.
@@ -3353,6 +3530,10 @@
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
ivinfo.strategy = NULL;
+ ivinfo.validate_index = true;
+
+ auxivinfo = ivinfo;
+ auxivinfo.index = auxIndexRelation;
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
@@ -3360,9 +3541,27 @@
* is a pass-by-reference type on all platforms, whereas int8 is
* pass-by-value on most platforms.
*/
+ auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+ InvalidOid, false,
+ maintenance_work_mem - main_work_mem_part,
+ NULL, TUPLESORT_NONE);
+ auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+ (void) index_bulk_delete(&auxivinfo, NULL,
+ validate_index_callback, (void *) &auxState);
+
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+
+
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
InvalidOid, false,
- maintenance_work_mem,
+ main_work_mem_part,
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
@@ -3370,38 +3569,63 @@
(void) index_bulk_delete(&ivinfo, NULL,
validate_index_callback, (void *) &state);
+
+
/* Execute the sort */
{
const int progress_index[] = {
- PROGRESS_CREATEIDX_PHASE,
- PROGRESS_SCAN_BLOCKS_DONE,
- PROGRESS_SCAN_BLOCKS_TOTAL
+ PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_SCAN_BLOCKS_DONE,
+ PROGRESS_SCAN_BLOCKS_TOTAL
};
const int64 progress_vals[] = {
- PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT,
- 0, 0
+ PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT,
+ 0, 0
};
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
tuplesort_performsort(state.tuplesort);
+ tuplesort_performsort(auxState.tuplesort);
+
+ /*
+ * Drop the reference snapshot. We must do this before waiting out other
+ * snapshot holders, else we will deadlock against other processes also
+ * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+ * they must wait for. But first, save the snapshot's xmin to use as
+ * limitXmin for GetCurrentVirtualXIDs().
+ */
+ limitXmin = snapshot->xmin;
+
+
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ snapshot = InvalidSnapshot;
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
/*
* Now scan the heap and "merge" it with the index
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
- table_index_validate_scan(heapRelation,
+ limitXmin = TransactionIdNewer(limitXmin, table_index_validate_scan(heapRelation,
indexRelation,
+ auxIndexRelation,
indexInfo,
- snapshot,
- &state);
+ auxIndexInfo,
+ snapshot, /* may be invalid */
+ &state,
+ &auxState));
/* Done with tuplesort object */
tuplesort_end(state.tuplesort);
+ tuplesort_end(auxState.tuplesort);
/* Make sure to release resources cached in indexInfo (if needed). */
index_insert_cleanup(indexRelation, indexInfo);
+ index_insert_cleanup(auxIndexRelation, auxIndexInfo);
elog(DEBUG2,
"validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
@@ -3414,8 +3638,13 @@
SetUserIdAndSecContext(save_userid, save_sec_context);
/* Close rels, but keep locks */
+ index_close(auxIndexRelation, NoLock);
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
+
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ return limitXmin;
}
/*
@@ -3466,6 +3695,12 @@
Assert(!indexForm->indisready);
Assert(!indexForm->indisvalid);
indexForm->indisready = true;
+ break;
+ case INDEX_DROP_CLEAR_READY:
+ Assert(indexForm->indislive);
+ Assert(indexForm->indisready);
+ Assert(!indexForm->indisvalid);
+ indexForm->indisready = false;
break;
case INDEX_CREATE_SET_VALID:
/* Set indisvalid during a CREATE INDEX CONCURRENTLY sequence */
Index: src/backend/catalog/toasting.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
--- a/src/backend/catalog/toasting.c (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/catalog/toasting.c (revision 6973360aaf4eb9012a60a5f2d5d46f022ac2d38c)
@@ -324,7 +324,8 @@
BTREE_AM_OID,
rel->rd_rel->reltablespace,
collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
- INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+ INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+ toast_rel->rd_rel->relpersistence);
table_close(toast_rel, NoLock);
Index: src/backend/commands/indexcmds.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
--- a/src/backend/commands/indexcmds.c (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/commands/indexcmds.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -69,6 +69,7 @@
#include "utils/regproc.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* non-export function prototypes */
@@ -112,7 +113,6 @@
Oid relationOid,
const ReindexParams *params);
static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
/*
* callback argument type for RangeVarCallbackForReindexIndex()
@@ -428,8 +428,7 @@
VirtualTransactionId *old_snapshots;
old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
- PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
- | PROC_IN_SAFE_IC,
+ PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_old_snapshots);
if (progress)
pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -449,8 +448,7 @@
newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
true, false,
- PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
- | PROC_IN_SAFE_IC,
+ PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_newer_snapshots);
for (j = i; j < n_old_snapshots; j++)
{
@@ -542,7 +540,9 @@
{
bool concurrent;
char *indexRelationName;
+ char *auxIndexRelationName;
char *accessMethodName;
+ Oid auxIndexRelationId;
Oid *typeIds;
Oid *collationIds;
Oid *opclassIds;
@@ -561,7 +561,6 @@
bool amissummarizing;
amoptions_function amoptions;
bool partitioned;
- bool safe_index;
Datum reloptions;
int16 *coloptions;
IndexInfo *indexInfo;
@@ -571,10 +570,10 @@
int numberOfKeyAttributes;
TransactionId limitXmin;
ObjectAddress address;
+ ObjectAddress auxAddress;
LockRelId heaprelid;
LOCKTAG heaplocktag;
LOCKMODE lockmode;
- Snapshot snapshot;
Oid root_save_userid;
int root_save_sec_context;
int root_save_nestlevel;
@@ -808,6 +807,7 @@
* Select name for index if caller didn't specify
*/
indexRelationName = stmt->idxname;
+ auxIndexRelationName = NULL;
if (indexRelationName == NULL)
indexRelationName = ChooseIndexName(RelationGetRelationName(rel),
namespaceId,
@@ -815,6 +815,12 @@
stmt->excludeOpNames,
stmt->primary,
stmt->isconstraint);
+ if (concurrent)
+ auxIndexRelationName = ChooseRelationName(indexRelationName,
+ NULL,
+ "ccaux",
+ namespaceId,
+ false);
/*
* look up the access method, verify it can handle the requested features
@@ -1116,10 +1122,6 @@
}
}
- /* Is index safe for others to ignore? See set_indexsafe_procflags() */
- safe_index = indexInfo->ii_Expressions == NIL &&
- indexInfo->ii_Predicate == NIL;
-
/*
* Report index creation if appropriate (delay this till after most of the
* error checks)
@@ -1199,7 +1201,8 @@
coloptions, NULL, reloptions,
flags, constr_flags,
allowSystemTableMods, !check_rights,
- &createdConstraintId);
+ &createdConstraintId,
+ rel->rd_rel->relpersistence);
ObjectAddressSet(address, RelationRelationId, indexRelationId);
@@ -1595,6 +1598,28 @@
return address;
}
+ else
+ {
+ Oid save_userid;
+ int save_sec_context;
+ int save_nestlevel;
+
+ GetUserIdAndSecContext(&save_userid, &save_sec_context);
+ SetUserIdAndSecContext(rel->rd_rel->relowner,
+ save_sec_context | SECURITY_RESTRICTED_OPERATION);
+ save_nestlevel = NewGUCNestLevel();
+ RestrictSearchPath();
+
+ auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+ tablespaceId, auxIndexRelationName);
+ ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+
+ /* Roll back any GUC changes executed by index functions */
+ AtEOXact_GUC(false, save_nestlevel);
+
+ /* Restore userid and security context */
+ SetUserIdAndSecContext(save_userid, save_sec_context);
+ }
/* save lockrelid and locktag for below, then close rel */
heaprelid = rel->rd_lockInfo.lockRelId;
@@ -1626,11 +1651,18 @@
PopActiveSnapshot();
CommitTransactionCommand();
- StartTransactionCommand();
+
+ {
+ StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
+ WaitForLockers(heaplocktag, ShareLock, true);
+ index_concurrently_build(tableId, auxIndexRelationId);
+
+ CommitTransactionCommand();
+ }
+
+ StartTransactionCommand();
+
/*
* The index is now visible, so we can report the OID. While on it,
@@ -1685,25 +1717,15 @@
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/*
* Phase 3 of concurrent index build
*
@@ -1713,41 +1735,17 @@
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_2);
WaitForLockers(heaplocktag, ShareLock, true);
+ index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+ CommitTransactionCommand();
- /*
- * Now take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples. Beware! There might still be snapshots in
- * use that treat some transaction as in-progress that our reference
- * snapshot treats as committed. If such a recently-committed transaction
- * deleted tuples in the table, we will not include them in the index; yet
- * those transactions which see the deleting one as still-in-progress will
- * expect such tuples to be there once we mark the index as valid.
- *
- * We solve this by waiting for all endangered transactions to exit before
- * we mark the index as valid.
- *
- * We also set ActiveSnapshot to this snap, since functions in indexes may
- * need a snapshot.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
+ StartTransactionCommand();
/*
* Scan the index and the heap, insert any missing index entries.
*/
- validate_index(tableId, indexRelationId, snapshot);
+ limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
- /*
- * Drop the reference snapshot. We must do this before waiting out other
- * snapshot holders, else we will deadlock against other processes also
- * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
- * they must wait for. But first, save the snapshot's xmin to use as
- * limitXmin for GetCurrentVirtualXIDs().
- */
- limitXmin = snapshot->xmin;
-
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
/*
* The snapshot subsystem could still contain registered snapshots that
@@ -1758,14 +1756,32 @@
* transaction, and do our wait before any snapshot has been taken in it.
*/
CommitTransactionCommand();
+
+ {
+ StartTransactionCommand();
+ index_concurrently_set_dead(tableId, auxIndexRelationId);
+ CommitTransactionCommand();
+ }
+
+ WaitForLockers(heaplocktag, ShareLock, true);
+
+
+ {
+ StartTransactionCommand();
+
+ /*
+ * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+ * right lock level.
+ */
+ performDeletion(&auxAddress, DROP_RESTRICT,
+ PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+ CommitTransactionCommand();
+ }
+
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/* We should now definitely not be advertising any xmin. */
- Assert(MyProc->xmin == InvalidTransactionId);
+ Assert(MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId);
/*
* The index is now valid in the sense that it contains all currently
@@ -3431,9 +3447,9 @@
typedef struct ReindexIndexInfo
{
Oid indexId;
+ Oid auxIndexId;
Oid tableId;
Oid amId;
- bool safe; /* for set_indexsafe_procflags */
} ReindexIndexInfo;
List *heapRelationIds = NIL;
List *indexIds = NIL;
@@ -3558,6 +3574,7 @@
oldcontext = MemoryContextSwitchTo(private_context);
idx = palloc_object(ReindexIndexInfo);
+ idx->auxIndexId = InvalidOid;
idx->indexId = cellOid;
/* other fields set later */
@@ -3608,6 +3625,7 @@
oldcontext = MemoryContextSwitchTo(private_context);
idx = palloc_object(ReindexIndexInfo);
+ idx->auxIndexId = InvalidOid;
idx->indexId = cellOid;
indexIds = lappend(indexIds, idx);
/* other fields set later */
@@ -3689,6 +3707,7 @@
* that invalid indexes are allowed here.
*/
idx = palloc_object(ReindexIndexInfo);
+ idx->auxIndexId = InvalidOid;
idx->indexId = relationOid;
indexIds = lappend(indexIds, idx);
/* other fields set later */
@@ -3754,15 +3773,18 @@
foreach(lc, indexIds)
{
char *concurrentName;
+ char *auxConcurrentName;
ReindexIndexInfo *idx = lfirst(lc);
ReindexIndexInfo *newidx;
Oid newIndexId;
+ Oid auxIndexId;
Relation indexRel;
Relation heapRel;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
Relation newIndexRel;
+ Relation auxIndexRel;
LockRelId *lockrelid;
Oid tablespaceid;
@@ -3781,9 +3803,6 @@
save_nestlevel = NewGUCNestLevel();
RestrictSearchPath();
- /* determine safety of this index for set_indexsafe_procflags */
- idx->safe = (indexRel->rd_indexprs == NIL &&
- indexRel->rd_indpred == NIL);
idx->tableId = RelationGetRelid(heapRel);
idx->amId = indexRel->rd_rel->relam;
@@ -3805,6 +3824,11 @@
"ccnew",
get_rel_namespace(indexRel->rd_index->indrelid),
false);
+ auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+ NULL,
+ "ccaux",
+ get_rel_namespace(indexRel->rd_index->indrelid),
+ false);
/* Choose the new tablespace, indexes of toast tables are not moved */
if (OidIsValid(params->tablespaceOid) &&
@@ -3819,11 +3843,17 @@
tablespaceid,
concurrentName);
+ auxIndexId = index_concurrently_create_aux(heapRel,
+ idx->indexId,
+ tablespaceid,
+ auxConcurrentName);
+
/*
* Now open the relation of the new index, a session-level lock is
* also needed on it.
*/
newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+ auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
/*
* Save the list of OIDs and locks in private context
@@ -3831,8 +3861,8 @@
oldcontext = MemoryContextSwitchTo(private_context);
newidx = palloc_object(ReindexIndexInfo);
+ newidx->auxIndexId = auxIndexId;
newidx->indexId = newIndexId;
- newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -3850,10 +3880,14 @@
lockrelid = palloc_object(LockRelId);
*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
relationLocks = lappend(relationLocks, lockrelid);
+ lockrelid = palloc_object(LockRelId);
+ *lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+ relationLocks = lappend(relationLocks, lockrelid);
MemoryContextSwitchTo(oldcontext);
index_close(indexRel, NoLock);
+ index_close(auxIndexRel, NoLock);
index_close(newIndexRel, NoLock);
/* Roll back any GUC changes executed by index functions */
@@ -3919,6 +3953,27 @@
PopActiveSnapshot();
CommitTransactionCommand();
+
+ {
+ StartTransactionCommand();
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+ }
+
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ StartTransactionCommand();
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+ index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+ CommitTransactionCommand();
+ }
+
StartTransactionCommand();
/*
@@ -3955,13 +4010,6 @@
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -3976,7 +4024,6 @@
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
@@ -3999,12 +4046,21 @@
PROGRESS_CREATEIDX_PHASE_WAIT_2);
WaitForLockersMultiple(lockTags, ShareLock, true);
CommitTransactionCommand();
+
+ StartTransactionCommand();
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ CHECK_FOR_INTERRUPTS();
+
+ index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+ }
+ CommitTransactionCommand();
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
TransactionId limitXmin;
- Snapshot snapshot;
StartTransactionCommand();
@@ -4015,17 +4071,6 @@
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
- /*
- * Take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4037,16 +4082,9 @@
progress_vals[3] = newidx->amId;
pgstat_progress_update_multi_param(4, progress_index, progress_vals);
- validate_index(newidx->tableId, newidx->indexId, snapshot);
+ limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
- /*
- * We can now do away with our active snapshot, we still need to save
- * the xmin limit to wait for older snapshots.
- */
- limitXmin = snapshot->xmin;
-
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
* To ensure no deadlocks, we must commit and start yet another
@@ -4085,13 +4123,6 @@
StartTransactionCommand();
- /*
- * Because this transaction only does catalog manipulations and doesn't do
- * any index operations, we can set the PROC_IN_SAFE_IC flag here
- * unconditionally.
- */
- set_indexsafe_procflags();
-
forboth(lc, indexIds, lc2, newIndexIds)
{
ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4171,6 +4202,16 @@
index_concurrently_set_dead(oldidx->tableId, oldidx->indexId);
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+ }
+
+
/* Commit this transaction to make the updates visible. */
CommitTransactionCommand();
StartTransactionCommand();
@@ -4204,6 +4245,18 @@
object.classId = RelationRelationId;
object.objectId = idx->indexId;
object.objectSubId = 0;
+
+ add_exact_object_address(&object, objects);
+ }
+
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *idx = lfirst(lc);
+ ObjectAddress object;
+
+ object.classId = RelationRelationId;
+ object.objectId = idx->auxIndexId;
+ object.objectSubId = 0;
add_exact_object_address(&object, objects);
}
@@ -4424,37 +4477,3 @@
heap_freetuple(tup);
table_close(classRel, RowExclusiveLock);
}
-
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots. On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial. Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
- /*
- * This should only be called before installing xid or xmin in MyProc;
- * otherwise, concurrent processes could see an Xmin that moves backwards.
- */
- Assert(MyProc->xid == InvalidTransactionId &&
- MyProc->xmin == InvalidTransactionId);
-
- LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
- MyProc->statusFlags |= PROC_IN_SAFE_IC;
- ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
- LWLockRelease(ProcArrayLock);
-}
Index: src/bin/pg_amcheck/t/006_concurrently.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
--- /dev/null (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,307 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=10 --transactions=10000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '001_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '003_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+ $n = 0;
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;));
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;));
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=1);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('reindex:)' . $n++);
+ }
+ }
+
+ if (1)
+ {
+ my $variant = int(rand(7));
+ my $sql;
+ if ($variant == 0) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+ } elsif ($variant == 1) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+ } elsif ($variant == 2) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+ } elsif ($variant == 3) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 4) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+ } elsif ($variant == 5) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 6) {
+ $sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+ } else { diag("wrong variant"); }
+
+ diag($sql);
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
Index: src/include/access/tableam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
--- a/src/include/access/tableam.h (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/access/tableam.h (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -70,6 +71,7 @@
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -703,11 +705,14 @@
TableScanDesc scan);
/* see table_index_validate_scan for reference about parameters */
- void (*index_validate_scan) (Relation table_rel,
+ TransactionId (*index_validate_scan) (Relation table_rel,
Relation index_rel,
+ Relation aux_index_rel,
struct IndexInfo *index_info,
+ struct IndexInfo *aux_index_info,
Snapshot snapshot,
- struct ValidateIndexState *state);
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *aux_state);
/* ------------------------------------------------------------------------
@@ -931,7 +936,8 @@
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -939,6 +945,11 @@
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1835,19 +1846,26 @@
*
* See validate_index() for an explanation.
*/
-static inline void
+static inline TransactionId
table_index_validate_scan(Relation table_rel,
- Relation index_rel,
- struct IndexInfo *index_info,
- Snapshot snapshot,
- struct ValidateIndexState *state)
+ Relation index_rel,
+ Relation aux_index_rel,
+ struct IndexInfo *index_info,
+ struct IndexInfo *aux_index_info,
+ Snapshot snapshot,
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *auxstate)
{
- table_rel->rd_tableam->index_validate_scan(table_rel,
- index_rel,
- index_info,
- snapshot,
- state);
+ return table_rel->rd_tableam->index_validate_scan(table_rel,
+ index_rel,
+ aux_index_rel,
+ index_info,
+ aux_index_info,
+ snapshot,
+ state,
+ auxstate);
}
+
/* ----------------------------------------------------------------------------
Index: src/include/catalog/index.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
--- a/src/include/catalog/index.h (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/catalog/index.h (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -26,6 +26,7 @@
INDEX_CREATE_SET_READY,
INDEX_CREATE_SET_VALID,
INDEX_DROP_CLEAR_VALID,
+ INDEX_DROP_CLEAR_READY,
INDEX_DROP_SET_DEAD,
} IndexStateFlagsAction;
@@ -43,6 +44,8 @@
#define REINDEXOPT_MISSING_OK 0x04 /* skip missing relations */
#define REINDEXOPT_CONCURRENTLY 0x08 /* concurrent mode */
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL 50 /* 50 ms */
+
/* state info for validate_index bulkdelete callback */
typedef struct ValidateIndexState
{
@@ -86,7 +89,8 @@
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId);
+ Oid *constraintId,
+ char relpersistence);
#define INDEX_CONSTR_CREATE_MARK_AS_PRIMARY (1 << 0)
#define INDEX_CONSTR_CREATE_DEFERRABLE (1 << 1)
@@ -98,6 +102,11 @@
Oid oldIndexId,
Oid tablespaceOid,
const char *newName);
+
+extern Oid index_concurrently_create_aux(Relation heapRelation,
+ Oid mainIndexId,
+ Oid tablespaceOid,
+ const char *newName);
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
@@ -144,7 +153,7 @@
bool isreindex,
bool parallel);
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
Index: src/include/commands/progress.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
--- a/src/include/commands/progress.h (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/commands/progress.h (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -79,6 +79,7 @@
/* Progress parameters for CREATE INDEX */
/* 3, 4 and 5 reserved for "waitfor" metrics */
+// TODO: new phase names
#define PROGRESS_CREATEIDX_COMMAND 0
#define PROGRESS_CREATEIDX_INDEX_OID 6
#define PROGRESS_CREATEIDX_ACCESS_METHOD_OID 8
@@ -91,6 +92,7 @@
/* 15 and 16 reserved for "block number" metrics */
/* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
+// TODO: new phase names
#define PROGRESS_CREATEIDX_PHASE_WAIT_1 1
#define PROGRESS_CREATEIDX_PHASE_BUILD 2
#define PROGRESS_CREATEIDX_PHASE_WAIT_2 3
Index: src/test/regress/expected/create_index.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
--- a/src/test/regress/expected/create_index.out (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/expected/create_index.out (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -1405,6 +1405,7 @@
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
ERROR: could not create unique index "concur_index3"
DETAIL: Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -2705,6 +2706,7 @@
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
ERROR: could not create unique index "concur_reindex_ind5"
DETAIL: Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -2717,8 +2719,10 @@
c1 | integer | | |
Indexes:
"concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccaux" stir (c1 record_ops) INVALID
"concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
Index: src/test/regress/expected/indexing.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
--- a/src/test/regress/expected/indexing.out (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/expected/indexing.out (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -1571,10 +1571,11 @@
--------------------------------+------------+-----------------------+-------------------------------
parted_isvalid_idx | f | parted_isvalid_tab |
parted_isvalid_idx_11 | f | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux | f | parted_isvalid_tab_11 |
parted_isvalid_tab_12_expr_idx | t | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
parted_isvalid_tab_1_expr_idx | f | parted_isvalid_tab_1 | parted_isvalid_idx
parted_isvalid_tab_2_expr_idx | t | parted_isvalid_tab_2 | parted_isvalid_idx
-(5 rows)
+(6 rows)
drop table parted_isvalid_tab;
-- Check state of replica indexes when attaching a partition.
Index: src/test/regress/sql/create_index.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
--- a/src/test/regress/sql/create_index.sql (revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/sql/create_index.sql (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -493,6 +493,7 @@
INSERT INTO concur_heap VALUES ('b','x');
-- check if constraint is enforced properly at build time
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1147,10 +1148,12 @@
INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
-- This trick creates an invalid index.
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
\d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
Index: src/backend/access/transam/twophase.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
--- a/src/backend/access/transam/twophase.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/access/transam/twophase.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -459,7 +459,7 @@
proc->vxid.procNumber = INVALID_PROC_NUMBER;
}
proc->xid = xid;
- Assert(proc->xmin == InvalidTransactionId);
+ Assert(proc->xmin == InvalidTransactionId && proc->catalogXmin == InvalidTransactionId);
proc->delayChkptFlags = 0;
proc->statusFlags = 0;
proc->pid = 0;
Index: src/backend/replication/logical/reorderbuffer.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
--- a/src/backend/replication/logical/reorderbuffer.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/logical/reorderbuffer.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -1844,6 +1844,7 @@
snap->active_count = 1; /* mark as active so nobody frees it */
snap->regd_count = 0;
snap->xip = (TransactionId *) (snap + 1);
+ snap->catalog = orig_snap->catalog;
memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
Index: src/backend/replication/logical/snapbuild.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
--- a/src/backend/replication/logical/snapbuild.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/logical/snapbuild.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -564,6 +564,7 @@
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->catalog = false; // TODO: or true?
return snapshot;
}
@@ -600,8 +601,8 @@
elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
/* so we don't overwrite the existing value */
- if (TransactionIdIsValid(MyProc->xmin))
- elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
+ if (TransactionIdIsValid(MyProc->xmin) || TransactionIdIsValid(MyProc->catalogXmin))
+ elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin or MyProc->catalogXmin already is valid");
snap = SnapBuildBuildSnapshot(builder);
@@ -622,7 +623,7 @@
elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
safeXid, snap->xmin);
- MyProc->xmin = snap->xmin;
+ MyProc->xmin = MyProc->catalogXmin = snap->xmin;
/* allocate in transaction context */
newxip = (TransactionId *)
Index: src/backend/replication/walsender.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
--- a/src/backend/replication/walsender.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/walsender.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -305,7 +305,7 @@
*/
if (MyDatabaseId == InvalidOid)
{
- Assert(MyProc->xmin == InvalidTransactionId);
+ Assert(MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId);
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
MyProc->statusFlags |= PROC_AFFECTS_ALL_HORIZONS;
ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
@@ -2498,7 +2498,7 @@
ReplicationSlot *slot = MyReplicationSlot;
SpinLockAcquire(&slot->mutex);
- MyProc->xmin = InvalidTransactionId;
+ MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
/*
* For physical replication we don't need the interlock provided by xmin
@@ -2627,7 +2627,7 @@
if (!TransactionIdIsNormal(feedbackXmin)
&& !TransactionIdIsNormal(feedbackCatalogXmin))
{
- MyProc->xmin = InvalidTransactionId;
+ MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
if (MyReplicationSlot != NULL)
PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
return;
@@ -2680,11 +2680,8 @@
PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
else
{
- if (TransactionIdIsNormal(feedbackCatalogXmin)
- && TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
- MyProc->xmin = feedbackCatalogXmin;
- else
- MyProc->xmin = feedbackXmin;
+ MyProc->catalogXmin = feedbackCatalogXmin;
+ MyProc->xmin = feedbackXmin;
}
}
Index: src/backend/storage/ipc/procarray.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/procarray.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/storage/ipc/procarray.c (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -701,7 +701,7 @@
Assert(!proc->subxidStatus.overflowed);
proc->vxid.lxid = InvalidLocalTransactionId;
- proc->xmin = InvalidTransactionId;
+ proc->xmin = proc->catalogXmin = InvalidTransactionId;
/* be sure this is cleared in abort */
proc->delayChkptFlags = 0;
@@ -743,7 +743,7 @@
ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
proc->xid = InvalidTransactionId;
proc->vxid.lxid = InvalidLocalTransactionId;
- proc->xmin = InvalidTransactionId;
+ proc->xmin = proc->catalogXmin = InvalidTransactionId;
/* be sure this is cleared in abort */
proc->delayChkptFlags = 0;
@@ -930,7 +930,7 @@
proc->xid = InvalidTransactionId;
proc->vxid.lxid = InvalidLocalTransactionId;
- proc->xmin = InvalidTransactionId;
+ proc->xmin = proc->catalogXmin = InvalidTransactionId;
proc->recoveryConflictPending = false;
Assert(!(proc->statusFlags & PROC_VACUUM_STATE_MASK));
@@ -1739,8 +1739,6 @@
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
- /* inferred after ProcArrayLock is released */
- h->catalog_oldest_nonremovable = InvalidTransactionId;
LWLockAcquire(ProcArrayLock, LW_SHARED);
@@ -1761,6 +1759,7 @@
h->oldest_considered_running = initial;
h->shared_oldest_nonremovable = initial;
+ h->catalog_oldest_nonremovable = initial;
h->data_oldest_nonremovable = initial;
/*
@@ -1796,10 +1795,13 @@
int8 statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
TransactionId xmin;
+ TransactionId catalogXmin;
+ TransactionId olderXmin;
/* Fetch xid just once - see GetNewTransactionId */
xid = UINT32_ACCESS_ONCE(other_xids[index]);
xmin = UINT32_ACCESS_ONCE(proc->xmin);
+ catalogXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
/*
* Consider both the transaction's Xmin, and its Xid.
@@ -1809,11 +1811,14 @@
* some not-yet-set Xmin.
*/
xmin = TransactionIdOlder(xmin, xid);
+ catalogXmin = TransactionIdOlder(catalogXmin, xid);
/* if neither is set, this proc doesn't influence the horizon */
- if (!TransactionIdIsValid(xmin))
+ if (!TransactionIdIsValid(xmin) && !TransactionIdIsValid(catalogXmin))
continue;
+ olderXmin = TransactionIdOlder(xmin, catalogXmin);
+
/*
* Don't ignore any procs when determining which transactions might be
* considered running. While slots should ensure logical decoding
@@ -1821,7 +1826,7 @@
* include them here as well..
*/
h->oldest_considered_running =
- TransactionIdOlder(h->oldest_considered_running, xmin);
+ TransactionIdOlder(h->oldest_considered_running, olderXmin);
/*
* Skip over backends either vacuuming (which is ok with rows being
@@ -1833,7 +1838,7 @@
/* shared tables need to take backends in all databases into account */
h->shared_oldest_nonremovable =
- TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+ TransactionIdOlder(h->shared_oldest_nonremovable, olderXmin);
/*
* Normally sessions in other databases are ignored for anything but
@@ -1859,8 +1864,12 @@
(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
in_recovery)
{
- h->data_oldest_nonremovable =
- TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ if (TransactionIdIsValid(xmin))
+ h->data_oldest_nonremovable =
+ TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+ if (TransactionIdIsValid(olderXmin))
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, olderXmin);
}
}
@@ -1885,6 +1894,8 @@
TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+ h->catalog_oldest_nonremovable =
+ TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
/* temp relations cannot be accessed in recovery */
}
@@ -1912,7 +1923,6 @@
h->shared_oldest_nonremovable =
TransactionIdOlder(h->shared_oldest_nonremovable,
h->slot_catalog_xmin);
- h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
h->catalog_oldest_nonremovable =
TransactionIdOlder(h->catalog_oldest_nonremovable,
h->slot_catalog_xmin);
@@ -2092,7 +2102,7 @@
* least in the case we already hold a snapshot), but that's for another day.
*/
static bool
-GetSnapshotDataReuse(Snapshot snapshot)
+GetSnapshotDataReuse(Snapshot snapshot, bool catalog)
{
uint64 curXactCompletionCount;
@@ -2101,6 +2111,9 @@
if (unlikely(snapshot->snapXactCompletionCount == 0))
return false;
+ if (unlikely(snapshot->catalog != catalog))
+ return false;
+
curXactCompletionCount = TransamVariables->xactCompletionCount;
if (curXactCompletionCount != snapshot->snapXactCompletionCount)
return false;
@@ -2125,8 +2138,19 @@
* requirement that concurrent GetSnapshotData() calls yield the same
* xmin.
*/
- if (!TransactionIdIsValid(MyProc->xmin))
- MyProc->xmin = TransactionXmin = snapshot->xmin;
+ if (!catalog)
+ {
+ if (!TransactionIdIsValid(MyProc->xmin))
+ MyProc->xmin = snapshot->xmin;
+ }
+ else
+ {
+ if (!TransactionIdIsValid(MyProc->catalogXmin))
+ MyProc->catalogXmin = snapshot->xmin;
+ }
+
+ if (!TransactionIdIsValid(TransactionXmin))
+ TransactionXmin = snapshot->xmin;
RecentXmin = snapshot->xmin;
Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
@@ -2173,8 +2197,8 @@
* Note: this function should probably not be called with an argument that's
* not statically allocated (see xip allocation below).
*/
-Snapshot
-GetSnapshotData(Snapshot snapshot)
+static Snapshot
+GetSnapshotDataImpl(Snapshot snapshot, bool catalog)
{
ProcArrayStruct *arrayP = procArray;
TransactionId *other_xids = ProcGlobal->xids;
@@ -2232,7 +2256,7 @@
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
- if (GetSnapshotDataReuse(snapshot))
+ if (GetSnapshotDataReuse(snapshot, catalog))
{
LWLockRelease(ProcArrayLock);
return snapshot;
@@ -2412,8 +2436,18 @@
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
- if (!TransactionIdIsValid(MyProc->xmin))
- MyProc->xmin = TransactionXmin = xmin;
+ if (!catalog)
+ {
+ if (!TransactionIdIsValid(MyProc->xmin))
+ MyProc->xmin = xmin;
+ }
+ else
+ {
+ if (!TransactionIdIsValid(MyProc->catalogXmin))
+ MyProc->catalogXmin = xmin;
+ }
+ if (!TransactionIdIsValid(TransactionXmin))
+ TransactionXmin = xmin;
LWLockRelease(ProcArrayLock);
@@ -2506,6 +2540,7 @@
snapshot->subxcnt = subcount;
snapshot->suboverflowed = suboverflowed;
snapshot->snapXactCompletionCount = curXactCompletionCount;
+ snapshot->catalog = catalog;
snapshot->curcid = GetCurrentCommandId(false);
@@ -2522,6 +2557,19 @@
return snapshot;
}
+Snapshot
+GetSnapshotData(Snapshot snapshot)
+{
+ return GetSnapshotDataImpl(snapshot, false);
+}
+
+
+Snapshot
+GetCatalogSnapshotData(Snapshot snapshot)
+{
+ return GetSnapshotDataImpl(snapshot, true);
+}
+
/*
* ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
*
@@ -2592,7 +2640,7 @@
* GetSnapshotData first, we'll be overwriting a valid xmin here, so
* we don't check that.)
*/
- MyProc->xmin = TransactionXmin = xmin;
+ MyProc->xmin = MyProc->catalogXmin = TransactionXmin = xmin;
result = true;
break;
@@ -2645,7 +2693,7 @@
* Install xmin and propagate the statusFlags that affect how the
* value is interpreted by vacuum.
*/
- MyProc->xmin = TransactionXmin = xmin;
+ MyProc->xmin = MyProc->catalogXmin = TransactionXmin = xmin;
MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
(proc->statusFlags & PROC_XMIN_FLAGS);
ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
@@ -3162,7 +3210,8 @@
*/
void
ProcNumberGetTransactionIds(ProcNumber procNumber, TransactionId *xid,
- TransactionId *xmin, int *nsubxid, bool *overflowed)
+ TransactionId *xmin, TransactionId *catalogXmin,
+ int *nsubxid, bool *overflowed)
{
PGPROC *proc;
@@ -3182,6 +3231,7 @@
{
*xid = proc->xid;
*xmin = proc->xmin;
+ *catalogXmin = proc->catalogXmin;
*nsubxid = proc->subxidStatus.count;
*overflowed = proc->subxidStatus.overflowed;
}
@@ -3356,8 +3406,10 @@
{
/* Fetch xmin just once - might change on us */
TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
+ TransactionId pcatalogXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
+ TransactionId olderpXmin = TransactionIdOlder(pxmin, pcatalogXmin);
- if (excludeXmin0 && !TransactionIdIsValid(pxmin))
+ if (excludeXmin0 && !TransactionIdIsValid(olderpXmin))
continue;
/*
@@ -3365,7 +3417,7 @@
* hasn't set xmin yet will not be rejected by this test.
*/
if (!TransactionIdIsValid(limitXmin) ||
- TransactionIdPrecedesOrEquals(pxmin, limitXmin))
+ TransactionIdPrecedesOrEquals(olderpXmin, limitXmin))
{
VirtualTransactionId vxid;
@@ -3456,6 +3508,8 @@
{
/* Fetch xmin just once - can't change on us, but good coding */
TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
+ TransactionId catalogpXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
+ TransactionId oldestpXmin = TransactionIdOlder(pxmin, catalogpXmin);
/*
* We ignore an invalid pxmin because this means that backend has
@@ -3466,7 +3520,7 @@
* test here.
*/
if (!TransactionIdIsValid(limitXmin) ||
- (TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)))
+ (TransactionIdIsValid(oldestpXmin) && !TransactionIdFollows(oldestpXmin, limitXmin)))
{
VirtualTransactionId vxid;
Index: src/backend/storage/lmgr/proc.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
--- a/src/backend/storage/lmgr/proc.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/storage/lmgr/proc.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -382,7 +382,7 @@
MyProc->fpVXIDLock = false;
MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
MyProc->xid = InvalidTransactionId;
- MyProc->xmin = InvalidTransactionId;
+ MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
MyProc->pid = MyProcPid;
MyProc->vxid.procNumber = MyProcNumber;
MyProc->vxid.lxid = InvalidLocalTransactionId;
@@ -580,7 +580,7 @@
MyProc->fpVXIDLock = false;
MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
MyProc->xid = InvalidTransactionId;
- MyProc->xmin = InvalidTransactionId;
+ MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
MyProc->vxid.procNumber = INVALID_PROC_NUMBER;
MyProc->vxid.lxid = InvalidLocalTransactionId;
MyProc->databaseId = InvalidOid;
Index: src/backend/utils/time/snapmgr.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
--- a/src/backend/utils/time/snapmgr.c (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/utils/time/snapmgr.c (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -290,14 +290,6 @@
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
@@ -332,6 +324,16 @@
RegisteredLSN = OldestRegisteredSnapshot->lsn;
}
+ if (CatalogSnapshot != NULL)
+ {
+ if (OldestRegisteredSnapshot == NULL ||
+ TransactionIdPrecedes(CatalogSnapshot->xmin, OldestRegisteredSnapshot->xmin))
+ {
+ OldestRegisteredSnapshot = CatalogSnapshot;
+ RegisteredLSN = CatalogSnapshot->lsn;
+ }
+ }
+
if (OldestActiveSnapshot != NULL)
{
XLogRecPtr ActiveLSN = OldestActiveSnapshot->as_snap->lsn;
@@ -388,7 +390,7 @@
if (CatalogSnapshot == NULL)
{
/* Get new snapshot. */
- CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
+ CatalogSnapshot = GetCatalogSnapshotData(&CatalogSnapshotData);
/*
* Make sure the catalog snapshot will be accounted for in decisions
@@ -402,7 +404,7 @@
* NB: it had better be impossible for this to throw error, since the
* CatalogSnapshot pointer is already valid.
*/
- pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
+ Assert(TransactionIdIsValid(MyProc->catalogXmin));
}
return CatalogSnapshot;
@@ -423,9 +425,8 @@
{
if (CatalogSnapshot)
{
- pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
- SnapshotResetXmin();
+ MyProc->catalogXmin = InvalidTransactionId;
}
}
@@ -444,7 +445,7 @@
{
if (CatalogSnapshot &&
ActiveSnapshot == NULL &&
- pairingheap_is_singular(&RegisteredSnapshots))
+ pairingheap_is_empty(&RegisteredSnapshots))
InvalidateCatalogSnapshot();
}
@@ -1081,7 +1082,7 @@
if (resetXmin)
SnapshotResetXmin();
- Assert(resetXmin || MyProc->xmin == 0);
+ Assert(resetXmin || (MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId));
}
@@ -1626,19 +1627,15 @@
if (ActiveSnapshot != NULL)
return true;
- /*
- * The catalog snapshot is in RegisteredSnapshots when valid, but can be
- * removed at any time due to invalidation processing. If explicitly
- * registered more than one snapshot has to be in RegisteredSnapshots.
- */
- if (CatalogSnapshot != NULL &&
- pairingheap_is_singular(&RegisteredSnapshots))
- return false;
+ return HaveRegisteredSnapshot();
+}
+bool
+HaveRegisteredSnapshot(void)
+{
return !pairingheap_is_empty(&RegisteredSnapshots);
}
-
/*
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
* access to behave just like it did at a certain point in the past.
@@ -1804,6 +1801,7 @@
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
snapshot->snapXactCompletionCount = 0;
+ snapshot->catalog = false;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
Index: src/include/storage/proc.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
--- a/src/include/storage/proc.h (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/storage/proc.h (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -56,10 +56,6 @@
*/
#define PROC_IS_AUTOVACUUM 0x01 /* is it an autovac worker? */
#define PROC_IN_VACUUM 0x02 /* currently running lazy vacuum */
-#define PROC_IN_SAFE_IC 0x04 /* currently running CREATE INDEX
- * CONCURRENTLY or REINDEX
- * CONCURRENTLY on non-expressional,
- * non-partial index */
#define PROC_VACUUM_FOR_WRAPAROUND 0x08 /* set by autovac only */
#define PROC_IN_LOGICAL_DECODING 0x10 /* currently doing logical
* decoding outside xact */
@@ -69,13 +65,13 @@
/* flags reset at EOXact */
#define PROC_VACUUM_STATE_MASK \
- (PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+ (PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
/*
* Xmin-related flags. Make sure any flags that affect how the process' Xmin
* value is interpreted by VACUUM are included here.
*/
-#define PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define PROC_XMIN_FLAGS (PROC_IN_VACUUM)
/*
* We allow a small number of "weak" relation locks (AccessShareLock,
@@ -179,6 +175,7 @@
* starting our xact, excluding LAZY VACUUM:
* vacuum must not remove tuples deleted by
* xid >= xmin ! */
+ TransactionId catalogXmin;
int pid; /* Backend's process ID; 0 if prepared xact */
Index: src/include/storage/procarray.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
--- a/src/include/storage/procarray.h (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/storage/procarray.h (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -45,6 +45,7 @@
extern int GetMaxSnapshotSubxidCount(void);
extern Snapshot GetSnapshotData(Snapshot snapshot);
+extern Snapshot GetCatalogSnapshotData(Snapshot snapshot);
extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
VirtualTransactionId *sourcevxid);
@@ -66,8 +67,8 @@
extern PGPROC *ProcNumberGetProc(int procNumber);
extern void ProcNumberGetTransactionIds(int procNumber, TransactionId *xid,
- TransactionId *xmin, int *nsubxid,
- bool *overflowed);
+ TransactionId *xmin, TransactionId *catalogXmin,
+ int *nsubxid, bool *overflowed);
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
extern int BackendXidGetPid(TransactionId xid);
Index: src/include/utils/snapshot.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
--- a/src/include/utils/snapshot.h (revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/utils/snapshot.h (revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -183,6 +183,7 @@
bool takenDuringRecovery; /* recovery-shaped snapshot? */
bool copied; /* false if it's a static snapshot */
+ bool catalog; /* snapshot used to access catalog */
CommandId curcid; /* in my xact, CID < curcid are visible */
Index: contrib/amcheck/verify_nbtree.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
--- a/contrib/amcheck/verify_nbtree.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/contrib/amcheck/verify_nbtree.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -691,7 +691,8 @@
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
Index: src/backend/access/brin/brin.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
--- a/src/backend/access/brin/brin.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/brin/brin.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -2369,16 +2369,7 @@
leaderparticipates = false;
#endif
- /*
- * Enter parallel mode, and create context for parallel build of brin
- * index
- */
- EnterParallelMode();
- Assert(request > 0);
- pcxt = CreateParallelContext("postgres", "_brin_parallel_build_main",
- request);
-
- scantuplesortstates = leaderparticipates ? request + 1 : request;
+ Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
/*
* Prepare for scan of the base relation. In a normal index build, we use
@@ -2390,7 +2381,21 @@
if (!isconcurrent)
snapshot = SnapshotAny;
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
+
+ /*
+ * Enter parallel mode, and create context for parallel build of brin
+ * index
+ */
+ EnterParallelMode();
+ Assert(request > 0);
+ pcxt = CreateParallelContext("postgres", "_brin_parallel_build_main",
+ request);
+
+ scantuplesortstates = leaderparticipates ? request + 1 : request;
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2429,6 +2434,8 @@
/* Everyone's had a chance to ask for space, so now create the DSM */
InitializeParallelDSM(pcxt);
+ if (IsMVCCSnapshot(snapshot))
+ PopActiveSnapshot();
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
@@ -2458,7 +2465,7 @@
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2504,7 +2511,7 @@
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
+ brinleader->snapshot = isconcurrent ? InvalidSnapshot : snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2518,6 +2525,12 @@
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ if (isconcurrent)
+ {
+ WaitForParallelWorkersToAttach(pcxt, true);
+ UnregisterSnapshot(snapshot);
+ }
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2526,7 +2539,8 @@
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!isconcurrent)
+ WaitForParallelWorkersToAttach(pcxt, false);
}
/*
@@ -2536,6 +2550,7 @@
_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
{
int i;
+ Snapshot snapshot = brinleader->snapshot;
/* Shutdown worker processes */
WaitForParallelWorkersToFinish(brinleader->pcxt);
@@ -2548,8 +2563,10 @@
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
+ Assert(!brinleader->brinshared->isconcurrent || snapshot == InvalidSnapshot);
+ Assert(brinleader->brinshared->isconcurrent || snapshot != InvalidSnapshot);
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
+ UnregisterSnapshot(snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2800,6 +2817,7 @@
TableScanDesc scan;
double reltuples;
IndexInfo *indexInfo;
+ Snapshot snapshot;
/* Initialize local tuplesort coordination state */
coordinate = palloc0(sizeof(SortCoordinateData));
@@ -2811,8 +2829,21 @@
state->bs_sortstate = tuplesort_begin_index_brin(sortmem, coordinate,
TUPLESORT_NONE);
+ Assert(!brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
+ /* Join parallel scan */
+ if (brinshared->isconcurrent)
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/* Join parallel scan */
indexInfo = BuildIndexInfo(index);
+ if (brinshared->isconcurrent)
+ {
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ }
+ Assert(!brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
indexInfo->ii_Concurrent = brinshared->isconcurrent;
scan = table_beginscan_parallel(heap,
@@ -2866,8 +2897,7 @@
* The only possible status flag that can be set to the parallel worker is
* PROC_IN_SAFE_IC.
*/
- Assert((MyProc->statusFlags == 0) ||
- (MyProc->statusFlags == PROC_IN_SAFE_IC));
+ Assert(MyProc->statusFlags == 0);
/* Set debug_query_string for individual workers first */
sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
@@ -2913,8 +2943,12 @@
*/
sortmem = maintenance_work_mem / brinshared->scantuplesortstates;
+ if (brinshared->isconcurrent)
+ PopActiveSnapshot();
_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
heapRel, indexRel, sortmem, false);
+ if (brinshared->isconcurrent)
+ PushActiveSnapshot(GetLatestSnapshot());
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
Index: src/backend/access/gin/gininsert.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
--- a/src/backend/access/gin/gininsert.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/gin/gininsert.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -17,6 +17,7 @@
#include "access/gin_private.h"
#include "access/tableam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
#include "miscadmin.h"
#include "nodes/execnodes.h"
#include "storage/bufmgr.h"
Index: src/backend/access/gist/gistbuild.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
--- a/src/backend/access/gist/gistbuild.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/gist/gistbuild.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -38,6 +38,7 @@
#include "access/gist_private.h"
#include "access/tableam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
#include "miscadmin.h"
#include "nodes/execnodes.h"
#include "optimizer/optimizer.h"
Index: src/backend/access/hash/hash.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
--- a/src/backend/access/hash/hash.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/hash/hash.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -23,6 +23,7 @@
#include "access/relscan.h"
#include "access/tableam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
Index: src/backend/access/heap/heapam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
--- a/src/backend/access/heap/heapam.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/heap/heapam.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -575,6 +575,24 @@
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ Assert(ActiveSnapshotSet());
+ PopActiveSnapshot();
+ UnregisterSnapshot(sscan->rs_snapshot);
+ sscan->rs_snapshot = InvalidSnapshot;
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ sscan->rs_snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(sscan->rs_snapshot);
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -593,6 +611,11 @@
scan->rs_cbuf = InvalidBuffer;
}
+ if (unlikely(scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) & likely(scan->rs_inited))
+ {
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
+
/*
* Be sure to check for interrupts at least once per page. Checks at
* higher code levels won't be able to stop a seqscan that encounters many
@@ -1242,6 +1265,13 @@
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT);
+ Assert(ActiveSnapshotSet());
+ PopActiveSnapshot();
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
Index: src/backend/access/nbtree/nbtsort.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
--- a/src/backend/access/nbtree/nbtsort.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/nbtree/nbtsort.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -84,6 +84,7 @@
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -377,6 +378,7 @@
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -425,8 +427,9 @@
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -435,7 +438,7 @@
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -443,7 +446,7 @@
/* Initialize secondary spool */
btspool2->heap = heap;
btspool2->index = index;
- btspool2->isunique = false;
+ btspool2->isunique = btspool2->unique_dead_ignored = false;
/* Save as secondary spool */
buildstate->spool2 = btspool2;
@@ -466,7 +469,7 @@
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1145,11 +1148,13 @@
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ fail_on_duplicate = (btspool->unique_dead_ignored && btspool->isunique && btspool2 == NULL);
if (merge)
{
@@ -1353,6 +1358,80 @@
pfree(dstate);
}
+ else if (fail_on_duplicate)
+ {
+ bool was_valid = false,
+ prev_checked = false,
+ was_null;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL &&
+ ((wstate->inskey->allequalimage &&
+ _bt_keep_natts_fast_wasnull(wstate->index, prev, itup, &was_null) > keysz) ||
+ (_bt_keep_natts_wasnull(wstate->index, prev, itup,wstate->inskey, &was_null) > keysz)
+ ) &&
+ (btspool->nulls_not_distinct && was_null))
+ {
+ bool call_again, ignored, now_valid;
+ ItemPointerData tid;
+ if (!prev_checked)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ was_valid = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_checked = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_valid = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ if (was_valid && now_valid)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ was_valid |= now_valid;
+ }
+ else
+ {
+ was_valid = false;
+ prev_checked = false;
+ }
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
else
{
/* merging and deduplication are both unnecessary */
@@ -1414,17 +1493,7 @@
leaderparticipates = false;
#endif
- /*
- * Enter parallel mode, and create context for parallel build of btree
- * index
- */
- EnterParallelMode();
- Assert(request > 0);
- pcxt = CreateParallelContext("postgres", "_bt_parallel_build_main",
- request);
-
- scantuplesortstates = leaderparticipates ? request + 1 : request;
-
+ Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1435,7 +1504,20 @@
if (!isconcurrent)
snapshot = SnapshotAny;
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
+ /*
+ * Enter parallel mode, and create context for parallel build of btree
+ * index
+ */
+ EnterParallelMode();
+ Assert(request > 0);
+ pcxt = CreateParallelContext("postgres", "_bt_parallel_build_main",
+ request);
+
+ scantuplesortstates = leaderparticipates ? request + 1 : request;
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1450,7 +1532,7 @@
* Unique case requires a second spool, and so we may have to account for
* another shared workspace for that -- PARALLEL_KEY_TUPLESORT_SPOOL2
*/
- if (!btspool->isunique)
+ if (!btspool->isunique || isconcurrent)
shm_toc_estimate_keys(&pcxt->estimator, 2);
else
{
@@ -1485,6 +1567,8 @@
/* Everyone's had a chance to ask for space, so now create the DSM */
InitializeParallelDSM(pcxt);
+ if (IsMVCCSnapshot(snapshot))
+ PopActiveSnapshot();
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
@@ -1515,7 +1599,7 @@
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1529,7 +1613,7 @@
shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
/* Unique case requires a second spool, and associated shared state */
- if (!btspool->isunique)
+ if (!btspool->isunique || isconcurrent)
sharedsort2 = NULL;
else
{
@@ -1575,7 +1659,7 @@
btleader->btshared = btshared;
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
- btleader->snapshot = snapshot;
+ btleader->snapshot = isconcurrent ? InvalidSnapshot : snapshot;
btleader->walusage = walusage;
btleader->bufferusage = bufferusage;
@@ -1589,15 +1673,25 @@
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ if (isconcurrent)
+ {
+ WaitForParallelWorkersToAttach(pcxt, true);
+ UnregisterSnapshot(snapshot);
+ }
+
/* Join heap scan ourselves */
if (leaderparticipates)
+ {
+ INJECTION_POINT("_bt_leader_participate_as_worker");
_bt_leader_participate_as_worker(buildstate);
+ }
/*
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!isconcurrent)
+ WaitForParallelWorkersToAttach(pcxt, false);
}
/*
@@ -1607,6 +1701,7 @@
_bt_end_parallel(BTLeader *btleader)
{
int i;
+ Snapshot snapshot = btleader->snapshot;
/* Shutdown worker processes */
WaitForParallelWorkersToFinish(btleader->pcxt);
@@ -1619,8 +1714,10 @@
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
- UnregisterSnapshot(btleader->snapshot);
+ Assert(!btleader->btshared->isconcurrent || snapshot == InvalidSnapshot);
+ Assert(btleader->btshared->isconcurrent || snapshot != InvalidSnapshot);
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
+ UnregisterSnapshot(snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
}
@@ -1697,9 +1794,10 @@
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = btleader->btshared->isconcurrent;
/* Initialize second spool, if required */
- if (!btleader->btshared->isunique)
+ if (!btleader->btshared->isunique || btleader->btshared->isconcurrent)
leaderworker2 = NULL;
else
{
@@ -1709,7 +1807,7 @@
/* Initialize worker's own secondary spool */
leaderworker2->heap = leaderworker->heap;
leaderworker2->index = leaderworker->index;
- leaderworker2->isunique = false;
+ leaderworker2->isunique = leaderworker2->unique_dead_ignored = false;
}
/*
@@ -1758,12 +1856,7 @@
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- /*
- * The only possible status flag that can be set to the parallel worker is
- * PROC_IN_SAFE_IC.
- */
- Assert((MyProc->statusFlags == 0) ||
- (MyProc->statusFlags == PROC_IN_SAFE_IC));
+ Assert(MyProc->statusFlags == 0);
/* Set debug_query_string for individual workers first */
sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
@@ -1796,12 +1889,13 @@
btspool->heap = heapRel;
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
+ btspool->unique_dead_ignored = btshared->isconcurrent;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1814,7 +1908,7 @@
/* Initialize worker's own secondary spool */
btspool2->heap = btspool->heap;
btspool2->index = btspool->index;
- btspool2->isunique = false;
+ btspool2->isunique = btspool2->unique_dead_ignored = false;
/* Look up shared state private to tuplesort.c */
sharedsort2 = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT_SPOOL2, false);
tuplesort_attach_shared(sharedsort2, seg);
@@ -1825,8 +1919,12 @@
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
+ if (btshared->isconcurrent)
+ PopActiveSnapshot();
_bt_parallel_scan_and_sort(btspool, btspool2, btshared, sharedsort,
sharedsort2, sortmem, false);
+ if (btshared->isconcurrent)
+ PushActiveSnapshot(GetLatestSnapshot());
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
@@ -1868,6 +1966,7 @@
TableScanDesc scan;
double reltuples;
IndexInfo *indexInfo;
+ Snapshot snapshot;
/* Initialize local tuplesort coordination state */
coordinate = palloc0(sizeof(SortCoordinateData));
@@ -1880,6 +1979,7 @@
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1902,7 +2002,8 @@
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index,
+ false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
@@ -1917,13 +2018,27 @@
buildstate.indtuples = 0;
buildstate.btleader = NULL;
+ Assert(!btshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
/* Join parallel scan */
+ if (btshared->isconcurrent)
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
indexInfo = BuildIndexInfo(btspool->index);
+ if (btshared->isconcurrent)
+ {
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ }
+ Assert(!btshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
+
indexInfo->ii_Concurrent = btshared->isconcurrent;
scan = table_beginscan_parallel(btspool->heap,
ParallelTableScanFromBTShared(btshared));
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
- true, progress, _bt_build_callback,
+ true, progress,
+ _bt_build_callback,
(void *) &buildstate, scan);
/* Execute this worker's part of the sort */
Index: src/backend/access/spgist/spginsert.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
--- a/src/backend/access/spgist/spginsert.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/spgist/spginsert.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -20,6 +20,7 @@
#include "access/spgist_private.h"
#include "access/tableam.h"
#include "access/xloginsert.h"
+#include "catalog/index.h"
#include "miscadmin.h"
#include "nodes/execnodes.h"
#include "storage/bufmgr.h"
Index: src/backend/optimizer/plan/planner.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
--- a/src/backend/optimizer/plan/planner.c (revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/optimizer/plan/planner.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -61,6 +61,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6791,6 +6792,7 @@
BlockNumber heap_blocks;
double reltuples;
double allvisfrac;
+ Snapshot snapshot = InvalidSnapshot;
/*
* We don't allow performing parallel operation in standalone backend or
@@ -6842,6 +6844,10 @@
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ if (!ActiveSnapshotSet()) {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Determine if it's safe to proceed.
*
@@ -6899,6 +6905,12 @@
parallel_workers--;
done:
+ if (snapshot != InvalidSnapshot)
+ {
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ }
+
index_close(index, NoLock);
table_close(heap, NoLock);
Index: src/backend/access/table/tableam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
--- a/src/backend/access/table/tableam.c (revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/backend/access/table/tableam.c (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -29,6 +29,7 @@
#include "storage/bufmgr.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
+#include "storage/proc.h"
/*
* Constants to control the behavior of block allocation to parallel workers
@@ -149,15 +150,23 @@
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+
+ if (snapshot == InvalidSnapshot)
+ {
+ pscan->phs_snapshot_any = false;
+ pscan->phs_snapshot_reset = true;
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_snapshot_reset = false;
}
else
{
Assert(snapshot == SnapshotAny);
pscan->phs_snapshot_any = true;
+ pscan->phs_snapshot_reset = false;
}
}
@@ -170,7 +179,16 @@
Assert(RelationGetRelid(relation) == pscan->phs_relid);
- if (!pscan->phs_snapshot_any)
+ if (pscan->phs_snapshot_reset)
+ {
+ Assert(!ActiveSnapshotSet());
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
Index: src/backend/access/transam/parallel.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
--- a/src/backend/access/transam/parallel.c (revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/backend/access/transam/parallel.c (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_SET_FLAG UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -289,6 +290,9 @@
mul_size(PARALLEL_ERROR_QUEUE_SIZE,
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
@@ -359,6 +363,7 @@
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -474,6 +479,15 @@
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_set_flag = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_set_flag = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, snapshot_set_flag_space);
}
/* Restore previous memory context. */
@@ -511,6 +525,7 @@
if (pcxt->nworkers > 0)
{
char *error_queue_space;
+ bool *snapshot_set_flag_space;
int i;
error_queue_space =
@@ -525,6 +540,11 @@
shm_mq_set_receiver(mq, MyProc);
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
+
+ snapshot_set_flag_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_set_flag_space[i] = false;
}
}
@@ -669,7 +689,7 @@
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -713,9 +733,12 @@
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_set_flag))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1274,6 +1297,7 @@
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_flag_set_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1449,6 +1473,9 @@
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ snapshot_flag_set_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, false);
+ snapshot_flag_set_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
Index: src/include/access/parallel.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
--- a/src/include/access/parallel.h (revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/access/parallel.h (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -26,6 +26,7 @@
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_set_flag;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
Index: src/include/access/relscan.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
--- a/src/include/access/relscan.h (revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/access/relscan.h (revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -64,6 +64,7 @@
{
Oid phs_relid; /* OID of relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
+ bool phs_snapshot_reset;
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
Index: src/include/utils/snapmgr.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
--- a/src/include/utils/snapmgr.h (revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/utils/snapmgr.h (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -96,6 +96,7 @@
extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
extern bool ThereAreNoPriorRegisteredSnapshots(void);
extern bool HaveRegisteredOrActiveSnapshot(void);
+extern bool HaveRegisteredSnapshot(void);
extern char *ExportSnapshot(Snapshot snapshot);
Index: contrib/pgstattuple/pgstattuple.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
--- a/contrib/pgstattuple/pgstattuple.c (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/contrib/pgstattuple/pgstattuple.c (revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -286,6 +286,9 @@
case BRIN_AM_OID:
err = "brin index";
break;
+ case STIR_AM_OID:
+ err = "stir index";
+ break;
default:
err = "unknown index";
break;
@@ -329,7 +332,7 @@
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
Index: src/backend/access/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
--- a/src/backend/access/Makefile (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/access/Makefile (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -9,6 +9,6 @@
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- sequence table tablesample transam
+ sequence table tablesample transam stir
include $(top_srcdir)/src/backend/common.mk
Index: src/backend/access/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
--- a/src/backend/access/meson.build (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/access/meson.build (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -14,3 +14,4 @@
subdir('table')
subdir('tablesample')
subdir('transam')
+subdir('stir')
Index: src/backend/commands/analyze.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
--- a/src/backend/commands/analyze.c (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/commands/analyze.c (revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -719,6 +719,7 @@
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
+ ivinfo.validate_index = false;
stats = index_vacuum_cleanup(&ivinfo, NULL);
Index: src/backend/commands/vacuumparallel.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
--- a/src/backend/commands/vacuumparallel.c (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/commands/vacuumparallel.c (revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -883,6 +883,7 @@
ivinfo.estimated_count = pvs->shared->estimated_count;
ivinfo.num_heap_tuples = pvs->shared->reltuples;
ivinfo.strategy = pvs->bstrategy;
+ ivinfo.validate_index = false;
/* Update error traceback information */
pvs->indname = pstrdup(RelationGetRelationName(indrel));
Index: src/include/access/genam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
--- a/src/include/access/genam.h (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/access/genam.h (revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -48,6 +48,7 @@
bool analyze_only; /* ANALYZE (without any actual vacuum) */
bool report_progress; /* emit progress.h status reports */
bool estimated_count; /* num_heap_tuples is an estimate */
+ bool validate_index; /* not a vacuum but an index validation */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
BufferAccessStrategy strategy; /* access strategy for reads */
Index: src/include/access/reloptions.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
--- a/src/include/access/reloptions.h (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/access/reloptions.h (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -51,8 +51,9 @@
RELOPT_KIND_VIEW = (1 << 9),
RELOPT_KIND_BRIN = (1 << 10),
RELOPT_KIND_PARTITIONED = (1 << 11),
+ RELOPT_KIND_STIR = (1 << 12),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
Index: src/include/catalog/pg_am.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
--- a/src/include/catalog/pg_am.dat (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_am.dat (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -33,5 +33,7 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
-
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+ descr => 'short term index replacement',
+ amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
]
Index: src/include/catalog/pg_amop.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_amop.dat b/src/include/catalog/pg_amop.dat
--- a/src/include/catalog/pg_amop.dat (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_amop.dat (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -3227,4 +3227,8 @@
amoprighttype => 'point', amopstrategy => '7', amopopr => '@>(box,point)',
amopmethod => 'brin' },
+{ amopfamily => 'stir/record_ops', amoplefttype => 'record',
+ amoprighttype => 'record', amopstrategy => '1', amopopr => '=(record,record)',
+ amopmethod => 'stir' },
+
]
Index: src/include/catalog/pg_opclass.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
--- a/src/include/catalog/pg_opclass.dat (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_opclass.dat (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -488,4 +488,8 @@
# no brin opclass for the geometric types except box
+{ oid => '5557', oid_symbol => 'RECORD_STIR_OPS_OID',
+ opcmethod => 'stir', opcname => 'record_ops', opcfamily => 'stir/record_ops',
+ opcintype => 'record' },
+
]
Index: src/include/catalog/pg_opfamily.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
--- a/src/include/catalog/pg_opfamily.dat (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_opfamily.dat (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -302,6 +302,8 @@
opfmethod => 'btree', opfname => 'multirange_ops' },
{ oid => '4225',
opfmethod => 'hash', opfname => 'multirange_ops' },
+{ oid => '5558',
+ opfmethod => 'stir', opfname => 'record_ops' },
{ oid => '6158',
opfmethod => 'gist', opfname => 'multirange_ops' },
Index: src/include/catalog/pg_proc.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
--- a/src/include/catalog/pg_proc.dat (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_proc.dat (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -935,6 +935,10 @@
proname => 'brinhandler', provolatile => 'v',
prorettype => 'index_am_handler', proargtypes => 'internal',
prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'just access method handler',
+ proname => 'stirhandler', provolatile => 'v',
+ prorettype => 'index_am_handler', proargtypes => 'internal',
+ prosrc => 'stirhandler' },
{ oid => '3952', descr => 'brin: standalone scan new table pages',
proname => 'brin_summarize_new_values', provolatile => 'v',
proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
@@ -5487,9 +5491,9 @@
proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
proretset => 't', provolatile => 's', proparallel => 'r',
prorettype => 'record', proargtypes => 'int4',
- proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,text,numeric,text,bool,text,bool,bool,int4,int8}',
- proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id}',
+ proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,text,numeric,text,bool,text,bool,bool,int4,int8,xid}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id,backend_catalog_xmin}',
prosrc => 'pg_stat_get_activity' },
{ oid => '6318', descr => 'describe wait events',
proname => 'pg_get_wait_events', procost => '10', prorows => '250',
Index: src/include/utils/index_selfuncs.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
--- a/src/include/utils/index_selfuncs.h (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/utils/index_selfuncs.h (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -70,5 +70,13 @@
Selectivity *indexSelectivity,
double *indexCorrelation,
double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+ struct IndexPath *path,
+ double loop_count,
+ Cost *indexStartupCost,
+ Cost *indexTotalCost,
+ Selectivity *indexSelectivity,
+ double *indexCorrelation,
+ double *indexPages);
#endif /* INDEX_SELFUNCS_H */
Index: src/test/regress/expected/amutils.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
--- a/src/test/regress/expected/amutils.out (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/amutils.out (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -173,7 +173,13 @@
spgist | can_exclude | t
spgist | can_include | t
spgist | bogus |
-(36 rows)
+ stir | can_order | f
+ stir | can_unique | f
+ stir | can_multi_col | t
+ stir | can_exclude | f
+ stir | can_include | t
+ stir | bogus |
+(42 rows)
--
-- additional checks for pg_index_column_has_property
Index: src/test/regress/expected/opr_sanity.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
--- a/src/test/regress/expected/opr_sanity.out (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/opr_sanity.out (revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -2092,7 +2092,8 @@
4000 | 28 | ^@
4000 | 29 | <^
4000 | 30 | >^
-(124 rows)
+ 5555 | 1 | =
+(125 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
Index: src/test/regress/expected/psql.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
--- a/src/test/regress/expected/psql.out (revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/psql.out (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -5027,7 +5027,8 @@
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA *
List of access methods
@@ -5041,7 +5042,8 @@
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA h*
List of access methods
@@ -5077,7 +5079,8 @@
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement
+(9 rows)
\dA+ *
List of access methods
@@ -5091,7 +5094,8 @@
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement
+(9 rows)
\dA+ h*
List of access methods
Index: src/backend/catalog/system_views.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
--- a/src/backend/catalog/system_views.sql (revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/catalog/system_views.sql (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -879,6 +879,7 @@
S.state,
S.backend_xid,
s.backend_xmin,
+ s.backend_catalog_xmin,
S.query_id,
S.query,
S.backend_type
Index: src/backend/utils/activity/backend_status.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
--- a/src/backend/utils/activity/backend_status.c (revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/utils/activity/backend_status.c (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -838,6 +838,7 @@
ProcNumberGetTransactionIds(procNumber,
&localentry->backend_xid,
&localentry->backend_xmin,
+ &localentry->backend_catalog_xmin,
&localentry->backend_subxact_count,
&localentry->backend_subxact_overflowed);
Index: src/backend/utils/adt/pgstatfuncs.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
--- a/src/backend/utils/adt/pgstatfuncs.c (revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/utils/adt/pgstatfuncs.c (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -302,7 +302,7 @@
Datum
pg_stat_get_activity(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_ACTIVITY_COLS 31
+#define PG_STAT_GET_ACTIVITY_COLS 32
int num_backends = pgstat_fetch_stat_numbackends();
int curr_backend;
int pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -353,6 +353,11 @@
else
nulls[15] = true;
+ if (TransactionIdIsValid(local_beentry->backend_catalog_xmin))
+ values[31] = TransactionIdGetDatum(local_beentry->backend_catalog_xmin);
+ else
+ nulls[31] = true;
+
if (TransactionIdIsValid(local_beentry->backend_xmin))
values[16] = TransactionIdGetDatum(local_beentry->backend_xmin);
else
Index: src/include/utils/backend_status.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
--- a/src/include/utils/backend_status.h (revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/include/utils/backend_status.h (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -266,6 +266,8 @@
*/
TransactionId backend_xmin;
+ TransactionId backend_catalog_xmin;
+
/*
* Number of cached subtransactions in the current session.
*/
Index: src/test/regress/expected/rules.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
--- a/src/test/regress/expected/rules.out (revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/test/regress/expected/rules.out (revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -1759,10 +1759,11 @@
s.state,
s.backend_xid,
s.backend_xmin,
+ s.backend_catalog_xmin,
s.query_id,
s.query,
s.backend_type
- FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+ FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
LEFT JOIN pg_database d ON ((s.datid = d.oid)))
LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
pg_stat_all_indexes| SELECT c.oid AS relid,
@@ -1882,7 +1883,7 @@
gss_princ AS principal,
gss_enc AS encrypted,
gss_delegation AS credentials_delegated
- FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+ FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
WHERE (client_port IS NOT NULL);
pg_stat_io| SELECT backend_type,
object,
@@ -2086,7 +2087,7 @@
w.sync_priority,
w.sync_state,
w.reply_time
- FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+ FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
pg_stat_replication_slots| SELECT s.slot_name,
@@ -2120,7 +2121,7 @@
ssl_client_dn AS client_dn,
ssl_client_serial AS client_serial,
ssl_issuer_dn AS issuer_dn
- FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+ FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
WHERE (client_port IS NOT NULL);
pg_stat_subscription| SELECT su.oid AS subid,
su.subname,
Index: src/backend/access/stir/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
--- /dev/null (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/Makefile (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/stir
+#
+# IDENTIFICATION
+# src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ stir.o
+
+include $(top_srcdir)/src/backend/common.mk
Index: src/backend/access/stir/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
--- /dev/null (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/meson.build (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,5 @@
+# Copyright (c) 2024-2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'stir.c',
+)
Index: src/backend/access/stir/stir.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
--- /dev/null (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/stir.c (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,517 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ * Implementation of Short-Term Index Replacement.
+ *
+ * Portions Copyright (c) 2024-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+ IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+ amroutine->amstrategies = STIR_NSTRATEGIES;
+ amroutine->amsupport = STIR_NPROC;
+ amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+ amroutine->amcanorder = false;
+ amroutine->amcanorderbyop = false;
+ amroutine->amcanbackward = false;
+ amroutine->amcanunique = false;
+ amroutine->amcanmulticol = true;
+ amroutine->amoptionalkey = true;
+ amroutine->amsearcharray = false;
+ amroutine->amsearchnulls = false;
+ amroutine->amstorage = false;
+ amroutine->amclusterable = false;
+ amroutine->ampredlocks = false;
+ amroutine->amcanparallel = false;
+ amroutine->amcanbuildparallel = false;
+ amroutine->amcaninclude = true;
+ amroutine->amusemaintenanceworkmem = false;
+ amroutine->amparallelvacuumoptions =
+ VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amkeytype = InvalidOid;
+
+ amroutine->ambuild = stirbuild;
+ amroutine->ambuildempty = stirbuildempty;
+ amroutine->aminsert = stirinsert;
+ amroutine->aminsertcleanup = NULL;
+ amroutine->ambulkdelete = stirbulkdelete;
+ amroutine->amvacuumcleanup = stirvacuumcleanup;
+ amroutine->amcanreturn = NULL;
+ amroutine->amcostestimate = stircostestimate;
+ amroutine->amoptions = stiroptions;
+ amroutine->amproperty = NULL;
+ amroutine->ambuildphasename = NULL;
+ amroutine->amvalidate = stirvalidate;
+ amroutine->amadjustmembers = NULL;
+ amroutine->ambeginscan = stirbeginscan;
+ amroutine->amrescan = stirrescan;
+ amroutine->amgettuple = NULL;
+ amroutine->amgetbitmap = NULL;
+ amroutine->amendscan = stirendscan;
+ amroutine->ammarkpos = NULL;
+ amroutine->amrestrpos = NULL;
+ amroutine->amestimateparallelscan = NULL;
+ amroutine->aminitparallelscan = NULL;
+ amroutine->amparallelrescan = NULL;
+
+ PG_RETURN_POINTER(amroutine);
+}
+
+bool
+stirvalidate(Oid opclassoid)
+{
+ bool result = true;
+ HeapTuple classtup;
+ Form_pg_opclass classform;
+ Oid opfamilyoid;
+ HeapTuple familytup;
+ Form_pg_opfamily familyform;
+ char *opfamilyname;
+ CatCList *proclist,
+ *oprlist;
+ int i;
+
+ /* Fetch opclass information */
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+ classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+ opfamilyoid = classform->opcfamily;
+
+
+ /* Fetch opfamily information */
+ familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+ if (!HeapTupleIsValid(familytup))
+ elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+ familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+ opfamilyname = NameStr(familyform->opfname);
+
+ /* Fetch all operators and support functions of the opfamily */
+ oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+ proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+ /* Check individual operators */
+ for (i = 0; i < oprlist->n_members; i++)
+ {
+ HeapTuple oprtup = &oprlist->members[i]->tuple;
+ Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+ /* Check it's allowed strategy for stir */
+ if (oprform->amopstrategy < 1 ||
+ oprform->amopstrategy > STIR_NSTRATEGIES)
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+ opfamilyname,
+ format_operator(oprform->amopopr),
+ oprform->amopstrategy)));
+ result = false;
+ }
+
+ /* stir doesn't support ORDER BY operators */
+ if (oprform->amoppurpose != AMOP_SEARCH ||
+ OidIsValid(oprform->amopsortfamily))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+
+ /* Check operator signature --- same for all stir strategies */
+ if (!check_amop_signature(oprform->amopopr, BOOLOID,
+ oprform->amoplefttype,
+ oprform->amoprighttype))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with wrong signature",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+ }
+
+
+ ReleaseCatCacheList(proclist);
+ ReleaseCatCacheList(oprlist);
+ ReleaseSysCache(familytup);
+ ReleaseSysCache(classtup);
+
+ return result;
+}
+
+
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+ StirMetaPageData *metadata;
+
+ StirInitPage(metaPage, STIR_META);
+ metadata = StirPageGetMeta(metaPage);
+ memset(metadata, 0, sizeof(StirMetaPageData));
+ metadata->magickNumber = STIR_MAGICK_NUMBER;
+ metadata->skipInserts = skipInserts;
+ ((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ /*
+ * Make a new page; since it is first page it should be associated with
+ * block number 0 (STIR_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
+ */
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+ /* Initialize contents of meta page */
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+ StirPageOpaque opaque;
+
+ PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+ opaque = StirPageGetOpaque(page);
+ opaque->flags = flags;
+ opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+ StirTuple *itup;
+ StirPageOpaque opaque;
+ Pointer ptr;
+
+ /* We shouldn't be pointed to an invalid page */
+ Assert(!PageIsNew(page));
+
+ /* Does new tuple fit on the page? */
+ if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+ return false;
+
+ /* Copy new tuple to the end of page */
+ opaque = StirPageGetOpaque(page);
+ itup = StirPageGetTuple(page, opaque->maxoff + 1);
+ memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+ /* Adjust maxoff and pd_lower */
+ opaque->maxoff++;
+ ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+ ((PageHeader) page)->pd_lower = ptr - page;
+
+ /* Assert we didn't overrun available space */
+ Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+ return true;
+}
+
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo)
+{
+ StirTuple *itup;
+ MemoryContext oldCtx;
+ MemoryContext insertCtx;
+ StirMetaPageData *metaData;
+ Buffer buffer,
+ metaBuffer;
+ Page page;
+ GenericXLogState *state;
+ uint16 blkNo;
+
+ insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Stir insert temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+
+ oldCtx = MemoryContextSwitchTo(insertCtx);
+
+ itup = (StirTuple *) palloc0(sizeof(StirTuple));
+ itup->heapPtr = *ht_ctid;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+ for (;;)
+ {
+ LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+ blkNo = metaData->lastBlkNo;
+ /* Don't hold metabuffer lock while doing insert */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+ if (blkNo > 0)
+ {
+ buffer = ReadBuffer(index, blkNo);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+ Assert(!PageIsNew(page));
+
+ if (StirPageAddItem(page, itup))
+ {
+ /* Success! Apply the change, clean up, and exit */
+ GenericXLogFinish(state);
+ UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(metaBuffer);
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+ return false;
+ }
+
+ /* Didn't fit, must try other pages */
+ GenericXLogAbort(state);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+ if (blkNo != metaData->lastBlkNo)
+ {
+ Assert(blkNo < metaData->lastBlkNo);
+ // someone else inserted the new page into the index, lets try again
+ GenericXLogAbort(state);
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+ else
+ {
+ /* Must extend the file */
+ buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+ EB_LOCK_FIRST);
+
+ page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+ StirInitPage(page, 0);
+
+ if (!StirPageAddItem(page, itup))
+ {
+ /* We shouldn't be here since we're inserting to an empty page */
+ elog(ERROR, "could not add new stir tuple to empty page");
+ }
+ metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(buffer);
+ UnlockReleaseBuffer(metaBuffer);
+
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+
+ return false;
+ }
+ }
+}
+
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo)
+{
+ IndexBuildResult *result;
+
+ StirInitMetapage(index, MAIN_FORKNUM);
+
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ result->heap_tuples = 0;
+ result->index_tuples = 0;
+ return result;
+}
+
+void stirbuildempty(Relation index)
+{
+ StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ Relation index = info->index;
+ BlockNumber blkno, npages;
+ Buffer buffer;
+ Page page;
+
+ if (!info->validate_index)
+ {
+ StirMarkAsSkipInserts(index);
+
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+ }
+
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ /*
+ * Iterate over the pages. We don't care about concurrently added pages,
+ * because TODO
+ */
+ npages = RelationGetNumberOfBlocks(index);
+ for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+ {
+ StirTuple *itup, *itupEnd;
+
+ vacuum_delay_point();
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+ RBM_NORMAL, info->strategy);
+
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buffer);
+
+ if (PageIsNew(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ continue;
+ }
+
+ itup = StirPageGetTuple(page, FirstOffsetNumber);
+ itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+ while (itup < itupEnd)
+ {
+ /* Do we have to delete this tuple? */
+ if (callback(&itup->heapPtr, callback_state))
+ {
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+ }
+
+ itup = StirPageGetNextTuple(itup);
+ }
+
+ UnlockReleaseBuffer(buffer);
+ }
+
+ return stats;
+}
+
+void StirMarkAsSkipInserts(Relation index)
+{
+ StirMetaPageData *metaData;
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ metaData = StirPageGetMeta(metaPage);
+ if (!metaData->skipInserts)
+ {
+ metaData->skipInserts = true;
+ GenericXLogFinish(state);
+ }
+ else
+ {
+ GenericXLogAbort(state);
+ }
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats)
+{
+ StirMarkAsSkipInserts(info->index);
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+ return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+ double loop_count, Cost *indexStartupCost,
+ Cost *indexTotalCost, Selectivity *indexSelectivity,
+ double *indexCorrelation, double *indexPages)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
Index: src/include/access/stir.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
--- /dev/null (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/include/access/stir.h (revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ * header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2024-2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC 0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES 1
+
+#define STIR_OPTIONS_PROC 0
+
+/* Macros for accessing bloom page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+ ((StirPageGetOpaque(page)->flags & BLOOM_META) != 0)
+#define StirPageGetData(page) ((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+ ((StirTuple *)(PageGetContents(page) \
+ + sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+ ((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO (0)
+#define STIR_HEAD_BLKNO (1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+ OffsetNumber maxoff; /* number of index tuples on page */
+ uint16 flags; /* see bit definitions below */
+ uint16 unused; /* placeholder to force maxaligning of size of
+ * StirPageOpaqueData and to place
+ * stir_page_id exactly at the end of page */
+ uint16 stir_page_id; /* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META (1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID 0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+ uint32 magickNumber;
+ uint16 lastBlkNo;
+ bool skipInserts;
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page) ((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+ ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+ - StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+ - MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+ void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
Index: src/backend/utils/sort/tuplesortvariants.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
--- a/src/backend/utils/sort/tuplesortvariants.c (revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/backend/utils/sort/tuplesortvariants.c (revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -123,6 +123,7 @@
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored;
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -514,6 +517,7 @@
arg->index.indexRel = indexRel;
arg->enforceUnique = false;
arg->uniqueNullsNotDistinct = false;
+ arg->uniqueDeadIgnored = false;
/* Prepare SortSupport data for each column */
base->sortKeys = (SortSupport) palloc0(base->nKeys *
@@ -1520,6 +1524,7 @@
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,56 @@
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
Index: src/bin/pg_amcheck/t/007_concurrently_unique.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
--- /dev/null (revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl (revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ # $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+ # while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+ $node->pgbench(
+ '--no-vacuum --client=40 --exit-on-abort --transactions=10000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ # Ensure some HOT updates happen
+ '001_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '002_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '003_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '004_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+# ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+
+ if (1)
+ {
+ my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
Index: src/include/access/transam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
--- a/src/include/access/transam.h (revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/include/access/transam.h (revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
@@ -344,6 +344,21 @@
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdFollows(a, b))
+ return a;
+ return b;
+}
+
/* return the older of the two IDs, assuming they're both normal */
static inline TransactionId
NormalTransactionIdOlder(TransactionId a, TransactionId b)
Index: src/include/utils/tuplesort.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
--- a/src/include/utils/tuplesort.h (revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/include/utils/tuplesort.h (revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
@@ -428,6 +428,7 @@
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
Index: src/backend/access/nbtree/nbtutils.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
--- a/src/backend/access/nbtree/nbtutils.c (revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
+++ b/src/backend/access/nbtree/nbtutils.c (revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -100,8 +100,6 @@
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4775,6 +4773,14 @@
return tidpivot;
}
+int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key) {
+ bool ignored;
+ return _bt_keep_natts_wasnull(rel, lastleft, firstright, itup_key, &ignored);
+}
+
+
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
@@ -4786,9 +4792,10 @@
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+int
+_bt_keep_natts_wasnull(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key,
+ bool *wasnull)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4814,6 +4821,7 @@
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ (*wasnull) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4838,6 +4846,13 @@
return keepnatts;
}
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+ bool ignored;
+ return _bt_keep_natts_fast_wasnull(rel, lastleft, firstright, &ignored);
+}
+
/*
* _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
*
@@ -4861,7 +4876,8 @@
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast_wasnull(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *wasnull)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4878,6 +4894,7 @@
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ *wasnull |= (isNull1 | isNull2);
att = TupleDescAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
Index: src/include/access/nbtree.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
--- a/src/include/access/nbtree.h (revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
+++ b/src/include/access/nbtree.h (revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -1302,8 +1302,15 @@
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_wasnull(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, BTScanInsert itup_key,
+ bool *wasnull);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
IndexTuple firstright);
+extern int _bt_keep_natts_fast_wasnull(Relation rel, IndexTuple lastleft,
+ IndexTuple firstright, bool *wasnull);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
Index: src/backend/optimizer/util/plancat.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
--- a/src/backend/optimizer/util/plancat.c (revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
+++ b/src/backend/optimizer/util/plancat.c (revision 94aa5d7dab7e8ebd77004b50ba96b1f82a04c249)
@@ -720,6 +720,7 @@
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -813,7 +814,13 @@
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -835,10 +842,9 @@
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
+ foundValid |= idxForm->indisvalid;
index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ break;
}
else if (indexOidFromConstraint != InvalidOid)
{
@@ -932,6 +938,7 @@
goto next;
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -939,7 +946,8 @@
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
Index: src/backend/access/index/genam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
--- a/src/backend/access/index/genam.c (revision 94aa5d7dab7e8ebd77004b50ba96b1f82a04c249)
+++ b/src/backend/access/index/genam.c (revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -454,7 +454,7 @@
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
Index: src/test/modules/injection_points/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
--- a/src/test/modules/injection_points/Makefile (revision 56c9d3f4842baa53d7ab13d0764eae7f305aba0f)
+++ b/src/test/modules/injection_points/Makefile (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -13,7 +13,8 @@
REGRESS = injection_points
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace \
+ reset_snapshots
TAP_TESTS = 1
Index: src/test/modules/injection_points/expected/reset_snapshots.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/expected/reset_snapshots.out b/src/test/modules/injection_points/expected/reset_snapshots.out
new file mode 100644
--- /dev/null (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/test/modules/injection_points/expected/reset_snapshots.out (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,318 @@
+unused step name: sleep
+Parsed test spec with 2 sessions
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_simple reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_simple: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_unique_index_concurrently_simple reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_unique_index_concurrently_simple: CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_expression_mod reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_expression_mod: CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_set_xid_no_param reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+step create_index_concurrently_predicate_set_xid_no_param: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable_no_param();
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_set_xid reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_set_xid: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable(i);
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_index_concurrently_simple wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_index_concurrently_simple: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j); <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_simple: <... completed>
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_unique_index_concurrently_simple wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_unique_index_concurrently_simple: CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i); <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_unique_index_concurrently_simple: <... completed>
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_index_concurrently_predicate_expression_mod wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_index_concurrently_predicate_expression_mod: CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_expression_mod: <... completed>
+test: NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+test: NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach:
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
Index: src/test/modules/injection_points/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
--- a/src/test/modules/injection_points/meson.build (revision 56c9d3f4842baa53d7ab13d0764eae7f305aba0f)
+++ b/src/test/modules/injection_points/meson.build (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -42,6 +42,7 @@
'isolation': {
'specs': [
'inplace',
+ 'reset_snapshots',
],
},
'tap': {
Index: src/test/modules/injection_points/specs/reset_snapshots.spec
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/specs/reset_snapshots.spec b/src/test/modules/injection_points/specs/reset_snapshots.spec
new file mode 100644
--- /dev/null (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/test/modules/injection_points/specs/reset_snapshots.spec (revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,114 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, j int);
+ INSERT INTO test.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+ CREATE FUNCTION test.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+
+ CREATE FUNCTION test.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+ END; $$;
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session test
+setup {
+ SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ SELECT injection_points_attach('_bt_leader_participate_as_worker', 'wait');
+}
+step sleep { SELECT pg_sleep(10); }
+step drop_index { DROP INDEX CONCURRENTLY test.idx; }
+step create_index_concurrently_simple { CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j); }
+step create_unique_index_concurrently_simple { CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i); }
+step create_index_concurrently_predicate_expression_mod { CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0; }
+step create_index_concurrently_predicate_set_xid { CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable(i); }
+step create_index_concurrently_predicate_set_xid_no_param { CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable_no_param(); }
+step reindex_index_concurrently { REINDEX INDEX CONCURRENTLY test.idx; }
+step set_parallel_workers_1 { ALTER TABLE test.tbl SET (parallel_workers=0); }
+step set_parallel_workers_2 { ALTER TABLE test.tbl SET (parallel_workers=2); }
+step detach {
+ SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ SELECT injection_points_detach('_bt_leader_participate_as_worker');
+}
+
+session wakeup_session
+step wakeup { SELECT injection_points_wakeup('_bt_leader_participate_as_worker'); }
+
+permutation
+ set_parallel_workers_1
+ create_index_concurrently_simple
+ reindex_index_concurrently
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_1
+ create_unique_index_concurrently_simple
+ reindex_index_concurrently
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_1
+ create_index_concurrently_predicate_expression_mod
+ reindex_index_concurrently
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_1
+ create_index_concurrently_predicate_set_xid_no_param
+ reindex_index_concurrently
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_1
+ create_index_concurrently_predicate_set_xid
+ reindex_index_concurrently
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_2
+ create_index_concurrently_simple
+ wakeup
+ reindex_index_concurrently
+ wakeup
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_2
+ create_unique_index_concurrently_simple
+ wakeup
+ reindex_index_concurrently
+ wakeup
+ drop_index
+ detach
+
+permutation
+ set_parallel_workers_2
+ create_index_concurrently_predicate_expression_mod
+ wakeup
+ reindex_index_concurrently
+ wakeup
+ drop_index
+ detach
\ No newline at end of file
Hello, everyone!
With winter approaching, it’s the perfect time to dive back into work on
this patch! :)
The first attached patch implements Matthias's idea of periodically
resetting the snapshot during the initial heap scan. The next step will be
to add support for parallel builds.
Additionally, here are a few comments on previous emails:
In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)
Now, visibility snapshots are updated between pages.
As for the catalog snapshot:
* Dropping functions isn’t possible due to dependencies and locking
constraints.
* Updating functions is possible, but it offers the same level of isolation
as we have now:
1) Functions are already converted into an execution state and aren’t
re-read from the catalog during the scan.
2) During the validation phase, the latest version of a function will be
used.
3) Even in the initial phase, predicates and expressions could be read
using different catalog snapshots, as it’s possible to receive a cache
invalidation message before the first FormIndexDatum is created.
Best regards,
Mikhail.
Show quoted text
Attachments:
v1-0001-Allow-advancing-xmin-during-non-unique-non-parall.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From f0ad209453b645728570a1f57b364517bcfdf734 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Tue, 12 Nov 2024 13:09:29 +0100
Subject: [PATCH v1] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
Author: Michail Nikolaev
Reviewed-by: [Reviewers' Names]
Discussion: https://postgr.es/m/CANtu0oiLc-%2B7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg%40mail.gmail.com
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 4 +
src/backend/access/heap/heapam.c | 37 +++++++
src/backend/access/heap/heapam_handler.c | 45 ++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 4 +
src/backend/catalog/index.c | 30 +++++-
src/backend/commands/indexcmds.c | 10 --
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 27 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 102 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 82 ++++++++++++++
15 files changed, 332 insertions(+), 28 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 8b82797c10..23c138db0a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4..ff7cc07df9 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c0b978119a..94c086073e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2430,8 +2430,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
else
querylen = 0; /* keep compiler quiet */
+ if (IsMVCCSnapshot(snapshot))
+ PushActiveSnapshot(snapshot);
/* Everyone's had a chance to ask for space, so now create the DSM */
InitializeParallelDSM(pcxt);
+ if (IsMVCCSnapshot(snapshot))
+ PopActiveSnapshot();
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dc..21a2515de3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,28 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ Assert(ActiveSnapshotSet());
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ PopActiveSnapshot();
+ UnregisterSnapshot(sscan->rs_snapshot);
+ sscan->rs_snapshot = InvalidSnapshot;
+ InvalidateCatalogSnapshotConditionally();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ sscan->rs_snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(sscan->rs_snapshot);
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +630,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1262,14 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT);
+ Assert(ActiveSnapshotSet());
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ PopActiveSnapshot();
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c..5a1d0a9d36 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1244,24 +1243,40 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+ Assert(!ActiveSnapshotSet());
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ PushActiveSnapshot(snapshot);
+ /* In case of SO_RESET_SNAPSHOT snapshots are cleared by table_endscan. */
+ need_unregister_snapshot = need_pop_active_snapshot = !reset_snapshots;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1290,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1306,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1748,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1822,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d6..777df91972 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fb9a05f7af..e7ccefb133 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1485,8 +1485,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
else
querylen = 0; /* keep compiler quiet */
+ if (IsMVCCSnapshot(snapshot))
+ PushActiveSnapshot(snapshot);
/* Everyone's had a chance to ask for space, so now create the DSM */
InitializeParallelDSM(pcxt);
+ if (IsMVCCSnapshot(snapshot))
+ PopActiveSnapshot();
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..3aa500072c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be registered every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2f652463e3..df5873e124 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1671,15 +1671,9 @@ DefineIndex(Oid tableId,
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4076,9 +4070,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4093,7 +4084,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1f78dc3d53..6b75c14c69 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6890,6 +6891,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6945,6 +6947,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -7002,6 +7009,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93c..dc7c766661 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,18 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot each page? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped and
+ * unregistered, catalog snapshot invalidated, latest snapshot is
+ * registered and pushed as active.
+ *
+ * At the end of the scan snapshot is popped and unregistered too.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +948,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +957,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ Assert(ActiveSnapshotSet());
+ Assert(GetActiveSnapshot() == snapshot);
+ flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1800,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58..2225cd0bf8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 0000000000..4cfbbb0592
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f1900115..44cc028e82 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 0000000000..4fef5a4743
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
Hello!
Added support for parallel builds (resetting in the first phase), next step
- support for unique indexes.
Best regards,
Mikhail.
Show quoted text
Attachments:
v2-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchtext/plain; charset=US-ASCII; name=v2-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From fc79ec8084837e1792441b1dae1594986dba0caa Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v2 4/4] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
12 files changed, 178 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3a7357a050d..148e1982cad 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -291,14 +291,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
v2-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchtext/plain; charset=US-ASCII; name=v2-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From 9432da61d7640457a67cc5ac8ecd0b1c6be132e1 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v2 1/4] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e439..5ffef4595e2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -713,12 +713,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -753,8 +755,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -766,30 +768,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -812,7 +860,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -832,27 +886,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -872,7 +922,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -880,6 +930,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -917,27 +971,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -945,7 +1007,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..3a7357a050d 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v2-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchtext/plain; charset=US-ASCII; name=v2-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From c8e63c35e9ac09b71d53ddc4e5d4dd2b1ec31cb6 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v2 3/4] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 +++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 102 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 82 ++++++++++++++
15 files changed, 375 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..777df91972e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d37..f581a743aae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762ec..d9de16af81d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6942,6 +6943,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6997,6 +6999,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -7054,6 +7061,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v2-0002-Add-stress-tests-for-concurrent-index-operations.patchtext/plain; charset=US-ASCII; name=v2-0002-Add-stress-tests-for-concurrent-index-operations.patchDownload
From 53cfcf3dc0effd2b1a41195d01207f46bac6df86 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v2 2/4] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck bt_index_parent_check
* Exercising parallel worker configurations
The tests perform intensive concurrent modifications via pgbench while
executing index operations to stress test index build infrastructure.
Test cases cover:
- Regular and unique indexes
- Indexes with stable and immutable predicates
- Multi-column indexes with various combinations
- Different parallel worker configurations
Two new test files added:
- t/006_concurrently.pl: General concurrent index operation tests
- t/007_concurrently_unique.pl: Focused testing of unique indexes
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 2 +
src/bin/pg_amcheck/t/006_concurrently.pl | 315 ++++++++++++++++++
.../pg_amcheck/t/007_concurrently_unique.pl | 239 +++++++++++++
3 files changed, 556 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
create mode 100644 src/bin/pg_amcheck/t/007_concurrently_unique.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..b4e14a15ef3 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,8 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_concurrently.pl',
+ 't/007_concurrently_unique.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 00000000000..c0f9e9557bf
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,315 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+
+use threads;
+use Test::More;
+use Test::Builder;
+
+
+eval {
+ require IPC::SysV;
+ IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=10 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '001_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '003_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+ $n = 0;
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;));
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;));
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=1);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex:)' . $n++);
+ }
+ }
+
+ if (1)
+ {
+ my $variant = int(rand(7));
+ my $sql;
+ if ($variant == 0) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+ } elsif ($variant == 1) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+ } elsif ($variant == 2) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+ } elsif ($variant == 3) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 4) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+ } elsif ($variant == 5) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 6) {
+ $sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+ } else { diag("#wrong variant"); }
+
+ diag('#' . $sql);
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
index 00000000000..22cd3b4bf2b
--- /dev/null
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
@@ -0,0 +1,239 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use threads;
+use Test::More;
+use Test::Builder;
+
+eval {
+ require IPC::SysV;
+ IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ # $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+ # while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+ $node->pgbench(
+ '--no-vacuum --client=40 --exit-on-abort --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ # Ensure some HOT updates happen
+ '001_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '002_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '003_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '004_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+# ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+
+ if (1)
+ {
+ my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+ $node->stop;
+ done_testing();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
--
2.43.0
Hello, Matthias!
Added support for unique indexes.
So, now your initial idea about resetting during the first phase appears to
be ready.
Next step - use single-scan and auxiliary index for concurrent index build.
Also, I have updated the stress tests accordingly to [0]/messages/by-id/CANtu0ojmVd27fEhfpST7RG2KZvwkX=dMyKUqg0KM87FkOSdz8Q@mail.gmail.com.
[0]: /messages/by-id/CANtu0ojmVd27fEhfpST7RG2KZvwkX=dMyKUqg0KM87FkOSdz8Q@mail.gmail.com
/messages/by-id/CANtu0ojmVd27fEhfpST7RG2KZvwkX=dMyKUqg0KM87FkOSdz8Q@mail.gmail.com
Best regards,
Mikhail.
Attachments:
v5-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchapplication/x-patch; name=v5-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchDownload
From e7d31801aac57f2e0bfc6bfc209be89eb90c75e9 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v5 5/5] Allow snapshot resets in concurrent unique index
builds
Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.
Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:
1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values
This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
src/backend/access/heap/heapam_handler.c | 6 +-
src/backend/access/nbtree/nbtdedup.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 173 ++++++++++++++----
src/backend/access/nbtree/nbtsplitloc.c | 12 +-
src/backend/access/nbtree/nbtutils.c | 29 ++-
src/backend/catalog/index.c | 6 +-
src/backend/utils/sort/tuplesortvariants.c | 67 +++++--
src/include/access/nbtree.h | 4 +-
src/include/access/tableam.h | 5 +-
src/include/utils/tuplesort.h | 1 +
.../expected/cic_reset_snapshots.out | 6 +
11 files changed, 242 insertions(+), 75 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
* and index whatever's live according to that while that snapshot is reset
- * every so often (in case of non-unique index).
+ * every so often.
*/
OldestXmin = InvalidTransactionId;
/*
- * For unique index we need consistent snapshot for the whole scan.
+ * For concurrent builds of non-system indexes, we may want to periodically
+ * reset snapshots to allow vacuum to clean up tuples.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
- !indexInfo->ii_Unique &&
!is_system_catalog; /* just for the case */
/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
_bt_dedup_start_pending(state, itup, offnum);
}
else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
itemid = PageGetItemId(page, minoff);
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
{
itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
return true;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -101,6 +102,7 @@ typedef struct BTShared
Oid indexrelid;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
bool isconcurrent;
int scantuplesortstates;
@@ -203,15 +205,13 @@ typedef struct BTLeader
*/
typedef struct BTBuildState
{
- bool isunique;
- bool nulls_not_distinct;
bool havedead;
Relation heap;
BTSpool *spool;
/*
- * spool2 is needed only when the index is a unique index. Dead tuples are
- * put into spool2 instead of spool in order to avoid uniqueness check.
+ * spool2 is needed only when the index is a unique index and build non-concurrently.
+ * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
*/
BTSpool *spool2;
double indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- buildstate.isunique = indexInfo->ii_Unique;
- buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
buildstate.havedead = false;
buildstate.heap = heap;
buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ /*
+ * We need to ignore dead tuples for unique checks in case of concurrent build.
+ * It is required because or periodic reset of snapshot.
+ */
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
+ *
+ * In case of concurrent build dead tuples are not need to be put into index
+ * since we wait for all snapshots older than reference snapshot during the
+ * validation phase.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_alive_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ /*
+ * The unique_dead_ignored does not guarantee absence of multiple alive
+ * tuples with same values exists in the spool. Such thing may happen if
+ * alive tuples are located between a few dead tuples, like this: addda.
+ */
+ fail_on_alive_duplicate = btspool->unique_dead_ignored;
- if (merge)
+ if (fail_on_alive_duplicate)
+ {
+ bool seen_alive = false,
+ prev_tested = false;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ Assert(btspool->isunique);
+ Assert(!btspool2);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ bool tuples_equal = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL) /* if is not the first tuple */
+ {
+ bool has_nulls = false,
+ call_again, /* just to pass something */
+ ignored, /* just to pass something */
+ now_alive;
+ ItemPointerData tid;
+
+ /* if this tuples equal to previouse one? */
+ if (wstate->inskey->allequalimage)
+ tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+ else
+ tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+ /* handle null values correctly */
+ if (has_nulls && !btspool->nulls_not_distinct)
+ tuples_equal = false;
+
+ if (tuples_equal)
+ {
+ /* check previous tuple if not yet */
+ if (!prev_tested)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_tested = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ /* are multiple alive tuples detected in equal group? */
+ if (seen_alive && now_alive)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ /* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ seen_alive |= now_alive;
+ }
+ }
+
+ if (!tuples_equal)
+ {
+ seen_alive = false;
+ prev_tested = false;
+ }
+
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ else if (merge)
{
/*
* Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
InvalidOffsetNumber);
}
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
+ itup, NULL) > keysz &&
_bt_dedup_save_htid(dstate, itup))
{
/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
- bool reset_snapshot;
bool wait_for_snapshot_attach;
int querylen;
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
- /*
- * For concurrent non-unique index builds, we can periodically reset snapshots
- * to allow the xmin horizon to advance. This is safe since these builds don't
- * require a consistent view across the entire scan. Unique indexes still need
- * a stable snapshot to properly enforce uniqueness constraints.
- */
- reset_snapshot = isconcurrent && !btspool->isunique;
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that, while that snapshot may be reset periodically in
- * case of non-unique index.
+ * live according to that, while that snapshot may be reset periodically.
*/
if (!isconcurrent)
{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
- else if (reset_snapshot)
+ else
{
+ /*
+ * For concurrent index builds, we can periodically reset snapshots to allow
+ * the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan.
+ */
snapshot = InvalidSnapshot;
PushActiveSnapshot(GetTransactionSnapshot());
}
- else
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
- }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->indexrelid = RelationGetRelid(btspool->index);
btshared->isunique = btspool->isunique;
btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+ btshared->unique_dead_ignored = btspool->unique_dead_ignored;
btshared->isconcurrent = isconcurrent;
btshared->scantuplesortstates = scantuplesortstates;
btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot,
- reset_snapshot);
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* In case when leader going to reset own active snapshot as well - we need to
* wait until all workers imported initial snapshot.
*/
- wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
if (wait_for_snapshot_attach)
WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
/* Initialize second spool, if required */
if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+ btspool->unique_dead_ignored = btshared->unique_dead_ignored;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
/* Fill in buildstate for _bt_build_callback() */
- buildstate.isunique = btshared->isunique;
- buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
buildstate.havedead = false;
buildstate.heap = btspool->heap;
buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ state->newitem, NULL);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
/* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
#ifdef DEBUG_NO_TRUNCATE
/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
* Caller provides two tuples that enclose a split point. Caller's insertion
* scankey is used to compare the tuples; the scankey's argument values are
* not considered here.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
+int
_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+ BTScanInsert itup_key,
+ bool *hasnulls)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ (*hasnulls) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* expected in an allequalimage index.
*/
Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
return keepnatts;
}
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* This is exported so that a candidate split point can have its effect on
* suffix truncation inexpensively evaluated ahead of time when finding a
* split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
*
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* "equal image" columns, routine is guaranteed to give the same result as
* _bt_keep_natts would.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts, even when the index uses
* an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *hasnulls)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ *hasnulls |= (isNull1 | isNull2);
att = TupleDescAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f581a743aae..6242b242940 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason for that is to
+ * propagate the xmin horizon forward.
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored; /* ignore dead tuples in unique check */
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ /* keep this error message in sync with the same in _bt_load */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key, bool *hasnulls);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+ IndexTuple firstright, bool *hasnulls);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 49ef68d9071..c8e4683ad6d 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
----------------
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
(1 row)
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
--
2.43.0
v5-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchapplication/x-patch; name=v5-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From 54e755b2d097753f65e14c4aafd5718e0cb457f8 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v5 3/5] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 +++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 102 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 82 ++++++++++++++
15 files changed, 375 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..777df91972e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d37..f581a743aae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762ec..d9de16af81d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6942,6 +6943,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6997,6 +6999,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -7054,6 +7061,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v5-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchapplication/x-patch; name=v5-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From 9432da61d7640457a67cc5ac8ecd0b1c6be132e1 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v5 1/5] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e439..5ffef4595e2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -713,12 +713,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -753,8 +755,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -766,30 +768,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -812,7 +860,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -832,27 +886,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -872,7 +922,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -880,6 +930,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -917,27 +971,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -945,7 +1007,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..3a7357a050d 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v5-0002-Add-stress-tests-for-concurrent-index-operations.patchapplication/x-patch; name=v5-0002-Add-stress-tests-for-concurrent-index-operations.patchDownload
From 836cb845682460d8967dfbf2826f4c237d6be4e1 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v5 2/5] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck bt_index_parent_check
* Exercising parallel worker configurations
The tests perform intensive concurrent modifications via pgbench while
executing index operations to stress test index build infrastructure.
Test cases cover:
- Regular and unique indexes
- Indexes with stable and immutable predicates
- Multi-column indexes with various combinations
- Different parallel worker configurations
Two new test files added:
- t/006_concurrently.pl: General concurrent index operation tests
- t/007_concurrently_unique.pl: Focused testing of unique indexes
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 2 +
src/bin/pg_amcheck/t/006_concurrently.pl | 315 ++++++++++++++++++
.../pg_amcheck/t/007_concurrently_unique.pl | 239 +++++++++++++
3 files changed, 556 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
create mode 100644 src/bin/pg_amcheck/t/007_concurrently_unique.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..b4e14a15ef3 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,8 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_concurrently.pl',
+ 't/007_concurrently_unique.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 00000000000..e13a340e777
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,315 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+
+use threads;
+use Test::More;
+use Test::Builder;
+
+
+eval {
+ require IPC::SysV;
+ IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ $node->pgbench(
+ '--no-vacuum --client=10 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ '001_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ '002_pgbench_concurrent_transaction_inserts' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ ),
+ # Ensure some HOT updates happen
+ '003_pgbench_concurrent_transaction_updates' => q(
+ BEGIN;
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+ on conflict(i) do update set updated_at = now();
+ COMMIT;
+ )
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+ $n = 0;
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;));
+
+ $node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;));
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+ is($result, '0', 'REINDEX is correct');
+
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx', heapallindexed => true, checkunique => true);));
+ is($result, '0', 'bt_index_check is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex:)' . $n++);
+ }
+ }
+
+ if (1)
+ {
+ my $variant = int(rand(7));
+ my $sql;
+ if ($variant == 0) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+ } elsif ($variant == 1) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+ } elsif ($variant == 2) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+ } elsif ($variant == 3) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 4) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+ } elsif ($variant == 5) {
+ $sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+ } elsif ($variant == 6) {
+ $sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+ } else { diag("#wrong variant"); }
+
+ diag('#' . $sql);
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+ waitpid($pid,0);
+ done_testing();
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
index 00000000000..67e2be3e33f
--- /dev/null
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
@@ -0,0 +1,239 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use threads;
+use Test::More;
+use Test::Builder;
+
+eval {
+ require IPC::SysV;
+ IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+ plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key, $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test_unique');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+ # fork returned undef, so unsuccessful
+ die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+ # $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+ # while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+ $node->pgbench(
+ '--no-vacuum --client=40 --exit-on-abort --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs, UPDATES and RC',
+ {
+ # Ensure some HOT updates happen
+ '001_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '002_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '003_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ '004_pgbench_concurrent_transaction_updates' => q(
+ INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+ ),
+ });
+
+ if ($child->is_passing()) {
+ shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+ } else {
+ shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+ }
+
+ my $pg_bench_fork_flag;
+ while (1) {
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ sleep(0.1);
+ last if $pg_bench_fork_flag eq "stop";
+ }
+} else {
+ my $pg_bench_fork_flag;
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+ subtest 'reindex run subtest' => sub {
+ is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+ my %psql = (stdin => '', stdout => '', stderr => '');
+ $psql{run} = IPC::Run::start(
+ [ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+ '<',
+ \$psql{stdin},
+ '>',
+ \$psql{stdout},
+ '2>',
+ \$psql{stderr},
+ $psql_timeout);
+
+ my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+# ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+ while (1)
+ {
+
+ if (int(rand(2)) == 0) {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+ } else {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+ }
+ is($result, '0', 'ALTER TABLE is correct');
+
+
+ if (1)
+ {
+ my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+ is($result, '0', 'CREATE INDEX is correct');
+ $stderr_saved = $stderr;
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+ is($result, '0', 'bt_index_check for new index is correct');
+ if ($result)
+ {
+ diag($stderr);
+ diag($stderr_saved);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#create:)' . $n++);
+ }
+
+ if (1)
+ {
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'REINDEX 2 is correct');
+ if ($result) {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+ is($result, '0', 'bt_index_check 2 is correct');
+ if ($result)
+ {
+ diag($stderr);
+ BAIL_OUT($stderr);
+ } else {
+ diag('#reindex2:)' . $n++);
+ }
+ }
+
+ ($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+ is($result, '0', 'DROP INDEX is correct');
+ }
+ shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+ last if $pg_bench_fork_flag ne "wait";
+ }
+
+ # explicitly shut down psql instances gracefully
+ $psql{stdin} .= "\\q\n";
+ $psql{run}->finish;
+
+ is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+ };
+
+ $child->finalize();
+ $child->summary();
+
+ shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+ waitpid($pid,0);
+ done_testing();
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+ my ($psql, $query, $untl) = @_;
+ my $ret;
+
+ # For each query we run, we'll restart the timeout. Otherwise the timeout
+ # would apply to the whole test script, and would need to be set very high
+ # to survive when running under Valgrind.
+ $psql_timeout->reset();
+ $psql_timeout->start();
+
+ # send query
+ $$psql{stdin} .= $query;
+ $$psql{stdin} .= "\n";
+
+ # wait for query results
+ $$psql{run}->pump_nb();
+ while (1)
+ {
+ last if $$psql{stdout} =~ /$untl/;
+ if ($psql_timeout->is_expired)
+ {
+ diag("aborting wait: program timed out\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ if (not $$psql{run}->pumpable())
+ {
+ diag("aborting wait: program died\n"
+ . "stream contents: >>$$psql{stdout}<<\n"
+ . "pattern searched for: $untl\n");
+ return 0;
+ }
+ $$psql{run}->pump();
+ }
+
+ $$psql{stdout} = '';
+
+ return 1;
+}
--
2.43.0
v5-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchapplication/x-patch; name=v5-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From d435fe63303485e68e197b3dc6e571065eb6863b Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v5 4/5] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
12 files changed, 178 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3a7357a050d..148e1982cad 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -291,14 +291,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
Hello!
After [0]https://commitfest.postgresql.org/51/5439/ fix, I simplified stress tests to single pgbench run without any
forks.
[0]: https://commitfest.postgresql.org/51/5439/
Show quoted text
Attachments:
v6-0005-Allow-snapshot-resets-during-parallel-concurrent-.patchapplication/octet-stream; name=v6-0005-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From 15d61bbb64e5f8e418594d1ea6b50ceb9c65d9d1 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v6 5/6] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
12 files changed, 178 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
v6-0004-Allow-advancing-xmin-during-non-unique-non-parall.patchapplication/octet-stream; name=v6-0004-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From e85b568a1a8d39ab24bd21bef90d546fce61a726 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v6 4/6] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 +++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 102 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 82 ++++++++++++++
15 files changed, 375 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v6-0002-this-is-https-commitfest.postgresql.org-50-5160-m.patchapplication/octet-stream; name=v6-0002-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v6 2/6] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -946,7 +1008,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v6-0003-Add-stress-tests-for-concurrent-index-operations.patchapplication/octet-stream; name=v6-0003-Add-stress-tests-for-concurrent-index-operations.patchDownload
From 212a59c454c7584f1b020e9b847da5bd86e22f56 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v6 3/6] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 1 +
src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
2 files changed, 145 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_cic.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_cic.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..002348b8366
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;
+));
+
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set variant random(0, 5)
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+ \elif :variant = 4
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+ \elif :variant = 5
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+ \endif
+ --\sleep 200 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ --\sleep 200 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1000, 100000)
+ BEGIN;
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ COMMIT;
+ \endif
+ )
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops_unique_idx' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+ --\sleep 200 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ --\sleep 200 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ )
+ });
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v6-0006-Allow-snapshot-resets-in-concurrent-unique-index-.patchapplication/octet-stream; name=v6-0006-Allow-snapshot-resets-in-concurrent-unique-index-.patchDownload
From dc8447015383a3c38c71570749b697b25c7aceb7 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v6 6/6] Allow snapshot resets in concurrent unique index
builds
Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.
Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:
1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values
This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
src/backend/access/heap/heapam_handler.c | 6 +-
src/backend/access/nbtree/nbtdedup.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 173 ++++++++++++++----
src/backend/access/nbtree/nbtsplitloc.c | 12 +-
src/backend/access/nbtree/nbtutils.c | 29 ++-
src/backend/catalog/index.c | 6 +-
src/backend/utils/sort/tuplesortvariants.c | 67 +++++--
src/include/access/nbtree.h | 4 +-
src/include/access/tableam.h | 5 +-
src/include/utils/tuplesort.h | 1 +
.../expected/cic_reset_snapshots.out | 6 +
11 files changed, 242 insertions(+), 75 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
* and index whatever's live according to that while that snapshot is reset
- * every so often (in case of non-unique index).
+ * every so often.
*/
OldestXmin = InvalidTransactionId;
/*
- * For unique index we need consistent snapshot for the whole scan.
+ * For concurrent builds of non-system indexes, we may want to periodically
+ * reset snapshots to allow vacuum to clean up tuples.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
- !indexInfo->ii_Unique &&
!is_system_catalog; /* just for the case */
/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
_bt_dedup_start_pending(state, itup, offnum);
}
else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
itemid = PageGetItemId(page, minoff);
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
{
itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
return true;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -101,6 +102,7 @@ typedef struct BTShared
Oid indexrelid;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
bool isconcurrent;
int scantuplesortstates;
@@ -203,15 +205,13 @@ typedef struct BTLeader
*/
typedef struct BTBuildState
{
- bool isunique;
- bool nulls_not_distinct;
bool havedead;
Relation heap;
BTSpool *spool;
/*
- * spool2 is needed only when the index is a unique index. Dead tuples are
- * put into spool2 instead of spool in order to avoid uniqueness check.
+ * spool2 is needed only when the index is a unique index and build non-concurrently.
+ * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
*/
BTSpool *spool2;
double indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- buildstate.isunique = indexInfo->ii_Unique;
- buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
buildstate.havedead = false;
buildstate.heap = heap;
buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ /*
+ * We need to ignore dead tuples for unique checks in case of concurrent build.
+ * It is required because or periodic reset of snapshot.
+ */
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
+ *
+ * In case of concurrent build dead tuples are not need to be put into index
+ * since we wait for all snapshots older than reference snapshot during the
+ * validation phase.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_alive_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ /*
+ * The unique_dead_ignored does not guarantee absence of multiple alive
+ * tuples with same values exists in the spool. Such thing may happen if
+ * alive tuples are located between a few dead tuples, like this: addda.
+ */
+ fail_on_alive_duplicate = btspool->unique_dead_ignored;
- if (merge)
+ if (fail_on_alive_duplicate)
+ {
+ bool seen_alive = false,
+ prev_tested = false;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ Assert(btspool->isunique);
+ Assert(!btspool2);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ bool tuples_equal = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL) /* if is not the first tuple */
+ {
+ bool has_nulls = false,
+ call_again, /* just to pass something */
+ ignored, /* just to pass something */
+ now_alive;
+ ItemPointerData tid;
+
+ /* if this tuples equal to previouse one? */
+ if (wstate->inskey->allequalimage)
+ tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+ else
+ tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+ /* handle null values correctly */
+ if (has_nulls && !btspool->nulls_not_distinct)
+ tuples_equal = false;
+
+ if (tuples_equal)
+ {
+ /* check previous tuple if not yet */
+ if (!prev_tested)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_tested = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ /* are multiple alive tuples detected in equal group? */
+ if (seen_alive && now_alive)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ /* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ seen_alive |= now_alive;
+ }
+ }
+
+ if (!tuples_equal)
+ {
+ seen_alive = false;
+ prev_tested = false;
+ }
+
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ else if (merge)
{
/*
* Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
InvalidOffsetNumber);
}
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
+ itup, NULL) > keysz &&
_bt_dedup_save_htid(dstate, itup))
{
/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
- bool reset_snapshot;
bool wait_for_snapshot_attach;
int querylen;
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
- /*
- * For concurrent non-unique index builds, we can periodically reset snapshots
- * to allow the xmin horizon to advance. This is safe since these builds don't
- * require a consistent view across the entire scan. Unique indexes still need
- * a stable snapshot to properly enforce uniqueness constraints.
- */
- reset_snapshot = isconcurrent && !btspool->isunique;
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that, while that snapshot may be reset periodically in
- * case of non-unique index.
+ * live according to that, while that snapshot may be reset periodically.
*/
if (!isconcurrent)
{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
- else if (reset_snapshot)
+ else
{
+ /*
+ * For concurrent index builds, we can periodically reset snapshots to allow
+ * the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan.
+ */
snapshot = InvalidSnapshot;
PushActiveSnapshot(GetTransactionSnapshot());
}
- else
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
- }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->indexrelid = RelationGetRelid(btspool->index);
btshared->isunique = btspool->isunique;
btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+ btshared->unique_dead_ignored = btspool->unique_dead_ignored;
btshared->isconcurrent = isconcurrent;
btshared->scantuplesortstates = scantuplesortstates;
btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot,
- reset_snapshot);
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* In case when leader going to reset own active snapshot as well - we need to
* wait until all workers imported initial snapshot.
*/
- wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
if (wait_for_snapshot_attach)
WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
/* Initialize second spool, if required */
if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+ btspool->unique_dead_ignored = btshared->unique_dead_ignored;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
/* Fill in buildstate for _bt_build_callback() */
- buildstate.isunique = btshared->isunique;
- buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
buildstate.havedead = false;
buildstate.heap = btspool->heap;
buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ state->newitem, NULL);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
/* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
#ifdef DEBUG_NO_TRUNCATE
/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
* Caller provides two tuples that enclose a split point. Caller's insertion
* scankey is used to compare the tuples; the scankey's argument values are
* not considered here.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
+int
_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+ BTScanInsert itup_key,
+ bool *hasnulls)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ (*hasnulls) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* expected in an allequalimage index.
*/
Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
return keepnatts;
}
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* This is exported so that a candidate split point can have its effect on
* suffix truncation inexpensively evaluated ahead of time when finding a
* split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
*
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* "equal image" columns, routine is guaranteed to give the same result as
* _bt_keep_natts would.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts, even when the index uses
* an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *hasnulls)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ *hasnulls |= (isNull1 | isNull2);
att = TupleDescAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f6a1a2f3f90 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason for that is to
+ * propagate the xmin horizon forward.
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored; /* ignore dead tuples in unique check */
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ /* keep this error message in sync with the same in _bt_load */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key, bool *hasnulls);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+ IndexTuple firstright, bool *hasnulls);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 49ef68d9071..c8e4683ad6d 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
----------------
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
(1 row)
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
--
2.43.0
Hello!
Added STIR access method, next step is validating indexes using it.
Best regards,
Mikhail.
Show quoted text
Attachments:
v7-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchapplication/octet-stream; name=v7-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v7 1/6] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -946,7 +1008,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v7-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchapplication/octet-stream; name=v7-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From 452ef7089db779a08421a1084584c13c599d1320 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v7 3/6] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 ++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 107 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 86 ++++++++++++++
15 files changed, 384 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v7-0002-Add-stress-tests-for-concurrent-index-operations.patchapplication/octet-stream; name=v7-0002-Add-stress-tests-for-concurrent-index-operations.patchDownload
From b4f22a1da4bbbff6a268c0f62196a264cb126896 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v7 2/6] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 1 +
src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
2 files changed, 145 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_cic.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_cic.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;
+));
+
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set variant random(0, 5)
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+ \elif :variant = 4
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+ \elif :variant = 5
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+ \endif
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1000, 100000)
+ BEGIN;
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ COMMIT;
+ \endif
+ )
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops_unique_idx' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ )
+ });
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v7-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchapplication/octet-stream; name=v7-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From 1a2a8cc969011974913c22604d608a0d9c4ffa78 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v7 4/6] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
12 files changed, 178 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
NOTICE: notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
v7-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchapplication/octet-stream; name=v7-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchDownload
From f48e59a4b33a4b05e2f08dedadfce8628a8ae094 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v7 5/6] Allow snapshot resets in concurrent unique index
builds
Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.
Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:
1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values
This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
src/backend/access/heap/heapam_handler.c | 6 +-
src/backend/access/nbtree/nbtdedup.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 173 ++++++++++++++----
src/backend/access/nbtree/nbtsplitloc.c | 12 +-
src/backend/access/nbtree/nbtutils.c | 29 ++-
src/backend/catalog/index.c | 6 +-
src/backend/utils/sort/tuplesortvariants.c | 67 +++++--
src/include/access/nbtree.h | 4 +-
src/include/access/tableam.h | 5 +-
src/include/utils/tuplesort.h | 1 +
.../expected/cic_reset_snapshots.out | 6 +
11 files changed, 242 insertions(+), 75 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
* and index whatever's live according to that while that snapshot is reset
- * every so often (in case of non-unique index).
+ * every so often.
*/
OldestXmin = InvalidTransactionId;
/*
- * For unique index we need consistent snapshot for the whole scan.
+ * For concurrent builds of non-system indexes, we may want to periodically
+ * reset snapshots to allow vacuum to clean up tuples.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
- !indexInfo->ii_Unique &&
!is_system_catalog; /* just for the case */
/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
_bt_dedup_start_pending(state, itup, offnum);
}
else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
itemid = PageGetItemId(page, minoff);
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
{
itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
return true;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -101,6 +102,7 @@ typedef struct BTShared
Oid indexrelid;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
bool isconcurrent;
int scantuplesortstates;
@@ -203,15 +205,13 @@ typedef struct BTLeader
*/
typedef struct BTBuildState
{
- bool isunique;
- bool nulls_not_distinct;
bool havedead;
Relation heap;
BTSpool *spool;
/*
- * spool2 is needed only when the index is a unique index. Dead tuples are
- * put into spool2 instead of spool in order to avoid uniqueness check.
+ * spool2 is needed only when the index is a unique index and build non-concurrently.
+ * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
*/
BTSpool *spool2;
double indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- buildstate.isunique = indexInfo->ii_Unique;
- buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
buildstate.havedead = false;
buildstate.heap = heap;
buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ /*
+ * We need to ignore dead tuples for unique checks in case of concurrent build.
+ * It is required because or periodic reset of snapshot.
+ */
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
+ *
+ * In case of concurrent build dead tuples are not need to be put into index
+ * since we wait for all snapshots older than reference snapshot during the
+ * validation phase.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_alive_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ /*
+ * The unique_dead_ignored does not guarantee absence of multiple alive
+ * tuples with same values exists in the spool. Such thing may happen if
+ * alive tuples are located between a few dead tuples, like this: addda.
+ */
+ fail_on_alive_duplicate = btspool->unique_dead_ignored;
- if (merge)
+ if (fail_on_alive_duplicate)
+ {
+ bool seen_alive = false,
+ prev_tested = false;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ Assert(btspool->isunique);
+ Assert(!btspool2);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ bool tuples_equal = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL) /* if is not the first tuple */
+ {
+ bool has_nulls = false,
+ call_again, /* just to pass something */
+ ignored, /* just to pass something */
+ now_alive;
+ ItemPointerData tid;
+
+ /* if this tuples equal to previouse one? */
+ if (wstate->inskey->allequalimage)
+ tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+ else
+ tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+ /* handle null values correctly */
+ if (has_nulls && !btspool->nulls_not_distinct)
+ tuples_equal = false;
+
+ if (tuples_equal)
+ {
+ /* check previous tuple if not yet */
+ if (!prev_tested)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_tested = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ /* are multiple alive tuples detected in equal group? */
+ if (seen_alive && now_alive)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ /* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ seen_alive |= now_alive;
+ }
+ }
+
+ if (!tuples_equal)
+ {
+ seen_alive = false;
+ prev_tested = false;
+ }
+
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ else if (merge)
{
/*
* Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
InvalidOffsetNumber);
}
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
+ itup, NULL) > keysz &&
_bt_dedup_save_htid(dstate, itup))
{
/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
- bool reset_snapshot;
bool wait_for_snapshot_attach;
int querylen;
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
- /*
- * For concurrent non-unique index builds, we can periodically reset snapshots
- * to allow the xmin horizon to advance. This is safe since these builds don't
- * require a consistent view across the entire scan. Unique indexes still need
- * a stable snapshot to properly enforce uniqueness constraints.
- */
- reset_snapshot = isconcurrent && !btspool->isunique;
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that, while that snapshot may be reset periodically in
- * case of non-unique index.
+ * live according to that, while that snapshot may be reset periodically.
*/
if (!isconcurrent)
{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
- else if (reset_snapshot)
+ else
{
+ /*
+ * For concurrent index builds, we can periodically reset snapshots to allow
+ * the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan.
+ */
snapshot = InvalidSnapshot;
PushActiveSnapshot(GetTransactionSnapshot());
}
- else
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
- }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->indexrelid = RelationGetRelid(btspool->index);
btshared->isunique = btspool->isunique;
btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+ btshared->unique_dead_ignored = btspool->unique_dead_ignored;
btshared->isconcurrent = isconcurrent;
btshared->scantuplesortstates = scantuplesortstates;
btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot,
- reset_snapshot);
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* In case when leader going to reset own active snapshot as well - we need to
* wait until all workers imported initial snapshot.
*/
- wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
if (wait_for_snapshot_attach)
WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
/* Initialize second spool, if required */
if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+ btspool->unique_dead_ignored = btshared->unique_dead_ignored;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
/* Fill in buildstate for _bt_build_callback() */
- buildstate.isunique = btshared->isunique;
- buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
buildstate.havedead = false;
buildstate.heap = btspool->heap;
buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ state->newitem, NULL);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
/* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
#ifdef DEBUG_NO_TRUNCATE
/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
* Caller provides two tuples that enclose a split point. Caller's insertion
* scankey is used to compare the tuples; the scankey's argument values are
* not considered here.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
+int
_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+ BTScanInsert itup_key,
+ bool *hasnulls)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ (*hasnulls) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* expected in an allequalimage index.
*/
Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
return keepnatts;
}
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* This is exported so that a candidate split point can have its effect on
* suffix truncation inexpensively evaluated ahead of time when finding a
* split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
*
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* "equal image" columns, routine is guaranteed to give the same result as
* _bt_keep_natts would.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts, even when the index uses
* an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *hasnulls)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ *hasnulls |= (isNull1 | isNull2);
att = TupleDescAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f6a1a2f3f90 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason for that is to
+ * propagate the xmin horizon forward.
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored; /* ignore dead tuples in unique check */
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ /* keep this error message in sync with the same in _bt_load */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key, bool *hasnulls);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+ IndexTuple firstright, bool *hasnulls);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
----------------
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
(1 row)
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
--
2.43.0
v7-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchapplication/octet-stream; name=v7-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchDownload
From ccad95c4c080d0a73d7e5c1458fde825b559f9fe Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v7 6/6] Add STIR (Short-Term Index Replacement) access method
This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:
- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
operation is validating a newly built index (e.g., during concurrent build).
Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.
These changes lay essential groundwork for further improvements to concurrent
index builds.
---
contrib/pgstattuple/pgstattuple.c | 3 +
src/backend/access/Makefile | 2 +-
src/backend/access/heap/vacuumlazy.c | 2 +
src/backend/access/meson.build | 1 +
src/backend/access/stir/Makefile | 18 +
src/backend/access/stir/meson.build | 5 +
src/backend/access/stir/stir.c | 576 +++++++++++++++++++++++
src/backend/catalog/index.c | 1 +
src/backend/commands/analyze.c | 1 +
src/backend/commands/vacuumparallel.c | 1 +
src/backend/nodes/makefuncs.c | 1 +
src/include/access/genam.h | 1 +
src/include/access/reloptions.h | 3 +-
src/include/access/stir.h | 117 +++++
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_opclass.dat | 3 +
src/include/catalog/pg_opfamily.dat | 2 +
src/include/catalog/pg_proc.dat | 4 +
src/include/nodes/execnodes.h | 6 +-
src/include/utils/index_selfuncs.h | 8 +
src/test/regress/expected/amutils.out | 8 +-
src/test/regress/expected/opr_sanity.out | 7 +-
src/test/regress/expected/psql.out | 24 +-
23 files changed, 779 insertions(+), 18 deletions(-)
create mode 100644 src/backend/access/stir/Makefile
create mode 100644 src/backend/access/stir/meson.build
create mode 100644 src/backend/access/stir/stir.c
create mode 100644 src/include/access/stir.h
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
case SPGIST_AM_OID:
err = "spgist index";
break;
+ case STIR_AM_OID:
+ err = "stir index";
+ break;
case BRIN_AM_OID:
err = "brin index";
break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- sequence table tablesample transam
+ stir sequence table tablesample transam
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
subdir('rmgrdesc')
subdir('sequence')
subdir('spgist')
+subdir('stir')
subdir('table')
subdir('tablesample')
subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/stir
+#
+# IDENTIFICATION
+# src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ * Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+ IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+ /* Set STIR-specific strategy and procedure numbers */
+ amroutine->amstrategies = STIR_NSTRATEGIES;
+ amroutine->amsupport = STIR_NPROC;
+ amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+ /* STIR doesn't support most index operations */
+ amroutine->amcanorder = false;
+ amroutine->amcanorderbyop = false;
+ amroutine->amcanbackward = false;
+ amroutine->amcanunique = false;
+ amroutine->amcanmulticol = true;
+ amroutine->amoptionalkey = true;
+ amroutine->amsearcharray = false;
+ amroutine->amsearchnulls = false;
+ amroutine->amstorage = false;
+ amroutine->amclusterable = false;
+ amroutine->ampredlocks = false;
+ amroutine->amcanparallel = false;
+ amroutine->amcanbuildparallel = false;
+ amroutine->amcaninclude = true;
+ amroutine->amusemaintenanceworkmem = false;
+ amroutine->amparallelvacuumoptions =
+ VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amkeytype = InvalidOid;
+
+ /* Set up function callbacks */
+ amroutine->ambuild = stirbuild;
+ amroutine->ambuildempty = stirbuildempty;
+ amroutine->aminsert = stirinsert;
+ amroutine->aminsertcleanup = NULL;
+ amroutine->ambulkdelete = stirbulkdelete;
+ amroutine->amvacuumcleanup = stirvacuumcleanup;
+ amroutine->amcanreturn = NULL;
+ amroutine->amcostestimate = stircostestimate;
+ amroutine->amoptions = stiroptions;
+ amroutine->amproperty = NULL;
+ amroutine->ambuildphasename = NULL;
+ amroutine->amvalidate = stirvalidate;
+ amroutine->amadjustmembers = NULL;
+ amroutine->ambeginscan = stirbeginscan;
+ amroutine->amrescan = stirrescan;
+ amroutine->amgettuple = NULL;
+ amroutine->amgetbitmap = NULL;
+ amroutine->amendscan = stirendscan;
+ amroutine->ammarkpos = NULL;
+ amroutine->amrestrpos = NULL;
+ amroutine->amestimateparallelscan = NULL;
+ amroutine->aminitparallelscan = NULL;
+ amroutine->amparallelrescan = NULL;
+
+ PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+ bool result = true;
+ HeapTuple classtup;
+ Form_pg_opclass classform;
+ Oid opfamilyoid;
+ HeapTuple familytup;
+ Form_pg_opfamily familyform;
+ char *opfamilyname;
+ CatCList *proclist,
+ *oprlist;
+ int i;
+
+ /* Fetch opclass information */
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+ classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+ opfamilyoid = classform->opcfamily;
+
+
+ /* Fetch opfamily information */
+ familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+ if (!HeapTupleIsValid(familytup))
+ elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+ familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+ opfamilyname = NameStr(familyform->opfname);
+
+ /* Fetch all operators and support functions of the opfamily */
+ oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+ proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+ /* Check individual operators */
+ for (i = 0; i < oprlist->n_members; i++)
+ {
+ HeapTuple oprtup = &oprlist->members[i]->tuple;
+ Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+ /* Check it's allowed strategy for stir */
+ if (oprform->amopstrategy < 1 ||
+ oprform->amopstrategy > STIR_NSTRATEGIES)
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+ opfamilyname,
+ format_operator(oprform->amopopr),
+ oprform->amopstrategy)));
+ result = false;
+ }
+
+ /* stir doesn't support ORDER BY operators */
+ if (oprform->amoppurpose != AMOP_SEARCH ||
+ OidIsValid(oprform->amopsortfamily))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+
+ /* Check operator signature --- same for all stir strategies */
+ if (!check_amop_signature(oprform->amopopr, BOOLOID,
+ oprform->amoplefttype,
+ oprform->amoprighttype))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with wrong signature",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+ }
+
+
+ ReleaseCatCacheList(proclist);
+ ReleaseCatCacheList(oprlist);
+ ReleaseSysCache(familytup);
+ ReleaseSysCache(classtup);
+
+ return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+ StirMetaPageData *metadata;
+
+ StirInitPage(metaPage, STIR_META);
+ metadata = StirPageGetMeta(metaPage);
+ memset(metadata, 0, sizeof(StirMetaPageData));
+ metadata->magickNumber = STIR_MAGICK_NUMBER;
+ metadata->skipInserts = skipInserts;
+ ((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ /*
+ * Make a new page; since it is first page it should be associated with
+ * block number 0 (STIR_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
+ */
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+ /* Initialize contents of meta page */
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+ StirPageOpaque opaque;
+
+ PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+ opaque = StirPageGetOpaque(page);
+ opaque->flags = flags;
+ opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+ StirTuple *itup;
+ StirPageOpaque opaque;
+ Pointer ptr;
+
+ /* We shouldn't be pointed to an invalid page */
+ Assert(!PageIsNew(page));
+
+ /* Does new tuple fit on the page? */
+ if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+ return false;
+
+ /* Copy new tuple to the end of page */
+ opaque = StirPageGetOpaque(page);
+ itup = StirPageGetTuple(page, opaque->maxoff + 1);
+ memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+ /* Adjust maxoff and pd_lower */
+ opaque->maxoff++;
+ ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+ ((PageHeader) page)->pd_lower = ptr - page;
+
+ /* Assert we didn't overrun available space */
+ Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+ return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo)
+{
+ StirTuple *itup;
+ MemoryContext oldCtx;
+ MemoryContext insertCtx;
+ StirMetaPageData *metaData;
+ Buffer buffer,
+ metaBuffer;
+ Page page;
+ GenericXLogState *state;
+ uint16 blkNo;
+
+ /* Create temporary context for insert operation */
+ insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Stir insert temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+
+ oldCtx = MemoryContextSwitchTo(insertCtx);
+
+ /* Create new tuple with heap pointer */
+ itup = (StirTuple *) palloc0(sizeof(StirTuple));
+ itup->heapPtr = *ht_ctid;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+ for (;;)
+ {
+ LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+ /* Check if inserts are allowed */
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+ blkNo = metaData->lastBlkNo;
+ /* Don't hold metabuffer lock while doing insert */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+ if (blkNo > 0)
+ {
+ buffer = ReadBuffer(index, blkNo);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+ Assert(!PageIsNew(page));
+
+ /* Try to add tuple to existing page */
+ if (StirPageAddItem(page, itup))
+ {
+ /* Success! Apply the change, clean up, and exit */
+ GenericXLogFinish(state);
+ UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(metaBuffer);
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+ return false;
+ }
+
+ /* Didn't fit, must try other pages */
+ GenericXLogAbort(state);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ /* Need to add new page - get exclusive lock on meta page */
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+ /* Check if another backend already extended the index */
+
+ if (blkNo != metaData->lastBlkNo)
+ {
+ Assert(blkNo < metaData->lastBlkNo);
+ /* Someone else inserted the new page into the index, lets try again /
+ */
+ GenericXLogAbort(state);
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+ else
+ {
+ /* Must extend the file */
+ buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+ EB_LOCK_FIRST);
+
+ page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+ StirInitPage(page, 0);
+
+ if (!StirPageAddItem(page, itup))
+ {
+ /* We shouldn't be here since we're inserting to an empty page */
+ elog(ERROR, "could not add new stir tuple to empty page");
+ }
+
+ /* Update meta page with new last block number */
+ metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(buffer);
+ UnlockReleaseBuffer(metaBuffer);
+
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+
+ return false;
+ }
+ }
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo)
+{
+ IndexBuildResult *result;
+
+ if (!indexInfo->ii_Auxiliary)
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+ StirInitMetapage(index, MAIN_FORKNUM);
+
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ result->heap_tuples = 0;
+ result->index_tuples = 0;
+ return result;
+}
+
+void stirbuildempty(Relation index)
+{
+ StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ Relation index = info->index;
+ BlockNumber blkno, npages;
+ Buffer buffer;
+ Page page;
+
+ /* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+ if (!info->validate_index)
+ {
+ StirMarkAsSkipInserts(index);
+
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+ }
+
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ /*
+ * Iterate over the pages. We don't care about concurrently added pages,
+ * because index is marked as not-ready for that momment and index not
+ * used for insert.
+ */
+ npages = RelationGetNumberOfBlocks(index);
+ for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+ {
+ StirTuple *itup, *itupEnd;
+
+ vacuum_delay_point();
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+ RBM_NORMAL, info->strategy);
+
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buffer);
+
+ if (PageIsNew(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ continue;
+ }
+
+ itup = StirPageGetTuple(page, FirstOffsetNumber);
+ itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+ while (itup < itupEnd)
+ {
+ /* Do we have to delete this tuple? */
+ if (callback(&itup->heapPtr, callback_state))
+ {
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+ }
+
+ itup = StirPageGetNextTuple(itup);
+ }
+
+ UnlockReleaseBuffer(buffer);
+ }
+
+ return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+ StirMetaPageData *metaData;
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ metaData = StirPageGetMeta(metaPage);
+ if (!metaData->skipInserts)
+ {
+ metaData->skipInserts = true;
+ GenericXLogFinish(state);
+ }
+ else
+ {
+ GenericXLogAbort(state);
+ }
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats)
+{
+ StirMarkAsSkipInserts(info->index);
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+ return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+ double loop_count, Cost *indexStartupCost,
+ Cost *indexTotalCost, Selectivity *indexSelectivity,
+ double *indexCorrelation, double *indexPages)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f6a1a2f3f90..82816580e3c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3402,6 +3402,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
ivinfo.strategy = NULL;
+ ivinfo.validate_index = true;
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
+ ivinfo.validate_index = false;
stats = index_vacuum_cleanup(&ivinfo, NULL);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
ivinfo.estimated_count = pvs->shared->estimated_count;
ivinfo.num_heap_tuples = pvs->shared->reltuples;
ivinfo.strategy = pvs->bstrategy;
+ ivinfo.validate_index = false;
/* Update error traceback information */
pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
/* initialize index-build state to default */
n->ii_BrokenHotChain = false;
n->ii_ParallelWorkers = 0;
+ n->ii_Auxiliary = false;
/* set up for possible use by index AM */
n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
bool estimated_count; /* num_heap_tuples is an estimate */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
+ bool validate_index; /* validating concurrently built index? */
BufferAccessStrategy strategy; /* access strategy for reads */
} IndexVacuumInfo;
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
RELOPT_KIND_VIEW = (1 << 9),
RELOPT_KIND_BRIN = (1 << 10),
RELOPT_KIND_PARTITIONED = (1 << 11),
+ RELOPT_KIND_STIR = (1 << 12),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ * header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC 0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES 1
+
+#define STIR_OPTIONS_PROC 0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+ ((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page) ((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+ ((StirTuple *)(PageGetContents(page) \
+ + sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+ ((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO (0)
+#define STIR_HEAD_BLKNO (1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+ OffsetNumber maxoff; /* number of index tuples on page */
+ uint16 flags; /* see bit definitions below */
+ uint16 unused; /* placeholder to force maxaligning of size of
+ * StirPageOpaqueData and to place
+ * stir_page_id exactly at the end of page */
+ uint16 stir_page_id; /* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META (1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID 0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+ uint32 magickNumber;
+ uint16 lastBlkNo;
+ bool skipInserts; /* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page) ((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+ ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+ - StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+ - MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+ void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+ descr => 'short term index replacement access method',
+ amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..7067452a035 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,7 @@
# no brin opclass for the geometric types except box
+# allow any types for STIR
+{ opcmethod => 'stir', opcname => 'stir_ops', opcfamily => 'stir/any_ops',
+ opcintype => 'any' },
]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
opfmethod => 'hash', opfname => 'multirange_ops' },
{ oid => '6158',
opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+ opfmethod => 'stir', opfname => 'any_ops' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0f22c217235..59f50e2b027 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
proname => 'brinhandler', provolatile => 'v',
prorettype => 'index_am_handler', proargtypes => 'internal',
prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+ proname => 'stirhandler', provolatile => 'v',
+ prorettype => 'index_am_handler', proargtypes => 'internal',
+ prosrc => 'stirhandler' },
{ oid => '3952', descr => 'brin: standalone scan new table pages',
proname => 'brin_summarize_new_values', provolatile => 'v',
proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7f71b7625df..748655fd0cf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
* BrokenHotChain did we detect any broken HOT chains?
* Summarizing is it a summarizing index?
* ParallelWorkers # of workers requested (excludes leader)
+ * Auxiliary # index-helper for concurrent build?
* Am Oid of index AM
* AmCache private cache area for index AM
* Context memory context holding this IndexInfo
*
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
* ----------------
*/
typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
bool ii_Summarizing;
bool ii_WithoutOverlaps;
int ii_ParallelWorkers;
+ bool ii_Auxiliary;
Oid ii_Am;
void *ii_AmCache;
MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
Selectivity *indexSelectivity,
double *indexCorrelation,
double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+ struct IndexPath *path,
+ double loop_count,
+ Cost *indexStartupCost,
+ Cost *indexTotalCost,
+ Selectivity *indexSelectivity,
+ double *indexCorrelation,
+ double *indexPages);
extern void gincostestimate(struct PlannerInfo *root,
struct IndexPath *path,
double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
spgist | can_exclude | t
spgist | can_include | t
spgist | bogus |
-(36 rows)
+ stir | can_order | f
+ stir | can_unique | f
+ stir | can_multi_col | t
+ stir | can_exclude | f
+ stir | can_include | t
+ stir | bogus |
+(42 rows)
--
-- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
WHERE a1.amopfamily = c1.opcfamily
AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily
----------+-----------
-(0 rows)
+ opcname | opcfamily
+----------+-----------
+ stir_ops | 5558
+(1 row)
-- Check that each operator listed in pg_amop has an associated opclass,
-- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA *
List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA h*
List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
\dA: extra argument "bar" ignored
\dA+
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ *
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ h*
List of access methods
--
2.43.0
Hello!
Now STIR used for validation (but without resetting of snapshot during
that phase for now).
Best regards,
Mikhail.
Show quoted text
Attachments:
v8-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchapplication/octet-stream; name=v8-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchDownload
From 3c82e0404db908491bd0ebaf1d177f9741c6c6ab Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v8 5/7] Allow snapshot resets in concurrent unique index
builds
Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.
Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:
1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values
This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
src/backend/access/heap/heapam_handler.c | 6 +-
src/backend/access/nbtree/nbtdedup.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 173 ++++++++++++++----
src/backend/access/nbtree/nbtsplitloc.c | 12 +-
src/backend/access/nbtree/nbtutils.c | 29 ++-
src/backend/catalog/index.c | 8 +-
src/backend/commands/indexcmds.c | 4 +-
src/backend/utils/sort/tuplesortvariants.c | 67 +++++--
src/include/access/nbtree.h | 4 +-
src/include/access/tableam.h | 5 +-
src/include/utils/tuplesort.h | 1 +
.../expected/cic_reset_snapshots.out | 6 +
12 files changed, 245 insertions(+), 78 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
* and index whatever's live according to that while that snapshot is reset
- * every so often (in case of non-unique index).
+ * every so often.
*/
OldestXmin = InvalidTransactionId;
/*
- * For unique index we need consistent snapshot for the whole scan.
+ * For concurrent builds of non-system indexes, we may want to periodically
+ * reset snapshots to allow vacuum to clean up tuples.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
- !indexInfo->ii_Unique &&
!is_system_catalog; /* just for the case */
/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
_bt_dedup_start_pending(state, itup, offnum);
}
else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
itemid = PageGetItemId(page, minoff);
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
{
itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
return true;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -101,6 +102,7 @@ typedef struct BTShared
Oid indexrelid;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
bool isconcurrent;
int scantuplesortstates;
@@ -203,15 +205,13 @@ typedef struct BTLeader
*/
typedef struct BTBuildState
{
- bool isunique;
- bool nulls_not_distinct;
bool havedead;
Relation heap;
BTSpool *spool;
/*
- * spool2 is needed only when the index is a unique index. Dead tuples are
- * put into spool2 instead of spool in order to avoid uniqueness check.
+ * spool2 is needed only when the index is a unique index and build non-concurrently.
+ * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
*/
BTSpool *spool2;
double indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- buildstate.isunique = indexInfo->ii_Unique;
- buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
buildstate.havedead = false;
buildstate.heap = heap;
buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ /*
+ * We need to ignore dead tuples for unique checks in case of concurrent build.
+ * It is required because or periodic reset of snapshot.
+ */
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
+ *
+ * In case of concurrent build dead tuples are not need to be put into index
+ * since we wait for all snapshots older than reference snapshot during the
+ * validation phase.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_alive_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ /*
+ * The unique_dead_ignored does not guarantee absence of multiple alive
+ * tuples with same values exists in the spool. Such thing may happen if
+ * alive tuples are located between a few dead tuples, like this: addda.
+ */
+ fail_on_alive_duplicate = btspool->unique_dead_ignored;
- if (merge)
+ if (fail_on_alive_duplicate)
+ {
+ bool seen_alive = false,
+ prev_tested = false;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ Assert(btspool->isunique);
+ Assert(!btspool2);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ bool tuples_equal = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL) /* if is not the first tuple */
+ {
+ bool has_nulls = false,
+ call_again, /* just to pass something */
+ ignored, /* just to pass something */
+ now_alive;
+ ItemPointerData tid;
+
+ /* if this tuples equal to previouse one? */
+ if (wstate->inskey->allequalimage)
+ tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+ else
+ tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+ /* handle null values correctly */
+ if (has_nulls && !btspool->nulls_not_distinct)
+ tuples_equal = false;
+
+ if (tuples_equal)
+ {
+ /* check previous tuple if not yet */
+ if (!prev_tested)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_tested = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ /* are multiple alive tuples detected in equal group? */
+ if (seen_alive && now_alive)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ /* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ seen_alive |= now_alive;
+ }
+ }
+
+ if (!tuples_equal)
+ {
+ seen_alive = false;
+ prev_tested = false;
+ }
+
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ else if (merge)
{
/*
* Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
InvalidOffsetNumber);
}
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
+ itup, NULL) > keysz &&
_bt_dedup_save_htid(dstate, itup))
{
/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
- bool reset_snapshot;
bool wait_for_snapshot_attach;
int querylen;
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
- /*
- * For concurrent non-unique index builds, we can periodically reset snapshots
- * to allow the xmin horizon to advance. This is safe since these builds don't
- * require a consistent view across the entire scan. Unique indexes still need
- * a stable snapshot to properly enforce uniqueness constraints.
- */
- reset_snapshot = isconcurrent && !btspool->isunique;
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that, while that snapshot may be reset periodically in
- * case of non-unique index.
+ * live according to that, while that snapshot may be reset periodically.
*/
if (!isconcurrent)
{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
- else if (reset_snapshot)
+ else
{
+ /*
+ * For concurrent index builds, we can periodically reset snapshots to allow
+ * the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan.
+ */
snapshot = InvalidSnapshot;
PushActiveSnapshot(GetTransactionSnapshot());
}
- else
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
- }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->indexrelid = RelationGetRelid(btspool->index);
btshared->isunique = btspool->isunique;
btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+ btshared->unique_dead_ignored = btspool->unique_dead_ignored;
btshared->isconcurrent = isconcurrent;
btshared->scantuplesortstates = scantuplesortstates;
btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot,
- reset_snapshot);
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* In case when leader going to reset own active snapshot as well - we need to
* wait until all workers imported initial snapshot.
*/
- wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
if (wait_for_snapshot_attach)
WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
/* Initialize second spool, if required */
if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+ btspool->unique_dead_ignored = btshared->unique_dead_ignored;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
/* Fill in buildstate for _bt_build_callback() */
- buildstate.isunique = btshared->isunique;
- buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
buildstate.havedead = false;
buildstate.heap = btspool->heap;
buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ state->newitem, NULL);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
/* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
#ifdef DEBUG_NO_TRUNCATE
/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
* Caller provides two tuples that enclose a split point. Caller's insertion
* scankey is used to compare the tuples; the scankey's argument values are
* not considered here.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
+int
_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+ BTScanInsert itup_key,
+ bool *hasnulls)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ (*hasnulls) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* expected in an allequalimage index.
*/
Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
return keepnatts;
}
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* This is exported so that a candidate split point can have its effect on
* suffix truncation inexpensively evaluated ahead of time when finding a
* split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
*
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* "equal image" columns, routine is guaranteed to give the same result as
* _bt_keep_natts would.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts, even when the index uses
* an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *hasnulls)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ *hasnulls |= (isNull1 | isNull2);
att = TupleDescAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f4464f64789..4eec5525993 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1530,7 +1530,7 @@ index_concurrently_build(Oid heapRelationId,
/* Invalidate catalog snapshot just for assert */
InvalidateCatalogSnapshot();
- Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason for that is to
+ * propagate the xmin horizon forward.
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We build the index using all tuples that are visible using single or
- * multiple refreshing snapshots. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using multiple
+ * refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored; /* ignore dead tuples in unique check */
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ /* keep this error message in sync with the same in _bt_load */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key, bool *hasnulls);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+ IndexTuple firstright, bool *hasnulls);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
----------------
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
(1 row)
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
--
2.43.0
v8-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchapplication/octet-stream; name=v8-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From 452ef7089db779a08421a1084584c13c599d1320 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v8 3/7] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 ++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 107 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 86 ++++++++++++++
15 files changed, 384 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v8-0002-Add-stress-tests-for-concurrent-index-operations.patchapplication/octet-stream; name=v8-0002-Add-stress-tests-for-concurrent-index-operations.patchDownload
From b4f22a1da4bbbff6a268c0f62196a264cb126896 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v8 2/7] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 1 +
src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
2 files changed, 145 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_cic.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_cic.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;
+));
+
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set variant random(0, 5)
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+ \elif :variant = 4
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+ \elif :variant = 5
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+ \endif
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1000, 100000)
+ BEGIN;
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ COMMIT;
+ \endif
+ )
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops_unique_idx' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ )
+ });
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v8-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchapplication/octet-stream; name=v8-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchDownload
From 6f2d3ce069d5ccc738b3bacaa94759c13531030a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v8 6/7] Add STIR (Short-Term Index Replacement) access method
This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:
- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
operation is validating a newly built index (e.g., during concurrent build).
Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.
These changes lay essential groundwork for further improvements to concurrent
index builds.
---
contrib/pgstattuple/pgstattuple.c | 3 +
src/backend/access/Makefile | 2 +-
src/backend/access/heap/vacuumlazy.c | 2 +
src/backend/access/meson.build | 1 +
src/backend/access/stir/Makefile | 18 +
src/backend/access/stir/meson.build | 5 +
src/backend/access/stir/stir.c | 576 +++++++++++++++++++++++
src/backend/catalog/index.c | 1 +
src/backend/commands/analyze.c | 1 +
src/backend/commands/vacuumparallel.c | 1 +
src/backend/nodes/makefuncs.c | 1 +
src/include/access/genam.h | 1 +
src/include/access/reloptions.h | 3 +-
src/include/access/stir.h | 117 +++++
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_opclass.dat | 4 +
src/include/catalog/pg_opfamily.dat | 2 +
src/include/catalog/pg_proc.dat | 4 +
src/include/nodes/execnodes.h | 6 +-
src/include/utils/index_selfuncs.h | 8 +
src/test/regress/expected/amutils.out | 8 +-
src/test/regress/expected/opr_sanity.out | 7 +-
src/test/regress/expected/psql.out | 24 +-
23 files changed, 780 insertions(+), 18 deletions(-)
create mode 100644 src/backend/access/stir/Makefile
create mode 100644 src/backend/access/stir/meson.build
create mode 100644 src/backend/access/stir/stir.c
create mode 100644 src/include/access/stir.h
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
case SPGIST_AM_OID:
err = "spgist index";
break;
+ case STIR_AM_OID:
+ err = "stir index";
+ break;
case BRIN_AM_OID:
err = "brin index";
break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- sequence table tablesample transam
+ stir sequence table tablesample transam
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
subdir('rmgrdesc')
subdir('sequence')
subdir('spgist')
+subdir('stir')
subdir('table')
subdir('tablesample')
subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/stir
+#
+# IDENTIFICATION
+# src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ * Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+ IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+ /* Set STIR-specific strategy and procedure numbers */
+ amroutine->amstrategies = STIR_NSTRATEGIES;
+ amroutine->amsupport = STIR_NPROC;
+ amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+ /* STIR doesn't support most index operations */
+ amroutine->amcanorder = false;
+ amroutine->amcanorderbyop = false;
+ amroutine->amcanbackward = false;
+ amroutine->amcanunique = false;
+ amroutine->amcanmulticol = true;
+ amroutine->amoptionalkey = true;
+ amroutine->amsearcharray = false;
+ amroutine->amsearchnulls = false;
+ amroutine->amstorage = false;
+ amroutine->amclusterable = false;
+ amroutine->ampredlocks = false;
+ amroutine->amcanparallel = false;
+ amroutine->amcanbuildparallel = false;
+ amroutine->amcaninclude = true;
+ amroutine->amusemaintenanceworkmem = false;
+ amroutine->amparallelvacuumoptions =
+ VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amkeytype = InvalidOid;
+
+ /* Set up function callbacks */
+ amroutine->ambuild = stirbuild;
+ amroutine->ambuildempty = stirbuildempty;
+ amroutine->aminsert = stirinsert;
+ amroutine->aminsertcleanup = NULL;
+ amroutine->ambulkdelete = stirbulkdelete;
+ amroutine->amvacuumcleanup = stirvacuumcleanup;
+ amroutine->amcanreturn = NULL;
+ amroutine->amcostestimate = stircostestimate;
+ amroutine->amoptions = stiroptions;
+ amroutine->amproperty = NULL;
+ amroutine->ambuildphasename = NULL;
+ amroutine->amvalidate = stirvalidate;
+ amroutine->amadjustmembers = NULL;
+ amroutine->ambeginscan = stirbeginscan;
+ amroutine->amrescan = stirrescan;
+ amroutine->amgettuple = NULL;
+ amroutine->amgetbitmap = NULL;
+ amroutine->amendscan = stirendscan;
+ amroutine->ammarkpos = NULL;
+ amroutine->amrestrpos = NULL;
+ amroutine->amestimateparallelscan = NULL;
+ amroutine->aminitparallelscan = NULL;
+ amroutine->amparallelrescan = NULL;
+
+ PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+ bool result = true;
+ HeapTuple classtup;
+ Form_pg_opclass classform;
+ Oid opfamilyoid;
+ HeapTuple familytup;
+ Form_pg_opfamily familyform;
+ char *opfamilyname;
+ CatCList *proclist,
+ *oprlist;
+ int i;
+
+ /* Fetch opclass information */
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+ classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+ opfamilyoid = classform->opcfamily;
+
+
+ /* Fetch opfamily information */
+ familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+ if (!HeapTupleIsValid(familytup))
+ elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+ familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+ opfamilyname = NameStr(familyform->opfname);
+
+ /* Fetch all operators and support functions of the opfamily */
+ oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+ proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+ /* Check individual operators */
+ for (i = 0; i < oprlist->n_members; i++)
+ {
+ HeapTuple oprtup = &oprlist->members[i]->tuple;
+ Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+ /* Check it's allowed strategy for stir */
+ if (oprform->amopstrategy < 1 ||
+ oprform->amopstrategy > STIR_NSTRATEGIES)
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+ opfamilyname,
+ format_operator(oprform->amopopr),
+ oprform->amopstrategy)));
+ result = false;
+ }
+
+ /* stir doesn't support ORDER BY operators */
+ if (oprform->amoppurpose != AMOP_SEARCH ||
+ OidIsValid(oprform->amopsortfamily))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+
+ /* Check operator signature --- same for all stir strategies */
+ if (!check_amop_signature(oprform->amopopr, BOOLOID,
+ oprform->amoplefttype,
+ oprform->amoprighttype))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with wrong signature",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+ }
+
+
+ ReleaseCatCacheList(proclist);
+ ReleaseCatCacheList(oprlist);
+ ReleaseSysCache(familytup);
+ ReleaseSysCache(classtup);
+
+ return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+ StirMetaPageData *metadata;
+
+ StirInitPage(metaPage, STIR_META);
+ metadata = StirPageGetMeta(metaPage);
+ memset(metadata, 0, sizeof(StirMetaPageData));
+ metadata->magickNumber = STIR_MAGICK_NUMBER;
+ metadata->skipInserts = skipInserts;
+ ((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ /*
+ * Make a new page; since it is first page it should be associated with
+ * block number 0 (STIR_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
+ */
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+ /* Initialize contents of meta page */
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+ StirPageOpaque opaque;
+
+ PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+ opaque = StirPageGetOpaque(page);
+ opaque->flags = flags;
+ opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+ StirTuple *itup;
+ StirPageOpaque opaque;
+ Pointer ptr;
+
+ /* We shouldn't be pointed to an invalid page */
+ Assert(!PageIsNew(page));
+
+ /* Does new tuple fit on the page? */
+ if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+ return false;
+
+ /* Copy new tuple to the end of page */
+ opaque = StirPageGetOpaque(page);
+ itup = StirPageGetTuple(page, opaque->maxoff + 1);
+ memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+ /* Adjust maxoff and pd_lower */
+ opaque->maxoff++;
+ ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+ ((PageHeader) page)->pd_lower = ptr - page;
+
+ /* Assert we didn't overrun available space */
+ Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+ return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo)
+{
+ StirTuple *itup;
+ MemoryContext oldCtx;
+ MemoryContext insertCtx;
+ StirMetaPageData *metaData;
+ Buffer buffer,
+ metaBuffer;
+ Page page;
+ GenericXLogState *state;
+ uint16 blkNo;
+
+ /* Create temporary context for insert operation */
+ insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Stir insert temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+
+ oldCtx = MemoryContextSwitchTo(insertCtx);
+
+ /* Create new tuple with heap pointer */
+ itup = (StirTuple *) palloc0(sizeof(StirTuple));
+ itup->heapPtr = *ht_ctid;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+ for (;;)
+ {
+ LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+ /* Check if inserts are allowed */
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+ blkNo = metaData->lastBlkNo;
+ /* Don't hold metabuffer lock while doing insert */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+ if (blkNo > 0)
+ {
+ buffer = ReadBuffer(index, blkNo);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+ Assert(!PageIsNew(page));
+
+ /* Try to add tuple to existing page */
+ if (StirPageAddItem(page, itup))
+ {
+ /* Success! Apply the change, clean up, and exit */
+ GenericXLogFinish(state);
+ UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(metaBuffer);
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+ return false;
+ }
+
+ /* Didn't fit, must try other pages */
+ GenericXLogAbort(state);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ /* Need to add new page - get exclusive lock on meta page */
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+ /* Check if another backend already extended the index */
+
+ if (blkNo != metaData->lastBlkNo)
+ {
+ Assert(blkNo < metaData->lastBlkNo);
+ /* Someone else inserted the new page into the index, lets try again /
+ */
+ GenericXLogAbort(state);
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+ else
+ {
+ /* Must extend the file */
+ buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+ EB_LOCK_FIRST);
+
+ page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+ StirInitPage(page, 0);
+
+ if (!StirPageAddItem(page, itup))
+ {
+ /* We shouldn't be here since we're inserting to an empty page */
+ elog(ERROR, "could not add new stir tuple to empty page");
+ }
+
+ /* Update meta page with new last block number */
+ metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(buffer);
+ UnlockReleaseBuffer(metaBuffer);
+
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+
+ return false;
+ }
+ }
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo)
+{
+ IndexBuildResult *result;
+
+ if (!indexInfo->ii_Auxiliary)
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+ StirInitMetapage(index, MAIN_FORKNUM);
+
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ result->heap_tuples = 0;
+ result->index_tuples = 0;
+ return result;
+}
+
+void stirbuildempty(Relation index)
+{
+ StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ Relation index = info->index;
+ BlockNumber blkno, npages;
+ Buffer buffer;
+ Page page;
+
+ /* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+ if (!info->validate_index)
+ {
+ StirMarkAsSkipInserts(index);
+
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+ }
+
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ /*
+ * Iterate over the pages. We don't care about concurrently added pages,
+ * because index is marked as not-ready for that momment and index not
+ * used for insert.
+ */
+ npages = RelationGetNumberOfBlocks(index);
+ for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+ {
+ StirTuple *itup, *itupEnd;
+
+ vacuum_delay_point();
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+ RBM_NORMAL, info->strategy);
+
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buffer);
+
+ if (PageIsNew(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ continue;
+ }
+
+ itup = StirPageGetTuple(page, FirstOffsetNumber);
+ itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+ while (itup < itupEnd)
+ {
+ /* Do we have to delete this tuple? */
+ if (callback(&itup->heapPtr, callback_state))
+ {
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+ }
+
+ itup = StirPageGetNextTuple(itup);
+ }
+
+ UnlockReleaseBuffer(buffer);
+ }
+
+ return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+ StirMetaPageData *metaData;
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ metaData = StirPageGetMeta(metaPage);
+ if (!metaData->skipInserts)
+ {
+ metaData->skipInserts = true;
+ GenericXLogFinish(state);
+ }
+ else
+ {
+ GenericXLogAbort(state);
+ }
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats)
+{
+ StirMarkAsSkipInserts(info->index);
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+ return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+ double loop_count, Cost *indexStartupCost,
+ Cost *indexTotalCost, Selectivity *indexSelectivity,
+ double *indexCorrelation, double *indexPages)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4eec5525993..92d5f3ac009 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3402,6 +3402,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
ivinfo.strategy = NULL;
+ ivinfo.validate_index = true;
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
+ ivinfo.validate_index = false;
stats = index_vacuum_cleanup(&ivinfo, NULL);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
ivinfo.estimated_count = pvs->shared->estimated_count;
ivinfo.num_heap_tuples = pvs->shared->reltuples;
ivinfo.strategy = pvs->bstrategy;
+ ivinfo.validate_index = false;
/* Update error traceback information */
pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
/* initialize index-build state to default */
n->ii_BrokenHotChain = false;
n->ii_ParallelWorkers = 0;
+ n->ii_Auxiliary = false;
/* set up for possible use by index AM */
n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
bool estimated_count; /* num_heap_tuples is an estimate */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
+ bool validate_index; /* validating concurrently built index? */
BufferAccessStrategy strategy; /* access strategy for reads */
} IndexVacuumInfo;
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
RELOPT_KIND_VIEW = (1 << 9),
RELOPT_KIND_BRIN = (1 << 10),
RELOPT_KIND_PARTITIONED = (1 << 11),
+ RELOPT_KIND_STIR = (1 << 12),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ * header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC 0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES 1
+
+#define STIR_OPTIONS_PROC 0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+ ((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page) ((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+ ((StirTuple *)(PageGetContents(page) \
+ + sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+ ((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO (0)
+#define STIR_HEAD_BLKNO (1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+ OffsetNumber maxoff; /* number of index tuples on page */
+ uint16 flags; /* see bit definitions below */
+ uint16 unused; /* placeholder to force maxaligning of size of
+ * StirPageOpaqueData and to place
+ * stir_page_id exactly at the end of page */
+ uint16 stir_page_id; /* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META (1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID 0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+ uint32 magickNumber;
+ uint16 lastBlkNo;
+ bool skipInserts; /* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page) ((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+ ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+ - StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+ - MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+ void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+ descr => 'short term index replacement access method',
+ amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
# no brin opclass for the geometric types except box
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+ opcfamily => 'stir/any_ops', opcintype => 'any'},
+
]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
opfmethod => 'hash', opfname => 'multirange_ops' },
{ oid => '6158',
opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+ opfmethod => 'stir', opfname => 'any_ops' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0f22c217235..59f50e2b027 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
proname => 'brinhandler', provolatile => 'v',
prorettype => 'index_am_handler', proargtypes => 'internal',
prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+ proname => 'stirhandler', provolatile => 'v',
+ prorettype => 'index_am_handler', proargtypes => 'internal',
+ prosrc => 'stirhandler' },
{ oid => '3952', descr => 'brin: standalone scan new table pages',
proname => 'brin_summarize_new_values', provolatile => 'v',
proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7f71b7625df..748655fd0cf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
* BrokenHotChain did we detect any broken HOT chains?
* Summarizing is it a summarizing index?
* ParallelWorkers # of workers requested (excludes leader)
+ * Auxiliary # index-helper for concurrent build?
* Am Oid of index AM
* AmCache private cache area for index AM
* Context memory context holding this IndexInfo
*
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
* ----------------
*/
typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
bool ii_Summarizing;
bool ii_WithoutOverlaps;
int ii_ParallelWorkers;
+ bool ii_Auxiliary;
Oid ii_Am;
void *ii_AmCache;
MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
Selectivity *indexSelectivity,
double *indexCorrelation,
double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+ struct IndexPath *path,
+ double loop_count,
+ Cost *indexStartupCost,
+ Cost *indexTotalCost,
+ Selectivity *indexSelectivity,
+ double *indexCorrelation,
+ double *indexPages);
extern void gincostestimate(struct PlannerInfo *root,
struct IndexPath *path,
double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
spgist | can_exclude | t
spgist | can_include | t
spgist | bogus |
-(36 rows)
+ stir | can_order | f
+ stir | can_unique | f
+ stir | can_multi_col | t
+ stir | can_exclude | f
+ stir | can_include | t
+ stir | bogus |
+(42 rows)
--
-- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
WHERE a1.amopfamily = c1.opcfamily
AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily
----------+-----------
-(0 rows)
+ opcname | opcfamily
+----------+-----------
+ stir_ops | 5558
+(1 row)
-- Check that each operator listed in pg_amop has an associated opclass,
-- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA *
List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA h*
List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
\dA: extra argument "bar" ignored
\dA+
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ *
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ h*
List of access methods
--
2.43.0
v8-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchapplication/octet-stream; name=v8-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From 31b28f4a458da9486d7d851ee6a31f0241df074e Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v8 4/7] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/catalog/index.c | 2 +-
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
13 files changed, 179 insertions(+), 57 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f4464f64789 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1530,7 +1530,7 @@ index_concurrently_build(Oid heapRelationId,
/* Invalidate catalog snapshot just for assert */
InvalidateCatalogSnapshot();
- Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+ Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
NOTICE: notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
v8-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchapplication/octet-stream; name=v8-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v8 1/7] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -946,7 +1008,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v8-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patchapplication/octet-stream; name=v8-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patchDownload
From b6bb0dcc3598b51203ab89940f593f6cfbf6fe7a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 24 Dec 2024 13:40:45 +0100
Subject: [PATCH v8 7/7] Improve CREATE/REINDEX INDEX CONCURRENTLY using
auxiliary index
Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:
- Creating an auxiliary STIR (Short Term Index Replacement) index to track
new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase
instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready
This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.
This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
src/backend/access/heap/heapam_handler.c | 384 +++++++++---------
src/backend/catalog/index.c | 280 +++++++++++--
src/backend/catalog/toasting.c | 3 +-
src/backend/commands/indexcmds.c | 362 +++++++++++++----
src/include/access/tableam.h | 28 +-
src/include/catalog/index.h | 15 +-
src/include/commands/progress.h | 4 +-
.../expected/cic_reset_snapshots.out | 28 ++
.../sql/cic_reset_snapshots.sql | 1 +
src/test/regress/expected/create_index.out | 4 +
src/test/regress/expected/indexing.out | 3 +-
src/test/regress/sql/create_index.sql | 3 +
12 files changed, 792 insertions(+), 323 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 921b806642a..d575083962b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
#include "storage/bufpage.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
@@ -1777,246 +1778,267 @@ heapam_index_build_range_scan(Relation heapRelation,
return reltuples;
}
-static void
+static TransactionId
heapam_index_validate_scan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
- Snapshot snapshot,
- ValidateIndexState *state)
+ ValidateIndexState *state,
+ ValidateIndexState *auxState)
{
- TableScanDesc scan;
- HeapScanDesc hscan;
- HeapTuple heapTuple;
+ IndexFetchTableData *fetch;
+ TransactionId limitXmin;
+
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
- ExprState *predicate;
- TupleTableSlot *slot;
- EState *estate;
- ExprContext *econtext;
- BlockNumber root_blkno = InvalidBlockNumber;
- OffsetNumber root_offsets[MaxHeapTuplesPerPage];
- bool in_index[MaxHeapTuplesPerPage];
- BlockNumber previous_blkno = InvalidBlockNumber;
+
+ Snapshot snapshot;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
/* state variables for the merge */
- ItemPointer indexcursor = NULL;
- ItemPointerData decoded;
- bool tuplesort_empty = false;
+ ItemPointer indexcursor = NULL,
+ auxindexcursor = NULL,
+ prev_indexcursor = NULL;
+ ItemPointerData decoded,
+ auxdecoded,
+ prev_decoded,
+ fetched;
+ bool tuplesort_empty = false,
+ auxtuplesort_empty = false;
+
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ /*
+ * Now take the "reference snapshot" that will be used by to filter candidate
+ * tuples. Beware! There might still be snapshots in
+ * use that treat some transaction as in-progress that our reference
+ * snapshot treats as committed. If such a recently-committed transaction
+ * deleted tuples in the table, we will not include them in the index; yet
+ * those transactions which see the deleting one as still-in-progress will
+ * expect such tuples to be there once we mark the index as valid.
+ *
+ * We solve this by waiting for all endangered transactions to exit before
+ * we mark the index as valid.
+ *
+ * We also set ActiveSnapshot to this snap, since functions in indexes may
+ * need a snapshot.
+ */
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ limitXmin = snapshot->xmin;
/*
* sanity checks
*/
Assert(OidIsValid(indexRelation->rd_rel->relam));
- /*
- * Need an EState for evaluation of index expressions and partial-index
- * predicates. Also a slot to hold the current tuple.
- */
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
- &TTSOpsHeapTuple);
+ &TTSOpsBufferHeapTuple);
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
- * Prepare for scan of the base relation. We need just those tuples
- * satisfying the passed-in reference snapshot. We must disable syncscan
- * here, because it's critical that we read from block zero forward to
- * match the sorted TIDs.
+ * Prepare to fetch heap tuples in index style. This helps to reconstruct
+ * a tuple from the heap when we only have an ItemPointer.
*/
- scan = table_beginscan_strat(heapRelation, /* relation */
- snapshot, /* snapshot */
- 0, /* number of keys */
- NULL, /* scan key */
- true, /* buffer access strategy OK */
- false, /* syncscan not OK */
- false);
- hscan = (HeapScanDesc) scan;
+ fetch = heapam_index_fetch_begin(heapRelation);
+
+ /* Initialize pointers. */
+ ItemPointerSetInvalid(&decoded);
+ ItemPointerSetInvalid(&prev_decoded);
+ ItemPointerSetInvalid(&auxdecoded);
+ ItemPointerSetInvalid(&fetched);
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
- hscan->rs_nblocks);
+ /* We'll track the last "main" index position in prev_indexcursor. */
+ prev_indexcursor = &prev_decoded;
/*
- * Scan all tuples matching the snapshot.
+ * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+ * which holds TIDs that must be merged with or compared to those from
+ * the "main" sort (state->tuplesort).
*/
- while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ while (!auxtuplesort_empty)
{
- ItemPointer heapcursor = &heapTuple->t_self;
- ItemPointerData rootTuple;
- OffsetNumber root_offnum;
-
+ Datum ts_val;
+ bool ts_isnull;
CHECK_FOR_INTERRUPTS();
- state->htups += 1;
-
- if ((previous_blkno == InvalidBlockNumber) ||
- (hscan->rs_cblock != previous_blkno))
- {
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- hscan->rs_cblock);
- previous_blkno = hscan->rs_cblock;
- }
-
/*
- * As commented in table_index_build_scan, we should index heap-only
- * tuples under the TIDs of their root tuples; so when we advance onto
- * a new heap page, build a map of root item offsets on the page.
- *
- * This complicates merging against the tuplesort output: we will
- * visit the live tuples in order by their offsets, but the root
- * offsets that we need to compare against the index contents might be
- * ordered differently. So we might have to "look back" within the
- * tuplesort output, but only within the current page. We handle that
- * by keeping a bool array in_index[] showing all the
- * already-passed-over tuplesort output TIDs of the current page. We
- * clear that array here, when advancing onto a new heap page.
- */
- if (hscan->rs_cblock != root_blkno)
+ * Attempt to fetch the next TID from the auxiliary sort. If it's
+ * empty, we set auxindexcursor to NULL.
+ */
+ auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(auxtuplesort_empty || !ts_isnull);
+ if (!auxtuplesort_empty)
{
- Page page = BufferGetPage(hscan->rs_cbuf);
-
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
- heap_get_root_tuples(page, root_offsets);
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
- memset(in_index, 0, sizeof(in_index));
-
- root_blkno = hscan->rs_cblock;
+ itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+ auxindexcursor = &auxdecoded;
}
-
- /* Convert actual tuple TID to root TID */
- rootTuple = *heapcursor;
- root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
- if (HeapTupleIsHeapOnly(heapTuple))
+ else
{
- root_offnum = root_offsets[root_offnum - 1];
- if (!OffsetNumberIsValid(root_offnum))
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
- ItemPointerGetBlockNumber(heapcursor),
- ItemPointerGetOffsetNumber(heapcursor),
- RelationGetRelationName(heapRelation))));
- ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+ auxindexcursor = NULL;
}
/*
- * "merge" by skipping through the index tuples until we find or pass
- * the current root tuple.
- */
- while (!tuplesort_empty &&
- (!indexcursor ||
- ItemPointerCompare(indexcursor, &rootTuple) < 0))
+ * If the auxiliary sort is not yet empty, we now try to synchronize
+ * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+ * the main sort cursor until we've reached or passed the auxiliary TID.
+ */
+ if (!auxtuplesort_empty)
{
- Datum ts_val;
- bool ts_isnull;
-
- if (indexcursor)
+ /*
+ * Move the main sort forward while:
+ * (1) It's not exhausted (tuplesort_empty == false), and
+ * (2) Either indexcursor is NULL (first iteration) or
+ * indexcursor < auxindexcursor in TID order.
+ */
+ while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+ ItemPointerCompare(indexcursor, auxindexcursor) < 0))
{
+ /* Keep track of the previous TID in prev_decoded. */
+ prev_decoded = decoded;
/*
- * Remember index items seen earlier on the current heap page
+ * Get the next TID from the main sort. If it's empty,
+ * we set indexcursor to NULL.
*/
- if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
- in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
- }
-
- tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
- false, &ts_val, &ts_isnull,
- NULL);
- Assert(tuplesort_empty || !ts_isnull);
- if (!tuplesort_empty)
- {
- itemptr_decode(&decoded, DatumGetInt64(ts_val));
- indexcursor = &decoded;
- }
- else
- {
- /* Be tidy */
- indexcursor = NULL;
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(tuplesort_empty || !ts_isnull);
+ if (!tuplesort_empty)
+ {
+ itemptr_decode(&decoded, DatumGetInt64(ts_val));
+ indexcursor = &decoded;
+
+ /*
+ * If the current TID in the main sort is a duplicate of the
+ * previous one (prev_indexcursor), skip it to avoid
+ * double-inserting the same TID. Such situation is possible
+ * due concurrent page splits in btree (and, probabaly other
+ * indexes as well).
+ */
+ if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+ {
+ elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+ ItemPointerGetBlockNumber(indexcursor),
+ ItemPointerGetOffsetNumber(indexcursor));
+ }
+ }
+ else
+ {
+ indexcursor = NULL;
+ }
+
+ CHECK_FOR_INTERRUPTS();
}
- }
-
- /*
- * If the tuplesort has overshot *and* we didn't see a match earlier,
- * then this tuple is missing from the index, so insert it.
- */
- if ((tuplesort_empty ||
- ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
- !in_index[root_offnum - 1])
- {
- MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
- /* Set up for predicate or expression evaluation */
- ExecStoreHeapTuple(heapTuple, slot, false);
/*
- * In a partial index, discard tuples that don't satisfy the
- * predicate.
+ * Now, if either:
+ * - the main sort is empty, or
+ * - indexcursor > auxindexcursor,
+ *
+ * then auxindexcursor identifies a TID that doesn't appear in
+ * the main sort. We likely need to insert it
+ * into the target index if it’s visible in the heap.
*/
- if (predicate != NULL)
+ if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
{
- if (!ExecQual(predicate, econtext))
- continue;
- }
+ bool call_again = false;
+ bool all_dead = false;
+ ItemPointer tid;
- /*
- * For the current heap tuple, extract all the attributes we use
- * in this index, and note which are null. This also performs
- * evaluation of any expressions needed.
- */
- FormIndexDatum(indexInfo,
- slot,
- estate,
- values,
- isnull);
+ /* Copy the auxindexcursor TID into fetched. */
+ fetched = *auxindexcursor;
+ tid = &fetched;
- /*
- * You'd think we should go ahead and build the index tuple here,
- * but some index AMs want to do further processing on the data
- * first. So pass the values[] and isnull[] arrays, instead.
- */
-
- /*
- * If the tuple is already committed dead, you might think we
- * could suppress uniqueness checking, but this is no longer true
- * in the presence of HOT, because the insert is actually a proxy
- * for a uniqueness check on the whole HOT-chain. That is, the
- * tuple we have here could be dead because it was already
- * HOT-updated, and if so the updating transaction will not have
- * thought it should insert index entries. The index AM will
- * check the whole HOT-chain and correctly detect a conflict if
- * there is one.
- */
+ /* Reset the per-tuple memory context for the next fetch. */
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+ state->htups += 1;
- index_insert(indexRelation,
- values,
- isnull,
- &rootTuple,
- heapRelation,
- indexInfo->ii_Unique ?
- UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- false,
- indexInfo);
-
- state->tups_inserted += 1;
+ /*
+ * Fetch the tuple from the heap to see if it's visible
+ * under our snapshot. If it is, form the index key values
+ * and insert a new entry into the target index.
+ */
+ if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+ {
+
+ /* Compute the key values and null flags for this tuple. */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /*
+ * Insert the tuple into the target index.
+ */
+ index_insert(indexRelation,
+ values,
+ isnull,
+ auxindexcursor, /* insert root tuple */
+ heapRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ false,
+ indexInfo);
+
+ state->tups_inserted += 1;
+
+ elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor),
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ }
+ else
+ {
+ /*
+ * The tuple wasn't visible under our snapshot. We
+ * skip inserting it into the target index because
+ * from our perspective, it doesn't exist.
+ */
+ elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor));
+ }
+ }
}
}
- table_endscan(scan);
-
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
+ heapam_index_fetch_end(fetch);
+
+ /*
+ * Drop the reference snapshot. We must do this before waiting out other
+ * snapshot holders, else we will deadlock against other processes also
+ * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+ * they must wait for. But first, save the snapshot's xmin to use as
+ * limitXmin for GetCurrentVirtualXIDs().
+ */
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ InvalidateCatalogSnapshot();
+ Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+ if (MyProc->xid == InvalidTransactionId)
+ INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
+
+ return limitXmin;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 92d5f3ac009..f0389ef8583 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -718,6 +718,9 @@ UpdateIndexRelation(Oid indexoid,
* allow_system_table_mods: allow table to be a system catalog
* is_internal: if true, post creation hook for new index
* constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ * cases it is should be equal to persistence level of table,
+ * auxiliary indexes are only exception here.
*
* Returns the OID of the created index.
*/
@@ -742,7 +745,8 @@ index_create(Relation heapRelation,
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId)
+ Oid *constraintId,
+ char relpersistence)
{
Oid heapRelationId = RelationGetRelid(heapRelation);
Relation pg_class;
@@ -753,11 +757,11 @@ index_create(Relation heapRelation,
bool is_exclusion;
Oid namespaceId;
int i;
- char relpersistence;
bool isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
bool invalid = (flags & INDEX_CREATE_INVALID) != 0;
bool concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
bool partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+ bool auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
char relkind;
TransactionId relfrozenxid;
MultiXactId relminmxid;
@@ -783,7 +787,6 @@ index_create(Relation heapRelation,
namespaceId = RelationGetNamespace(heapRelation);
shared_relation = heapRelation->rd_rel->relisshared;
mapped_relation = RelationIsMapped(heapRelation);
- relpersistence = heapRelation->rd_rel->relpersistence;
/*
* check parameters
@@ -791,6 +794,11 @@ index_create(Relation heapRelation,
if (indexInfo->ii_NumIndexAttrs < 1)
elog(ERROR, "must index at least one column");
+ if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("user-defined indexes with STIR access method are not supported")));
+
if (!allow_system_table_mods &&
IsSystemRelation(heapRelation) &&
IsNormalProcessingMode())
@@ -1461,7 +1469,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
0,
true, /* allow table to be a system catalog? */
false, /* is_internal? */
- NULL);
+ NULL,
+ heapRelation->rd_rel->relpersistence);
/* Close the relations used and clean up */
index_close(indexRelation, NoLock);
@@ -1471,6 +1480,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller. The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+ Oid tablespaceOid, const char *newName)
+{
+ Relation indexRelation;
+ IndexInfo *oldInfo,
+ *newInfo;
+ Oid newIndexId = InvalidOid;
+ HeapTuple indexTuple;
+
+ List *indexColNames = NIL;
+ List *indexExprs = NIL;
+ List *indexPreds = NIL;
+
+ Oid *auxOpclassIds;
+ int16 *auxColoptions;
+
+ indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+ /* The new index needs some information from the old index */
+ oldInfo = BuildIndexInfo(indexRelation);
+
+ /*
+ * Build of an auxiliary index with exclusion constraints is not
+ * supported.
+ */
+ if (oldInfo->ii_ExclusionOps != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+ /* Get the array of class and column options IDs from index info */
+ indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+ if (!HeapTupleIsValid(indexTuple))
+ elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+ /*
+ * Fetch the list of expressions and predicates directly from the
+ * catalogs. This cannot rely on the information from IndexInfo of the
+ * old index as these have been flattened for the planner.
+ */
+ if (oldInfo->ii_Expressions != NIL)
+ {
+ Datum exprDatum;
+ char *exprString;
+
+ exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indexprs);
+ exprString = TextDatumGetCString(exprDatum);
+ indexExprs = (List *) stringToNode(exprString);
+ pfree(exprString);
+ }
+ if (oldInfo->ii_Predicate != NIL)
+ {
+ Datum predDatum;
+ char *predString;
+
+ predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indpred);
+ predString = TextDatumGetCString(predDatum);
+ indexPreds = (List *) stringToNode(predString);
+
+ /* Also convert to implicit-AND format */
+ indexPreds = make_ands_implicit((Expr *) indexPreds);
+ pfree(predString);
+ }
+
+ /*
+ * Build the index information for the new index. Note that rebuild of
+ * indexes with exclusion constraints is not supported, hence there is no
+ * need to fill all the ii_Exclusion* fields.
+ */
+ newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+ oldInfo->ii_NumIndexKeyAttrs,
+ STIR_AM_OID, /* special AM for aux indexes */
+ indexExprs,
+ indexPreds,
+ false, /* aux index are not unique */
+ oldInfo->ii_NullsNotDistinct,
+ false, /* not ready for inserts */
+ true,
+ false, /* aux are not summarizing */
+ oldInfo->ii_WithoutOverlaps);
+
+ /*
+ * Extract the list of column names and the column numbers for the new
+ * index information. All this information will be used for the index
+ * creation.
+ */
+ for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+ {
+ TupleDesc indexTupDesc = RelationGetDescr(indexRelation);
+ Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+ indexColNames = lappend(indexColNames, NameStr(att->attname));
+ newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+ }
+
+ auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+ auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+ /* Fill with "any ops" */
+ for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+ {
+ auxOpclassIds[i] = ANY_STIR_OPS_OID;
+ auxColoptions[i] = 0;
+ }
+
+ newIndexId = index_create(heapRelation,
+ newName,
+ InvalidOid, /* indexRelationId */
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidRelFileNumber, /* relFileNumber */
+ newInfo,
+ indexColNames,
+ STIR_AM_OID,
+ tablespaceOid,
+ indexRelation->rd_indcollation,
+ auxOpclassIds,
+ NULL,
+ auxColoptions,
+ NULL,
+ (Datum) 0,
+ INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+ 0,
+ true, /* allow table to be a system catalog? */
+ false, /* is_internal? */
+ NULL,
+ RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+ /* Close the relations used and clean up */
+ index_close(indexRelation, NoLock);
+ ReleaseSysCache(indexTuple);
+
+ return newIndexId;
+}
+
/*
* index_concurrently_build
*
@@ -1482,7 +1639,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
*/
void
index_concurrently_build(Oid heapRelationId,
- Oid indexRelationId)
+ Oid indexRelationId,
+ bool auxiliary)
{
Relation heapRel;
Oid save_userid;
@@ -1523,6 +1681,7 @@ index_concurrently_build(Oid heapRelationId,
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ indexInfo->ii_Auxiliary = auxiliary;
Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
@@ -3275,12 +3434,20 @@ IndexCheckExclusion(Relation heapRelation,
*
* We do a concurrent index build by first inserting the catalog entry for the
* index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
* Then we commit our transaction and start a new one, then we wait for all
* transactions that could have been modifying the table to terminate. Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
* honor its constraints on HOT updates; so while existing HOT-chains might
* be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it. We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We wait again for all
+ * transactions that could have been modifying the table to terminate. At that
+ * moment all new tuples are going to be inserted into auxiliary index.
+ *
+ * We now build the index normally via
* index_build(), while holding a weak lock that allows concurrent
* insert/update/delete. Also, we index only tuples that are valid
* as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,6 +3458,7 @@ IndexCheckExclusion(Relation heapRelation,
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
+ * But theese tuples contained in auxiliary index.
*
* Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
* snapshot to be set as active every so often. The reason for that is to
@@ -3300,8 +3468,10 @@ IndexCheckExclusion(Relation heapRelation,
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
* we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it. We then take a new reference snapshot
- * which is passed to validate_index(). Any tuples that are valid according
+ * insert their new tuples into it. At that moment we clear "indisready" for
+ * auxiliary index, since it is no more required/
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
* to this snap, but are not in the index, must be added to the index.
* (Any tuples committed live after the snap will be inserted into the
* index by their originating transaction. Any tuples committed dead before
@@ -3309,12 +3479,14 @@ IndexCheckExclusion(Relation heapRelation,
* that might care about them before we mark the index valid.)
*
* validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
* ever say "delete it". (This should be faster than a plain indexscan;
* also, not all index AMs support full-index indexscan.) Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index. Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index. Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
*
* Building a unique index this way is tricky: we might try to insert a
* tuple that is already dead or is in process of being deleted, and we
@@ -3330,24 +3502,25 @@ IndexCheckExclusion(Relation heapRelation,
* necessary to be sure there are none left with a transaction snapshot
* older than the reference (and hence possibly able to see tuples we did
* not index). Then we mark the index "indisvalid" and commit. Subsequent
- * transactions will be able to use it for queries.
- *
- * Doing two full table scans is a brute-force strategy. We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm). However that would
- * add yet more locking issues.
+ * transactions will be able to use it for queries. Auxiliary index is
+ * dropped.
*/
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
{
Relation heapRelation,
- indexRelation;
+ indexRelation,
+ auxIndexRelation;
IndexInfo *indexInfo;
- IndexVacuumInfo ivinfo;
- ValidateIndexState state;
+ TransactionId limitXmin;
+ IndexVacuumInfo ivinfo, auxivinfo;
+ ValidateIndexState state, auxState;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ /* Use 80% of maintenance_work_mem to target index sorting and
+ * rest for auxiliary */
+ int main_work_mem_part = (maintenance_work_mem * 8) / 10;
{
const int progress_index[] = {
@@ -3380,13 +3553,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
RestrictSearchPath();
indexRelation = index_open(indexId, RowExclusiveLock);
+ auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
/*
* Fetch info needed for index_insert. (You might think this should be
* passed in from DefineIndex, but its copy is long gone due to having
* been built in a previous transaction.)
+ *
+ * We might need snapshot for index expressions or predicates.
*/
+ PushActiveSnapshot(GetTransactionSnapshot());
indexInfo = BuildIndexInfo(indexRelation);
+ PopActiveSnapshot();
/* mark build is concurrent just for consistency */
indexInfo->ii_Concurrent = true;
@@ -3404,15 +3582,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.strategy = NULL;
ivinfo.validate_index = true;
+ /*
+ * Copy all info to auxiliary info, changing only relation.
+ */
+ auxivinfo = ivinfo;
+ auxivinfo.index = auxIndexRelation;
+
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
* item pointers. This can be significantly faster, primarily because TID
* is a pass-by-reference type on all platforms, whereas int8 is
* pass-by-value on most platforms.
*/
+ auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+ InvalidOid, false,
+ maintenance_work_mem - main_work_mem_part,
+ NULL, TUPLESORT_NONE);
+ auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+ (void) index_bulk_delete(&auxivinfo, NULL,
+ validate_index_callback, &auxState);
+
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
InvalidOid, false,
- maintenance_work_mem,
+ main_work_mem_part,
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
@@ -3435,27 +3628,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
tuplesort_performsort(state.tuplesort);
+ tuplesort_performsort(auxState.tuplesort);
+
+ InvalidateCatalogSnapshot();
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
- * Now scan the heap and "merge" it with the index
+ * Now merge both indexes
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
- table_index_validate_scan(heapRelation,
- indexRelation,
- indexInfo,
- snapshot,
- &state);
+ PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+ limitXmin = table_index_validate_scan(heapRelation,
+ indexRelation,
+ indexInfo,
+ &state,
+ &auxState);
- /* Done with tuplesort object */
+ /* Done with tuplesort objects */
tuplesort_end(state.tuplesort);
+ tuplesort_end(auxState.tuplesort);
/* Make sure to release resources cached in indexInfo (if needed). */
index_insert_cleanup(indexRelation, indexInfo);
elog(DEBUG2,
- "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
- state.htups, state.itups, state.tups_inserted);
+ "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+ " %.0f aux index tuples; inserted %.0f missing tuples",
+ state.htups, state.itups, auxState.itups, state.tups_inserted);
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -3464,8 +3663,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
SetUserIdAndSecContext(save_userid, save_sec_context);
/* Close rels, but keep locks */
+ index_close(auxIndexRelation, NoLock);
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ return limitXmin;
}
/*
@@ -3524,6 +3727,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
Assert(!indexForm->indisvalid);
indexForm->indisvalid = true;
break;
+ case INDEX_DROP_CLEAR_READY:
+ /* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+ Assert(indexForm->indislive);
+ Assert(indexForm->indisready);
+ Assert(!indexForm->indisvalid);
+ indexForm->indisready = false;
+ break;
case INDEX_DROP_CLEAR_VALID:
/*
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
BTREE_AM_OID,
rel->rd_rel->reltablespace,
collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
- INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+ INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+ toast_rel->rd_rel->relpersistence);
table_close(toast_rel, NoLock);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..02b636a0050 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -554,6 +554,7 @@ DefineIndex(Oid tableId,
{
bool concurrent;
char *indexRelationName;
+ char *auxIndexRelationName = NULL;
char *accessMethodName;
Oid *typeIds;
Oid *collationIds;
@@ -563,6 +564,7 @@ DefineIndex(Oid tableId,
Oid namespaceId;
Oid tablespaceId;
Oid createdConstraintId = InvalidOid;
+ Oid auxIndexRelationId = InvalidOid;
List *indexColNames;
List *allIndexParams;
Relation rel;
@@ -584,10 +586,10 @@ DefineIndex(Oid tableId,
int numberOfKeyAttributes;
TransactionId limitXmin;
ObjectAddress address;
+ ObjectAddress auxAddress;
LockRelId heaprelid;
LOCKTAG heaplocktag;
LOCKMODE lockmode;
- Snapshot snapshot;
Oid root_save_userid;
int root_save_sec_context;
int root_save_nestlevel;
@@ -834,6 +836,15 @@ DefineIndex(Oid tableId,
stmt->excludeOpNames,
stmt->primary,
stmt->isconstraint);
+ /*
+ * Select name for auxiliary index
+ */
+ if (concurrent)
+ auxIndexRelationName = ChooseRelationName(indexRelationName,
+ NULL,
+ "ccaux",
+ namespaceId,
+ false);
/*
* look up the access method, verify it can handle the requested features
@@ -1227,7 +1238,8 @@ DefineIndex(Oid tableId,
coloptions, NULL, reloptions,
flags, constr_flags,
allowSystemTableMods, !check_rights,
- &createdConstraintId);
+ &createdConstraintId,
+ rel->rd_rel->relpersistence);
ObjectAddressSet(address, RelationRelationId, indexRelationId);
@@ -1569,6 +1581,16 @@ DefineIndex(Oid tableId,
return address;
}
+ /*
+ * In case of concurrent build - create auxiliary index record.
+ */
+ if (concurrent)
+ {
+ auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+ tablespaceId, auxIndexRelationName);
+ ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+ }
+
AtEOXact_GUC(false, root_save_nestlevel);
SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
@@ -1597,11 +1619,11 @@ DefineIndex(Oid tableId,
/*
* For a concurrent build, it's important to make the catalog entries
* visible to other transactions before we start to build the index. That
- * will prevent them from making incompatible HOT updates. The new index
- * will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * will prevent them from making incompatible HOT updates. New indexes
+ * (main and auxiliary) will be marked not indisready and not indisvalid,
+ * so that no one else tries to either insert into it or use it for queries.
*
- * We must commit our current transaction so that the index becomes
+ * We must commit our current transaction so that the indexes becomes
* visible; then start another. Note that all the data structures we just
* built are lost in the commit. The only data we keep past here are the
* relation IDs.
@@ -1611,7 +1633,7 @@ DefineIndex(Oid tableId,
* cannot block, even if someone else is waiting for access, because we
* already have the same lock within our transaction.
*
- * Note: we don't currently bother with a session lock on the index,
+ * Note: we don't currently bother with a session lock on the indexes,
* because there are no operations that could change its state while we
* hold lock on the parent table. This might need to change later.
*/
@@ -1632,14 +1654,16 @@ DefineIndex(Oid tableId,
{
const int progress_cols[] = {
PROGRESS_CREATEIDX_INDEX_OID,
+ PROGRESS_CREATEIDX_AUX_INDEX_OID,
PROGRESS_CREATEIDX_PHASE
};
const int64 progress_vals[] = {
indexRelationId,
+ auxIndexRelationId,
PROGRESS_CREATEIDX_PHASE_WAIT_1
};
- pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
+ pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
}
/*
@@ -1650,7 +1674,7 @@ DefineIndex(Oid tableId,
* with the old list of indexes. Use ShareLock to consider running
* transactions that hold locks that permit writing to the table. Note we
* do not need to worry about xacts that open the table for writing after
- * this point; they will see the new index when they open it.
+ * this point; they will see the new indexes when they open it.
*
* Note: the reason we use actual lock acquisition here, rather than just
* checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1686,39 @@ DefineIndex(Oid tableId,
/*
* At this moment we are sure that there are no transactions with the
- * table open for write that don't have this new index in their list of
+ * table open for write that don't have this new indexes in their list of
* indexes. We have waited out all the existing transactions and any new
- * transaction will have the new index in its list, but the index is still
- * marked as "not-ready-for-inserts". The index is consulted while
+ * transaction will have both new indexes in its list, but indexes are still
+ * marked as "not-ready-for-inserts". The indexes are consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We build the index using all tuples that are visible using multiple
+ * Now call build on auxiliary index. Index will be created empty without
+ * any actual heap scan, but marked as "ready-for-inserts". The goal of
+ * that index is accumulate new tuples while main index is actually built.
+ */
+ index_concurrently_build(tableId, auxIndexRelationId, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Now we need to ensure are no transactions with the with auxiliary index
+ * marked as "not-ready-for-inserts".
+ */
+ WaitForLockers(heaplocktag, ShareLock, true);
+
+ /*
+ * At this moment we are sure what all new tuples in table are inserted into
+ * auxiliary index. Now it is time to build the target index itself.
+ *
+ * We build that index using all tuples that are visible using multiple
* refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
@@ -1679,7 +1727,7 @@ DefineIndex(Oid tableId,
*/
/* Perform concurrent build of index */
- index_concurrently_build(tableId, indexRelationId);
+ index_concurrently_build(tableId, indexRelationId, false);
/*
* Commit this transaction to make the indisready update visible.
@@ -1698,43 +1746,28 @@ DefineIndex(Oid tableId,
* the index marked as read-only for updates.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForLockers(heaplocktag, ShareLock, true);
/*
- * Now take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples. Beware! There might still be snapshots in
- * use that treat some transaction as in-progress that our reference
- * snapshot treats as committed. If such a recently-committed transaction
- * deleted tuples in the table, we will not include them in the index; yet
- * those transactions which see the deleting one as still-in-progress will
- * expect such tuples to be there once we mark the index as valid.
- *
- * We solve this by waiting for all endangered transactions to exit before
- * we mark the index as valid.
- *
- * We also set ActiveSnapshot to this snap, since functions in indexes may
- * need a snapshot.
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
*/
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
- * Scan the index and the heap, insert any missing index entries.
+ * Now target index is marked as "ready" for all transaction. So, auxiliary
+ * index is not more needed. So, start removing process by reverting "ready"
+ * flag.
*/
- validate_index(tableId, indexRelationId, snapshot);
-
- /*
- * Drop the reference snapshot. We must do this before waiting out other
- * snapshot holders, else we will deadlock against other processes also
- * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
- * they must wait for. But first, save the snapshot's xmin to use as
- * limitXmin for GetCurrentVirtualXIDs().
- */
- limitXmin = snapshot->xmin;
-
+ index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Merge content of auxiliary and target indexes - insert any missing index entries.
+ */
+ limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
/*
* The snapshot subsystem could still contain registered snapshots that
@@ -1747,6 +1780,49 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ /* Now it is time to mark auxiliary index as dead */
+ index_concurrently_set_dead(tableId, auxIndexRelationId);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
+ */
+
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
+ /* Now wait for all transaction to ignore auxiliary because it is dead */
+ WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Drop auxiliary index.
+ *
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
+ *
+ * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+ * right lock level.
+ */
+ performDeletion(&auxAddress, DROP_RESTRICT,
+ PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
/* Tell concurrent index builds to ignore us, if index qualifies */
if (safe_index)
set_indexsafe_procflags();
@@ -1757,12 +1833,12 @@ DefineIndex(Oid tableId,
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted just
- * before the reference snap was taken, we have to wait out any
- * transactions that might have older snapshots.
+ * before the last snapshot during validating was taken, we have to wait
+ * out any transactions that might have older snapshots.
*/
INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
WaitForOlderSnapshots(limitXmin, true);
/*
@@ -3542,6 +3618,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
typedef struct ReindexIndexInfo
{
Oid indexId;
+ Oid auxIndexId;
Oid tableId;
Oid amId;
bool safe; /* for set_indexsafe_procflags */
@@ -3563,9 +3640,10 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PROGRESS_CREATEIDX_COMMAND,
PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_INDEX_OID,
+ PROGRESS_CREATEIDX_AUX_INDEX_OID,
PROGRESS_CREATEIDX_ACCESS_METHOD_OID
};
- int64 progress_vals[4];
+ int64 progress_vals[5];
/*
* Create a memory context that will survive forced transaction commits we
@@ -3865,15 +3943,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
foreach(lc, indexIds)
{
char *concurrentName;
+ char *auxConcurrentName;
ReindexIndexInfo *idx = lfirst(lc);
ReindexIndexInfo *newidx;
Oid newIndexId;
+ Oid auxIndexId;
Relation indexRel;
Relation heapRel;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
Relation newIndexRel;
+ Relation auxIndexRel;
LockRelId *lockrelid;
Oid tablespaceid;
@@ -3915,8 +3996,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = 0; /* initializing */
progress_vals[2] = idx->indexId;
- progress_vals[3] = idx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = InvalidOid;
+ progress_vals[4] = idx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
/* Choose a temporary relation name for the new index */
concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3924,6 +4006,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
"ccnew",
get_rel_namespace(indexRel->rd_index->indrelid),
false);
+ auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+ NULL,
+ "ccaux",
+ get_rel_namespace(indexRel->rd_index->indrelid),
+ false);
/* Choose the new tablespace, indexes of toast tables are not moved */
if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4024,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
idx->indexId,
tablespaceid,
concurrentName);
+ auxIndexId = index_concurrently_create_aux(heapRel,
+ idx->indexId,
+ tablespaceid,
+ auxConcurrentName);
/*
* Now open the relation of the new index, a session-level lock is
* also needed on it.
*/
newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+ auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
/*
* Save the list of OIDs and locks in private context
@@ -3951,6 +4043,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
newidx = palloc_object(ReindexIndexInfo);
newidx->indexId = newIndexId;
+ newidx->auxIndexId = auxIndexId;
newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -3969,10 +4062,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
lockrelid = palloc_object(LockRelId);
*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
relationLocks = lappend(relationLocks, lockrelid);
+ lockrelid = palloc_object(LockRelId);
+ *lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+ relationLocks = lappend(relationLocks, lockrelid);
MemoryContextSwitchTo(oldcontext);
index_close(indexRel, NoLock);
+ index_close(auxIndexRel, NoLock);
index_close(newIndexRel, NoLock);
/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4150,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* doing that, wait until no running transactions could have the table of
* the index open with the old list of indexes. See "phase 2" in
* DefineIndex() for more details.
+ */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ StartTransactionCommand();
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+ index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
+
+ CommitTransactionCommand();
+ }
+
+ StartTransactionCommand();
+
+ /*
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Wait until all auxiliary indexes are taken into account by all
+ * transactions.
+ */
WaitForLockersMultiple(lockTags, ShareLock, true);
CommitTransactionCommand();
+ /* Now it is time to perform target index build. */
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
@@ -4086,11 +4225,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
progress_vals[2] = newidx->indexId;
- progress_vals[3] = newidx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = newidx->auxIndexId;
+ progress_vals[4] = newidx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
/* Perform concurrent build of new index */
- index_concurrently_build(newidx->tableId, newidx->indexId);
+ index_concurrently_build(newidx->tableId, newidx->indexId, false);
CommitTransactionCommand();
}
@@ -4102,24 +4242,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* need to set the PROC_IN_SAFE_IC flag here.
*/
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * At this moment all target indexes are marked as "ready-to-insert". So,
+ * we are free to start process of dropping auxiliary indexes.
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ StartTransactionCommand();
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ }
+
/*
* Phase 3 of REINDEX CONCURRENTLY
*
- * During this phase the old indexes catch up with any new tuples that
+ * During this phase the new indexes catch up with any new tuples that
* were created during the previous phase. See "phase 3" in DefineIndex()
* for more details.
*/
-
- pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
- WaitForLockersMultiple(lockTags, ShareLock, true);
- CommitTransactionCommand();
-
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
TransactionId limitXmin;
- Snapshot snapshot;
StartTransactionCommand();
@@ -4134,13 +4302,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /*
- * Take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4149,19 +4310,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
progress_vals[2] = newidx->indexId;
- progress_vals[3] = newidx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = newidx->auxIndexId;
+ progress_vals[4] = newidx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
- validate_index(newidx->tableId, newidx->indexId, snapshot);
-
- /*
- * We can now do away with our active snapshot, we still need to save
- * the xmin limit to wait for older snapshots.
- */
- limitXmin = snapshot->xmin;
-
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+ limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
* To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4335,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* there's no need to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForOlderSnapshots(limitXmin, true);
CommitTransactionCommand();
@@ -4271,14 +4425,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 5 of REINDEX CONCURRENTLY
*
- * Mark the old indexes as dead. First we must wait until no running
- * transaction could be using the index for a query. See also
+ * Mark the old and auxiliary indexes as dead. First we must wait until no
+ * running transaction could be using the index for a query. See also
* index_drop() for more details.
*/
INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_4);
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
foreach(lc, indexIds)
@@ -4303,6 +4457,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PopActiveSnapshot();
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+
+ index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+ PopActiveSnapshot();
+ }
+
/* Commit this transaction to make the updates visible. */
CommitTransactionCommand();
StartTransactionCommand();
@@ -4316,11 +4492,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 6 of REINDEX CONCURRENTLY
*
- * Drop the old indexes.
+ * Drop the old and auxiliary indexes.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_5);
+ PROGRESS_CREATEIDX_PHASE_WAIT_6);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4516,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
add_exact_object_address(&object, objects);
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *idx = lfirst(lc);
+ ObjectAddress object;
+
+ object.classId = RelationRelationId;
+ object.objectId = idx->auxIndexId;
+ object.objectSubId = 0;
+
+ add_exact_object_address(&object, objects);
+ }
+
/*
* Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
* right lock level.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec3769585c3..d881241f837 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
TableScanDesc scan);
/* see table_index_validate_scan for reference about parameters */
- void (*index_validate_scan) (Relation table_rel,
- Relation index_rel,
- struct IndexInfo *index_info,
- Snapshot snapshot,
- struct ValidateIndexState *state);
+ TransactionId (*index_validate_scan) (Relation table_rel,
+ Relation index_rel,
+ struct IndexInfo *index_info,
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *aux_state);
/* ------------------------------------------------------------------------
@@ -1866,22 +1866,22 @@ table_index_build_range_scan(Relation table_rel,
}
/*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
*
* See validate_index() for an explanation.
*/
-static inline void
+static inline TransactionId
table_index_validate_scan(Relation table_rel,
Relation index_rel,
struct IndexInfo *index_info,
- Snapshot snapshot,
- struct ValidateIndexState *state)
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *auxstate)
{
- table_rel->rd_tableam->index_validate_scan(table_rel,
- index_rel,
- index_info,
- snapshot,
- state);
+ return table_rel->rd_tableam->index_validate_scan(table_rel,
+ index_rel,
+ index_info,
+ state,
+ auxstate);
}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..82d0d6b46d3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
{
INDEX_CREATE_SET_READY,
INDEX_CREATE_SET_VALID,
+ INDEX_DROP_CLEAR_READY,
INDEX_DROP_CLEAR_VALID,
INDEX_DROP_SET_DEAD,
} IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
#define INDEX_CREATE_IF_NOT_EXISTS (1 << 4)
#define INDEX_CREATE_PARTITIONED (1 << 5)
#define INDEX_CREATE_INVALID (1 << 6)
+#define INDEX_CREATE_AUXILIARY (1 << 7)
extern Oid index_create(Relation heapRelation,
const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid index_create(Relation heapRelation,
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId);
+ Oid *constraintId,
+ char relpersistence);
#define INDEX_CONSTR_CREATE_MARK_AS_PRIMARY (1 << 0)
#define INDEX_CONSTR_CREATE_DEFERRABLE (1 << 1)
@@ -100,8 +103,14 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern Oid index_concurrently_create_aux(Relation heapRelation,
+ Oid mainIndexId,
+ Oid tablespaceOid,
+ const char *newName);
+
extern void index_concurrently_build(Oid heapRelationId,
- Oid indexRelationId);
+ Oid indexRelationId,
+ bool auxiliary);
extern void index_concurrently_swap(Oid newIndexId,
Oid oldIndexId,
@@ -145,7 +154,7 @@ extern void index_build(Relation heapRelation,
bool isreindex,
bool parallel);
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..89f8d02fdc3 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -88,6 +88,7 @@
#define PROGRESS_CREATEIDX_TUPLES_DONE 12
#define PROGRESS_CREATEIDX_PARTITIONS_TOTAL 13
#define PROGRESS_CREATEIDX_PARTITIONS_DONE 14
+#define PROGRESS_CREATEIDX_AUX_INDEX_OID 15
/* 15 and 16 reserved for "block number" metrics */
/* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
@@ -96,10 +97,11 @@
#define PROGRESS_CREATEIDX_PHASE_WAIT_2 3
#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN 4
#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT 5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN 6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE 6
#define PROGRESS_CREATEIDX_PHASE_WAIT_3 7
#define PROGRESS_CREATEIDX_PHASE_WAIT_4 8
#define PROGRESS_CREATEIDX_PHASE_WAIT_5 9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6 10
/*
* Subphases of CREATE INDEX, for index_build.
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
(1 row)
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
NOTICE: notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7e008b1cbd9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL: Key (f1)=(b) already exists.
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
ERROR: could not create unique index "concur_index3"
DETAIL: Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
ERROR: could not create unique index "concur_reindex_ind5"
DETAIL: Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL: Key (c1)=(1) is duplicated.
c1 | integer | | |
Indexes:
"concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccaux" stir (c1) INVALID
"concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
--------------------------------+------------+-----------------------+-------------------------------
parted_isvalid_idx | f | parted_isvalid_tab |
parted_isvalid_idx_11 | f | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux | f | parted_isvalid_tab_11 |
parted_isvalid_tab_12_expr_idx | t | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
parted_isvalid_tab_1_expr_idx | f | parted_isvalid_tab_1 | parted_isvalid_idx
parted_isvalid_tab_2_expr_idx | t | parted_isvalid_tab_2 | parted_isvalid_idx
-(5 rows)
+(6 rows)
drop table parted_isvalid_tab;
-- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..c44e460b0d3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
INSERT INTO concur_heap VALUES ('b','x');
-- check if constraint is enforced properly at build time
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
-- This trick creates an invalid index.
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
\d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
--
2.43.0
Hello!
Rebased + snapshot resetting during validation + removed PROC_IN_SAFE_IC.
Going to do some benchmarks soon.
Best regards,
Mikhail.
Show quoted text
Attachments:
v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchapplication/octet-stream; name=v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patchDownload
From 86d498d18c232a62c4da4e5849258c1ab09f69b3 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v9 5/9] Allow snapshot resets in concurrent unique index
builds
Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.
Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:
1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values
This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
src/backend/access/heap/README.HOT | 12 +-
src/backend/access/heap/heapam_handler.c | 6 +-
src/backend/access/nbtree/nbtdedup.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 173 ++++++++++++++----
src/backend/access/nbtree/nbtsplitloc.c | 12 +-
src/backend/access/nbtree/nbtutils.c | 29 ++-
src/backend/catalog/index.c | 8 +-
src/backend/commands/indexcmds.c | 4 +-
src/backend/utils/sort/tuplesortvariants.c | 67 +++++--
src/include/access/nbtree.h | 4 +-
src/include/access/tableam.h | 5 +-
src/include/utils/tuplesort.h | 1 +
.../expected/cic_reset_snapshots.out | 6 +
13 files changed, 251 insertions(+), 84 deletions(-)
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
index.
After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot. Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains. (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.) As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains. (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open. Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8144743c338..0f706553605 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
* and index whatever's live according to that while that snapshot is reset
- * every so often (in case of non-unique index).
+ * every so often.
*/
OldestXmin = InvalidTransactionId;
/*
- * For unique index we need consistent snapshot for the whole scan.
+ * For concurrent builds of non-system indexes, we may want to periodically
+ * reset snapshots to allow vacuum to clean up tuples.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
- !indexInfo->ii_Unique &&
!is_system_catalog; /* just for the case */
/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
_bt_dedup_start_pending(state, itup, offnum);
}
else if (state->deduplicate &&
- _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
/* itup starts first pending interval */
_bt_dedup_start_pending(state, itup, offnum);
}
- else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+ else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
_bt_dedup_save_htid(state, itup))
{
/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
itemid = PageGetItemId(page, minoff);
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
{
itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
itup = (IndexTuple) PageGetItem(page, itemid);
- if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+ if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
return true;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 783489600fc..38355601421 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
Relation index;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
} BTSpool;
/*
@@ -101,6 +102,7 @@ typedef struct BTShared
Oid indexrelid;
bool isunique;
bool nulls_not_distinct;
+ bool unique_dead_ignored;
bool isconcurrent;
int scantuplesortstates;
@@ -203,15 +205,13 @@ typedef struct BTLeader
*/
typedef struct BTBuildState
{
- bool isunique;
- bool nulls_not_distinct;
bool havedead;
Relation heap;
BTSpool *spool;
/*
- * spool2 is needed only when the index is a unique index. Dead tuples are
- * put into spool2 instead of spool in order to avoid uniqueness check.
+ * spool2 is needed only when the index is a unique index and build non-concurrently.
+ * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
*/
BTSpool *spool2;
double indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
- buildstate.isunique = indexInfo->ii_Unique;
- buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
buildstate.havedead = false;
buildstate.heap = heap;
buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
btspool->index = index;
btspool->isunique = indexInfo->ii_Unique;
btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+ /*
+ * We need to ignore dead tuples for unique checks in case of concurrent build.
+ * It is required because or periodic reset of snapshot.
+ */
+ btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
/* Save as primary spool */
buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* the use of parallelism or any other factor.
*/
buildstate->spool->sortstate =
- tuplesort_begin_index_btree(heap, index, buildstate->isunique,
- buildstate->nulls_not_distinct,
+ tuplesort_begin_index_btree(heap, index, btspool->isunique,
+ btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
* dead tuples) won't get very full, so we give it only work_mem.
+ *
+ * In case of concurrent build dead tuples are not need to be put into index
+ * since we wait for all snapshots older than reference snapshot during the
+ * validation phase.
*/
- if (indexInfo->ii_Unique)
+ if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
{
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* full, so we give it only work_mem
*/
buildstate->spool2->sortstate =
- tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+ tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
coordinate2, TUPLESORT_NONE);
}
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
SortSupport sortKeys;
int64 tuples_done = 0;
bool deduplicate;
+ bool fail_on_alive_duplicate;
wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
BTGetDeduplicateItems(wstate->index);
+ /*
+ * The unique_dead_ignored does not guarantee absence of multiple alive
+ * tuples with same values exists in the spool. Such thing may happen if
+ * alive tuples are located between a few dead tuples, like this: addda.
+ */
+ fail_on_alive_duplicate = btspool->unique_dead_ignored;
- if (merge)
+ if (fail_on_alive_duplicate)
+ {
+ bool seen_alive = false,
+ prev_tested = false;
+ IndexTuple prev = NULL;
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+ &TTSOpsBufferHeapTuple);
+ IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+ Assert(btspool->isunique);
+ Assert(!btspool2);
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+ {
+ bool tuples_equal = false;
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ if (prev != NULL) /* if is not the first tuple */
+ {
+ bool has_nulls = false,
+ call_again, /* just to pass something */
+ ignored, /* just to pass something */
+ now_alive;
+ ItemPointerData tid;
+
+ /* if this tuples equal to previouse one? */
+ if (wstate->inskey->allequalimage)
+ tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+ else
+ tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+ /* handle null values correctly */
+ if (has_nulls && !btspool->nulls_not_distinct)
+ tuples_equal = false;
+
+ if (tuples_equal)
+ {
+ /* check previous tuple if not yet */
+ if (!prev_tested)
+ {
+ call_again = false;
+ tid = prev->t_tid;
+ seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+ prev_tested = true;
+ }
+
+ call_again = false;
+ tid = itup->t_tid;
+ now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ /* are multiple alive tuples detected in equal group? */
+ if (seen_alive && now_alive)
+ {
+ char *key_desc;
+ TupleDesc tupDes = RelationGetDescr(wstate->index);
+ bool isnull[INDEX_MAX_KEYS];
+ Datum values[INDEX_MAX_KEYS];
+
+ index_deform_tuple(itup, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+ /* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(wstate->index)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(wstate->heap,
+ RelationGetRelationName(wstate->index))));
+ }
+ seen_alive |= now_alive;
+ }
+ }
+
+ if (!tuples_equal)
+ {
+ seen_alive = false;
+ prev_tested = false;
+ }
+
+ _bt_buildadd(wstate, state, itup, 0);
+ if (prev) pfree(prev);
+ prev = CopyIndexTuple(itup);
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ else if (merge)
{
/*
* Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
InvalidOffsetNumber);
}
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
- itup) > keysz &&
+ itup, NULL) > keysz &&
_bt_dedup_save_htid(dstate, itup))
{
/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
- bool reset_snapshot;
bool wait_for_snapshot_attach;
int querylen;
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
- /*
- * For concurrent non-unique index builds, we can periodically reset snapshots
- * to allow the xmin horizon to advance. This is safe since these builds don't
- * require a consistent view across the entire scan. Unique indexes still need
- * a stable snapshot to properly enforce uniqueness constraints.
- */
- reset_snapshot = isconcurrent && !btspool->isunique;
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that, while that snapshot may be reset periodically in
- * case of non-unique index.
+ * live according to that, while that snapshot may be reset periodically.
*/
if (!isconcurrent)
{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
- else if (reset_snapshot)
+ else
{
+ /*
+ * For concurrent index builds, we can periodically reset snapshots to allow
+ * the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan.
+ */
snapshot = InvalidSnapshot;
PushActiveSnapshot(GetTransactionSnapshot());
}
- else
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
- }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->indexrelid = RelationGetRelid(btspool->index);
btshared->isunique = btspool->isunique;
btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+ btshared->unique_dead_ignored = btspool->unique_dead_ignored;
btshared->isconcurrent = isconcurrent;
btshared->scantuplesortstates = scantuplesortstates;
btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot,
- reset_snapshot);
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* In case when leader going to reset own active snapshot as well - we need to
* wait until all workers imported initial snapshot.
*/
- wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
if (wait_for_snapshot_attach)
WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
leaderworker->index = buildstate->spool->index;
leaderworker->isunique = buildstate->spool->isunique;
leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+ leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
/* Initialize second spool, if required */
if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
btspool->index = indexRel;
btspool->isunique = btshared->isunique;
btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+ btspool->unique_dead_ignored = btshared->unique_dead_ignored;
/* Look up shared state private to tuplesort.c */
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
tuplesort_attach_shared(sharedsort, seg);
- if (!btshared->isunique)
+ if (!btshared->isunique || btshared->isconcurrent)
{
btspool2 = NULL;
sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
btspool->index,
btspool->isunique,
btspool->nulls_not_distinct,
+ btspool->unique_dead_ignored,
sortmem, coordinate,
TUPLESORT_NONE);
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
coordinate2->nParticipants = -1;
coordinate2->sharedsort = sharedsort2;
btspool2->sortstate =
- tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+ tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
Min(sortmem, work_mem), coordinate2,
false);
}
/* Fill in buildstate for _bt_build_callback() */
- buildstate.isunique = btshared->isunique;
- buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
buildstate.havedead = false;
buildstate.heap = btspool->heap;
buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
{
itemid = PageGetItemId(state->origpage, maxoff);
tup = (IndexTuple) PageGetItem(state->origpage, itemid);
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
- keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+ keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
if (keepnatts > 1 && keepnatts <= nkeyatts)
{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
- perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+ perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
- state->newitem);
+ state->newitem, NULL);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
lastleft = _bt_split_lastleft(state, split);
firstright = _bt_split_firstright(state, split);
- return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+ return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
}
/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a531d37908a..e729b4a4d7c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
ScanDirection dir, bool *continuescan);
static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
int tupnatts, TupleDesc tupdesc);
-static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
- IndexTuple firstright, BTScanInsert itup_key);
/*
@@ -4676,7 +4674,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
/* Determine how many attributes must be kept in truncated tuple */
- keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+ keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
#ifdef DEBUG_NO_TRUNCATE
/* Force truncation to be ineffective for testing purposes */
@@ -4794,17 +4792,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
/*
* _bt_keep_natts - how many key attributes to keep when truncating.
*
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
* Caller provides two tuples that enclose a split point. Caller's insertion
* scankey is used to compare the tuples; the scankey's argument values are
* not considered here.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
-static int
+int
_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
- BTScanInsert itup_key)
+ BTScanInsert itup_key,
+ bool *hasnulls)
{
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
TupleDesc itupdesc = RelationGetDescr(rel);
@@ -4830,6 +4835,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ (*hasnulls) |= (isNull1 || isNull2);
if (isNull1 != isNull2)
break;
@@ -4849,7 +4856,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* expected in an allequalimage index.
*/
Assert(!itup_key->allequalimage ||
- keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+ keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
return keepnatts;
}
@@ -4860,7 +4867,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* This is exported so that a candidate split point can have its effect on
* suffix truncation inexpensively evaluated ahead of time when finding a
* split location. A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
*
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4869,6 +4877,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* "equal image" columns, routine is guaranteed to give the same result as
* _bt_keep_natts would.
*
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts, even when the index uses
* an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4877,7 +4887,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* more balanced split point.
*/
int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ bool *hasnulls)
{
TupleDesc itupdesc = RelationGetDescr(rel);
int keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4894,6 +4905,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+ if (hasnulls)
+ *hasnulls |= (isNull1 | isNull2);
att = TupleDescCompactAttr(itupdesc, attnum - 1);
if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fcb6e940ff2..73454accf61 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
/* Invalidate catalog snapshot just for assert */
InvalidateCatalogSnapshot();
- Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason for that is to
+ * propagate the xmin horizon forward.
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We build the index using all tuples that are visible using single or
- * multiple refreshing snapshots. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using multiple
+ * refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
bool enforceUnique; /* complain if we find duplicate tuples */
bool uniqueNullsNotDistinct; /* unique constraint null treatment */
+ bool uniqueDeadIgnored; /* ignore dead tuples in unique check */
} TuplesortIndexBTreeArg;
/*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem,
SortCoordinate coordinate,
int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->index.indexRel = indexRel;
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+ arg->uniqueDeadIgnored = uniqueDeadIgnored;
indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
+ bool uniqueCheckFail = true;
/*
* Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
*/
Assert(tuple1 != tuple2);
- index_deform_tuple(tuple1, tupDes, values, isnull);
-
- key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNIQUE_VIOLATION),
- errmsg("could not create unique index \"%s\"",
- RelationGetRelationName(arg->index.indexRel)),
- key_desc ? errdetail("Key %s is duplicated.", key_desc) :
- errdetail("Duplicate keys exist."),
- errtableconstraint(arg->index.heapRel,
- RelationGetRelationName(arg->index.indexRel))));
+ /* This is fail-fast check, see _bt_load for details. */
+ if (arg->uniqueDeadIgnored)
+ {
+ bool any_tuple_dead,
+ call_again = false,
+ ignored;
+
+ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+ &TTSOpsBufferHeapTuple);
+ ItemPointerData tid = tuple1->t_tid;
+
+ IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+ if (!any_tuple_dead)
+ {
+ call_again = false;
+ tid = tuple2->t_tid;
+ any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+ &ignored);
+ }
+
+ if (any_tuple_dead)
+ {
+ elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+ ItemPointerGetBlockNumber(&tuple1->t_tid),
+ ItemPointerGetOffsetNumber(&tuple1->t_tid),
+ ItemPointerGetBlockNumber(&tuple2->t_tid),
+ ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+ uniqueCheckFail = false;
+ }
+ ExecDropSingleTupleTableSlot(slot);
+ table_index_fetch_end(fetch);
+ }
+ if (uniqueCheckFail)
+ {
+ index_deform_tuple(tuple1, tupDes, values, isnull);
+
+ key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+ /* keep this error message in sync with the same in _bt_load */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNIQUE_VIOLATION),
+ errmsg("could not create unique index \"%s\"",
+ RelationGetRelationName(arg->index.indexRel)),
+ key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+ errdetail("Duplicate keys exist."),
+ errtableconstraint(arg->index.heapRel,
+ RelationGetRelationName(arg->index.indexRel))));
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
extern char *btbuildphasename(int64 phasenum);
extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+ BTScanInsert itup_key, bool *hasnulls);
extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
- IndexTuple firstright);
+ IndexTuple firstright, bool *hasnulls);
extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 66e1ad83f1a..0ecc3147bbd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
bool uniqueNullsNotDistinct,
+ bool uniqueDeadIgnored,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
----------------
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
(1 row)
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
--
2.43.0
v9-0002-Add-stress-tests-for-concurrent-index-operations.patchapplication/octet-stream; name=v9-0002-Add-stress-tests-for-concurrent-index-operations.patchDownload
From 23c3c9f06ca446f1b2840c18e511a11c827cbc14 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v9 2/9] Add stress tests for concurrent index operations
Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations
These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
src/bin/pg_amcheck/meson.build | 1 +
src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
2 files changed, 145 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_cic.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_cic.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;
+));
+
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set variant random(0, 5)
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+ \elif :variant = 4
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+ \elif :variant = 5
+ CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+ \endif
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1000, 100000)
+ BEGIN;
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ COMMIT;
+ \endif
+ )
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops_unique_idx' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY idx_2;
+ \sleep 10 ms
+ SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx_2;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+ ON CONFLICT(i) DO UPDATE SET updated_at = now();
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ )
+ });
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchapplication/octet-stream; name=v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patchDownload
From 43662a22363ddab775ec4373711be0cf39bcc1be Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v9 4/9] Allow snapshot resets during parallel concurrent index
builds
Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.
Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
proceeding with scan
- Add regression tests to verify behavior with various index types
The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.
This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
src/backend/access/brin/brin.c | 43 +++++++++-------
src/backend/access/heap/heapam_handler.c | 12 +++--
src/backend/access/nbtree/nbtsort.c | 38 ++++++++++++--
src/backend/access/table/tableam.c | 37 ++++++++++++--
src/backend/access/transam/parallel.c | 50 +++++++++++++++++--
src/backend/catalog/index.c | 2 +-
src/backend/executor/nodeSeqscan.c | 3 +-
src/backend/utils/time/snapmgr.c | 8 ---
src/include/access/parallel.h | 3 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 9 ++--
.../expected/cic_reset_snapshots.out | 23 ++++++++-
.../sql/cic_reset_snapshots.sql | 7 ++-
13 files changed, 179 insertions(+), 57 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d80394766d5..f076cedcc2c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
*/
BrinShared *brinshared;
Sharedsort *sharedsort;
- Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
} BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
bool isconcurrent, int request);
static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
static double _brin_parallel_heapscan(BrinBuildState *state);
static double _brin_parallel_merge(BrinBuildState *state);
static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
ParallelContext *pcxt;
int scantuplesortstates;
- Snapshot snapshot;
Size estbrinshared;
Size estsort;
BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
- * concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * concurrent build, we take a regular MVCC snapshot and push it as active.
+ * Later we index whatever's live according to that snapshot while that
+ * snapshot is reset periodically.
*/
if (!isconcurrent)
{
Assert(ActiveSnapshotSet());
- snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
else
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ Assert(!ActiveSnapshotSet());
PushActiveSnapshot(GetTransactionSnapshot());
}
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
*/
- estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+ estbrinshared = _brin_parallel_estimate_shared(heap);
shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
- UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
table_parallelscan_initialize(heap,
ParallelTableScanFromBrinShared(brinshared),
- snapshot);
+ isconcurrent ? InvalidSnapshot : SnapshotAny,
+ isconcurrent);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->nparticipanttuplesorts++;
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
- brinleader->snapshot = snapshot;
brinleader->walusage = walusage;
brinleader->bufferusage = bufferusage;
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* Save leader state now that it's clear build will be parallel */
buildstate->bs_leader = brinleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
- /* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(brinleader->snapshot))
- UnregisterSnapshot(brinleader->snapshot);
DestroyParallelContext(brinleader->pcxt);
ExitParallelMode();
}
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
/*
* Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
*/
static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
{
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
return add_size(BUFFERALIGN(sizeof(BrinShared)),
- table_parallelscan_estimate(heap, snapshot));
+ table_parallelscan_estimate(heap, InvalidSnapshot));
}
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d9fce07e8ad..8144743c338 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, or during bootstrap, we take a regular MVCC snapshot
- * and index whatever's live according to that.
+ * and index whatever's live according to that while that snapshot is reset
+ * every so often (in case of non-unique index).
*/
OldestXmin = InvalidTransactionId;
/*
* For unique index we need consistent snapshot for the whole scan.
- * In case of parallel scan some additional infrastructure required
- * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
*/
reset_snapshots = indexInfo->ii_Concurrent &&
!indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
- PushActiveSnapshot(snapshot);
- need_pop_active_snapshot = true;
+ if (!reset_snapshots)
+ {
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
+ }
}
hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8647422ed05..783489600fc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
BufferUsage *bufferusage;
bool leaderparticipates = true;
bool need_pop_active_snapshot = true;
+ bool reset_snapshot;
+ bool wait_for_snapshot_attach;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
scantuplesortstates = leaderparticipates ? request + 1 : request;
+ /*
+ * For concurrent non-unique index builds, we can periodically reset snapshots
+ * to allow the xmin horizon to advance. This is safe since these builds don't
+ * require a consistent view across the entire scan. Unique indexes still need
+ * a stable snapshot to properly enforce uniqueness constraints.
+ */
+ reset_snapshot = isconcurrent && !btspool->isunique;
+
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
* concurrent build, we take a regular MVCC snapshot and index whatever's
- * live according to that.
+ * live according to that, while that snapshot may be reset periodically in
+ * case of non-unique index.
*/
if (!isconcurrent)
{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
snapshot = SnapshotAny;
need_pop_active_snapshot = false;
}
+ else if (reset_snapshot)
+ {
+ snapshot = InvalidSnapshot;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
else
{
snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
{
if (need_pop_active_snapshot)
PopActiveSnapshot();
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->brokenhotchain = false;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
- snapshot);
+ snapshot,
+ reset_snapshot);
/*
* Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* Save leader state now that it's clear build will be parallel */
buildstate->btleader = btleader;
+ /*
+ * In case of concurrent build snapshots are going to be reset periodically.
+ * In case when leader going to reset own active snapshot as well - we need to
+ * wait until all workers imported initial snapshot.
+ */
+ wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+ if (wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, true);
+
/* Join heap scan ourselves */
if (leaderparticipates)
_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* Caller needs to wait for all launched workers when we return. Make
* sure that the failure-to-start case will not hang forever.
*/
- WaitForParallelWorkersToAttach(pcxt);
+ if (!wait_for_snapshot_attach)
+ WaitForParallelWorkersToAttach(pcxt, false);
if (need_pop_active_snapshot)
PopActiveSnapshot();
}
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
- if (IsMVCCSnapshot(btleader->snapshot))
+ if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
UnregisterSnapshot(btleader->snapshot);
DestroyParallelContext(btleader->pcxt);
ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
{
Size sz = 0;
- if (IsMVCCSnapshot(snapshot))
+ if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
else
- Assert(snapshot == SnapshotAny);
+ Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
void
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
- Snapshot snapshot)
+ Snapshot snapshot, bool reset_snapshot)
{
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
pscan->phs_snapshot_off = snapshot_off;
- if (IsMVCCSnapshot(snapshot))
+ /*
+ * Initialize parallel scan description. For normal scans with a regular
+ * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+ * snapshot resets, mark the scan accordingly.
+ */
+ if (reset_snapshot)
+ {
+ Assert(snapshot == InvalidSnapshot);
+ pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = true;
+ INJECTION_POINT("table_parallelscan_initialize");
+ }
+ else if (IsMVCCSnapshot(snapshot))
{
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
pscan->phs_snapshot_any = false;
+ pscan->phs_reset_snapshot = false;
}
else
{
Assert(snapshot == SnapshotAny);
+ Assert(!reset_snapshot);
pscan->phs_snapshot_any = true;
+ pscan->phs_reset_snapshot = false;
}
}
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
- if (!pscan->phs_snapshot_any)
+ /*
+ * For scans that
+ * use periodic snapshot resets, mark the scan accordingly and use the active
+ * snapshot as the initial state.
+ */
+ if (pscan->phs_reset_snapshot)
+ {
+ Assert(ActiveSnapshotSet());
+ flags |= SO_RESET_SNAPSHOT;
+ /* Start with current active snapshot. */
+ snapshot = GetActiveSnapshot();
+ }
+ else if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
#define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0xFFFFFFFFFFFF000D)
#define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0xFFFFFFFFFFFF000E)
#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED UINT64CONST(0xFFFFFFFFFFFF0010)
/* Fixed-size parallel state. */
typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+ pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate how much we'll need for the entrypoint info. */
shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char *entrypointstate;
char *uncommittedenumsspace;
char *clientconninfospace;
+ bool *snapshot_set_flag_space;
Size lnamelen;
/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
strcpy(entrypointstate, pcxt->library_name);
strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+ /*
+ * Establish dynamic shared memory to pass information about importing
+ * of snapshot.
+ */
+ snapshot_set_flag_space =
+ shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+ *pcxt->worker[i].snapshot_restored = false;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
}
/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
}
+
+ /* Set snapshot restored flag to false. */
+ if (pcxt->nworkers > 0)
+ {
+ bool *snapshot_restored_space;
+ int i;
+ snapshot_restored_space =
+ shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ for (i = 0; i < pcxt->nworkers; ++i)
+ snapshot_restored_space[i] = false;
+ }
}
/*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* Wait for all workers to attach to their error queues, and throw an error if
* any worker fails to do this.
*
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
* Callers can assume that if this function returns successfully, then the
* number of workers given by pcxt->nworkers_launched have initialized and
* attached to their error queues. Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
* call this function at all.
*/
void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
{
int i;
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
if (shm_mq_get_sender(mq) != NULL)
{
- /* Yes, so it is known to be attached. */
- pcxt->known_attached_workers[i] = true;
- ++pcxt->nknown_attached_workers;
+ if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+ {
+ /* Yes, so it is known to be attached. */
+ pcxt->known_attached_workers[i] = true;
+ ++pcxt->nknown_attached_workers;
+ }
}
}
else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
shm_toc *toc;
FixedParallelState *fps;
char *error_queue_space;
+ bool *snapshot_restored_space;
shm_mq *mq;
shm_mq_handle *mqh;
char *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
+ /* Snapshot is restored, set flag to make leader know about it. */
+ snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+ snapshot_restored_space[ParallelWorkerNumber] = true;
+
/*
* We've changed which tuples we can see, and must therefore invalidate
* system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c5a900f1b29..fcb6e940ff2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
/* Invalidate catalog snapshot just for assert */
InvalidateCatalogSnapshot();
- Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+ Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
- estate->es_snapshot);
+ estate->es_snapshot,
+ false);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 101a02c5b60..153ac28db3e 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
Snapshot
GetLatestSnapshot(void)
{
- /*
- * We might be able to relax this, but nothing that could otherwise work
- * needs it.
- */
- if (IsInParallelMode())
- elog(ERROR,
- "cannot update SecondarySnapshot during a parallel operation");
-
/*
* So far there are no cases requiring support for GetLatestSnapshot()
* during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
{
BackgroundWorkerHandle *bgwhandle;
shm_mq_handle *error_mqh;
+ bool *snapshot_restored;
} ParallelWorkerInfo;
typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelDSM(ParallelContext *pcxt);
extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
extern void DestroyParallelContext(ParallelContext *pcxt);
extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8ca8f789617..d801aca82a5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
RelFileLocator phs_locator; /* physical relation to scan */
bool phs_syncscan; /* report location to syncscan logic? */
bool phs_snapshot_any; /* SnapshotAny, not phs_snapshot_data? */
+ bool phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
Size phs_snapshot_off; /* data for snapshot */
} ParallelTableScanDescData;
typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a328f3aea6b..66e1ad83f1a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
*/
extern void table_parallelscan_initialize(Relation rel,
ParallelTableScanDesc pscan,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool reset_snapshot);
/*
* Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
*
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
(1 row)
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach
+-------------------------
+
+(1 row)
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
NOTICE: notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
--
2.43.0
v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchapplication/octet-stream; name=v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patchDownload
From 4ee802bb929b4d401a3c69b879275fde06591866 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v9 3/9] Allow advancing xmin during non-unique, non-parallel
concurrent index builds by periodically resetting snapshots
Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.
This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.
Currently, this technique is applied to:
Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.
To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.
Regression tests are added to verify the behavior.
---
contrib/amcheck/verify_nbtree.c | 3 +-
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/brin/brin.c | 14 +++
src/backend/access/heap/heapam.c | 46 ++++++++
src/backend/access/heap/heapam_handler.c | 57 ++++++++--
src/backend/access/index/genam.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 14 +++
src/backend/catalog/index.c | 30 ++++-
src/backend/commands/indexcmds.c | 14 +--
src/backend/optimizer/plan/planner.c | 9 ++
src/include/access/tableam.h | 28 ++++-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/cic_reset_snapshots.out | 107 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../sql/cic_reset_snapshots.sql | 86 ++++++++++++++
15 files changed, 384 insertions(+), 31 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK? */
+ true, /* syncscan OK? */
+ false);
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
errmsg("only heap AM is supported")));
/* Disable syncscan because we assume we scan from block zero upwards */
- scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+ scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
hscan = (HeapScanDesc) scan;
InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9af445cdcdd..d80394766d5 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(GetTransactionSnapshot());
+ }
/*
* Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_brin_end_parallel(brinleader, NULL);
return;
}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 329e727f80d..c2860ebbf32 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "utils/datum.h"
#include "utils/inval.h"
#include "utils/spccache.h"
+#include "utils/injection_point.h"
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure active snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ PopActiveSnapshot();
+
+ sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ InvalidateCatalogSnapshot();
+
+ /* Goal of snapshot reset is to allow horizon to advance. */
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+ /* In some cases it is still not possible due xid assign. */
+ if (!TransactionIdIsValid(MyProc->xid))
+ INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+ PushActiveSnapshot(GetLatestSnapshot());
+ sscan->rs_snapshot = GetActiveSnapshot();
+}
+
/*
* heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
*
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
if (BufferIsValid(scan->rs_cbuf))
+ {
scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+ if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+ (scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+ heap_reset_scan_snapshot((TableScanDesc) scan);
+ }
}
/*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
if (scan->rs_parallelworkerdata != NULL)
pfree(scan->rs_parallelworkerdata);
+ if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+ {
+ Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+ /* Make sure no other snapshot was set as active. */
+ Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+ /* And make sure snapshot is not registered. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ }
+
if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
UnregisterSnapshot(scan->rs_base.rs_snapshot);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 53f572f384b..d9fce07e8ad 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
ExprContext *econtext;
Snapshot snapshot;
bool need_unregister_snapshot = false;
+ bool need_pop_active_snapshot = false;
+ bool reset_snapshots = false;
TransactionId OldestXmin;
BlockNumber previous_blkno = InvalidBlockNumber;
BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
* Prepare for scan of the base relation. In a normal index build, we use
* SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
*/
OldestXmin = InvalidTransactionId;
+ /*
+ * For unique index we need consistent snapshot for the whole scan.
+ * In case of parallel scan some additional infrastructure required
+ * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+ */
+ reset_snapshots = indexInfo->ii_Concurrent &&
+ !indexInfo->ii_Unique &&
+ !is_system_catalog; /* just for the case */
+
/* okay to ignore lazy VACUUMs here */
if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
{
/*
* Serial index build.
- *
- * Must begin our own heap scan in this case. We may also need to
- * register a snapshot whose lifetime is under our direct control.
*/
if (!TransactionIdIsValid(OldestXmin))
{
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- need_unregister_snapshot = true;
+ snapshot = GetTransactionSnapshot();
+ /*
+ * Must begin our own heap scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ * In case of resetting of snapshot during the scan registration is
+ * not allowed because snapshot is going to be changed every so
+ * often.
+ */
+ if (!reset_snapshots)
+ {
+ snapshot = RegisterSnapshot(snapshot);
+ need_unregister_snapshot = true;
+ }
+ Assert(!ActiveSnapshotSet());
+ PushActiveSnapshot(snapshot);
+ /* store link to snapshot because it may be copied */
+ snapshot = GetActiveSnapshot();
+ need_pop_active_snapshot = true;
}
else
+ {
+ Assert(!indexInfo->ii_Concurrent);
snapshot = SnapshotAny;
+ }
scan = table_beginscan_strat(heapRelation, /* relation */
snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- allow_sync); /* syncscan OK? */
+ allow_sync, /* syncscan OK? */
+ reset_snapshots /* reset snapshots? */);
}
else
{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(!IsBootstrapProcessingMode());
Assert(allow_sync);
snapshot = scan->rs_snapshot;
+ PushActiveSnapshot(snapshot);
+ need_pop_active_snapshot = true;
}
hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
!TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);
+ Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ /* Clear reference to snapshot since it may be changed by the scan itself. */
+ if (reset_snapshots)
+ snapshot = InvalidSnapshot;
/* Publish number of blocks to scan */
if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
table_endscan(scan);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
/* we can now forget our snapshot, if set and registered by us */
if (need_unregister_snapshot)
UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- false); /* syncscan not OK */
+ false, /* syncscan not OK */
+ false);
hscan = (HeapScanDesc) scan;
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
*/
sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
nkeys, key,
- true, false);
+ true, false, false);
sysscan->iscan = NULL;
}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 28522c0ac1c..8647422ed05 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
WalUsage *walusage;
BufferUsage *bufferusage;
bool leaderparticipates = true;
+ bool need_pop_active_snapshot = true;
int querylen;
#ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* live according to that.
*/
if (!isconcurrent)
+ {
+ Assert(ActiveSnapshotSet());
snapshot = SnapshotAny;
+ need_pop_active_snapshot = false;
+ }
else
+ {
snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ }
/*
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no DSM segment was available, back out (do serial build) */
if (pcxt->seg == NULL)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
if (IsMVCCSnapshot(snapshot))
UnregisterSnapshot(snapshot);
DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
{
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
_bt_end_parallel(btleader);
return;
}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* sure that the failure-to-start case will not hang forever.
*/
WaitForParallelWorkersToAttach(pcxt);
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6976249e9e9..c5a900f1b29 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/tuplesort.h"
+#include "storage/proc.h"
/* Potentially set by pg_upgrade_support functions */
Oid binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
Relation indexRelation;
IndexInfo *indexInfo;
- /* This had better make sure that a snapshot is active */
- Assert(ActiveSnapshotSet());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ Assert(!TransactionIdIsValid(MyProc->xid));
/* Open and lock the parent heap relation */
heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
indexRelation = index_open(indexRelationId, RowExclusiveLock);
+ /* BuildIndexInfo may require as snapshot for expressions and predicates */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* We have to re-build the IndexInfo struct, since it was lost in the
* commit of the transaction where this concurrent index was created at
* the catalog level.
*/
indexInfo = BuildIndexInfo(indexRelation);
+ /* Done with snapshot */
+ PopActiveSnapshot();
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
index_build(heapRel, indexRelation, indexInfo, false, true);
+ /* Invalidate catalog snapshot just for assert */
+ InvalidateCatalogSnapshot();
+ Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
table_close(heapRel, NoLock);
index_close(indexRelation, NoLock);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
}
/*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
- true); /* syncscan OK */
+ true, /* syncscan OK */
+ false);
while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
* as of the start of the scan (see table_index_build_scan), whereas a normal
* build takes care to include recently-dead tuples. This is OK because
* we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone. The reason for doing that is to avoid
+ * to see those tuples are gone. One of reasons for doing that is to avoid
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
*
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
* Next, we mark the index "indisready" (but still not "indisvalid") and
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We now take a new snapshot, and build the index using all tuples that
- * are visible in this snapshot. We can be sure that any HOT updates to
+ * We build the index using all tuples that are visible using single or
+ * multiple refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
- /* we can do away with our snapshot */
- PopActiveSnapshot();
-
/*
* Commit this transaction to make the indisready update visible.
*/
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /* Set ActiveSnapshot since functions in the indexes may need it */
- PushActiveSnapshot(GetTransactionSnapshot());
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/* Perform concurrent build of new index */
index_concurrently_build(newidx->tableId, newidx->indexId);
- PopActiveSnapshot();
CommitTransactionCommand();
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7468961b017..1ef6c7216f4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
Relation heap;
Relation index;
RelOptInfo *rel;
+ bool need_pop_active_snapshot = false;
int parallel_workers;
BlockNumber heap_blocks;
double reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
heap = table_open(tableOid, NoLock);
index = index_open(indexOid, NoLock);
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ if (!ActiveSnapshotSet()) {
+ PushActiveSnapshot(GetTransactionSnapshot());
+ need_pop_active_snapshot = true;
+ }
/*
* Determine if it's safe to proceed.
*
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
parallel_workers--;
done:
+ if (need_pop_active_snapshot)
+ PopActiveSnapshot();
index_close(index, NoLock);
table_close(heap, NoLock);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index bb32de11ea0..a328f3aea6b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
+#include "utils/injection_point.h"
#define DEFAULT_TABLE_ACCESS_METHOD "heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
* needed. If table data may be needed, set SO_NEED_TUPLES.
*/
SO_NEED_TUPLES = 1 << 10,
+ /*
+ * Reset scan and catalog snapshot every so often? If so, each
+ * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+ * catalog snapshot invalidated, latest snapshot pushed as active.
+ *
+ * At the end of the scan snapshot is not popped.
+ * Goal of such mode is keep xmin propagating horizon forward.
+ *
+ * see heap_reset_scan_snapshot for details.
+ */
+ SO_RESET_SNAPSHOT = 1 << 11,
} ScanOptions;
/*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
static inline TableScanDesc
table_beginscan_strat(Relation rel, Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
- bool allow_strat, bool allow_sync)
+ bool allow_strat, bool allow_sync,
+ bool reset_snapshot)
{
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
flags |= SO_ALLOW_STRAT;
if (allow_sync)
flags |= SO_ALLOW_SYNC;
+ if (reset_snapshot)
+ {
+ INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+ /* Active snapshot is required on start. */
+ Assert(GetActiveSnapshot() == snapshot);
+ /* Active snapshot should not be registered to keep xmin propagating. */
+ Assert(GetActiveSnapshot()->regd_count == 0);
+ flags |= (SO_RESET_SNAPSHOT);
+ }
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* very hard to detect whether they're really incompatible with the chain tip.
* This only really makes sense for heap AM, it might need to be generalized
* for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
*/
static inline double
table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local
+----------------------------
+
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE: notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
'sql': [
'injection_points',
'reindex_conc',
+ 'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
# The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
--
2.43.0
v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchapplication/octet-stream; name=v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patchDownload
From d694020bb8c9b8fa6e346029bba2500c0a0f06cc Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v9 1/9] this is https://commitfest.postgresql.org/50/5160/
merged in single commit. it is required for stability of stress tests.
---
src/backend/commands/indexcmds.c | 4 +-
src/backend/executor/execIndexing.c | 3 +
src/backend/executor/execPartition.c | 119 ++++++++-
src/backend/executor/nodeModifyTable.c | 2 +
src/backend/optimizer/util/plancat.c | 135 +++++++---
src/backend/utils/time/snapmgr.c | 2 +
src/test/modules/injection_points/Makefile | 7 +-
.../expected/index_concurrently_upsert.out | 80 ++++++
.../index_concurrently_upsert_predicate.out | 80 ++++++
.../expected/reindex_concurrently_upsert.out | 238 ++++++++++++++++++
...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
src/test/modules/injection_points/meson.build | 11 +
.../specs/index_concurrently_upsert.spec | 68 +++++
.../index_concurrently_upsert_predicate.spec | 70 ++++++
.../specs/reindex_concurrently_upsert.spec | 86 +++++++
...dex_concurrently_upsert_on_constraint.spec | 86 +++++++
...index_concurrently_upsert_partitioned.spec | 88 +++++++
18 files changed, 1505 insertions(+), 50 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots.
*/
+ INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* the same time to make sure we only get constraint violations from the
* indexes with the correct names.
*/
-
+ INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* index_drop() for more details.
*/
+ INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -936,6 +937,8 @@ retry:
econtext->ecxt_scantuple = save_scantuple;
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
return rri;
}
+/*
+ * IsIndexCompatibleAsArbiter
+ * Checks if the indexes are identical in terms of being used
+ * as arbiters for the INSERT ON CONFLICT operation by comparing
+ * them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation arbiterIndexRelation,
+ IndexInfo *arbiterIndexInfo,
+ Relation indexRelation,
+ IndexInfo *indexInfo)
+{
+ int i;
+
+ if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+ return false;
+ /* it is not supported for cases of exclusion constraints. */
+ if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+ return false;
+ if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+ return false;
+
+ for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+ {
+ int arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+ int attoNo = indexRelation->rd_index->indkey.values[i];
+ if (arbiterAttoNo != attoNo)
+ return false;
+ }
+
+ if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+ RelationGetIndexExpressions(indexRelation)) != NIL)
+ return false;
+
+ if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+ RelationGetIndexPredicate(indexRelation)) != NIL)
+ return false;
+ return true;
+}
+
/*
* ExecInitPartitionInfo
* Lock the partition and initialize ResultRelInfo. Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
{
List *childIdxs;
+ List *nonAncestorIdxs = NIL;
+ int i, j, additional_arbiters = 0;
childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ListCell *lc2;
ancestors = get_partition_ancestors(childIdx);
- foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ if (ancestors)
{
- if (list_member_oid(ancestors, lfirst_oid(lc2)))
- arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ if (list_member_oid(ancestors, lfirst_oid(lc2)))
+ arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+ }
}
+ else /* No ancestor was found for that index. Save it for rechecking later. */
+ nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
list_free(ancestors);
}
+
+ /*
+ * If any non-ancestor indexes are found, we need to compare them with other
+ * indexes of the relation that will be used as arbiters. This is necessary
+ * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+ * must be considered as arbiters to ensure that all concurrent transactions
+ * use the same set of arbiters.
+ */
+ if (nonAncestorIdxs)
+ {
+ for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+ {
+ if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+ {
+ Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+ IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+ Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+ /* It is too early to us non-ready indexes as arbiters */
+ if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+ continue;
+
+ for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+ {
+ if (list_member_oid(arbiterIndexes,
+ leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+ {
+ Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+ IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+ /* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+ if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+ nonAncestorIndexRelation, nonAncestorIndexInfo))
+ {
+ arbiterIndexes = lappend_oid(arbiterIndexes,
+ nonAncestorIndexRelation->rd_index->indexrelid);
+ additional_arbiters++;
+ }
+ }
+ }
+ }
+ }
+ }
+ list_free(nonAncestorIdxs);
+
+ /*
+ * If the resulting lists are of inequal length, something is wrong.
+ * (This shouldn't happen, since arbiter index selection should not
+ * pick up a non-ready index.)
+ *
+ * But we need to consider an additional arbiter indexes also.
+ */
+ if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+ list_length(arbiterIndexes) - additional_arbiters)
+ elog(ERROR, "invalid arbiter index list");
}
-
- /*
- * If the resulting lists are of inequal length, something is wrong.
- * (This shouldn't happen, since arbiter index selection should not
- * pick up an invalid index.)
- */
- if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
- list_length(arbiterIndexes))
- elog(ERROR, "invalid arbiter index list");
leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c445c433df4..67befb6cba6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
#include "utils/datum.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
return NULL;
}
}
+ INJECTION_POINT("exec_insert_before_insert_speculative");
/*
* Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c31cc3ee69f..b4f9641e588 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
List *indexList;
ListCell *l;
- /* Normalized inference attributes and inference expressions: */
- Bitmapset *inferAttrs = NULL;
- List *inferElems = NIL;
+ /* Normalized required attributes and expressions: */
+ Bitmapset *requiredArbiterAttrs = NULL;
+ List *requiredArbiterElems = NIL;
+ List *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
/* Results */
List *results = NIL;
+ bool foundValid = false;
/*
* Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
if (!IsA(elem->expr, Var))
{
- /* If not a plain Var, just shove it in inferElems for now */
- inferElems = lappend(inferElems, elem->expr);
+ /* If not a plain Var, just shove it in requiredArbiterElems for now */
+ requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
continue;
}
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("whole row unique index inference specifications are not supported")));
- inferAttrs = bms_add_member(inferAttrs,
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
attno - FirstLowInvalidHeapAttributeNumber);
}
+ indexList = RelationGetIndexList(relation);
+
/*
* Lookup named constraint's index. This is not immediately returned
- * because some additional sanity checks are required.
+ * because some additional sanity checks are required. Additionally, we
+ * need to process other indexes as potential arbiters to account for
+ * cases where REINDEX CONCURRENTLY is processing an index used as a
+ * named constraint.
*/
if (onconflict->constraint != InvalidOid)
{
indexOidFromConstraint = get_constraint_index(onconflict->constraint);
if (indexOidFromConstraint == InvalidOid)
+ {
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("constraint in ON CONFLICT clause has no associated index")));
+ errmsg("constraint in ON CONFLICT clause has no associated index")));
+ }
+
+ /*
+ * Find the named constraint index to extract its attributes and predicates.
+ * We open all indexes in the loop to avoid deadlock of changed order of locks.
+ * */
+ foreach(l, indexList)
+ {
+ Oid indexoid = lfirst_oid(l);
+ Relation idxRel;
+ Form_pg_index idxForm;
+ AttrNumber natt;
+
+ idxRel = index_open(indexoid, rte->rellockmode);
+ idxForm = idxRel->rd_index;
+
+ if (idxForm->indisready)
+ {
+ if (indexOidFromConstraint == idxForm->indexrelid)
+ {
+ /*
+ * Prepare requirements for other indexes to be used as arbiter together
+ * with indexOidFromConstraint. It is required to involve both equals indexes
+ * in case of REINDEX CONCURRENTLY.
+ */
+ for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+ {
+ int attno = idxRel->rd_index->indkey.values[natt];
+
+ if (attno != 0)
+ requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+ attno - FirstLowInvalidHeapAttributeNumber);
+ }
+ requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+ requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+ /* We are done, so, quite the loop. */
+ index_close(idxRel, NoLock);
+ break;
+ }
+ }
+ index_close(idxRel, NoLock);
+ }
}
/*
* Using that representation, iterate through the list of indexes on the
* target relation to try and find a match
*/
- indexList = RelationGetIndexList(relation);
-
foreach(l, indexList)
{
Oid indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
idxRel = index_open(indexoid, rte->rellockmode);
idxForm = idxRel->rd_index;
- if (!idxForm->indisvalid)
+ /*
+ * We need to consider both indisvalid and indisready indexes because
+ * them may become indisvalid before execution phase. It is required
+ * to keep set of indexes used as arbiter to be the same for all
+ * concurrent transactions.
+ */
+ if (!idxForm->indisready)
goto next;
/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
- results = lappend_oid(results, idxForm->indexrelid);
- list_free(indexList);
- index_close(idxRel, NoLock);
- table_close(relation, NoLock);
- return results;
+ goto found;
}
else if (indexOidFromConstraint != InvalidOid)
{
- /* No point in further work for index in named constraint case */
- goto next;
+ /* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+ if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+ goto next;
+ } else {
+ /*
+ * Only considering conventional inference at this point (not named
+ * constraints), so index under consideration can be immediately
+ * skipped if it's not unique
+ */
+ if (!idxForm->indisunique)
+ goto next;
}
- /*
- * Only considering conventional inference at this point (not named
- * constraints), so index under consideration can be immediately
- * skipped if it's not unique
- */
- if (!idxForm->indisunique)
- goto next;
-
/*
* So-called unique constraints with WITHOUT OVERLAPS are really
* exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/* Non-expression attributes (if any) must match */
- if (!bms_equal(indexedAttrs, inferAttrs))
+ if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
goto next;
/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
if (idxExprs && varno != 1)
ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
+ /*
+ * If arbiterElems are present, check them. If name >constraint is
+ * present arbiterElems == NIL.
+ */
foreach(el, onconflict->arbiterElems)
{
InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
}
/*
- * Now that all inference elements were matched, ensure that the
+ * In case of the conventional inference involved ensure that the
* expression elements from inference clause are not missing any
* cataloged expressions. This does the right thing when unique
* indexes redundantly repeat the same attribute, or if attributes
* redundantly appear multiple times within an inference clause.
+ *
+ * In the case of named constraint ensure candidate has equal set
+ * of expressions as the named constraint index.
*/
- if (list_difference(idxExprs, inferElems) != NIL)
+ if (list_difference(idxExprs, requiredArbiterElems) != NIL)
goto next;
- /*
- * If it's a partial index, its predicate must be implied by the ON
- * CONFLICT's WHERE clause.
- */
predExprs = RelationGetIndexPredicate(idxRel);
if (predExprs && varno != 1)
ChangeVarNodes((Node *) predExprs, 1, varno, 0);
- if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+ /*
+ * If it's a partial index and conventional inference, its predicate must be implied
+ * by the ON CONFLICT's WHERE clause.
+ */
+ if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+ goto next;
+ /* If it's a partial index and named constraint predicates must be equal. */
+ if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
goto next;
+found:
results = lappend_oid(results, idxForm->indexrelid);
+ foundValid |= idxForm->indisvalid;
next:
index_close(idxRel, NoLock);
}
@@ -946,7 +1008,8 @@ next:
list_free(indexList);
table_close(relation, NoLock);
- if (results == NIL)
+ /* It is required to have at least one indisvalid index during the planning. */
+ if (results == NIL || !foundValid)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6eb29b99735..101a02c5b60 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
#include "utils/resowner.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
}
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+ reindex_concurrently_upsert \
+ index_concurrently_upsert \
+ reindex_concurrently_upsert_partitioned \
+ reindex_concurrently_upsert_on_constraint \
+ index_concurrently_upsert_predicate
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid:
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot:
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead:
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s4_wakeup_s2:
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'reindex_concurrently_upsert',
+ 'index_concurrently_upsert',
+ 'reindex_concurrently_upsert_partitioned',
+ 'reindex_concurrently_upsert_on_constraint',
+ 'index_concurrently_upsert_predicate',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
+ # We waiting for all snapshots, so, avoid parallel test executions
+ 'runningcheck-parallel': false,
},
'tap': {
'env': {
@@ -53,5 +62,7 @@ tests += {
'tests': [
't/001_stats.pl',
],
+ # The injection points are cluster-wide, so disable installcheck
+ 'runningcheck': false,
},
}
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+ CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+ SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index { CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot {
+ SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+ SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid {
+ SELECT injection_points_detach('define_index_before_set_valid');
+ SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+ s3_start_create_index
+ s1_start_upsert
+ s4_wakeup_define_index_before_set_valid
+ s2_start_upsert
+ s4_wakeup_s1_from_invalidate_catalog_snapshot
+ s4_wakeup_s2
+ s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+ ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+ CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+ FOR VALUES FROM (0) TO (10000)
+ WITH (parallel_workers = 0);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert { INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+ SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex { REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2 {
+ SELECT injection_points_detach('exec_insert_before_insert_speculative');
+ SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead {
+ SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+ SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+ s3_start_reindex
+ s1_start_upsert
+ s4_wakeup_to_swap
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s2_start_upsert
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_s2
+ s4_wakeup_to_set_dead
+
+permutation
+ s3_start_reindex
+ s4_wakeup_to_swap
+ s1_start_upsert
+ s2_start_upsert
+ s4_wakeup_s1
+ s4_wakeup_to_set_dead
+ s4_wakeup_s2
\ No newline at end of file
--
2.43.0
v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchapplication/octet-stream; name=v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patchDownload
From 2976d46c4c65c844c1fe5c369c6b9942ccaf14cb Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v9 6/9] Add STIR (Short-Term Index Replacement) access method
This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:
- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
operation is validating a newly built index (e.g., during concurrent build).
Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.
These changes lay essential groundwork for further improvements to concurrent
index builds.
---
contrib/pgstattuple/pgstattuple.c | 3 +
src/backend/access/Makefile | 2 +-
src/backend/access/heap/vacuumlazy.c | 2 +
src/backend/access/meson.build | 1 +
src/backend/access/stir/Makefile | 18 +
src/backend/access/stir/meson.build | 5 +
src/backend/access/stir/stir.c | 576 +++++++++++++++++++++++
src/backend/catalog/index.c | 1 +
src/backend/commands/analyze.c | 1 +
src/backend/commands/vacuumparallel.c | 1 +
src/backend/nodes/makefuncs.c | 1 +
src/include/access/genam.h | 1 +
src/include/access/reloptions.h | 3 +-
src/include/access/stir.h | 117 +++++
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_opclass.dat | 4 +
src/include/catalog/pg_opfamily.dat | 2 +
src/include/catalog/pg_proc.dat | 4 +
src/include/nodes/execnodes.h | 6 +-
src/include/utils/index_selfuncs.h | 8 +
src/test/regress/expected/amutils.out | 8 +-
src/test/regress/expected/opr_sanity.out | 7 +-
src/test/regress/expected/psql.out | 24 +-
23 files changed, 780 insertions(+), 18 deletions(-)
create mode 100644 src/backend/access/stir/Makefile
create mode 100644 src/backend/access/stir/meson.build
create mode 100644 src/backend/access/stir/stir.c
create mode 100644 src/include/access/stir.h
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
case SPGIST_AM_OID:
err = "spgist index";
break;
+ case STIR_AM_OID:
+ err = "stir index";
+ break;
case BRIN_AM_OID:
err = "brin index";
break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- sequence table tablesample transam
+ stir sequence table tablesample transam
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
subdir('rmgrdesc')
subdir('sequence')
subdir('spgist')
+subdir('stir')
subdir('table')
subdir('tablesample')
subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/stir
+#
+# IDENTIFICATION
+# src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ * Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+ IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+ /* Set STIR-specific strategy and procedure numbers */
+ amroutine->amstrategies = STIR_NSTRATEGIES;
+ amroutine->amsupport = STIR_NPROC;
+ amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+ /* STIR doesn't support most index operations */
+ amroutine->amcanorder = false;
+ amroutine->amcanorderbyop = false;
+ amroutine->amcanbackward = false;
+ amroutine->amcanunique = false;
+ amroutine->amcanmulticol = true;
+ amroutine->amoptionalkey = true;
+ amroutine->amsearcharray = false;
+ amroutine->amsearchnulls = false;
+ amroutine->amstorage = false;
+ amroutine->amclusterable = false;
+ amroutine->ampredlocks = false;
+ amroutine->amcanparallel = false;
+ amroutine->amcanbuildparallel = false;
+ amroutine->amcaninclude = true;
+ amroutine->amusemaintenanceworkmem = false;
+ amroutine->amparallelvacuumoptions =
+ VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amkeytype = InvalidOid;
+
+ /* Set up function callbacks */
+ amroutine->ambuild = stirbuild;
+ amroutine->ambuildempty = stirbuildempty;
+ amroutine->aminsert = stirinsert;
+ amroutine->aminsertcleanup = NULL;
+ amroutine->ambulkdelete = stirbulkdelete;
+ amroutine->amvacuumcleanup = stirvacuumcleanup;
+ amroutine->amcanreturn = NULL;
+ amroutine->amcostestimate = stircostestimate;
+ amroutine->amoptions = stiroptions;
+ amroutine->amproperty = NULL;
+ amroutine->ambuildphasename = NULL;
+ amroutine->amvalidate = stirvalidate;
+ amroutine->amadjustmembers = NULL;
+ amroutine->ambeginscan = stirbeginscan;
+ amroutine->amrescan = stirrescan;
+ amroutine->amgettuple = NULL;
+ amroutine->amgetbitmap = NULL;
+ amroutine->amendscan = stirendscan;
+ amroutine->ammarkpos = NULL;
+ amroutine->amrestrpos = NULL;
+ amroutine->amestimateparallelscan = NULL;
+ amroutine->aminitparallelscan = NULL;
+ amroutine->amparallelrescan = NULL;
+
+ PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+ bool result = true;
+ HeapTuple classtup;
+ Form_pg_opclass classform;
+ Oid opfamilyoid;
+ HeapTuple familytup;
+ Form_pg_opfamily familyform;
+ char *opfamilyname;
+ CatCList *proclist,
+ *oprlist;
+ int i;
+
+ /* Fetch opclass information */
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+ classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+ opfamilyoid = classform->opcfamily;
+
+
+ /* Fetch opfamily information */
+ familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+ if (!HeapTupleIsValid(familytup))
+ elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+ familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+ opfamilyname = NameStr(familyform->opfname);
+
+ /* Fetch all operators and support functions of the opfamily */
+ oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+ proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+ /* Check individual operators */
+ for (i = 0; i < oprlist->n_members; i++)
+ {
+ HeapTuple oprtup = &oprlist->members[i]->tuple;
+ Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+ /* Check it's allowed strategy for stir */
+ if (oprform->amopstrategy < 1 ||
+ oprform->amopstrategy > STIR_NSTRATEGIES)
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+ opfamilyname,
+ format_operator(oprform->amopopr),
+ oprform->amopstrategy)));
+ result = false;
+ }
+
+ /* stir doesn't support ORDER BY operators */
+ if (oprform->amoppurpose != AMOP_SEARCH ||
+ OidIsValid(oprform->amopsortfamily))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+
+ /* Check operator signature --- same for all stir strategies */
+ if (!check_amop_signature(oprform->amopopr, BOOLOID,
+ oprform->amoplefttype,
+ oprform->amoprighttype))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with wrong signature",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+ }
+
+
+ ReleaseCatCacheList(proclist);
+ ReleaseCatCacheList(oprlist);
+ ReleaseSysCache(familytup);
+ ReleaseSysCache(classtup);
+
+ return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+ StirMetaPageData *metadata;
+
+ StirInitPage(metaPage, STIR_META);
+ metadata = StirPageGetMeta(metaPage);
+ memset(metadata, 0, sizeof(StirMetaPageData));
+ metadata->magickNumber = STIR_MAGICK_NUMBER;
+ metadata->skipInserts = skipInserts;
+ ((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ /*
+ * Make a new page; since it is first page it should be associated with
+ * block number 0 (STIR_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
+ */
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+ /* Initialize contents of meta page */
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+ StirPageOpaque opaque;
+
+ PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+ opaque = StirPageGetOpaque(page);
+ opaque->flags = flags;
+ opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+ StirTuple *itup;
+ StirPageOpaque opaque;
+ Pointer ptr;
+
+ /* We shouldn't be pointed to an invalid page */
+ Assert(!PageIsNew(page));
+
+ /* Does new tuple fit on the page? */
+ if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+ return false;
+
+ /* Copy new tuple to the end of page */
+ opaque = StirPageGetOpaque(page);
+ itup = StirPageGetTuple(page, opaque->maxoff + 1);
+ memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+ /* Adjust maxoff and pd_lower */
+ opaque->maxoff++;
+ ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+ ((PageHeader) page)->pd_lower = ptr - page;
+
+ /* Assert we didn't overrun available space */
+ Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+ return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo)
+{
+ StirTuple *itup;
+ MemoryContext oldCtx;
+ MemoryContext insertCtx;
+ StirMetaPageData *metaData;
+ Buffer buffer,
+ metaBuffer;
+ Page page;
+ GenericXLogState *state;
+ uint16 blkNo;
+
+ /* Create temporary context for insert operation */
+ insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Stir insert temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+
+ oldCtx = MemoryContextSwitchTo(insertCtx);
+
+ /* Create new tuple with heap pointer */
+ itup = (StirTuple *) palloc0(sizeof(StirTuple));
+ itup->heapPtr = *ht_ctid;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+ for (;;)
+ {
+ LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+ /* Check if inserts are allowed */
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+ blkNo = metaData->lastBlkNo;
+ /* Don't hold metabuffer lock while doing insert */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+ if (blkNo > 0)
+ {
+ buffer = ReadBuffer(index, blkNo);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+ Assert(!PageIsNew(page));
+
+ /* Try to add tuple to existing page */
+ if (StirPageAddItem(page, itup))
+ {
+ /* Success! Apply the change, clean up, and exit */
+ GenericXLogFinish(state);
+ UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(metaBuffer);
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+ return false;
+ }
+
+ /* Didn't fit, must try other pages */
+ GenericXLogAbort(state);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ /* Need to add new page - get exclusive lock on meta page */
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+ /* Check if another backend already extended the index */
+
+ if (blkNo != metaData->lastBlkNo)
+ {
+ Assert(blkNo < metaData->lastBlkNo);
+ /* Someone else inserted the new page into the index, lets try again /
+ */
+ GenericXLogAbort(state);
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+ else
+ {
+ /* Must extend the file */
+ buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+ EB_LOCK_FIRST);
+
+ page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+ StirInitPage(page, 0);
+
+ if (!StirPageAddItem(page, itup))
+ {
+ /* We shouldn't be here since we're inserting to an empty page */
+ elog(ERROR, "could not add new stir tuple to empty page");
+ }
+
+ /* Update meta page with new last block number */
+ metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+ GenericXLogFinish(state);
+
+ UnlockReleaseBuffer(buffer);
+ UnlockReleaseBuffer(metaBuffer);
+
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextDelete(insertCtx);
+
+ return false;
+ }
+ }
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo)
+{
+ IndexBuildResult *result;
+
+ if (!indexInfo->ii_Auxiliary)
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+ StirInitMetapage(index, MAIN_FORKNUM);
+
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ result->heap_tuples = 0;
+ result->index_tuples = 0;
+ return result;
+}
+
+void stirbuildempty(Relation index)
+{
+ StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ Relation index = info->index;
+ BlockNumber blkno, npages;
+ Buffer buffer;
+ Page page;
+
+ /* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+ if (!info->validate_index)
+ {
+ StirMarkAsSkipInserts(index);
+
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+ }
+
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ /*
+ * Iterate over the pages. We don't care about concurrently added pages,
+ * because index is marked as not-ready for that momment and index not
+ * used for insert.
+ */
+ npages = RelationGetNumberOfBlocks(index);
+ for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+ {
+ StirTuple *itup, *itupEnd;
+
+ vacuum_delay_point();
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+ RBM_NORMAL, info->strategy);
+
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buffer);
+
+ if (PageIsNew(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ continue;
+ }
+
+ itup = StirPageGetTuple(page, FirstOffsetNumber);
+ itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+ while (itup < itupEnd)
+ {
+ /* Do we have to delete this tuple? */
+ if (callback(&itup->heapPtr, callback_state))
+ {
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+ }
+
+ itup = StirPageGetNextTuple(itup);
+ }
+
+ UnlockReleaseBuffer(buffer);
+ }
+
+ return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+ StirMetaPageData *metaData;
+ Buffer metaBuffer;
+ Page metaPage;
+ GenericXLogState *state;
+
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ state = GenericXLogStart(index);
+ metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+ GENERIC_XLOG_FULL_IMAGE);
+ metaData = StirPageGetMeta(metaPage);
+ if (!metaData->skipInserts)
+ {
+ metaData->skipInserts = true;
+ GenericXLogFinish(state);
+ }
+ else
+ {
+ GenericXLogAbort(state);
+ }
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats)
+{
+ StirMarkAsSkipInserts(info->index);
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+ return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+ return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+ double loop_count, Cost *indexStartupCost,
+ Cost *indexTotalCost, Selectivity *indexSelectivity,
+ double *indexCorrelation, double *indexPages)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 73454accf61..7ff7ab6c72a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
ivinfo.strategy = NULL;
+ ivinfo.validate_index = true;
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
+ ivinfo.validate_index = false;
stats = index_vacuum_cleanup(&ivinfo, NULL);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
ivinfo.estimated_count = pvs->shared->estimated_count;
ivinfo.num_heap_tuples = pvs->shared->reltuples;
ivinfo.strategy = pvs->bstrategy;
+ ivinfo.validate_index = false;
/* Update error traceback information */
pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
/* initialize index-build state to default */
n->ii_BrokenHotChain = false;
n->ii_ParallelWorkers = 0;
+ n->ii_Auxiliary = false;
/* set up for possible use by index AM */
n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
bool estimated_count; /* num_heap_tuples is an estimate */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
+ bool validate_index; /* validating concurrently built index? */
BufferAccessStrategy strategy; /* access strategy for reads */
} IndexVacuumInfo;
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
RELOPT_KIND_VIEW = (1 << 9),
RELOPT_KIND_BRIN = (1 << 10),
RELOPT_KIND_PARTITIONED = (1 << 11),
+ RELOPT_KIND_STIR = (1 << 12),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ * header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC 0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES 1
+
+#define STIR_OPTIONS_PROC 0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+ ((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page) ((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+ ((StirTuple *)(PageGetContents(page) \
+ + sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+ ((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO (0)
+#define STIR_HEAD_BLKNO (1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+ OffsetNumber maxoff; /* number of index tuples on page */
+ uint16 flags; /* see bit definitions below */
+ uint16 unused; /* placeholder to force maxaligning of size of
+ * StirPageOpaqueData and to place
+ * stir_page_id exactly at the end of page */
+ uint16 stir_page_id; /* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META (1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID 0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+ uint32 magickNumber;
+ uint16 lastBlkNo;
+ bool skipInserts; /* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page) ((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+ ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+ - StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+ - MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+ void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+ descr => 'short term index replacement access method',
+ amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
# no brin opclass for the geometric types except box
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+ opcfamily => 'stir/any_ops', opcintype => 'any'},
+
]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
opfmethod => 'hash', opfname => 'multirange_ops' },
{ oid => '6158',
opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+ opfmethod => 'stir', opfname => 'any_ops' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2dcc2d42dac..34564109e50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
proname => 'brinhandler', provolatile => 'v',
prorettype => 'index_am_handler', proargtypes => 'internal',
prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+ proname => 'stirhandler', provolatile => 'v',
+ prorettype => 'index_am_handler', proargtypes => 'internal',
+ prosrc => 'stirhandler' },
{ oid => '3952', descr => 'brin: standalone scan new table pages',
proname => 'brin_summarize_new_values', provolatile => 'v',
proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1590b643920..7d4e43148e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
* BrokenHotChain did we detect any broken HOT chains?
* Summarizing is it a summarizing index?
* ParallelWorkers # of workers requested (excludes leader)
+ * Auxiliary # index-helper for concurrent build?
* Am Oid of index AM
* AmCache private cache area for index AM
* Context memory context holding this IndexInfo
*
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
* ----------------
*/
typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
bool ii_Summarizing;
bool ii_WithoutOverlaps;
int ii_ParallelWorkers;
+ bool ii_Auxiliary;
Oid ii_Am;
void *ii_AmCache;
MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
Selectivity *indexSelectivity,
double *indexCorrelation,
double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+ struct IndexPath *path,
+ double loop_count,
+ Cost *indexStartupCost,
+ Cost *indexTotalCost,
+ Selectivity *indexSelectivity,
+ double *indexCorrelation,
+ double *indexPages);
extern void gincostestimate(struct PlannerInfo *root,
struct IndexPath *path,
double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
spgist | can_exclude | t
spgist | can_include | t
spgist | bogus |
-(36 rows)
+ stir | can_order | f
+ stir | can_unique | f
+ stir | can_multi_col | t
+ stir | can_exclude | f
+ stir | can_include | t
+ stir | bogus |
+(42 rows)
--
-- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
WHERE a1.amopfamily = c1.opcfamily
AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily
----------+-----------
-(0 rows)
+ opcname | opcfamily
+----------+-----------
+ stir_ops | 5558
+(1 row)
-- Check that each operator listed in pg_amop has an associated opclass,
-- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA *
List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA h*
List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
\dA: extra argument "bar" ignored
\dA+
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ *
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ h*
List of access methods
--
2.43.0
v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patchapplication/octet-stream; name=v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patchDownload
From 6e38968bc529c4c72d3473d19405f5e3b79d1ff2 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 24 Dec 2024 13:40:45 +0100
Subject: [PATCH v9 7/9] Improve CREATE/REINDEX INDEX CONCURRENTLY using
auxiliary index
Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:
- Creating an auxiliary STIR (Short Term Index Replacement) index to track
new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase
instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready
This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.
This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
src/backend/access/heap/heapam_handler.c | 383 +++++++++---------
src/backend/catalog/index.c | 280 +++++++++++--
src/backend/catalog/toasting.c | 3 +-
src/backend/commands/indexcmds.c | 362 +++++++++++++----
src/include/access/tableam.h | 28 +-
src/include/catalog/index.h | 15 +-
src/include/commands/progress.h | 4 +-
.../expected/cic_reset_snapshots.out | 28 ++
.../sql/cic_reset_snapshots.sql | 1 +
src/test/regress/expected/create_index.out | 4 +
src/test/regress/expected/indexing.out | 3 +-
src/test/regress/sql/create_index.sql | 3 +
12 files changed, 791 insertions(+), 323 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0f706553605..ecec3c1c080 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
#include "storage/bufpage.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
@@ -1777,246 +1778,266 @@ heapam_index_build_range_scan(Relation heapRelation,
return reltuples;
}
-static void
+static TransactionId
heapam_index_validate_scan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
- Snapshot snapshot,
- ValidateIndexState *state)
+ ValidateIndexState *state,
+ ValidateIndexState *auxState)
{
- TableScanDesc scan;
- HeapScanDesc hscan;
- HeapTuple heapTuple;
+ IndexFetchTableData *fetch;
+ TransactionId limitXmin;
+
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
- ExprState *predicate;
- TupleTableSlot *slot;
- EState *estate;
- ExprContext *econtext;
- BlockNumber root_blkno = InvalidBlockNumber;
- OffsetNumber root_offsets[MaxHeapTuplesPerPage];
- bool in_index[MaxHeapTuplesPerPage];
- BlockNumber previous_blkno = InvalidBlockNumber;
+
+ Snapshot snapshot;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
/* state variables for the merge */
- ItemPointer indexcursor = NULL;
- ItemPointerData decoded;
- bool tuplesort_empty = false;
+ ItemPointer indexcursor = NULL,
+ auxindexcursor = NULL,
+ prev_indexcursor = NULL;
+ ItemPointerData decoded,
+ auxdecoded,
+ prev_decoded,
+ fetched;
+ bool tuplesort_empty = false,
+ auxtuplesort_empty = false;
+
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ /*
+ * Now take the "reference snapshot" that will be used by to filter candidate
+ * tuples. Beware! There might still be snapshots in
+ * use that treat some transaction as in-progress that our reference
+ * snapshot treats as committed. If such a recently-committed transaction
+ * deleted tuples in the table, we will not include them in the index; yet
+ * those transactions which see the deleting one as still-in-progress will
+ * expect such tuples to be there once we mark the index as valid.
+ *
+ * We solve this by waiting for all endangered transactions to exit before
+ * we mark the index as valid.
+ *
+ * We also set ActiveSnapshot to this snap, since functions in indexes may
+ * need a snapshot.
+ */
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ PushActiveSnapshot(snapshot);
+ limitXmin = snapshot->xmin;
/*
* sanity checks
*/
Assert(OidIsValid(indexRelation->rd_rel->relam));
- /*
- * Need an EState for evaluation of index expressions and partial-index
- * predicates. Also a slot to hold the current tuple.
- */
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
- &TTSOpsHeapTuple);
+ &TTSOpsBufferHeapTuple);
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
/*
- * Prepare for scan of the base relation. We need just those tuples
- * satisfying the passed-in reference snapshot. We must disable syncscan
- * here, because it's critical that we read from block zero forward to
- * match the sorted TIDs.
+ * Prepare to fetch heap tuples in index style. This helps to reconstruct
+ * a tuple from the heap when we only have an ItemPointer.
*/
- scan = table_beginscan_strat(heapRelation, /* relation */
- snapshot, /* snapshot */
- 0, /* number of keys */
- NULL, /* scan key */
- true, /* buffer access strategy OK */
- false, /* syncscan not OK */
- false);
- hscan = (HeapScanDesc) scan;
+ fetch = heapam_index_fetch_begin(heapRelation);
+
+ /* Initialize pointers. */
+ ItemPointerSetInvalid(&decoded);
+ ItemPointerSetInvalid(&prev_decoded);
+ ItemPointerSetInvalid(&auxdecoded);
+ ItemPointerSetInvalid(&fetched);
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
- hscan->rs_nblocks);
+ /* We'll track the last "main" index position in prev_indexcursor. */
+ prev_indexcursor = &prev_decoded;
/*
- * Scan all tuples matching the snapshot.
+ * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+ * which holds TIDs that must be merged with or compared to those from
+ * the "main" sort (state->tuplesort).
*/
- while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ while (!auxtuplesort_empty)
{
- ItemPointer heapcursor = &heapTuple->t_self;
- ItemPointerData rootTuple;
- OffsetNumber root_offnum;
-
+ Datum ts_val;
+ bool ts_isnull;
CHECK_FOR_INTERRUPTS();
- state->htups += 1;
-
- if ((previous_blkno == InvalidBlockNumber) ||
- (hscan->rs_cblock != previous_blkno))
- {
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- hscan->rs_cblock);
- previous_blkno = hscan->rs_cblock;
- }
-
/*
- * As commented in table_index_build_scan, we should index heap-only
- * tuples under the TIDs of their root tuples; so when we advance onto
- * a new heap page, build a map of root item offsets on the page.
- *
- * This complicates merging against the tuplesort output: we will
- * visit the live tuples in order by their offsets, but the root
- * offsets that we need to compare against the index contents might be
- * ordered differently. So we might have to "look back" within the
- * tuplesort output, but only within the current page. We handle that
- * by keeping a bool array in_index[] showing all the
- * already-passed-over tuplesort output TIDs of the current page. We
- * clear that array here, when advancing onto a new heap page.
- */
- if (hscan->rs_cblock != root_blkno)
+ * Attempt to fetch the next TID from the auxiliary sort. If it's
+ * empty, we set auxindexcursor to NULL.
+ */
+ auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(auxtuplesort_empty || !ts_isnull);
+ if (!auxtuplesort_empty)
{
- Page page = BufferGetPage(hscan->rs_cbuf);
-
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
- heap_get_root_tuples(page, root_offsets);
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
- memset(in_index, 0, sizeof(in_index));
-
- root_blkno = hscan->rs_cblock;
+ itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+ auxindexcursor = &auxdecoded;
}
-
- /* Convert actual tuple TID to root TID */
- rootTuple = *heapcursor;
- root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
- if (HeapTupleIsHeapOnly(heapTuple))
+ else
{
- root_offnum = root_offsets[root_offnum - 1];
- if (!OffsetNumberIsValid(root_offnum))
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
- ItemPointerGetBlockNumber(heapcursor),
- ItemPointerGetOffsetNumber(heapcursor),
- RelationGetRelationName(heapRelation))));
- ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+ auxindexcursor = NULL;
}
/*
- * "merge" by skipping through the index tuples until we find or pass
- * the current root tuple.
- */
- while (!tuplesort_empty &&
- (!indexcursor ||
- ItemPointerCompare(indexcursor, &rootTuple) < 0))
+ * If the auxiliary sort is not yet empty, we now try to synchronize
+ * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+ * the main sort cursor until we've reached or passed the auxiliary TID.
+ */
+ if (!auxtuplesort_empty)
{
- Datum ts_val;
- bool ts_isnull;
-
- if (indexcursor)
+ /*
+ * Move the main sort forward while:
+ * (1) It's not exhausted (tuplesort_empty == false), and
+ * (2) Either indexcursor is NULL (first iteration) or
+ * indexcursor < auxindexcursor in TID order.
+ */
+ while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+ ItemPointerCompare(indexcursor, auxindexcursor) < 0))
{
+ /* Keep track of the previous TID in prev_decoded. */
+ prev_decoded = decoded;
/*
- * Remember index items seen earlier on the current heap page
+ * Get the next TID from the main sort. If it's empty,
+ * we set indexcursor to NULL.
*/
- if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
- in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
- }
-
- tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
- false, &ts_val, &ts_isnull,
- NULL);
- Assert(tuplesort_empty || !ts_isnull);
- if (!tuplesort_empty)
- {
- itemptr_decode(&decoded, DatumGetInt64(ts_val));
- indexcursor = &decoded;
- }
- else
- {
- /* Be tidy */
- indexcursor = NULL;
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(tuplesort_empty || !ts_isnull);
+ if (!tuplesort_empty)
+ {
+ itemptr_decode(&decoded, DatumGetInt64(ts_val));
+ indexcursor = &decoded;
+
+ /*
+ * If the current TID in the main sort is a duplicate of the
+ * previous one (prev_indexcursor), skip it to avoid
+ * double-inserting the same TID. Such situation is possible
+ * due concurrent page splits in btree (and, probabaly other
+ * indexes as well).
+ */
+ if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+ {
+ elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+ ItemPointerGetBlockNumber(indexcursor),
+ ItemPointerGetOffsetNumber(indexcursor));
+ }
+ }
+ else
+ {
+ indexcursor = NULL;
+ }
+
+ CHECK_FOR_INTERRUPTS();
}
- }
-
- /*
- * If the tuplesort has overshot *and* we didn't see a match earlier,
- * then this tuple is missing from the index, so insert it.
- */
- if ((tuplesort_empty ||
- ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
- !in_index[root_offnum - 1])
- {
- MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
- /* Set up for predicate or expression evaluation */
- ExecStoreHeapTuple(heapTuple, slot, false);
/*
- * In a partial index, discard tuples that don't satisfy the
- * predicate.
+ * Now, if either:
+ * - the main sort is empty, or
+ * - indexcursor > auxindexcursor,
+ *
+ * then auxindexcursor identifies a TID that doesn't appear in
+ * the main sort. We likely need to insert it
+ * into the target index if it’s visible in the heap.
*/
- if (predicate != NULL)
+ if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
{
- if (!ExecQual(predicate, econtext))
- continue;
- }
+ bool call_again = false;
+ bool all_dead = false;
+ ItemPointer tid;
- /*
- * For the current heap tuple, extract all the attributes we use
- * in this index, and note which are null. This also performs
- * evaluation of any expressions needed.
- */
- FormIndexDatum(indexInfo,
- slot,
- estate,
- values,
- isnull);
+ /* Copy the auxindexcursor TID into fetched. */
+ fetched = *auxindexcursor;
+ tid = &fetched;
- /*
- * You'd think we should go ahead and build the index tuple here,
- * but some index AMs want to do further processing on the data
- * first. So pass the values[] and isnull[] arrays, instead.
- */
-
- /*
- * If the tuple is already committed dead, you might think we
- * could suppress uniqueness checking, but this is no longer true
- * in the presence of HOT, because the insert is actually a proxy
- * for a uniqueness check on the whole HOT-chain. That is, the
- * tuple we have here could be dead because it was already
- * HOT-updated, and if so the updating transaction will not have
- * thought it should insert index entries. The index AM will
- * check the whole HOT-chain and correctly detect a conflict if
- * there is one.
- */
+ /* Reset the per-tuple memory context for the next fetch. */
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+ state->htups += 1;
- index_insert(indexRelation,
- values,
- isnull,
- &rootTuple,
- heapRelation,
- indexInfo->ii_Unique ?
- UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- false,
- indexInfo);
-
- state->tups_inserted += 1;
+ /*
+ * Fetch the tuple from the heap to see if it's visible
+ * under our snapshot. If it is, form the index key values
+ * and insert a new entry into the target index.
+ */
+ if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+ {
+
+ /* Compute the key values and null flags for this tuple. */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /*
+ * Insert the tuple into the target index.
+ */
+ index_insert(indexRelation,
+ values,
+ isnull,
+ auxindexcursor, /* insert root tuple */
+ heapRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ false,
+ indexInfo);
+
+ state->tups_inserted += 1;
+
+ elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor),
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
+ }
+ else
+ {
+ /*
+ * The tuple wasn't visible under our snapshot. We
+ * skip inserting it into the target index because
+ * from our perspective, it doesn't exist.
+ */
+ elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+ ItemPointerGetBlockNumber(auxindexcursor),
+ ItemPointerGetOffsetNumber(auxindexcursor));
+ }
+ }
}
}
- table_endscan(scan);
-
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
+ heapam_index_fetch_end(fetch);
+
+ /*
+ * Drop the reference snapshot. We must do this before waiting out other
+ * snapshot holders, else we will deadlock against other processes also
+ * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+ * they must wait for.
+ */
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ InvalidateCatalogSnapshot();
+ Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+ if (MyProc->xid == InvalidTransactionId)
+ INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
+
+ return limitXmin;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7ff7ab6c72a..8b14f66affc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -719,6 +719,9 @@ UpdateIndexRelation(Oid indexoid,
* allow_system_table_mods: allow table to be a system catalog
* is_internal: if true, post creation hook for new index
* constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ * cases it is should be equal to persistence level of table,
+ * auxiliary indexes are only exception here.
*
* Returns the OID of the created index.
*/
@@ -743,7 +746,8 @@ index_create(Relation heapRelation,
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId)
+ Oid *constraintId,
+ char relpersistence)
{
Oid heapRelationId = RelationGetRelid(heapRelation);
Relation pg_class;
@@ -754,11 +758,11 @@ index_create(Relation heapRelation,
bool is_exclusion;
Oid namespaceId;
int i;
- char relpersistence;
bool isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
bool invalid = (flags & INDEX_CREATE_INVALID) != 0;
bool concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
bool partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+ bool auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
char relkind;
TransactionId relfrozenxid;
MultiXactId relminmxid;
@@ -784,7 +788,6 @@ index_create(Relation heapRelation,
namespaceId = RelationGetNamespace(heapRelation);
shared_relation = heapRelation->rd_rel->relisshared;
mapped_relation = RelationIsMapped(heapRelation);
- relpersistence = heapRelation->rd_rel->relpersistence;
/*
* check parameters
@@ -792,6 +795,11 @@ index_create(Relation heapRelation,
if (indexInfo->ii_NumIndexAttrs < 1)
elog(ERROR, "must index at least one column");
+ if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("user-defined indexes with STIR access method are not supported")));
+
if (!allow_system_table_mods &&
IsSystemRelation(heapRelation) &&
IsNormalProcessingMode())
@@ -1462,7 +1470,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
0,
true, /* allow table to be a system catalog? */
false, /* is_internal? */
- NULL);
+ NULL,
+ heapRelation->rd_rel->relpersistence);
/* Close the relations used and clean up */
index_close(indexRelation, NoLock);
@@ -1472,6 +1481,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller. The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+ Oid tablespaceOid, const char *newName)
+{
+ Relation indexRelation;
+ IndexInfo *oldInfo,
+ *newInfo;
+ Oid newIndexId = InvalidOid;
+ HeapTuple indexTuple;
+
+ List *indexColNames = NIL;
+ List *indexExprs = NIL;
+ List *indexPreds = NIL;
+
+ Oid *auxOpclassIds;
+ int16 *auxColoptions;
+
+ indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+ /* The new index needs some information from the old index */
+ oldInfo = BuildIndexInfo(indexRelation);
+
+ /*
+ * Build of an auxiliary index with exclusion constraints is not
+ * supported.
+ */
+ if (oldInfo->ii_ExclusionOps != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+ /* Get the array of class and column options IDs from index info */
+ indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+ if (!HeapTupleIsValid(indexTuple))
+ elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+ /*
+ * Fetch the list of expressions and predicates directly from the
+ * catalogs. This cannot rely on the information from IndexInfo of the
+ * old index as these have been flattened for the planner.
+ */
+ if (oldInfo->ii_Expressions != NIL)
+ {
+ Datum exprDatum;
+ char *exprString;
+
+ exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indexprs);
+ exprString = TextDatumGetCString(exprDatum);
+ indexExprs = (List *) stringToNode(exprString);
+ pfree(exprString);
+ }
+ if (oldInfo->ii_Predicate != NIL)
+ {
+ Datum predDatum;
+ char *predString;
+
+ predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indpred);
+ predString = TextDatumGetCString(predDatum);
+ indexPreds = (List *) stringToNode(predString);
+
+ /* Also convert to implicit-AND format */
+ indexPreds = make_ands_implicit((Expr *) indexPreds);
+ pfree(predString);
+ }
+
+ /*
+ * Build the index information for the new index. Note that rebuild of
+ * indexes with exclusion constraints is not supported, hence there is no
+ * need to fill all the ii_Exclusion* fields.
+ */
+ newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+ oldInfo->ii_NumIndexKeyAttrs,
+ STIR_AM_OID, /* special AM for aux indexes */
+ indexExprs,
+ indexPreds,
+ false, /* aux index are not unique */
+ oldInfo->ii_NullsNotDistinct,
+ false, /* not ready for inserts */
+ true,
+ false, /* aux are not summarizing */
+ oldInfo->ii_WithoutOverlaps);
+
+ /*
+ * Extract the list of column names and the column numbers for the new
+ * index information. All this information will be used for the index
+ * creation.
+ */
+ for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+ {
+ TupleDesc indexTupDesc = RelationGetDescr(indexRelation);
+ Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+ indexColNames = lappend(indexColNames, NameStr(att->attname));
+ newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+ }
+
+ auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+ auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+ /* Fill with "any ops" */
+ for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+ {
+ auxOpclassIds[i] = ANY_STIR_OPS_OID;
+ auxColoptions[i] = 0;
+ }
+
+ newIndexId = index_create(heapRelation,
+ newName,
+ InvalidOid, /* indexRelationId */
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidRelFileNumber, /* relFileNumber */
+ newInfo,
+ indexColNames,
+ STIR_AM_OID,
+ tablespaceOid,
+ indexRelation->rd_indcollation,
+ auxOpclassIds,
+ NULL,
+ auxColoptions,
+ NULL,
+ (Datum) 0,
+ INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+ 0,
+ true, /* allow table to be a system catalog? */
+ false, /* is_internal? */
+ NULL,
+ RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+ /* Close the relations used and clean up */
+ index_close(indexRelation, NoLock);
+ ReleaseSysCache(indexTuple);
+
+ return newIndexId;
+}
+
/*
* index_concurrently_build
*
@@ -1483,7 +1640,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
*/
void
index_concurrently_build(Oid heapRelationId,
- Oid indexRelationId)
+ Oid indexRelationId,
+ bool auxiliary)
{
Relation heapRel;
Oid save_userid;
@@ -1524,6 +1682,7 @@ index_concurrently_build(Oid heapRelationId,
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
+ indexInfo->ii_Auxiliary = auxiliary;
Assert(!TransactionIdIsValid(MyProc->xmin));
/* Now build the index */
@@ -3276,12 +3435,20 @@ IndexCheckExclusion(Relation heapRelation,
*
* We do a concurrent index build by first inserting the catalog entry for the
* index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
* Then we commit our transaction and start a new one, then we wait for all
* transactions that could have been modifying the table to terminate. Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
* honor its constraints on HOT updates; so while existing HOT-chains might
* be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it. We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We wait again for all
+ * transactions that could have been modifying the table to terminate. At that
+ * moment all new tuples are going to be inserted into auxiliary index.
+ *
+ * We now build the index normally via
* index_build(), while holding a weak lock that allows concurrent
* insert/update/delete. Also, we index only tuples that are valid
* as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3292,6 +3459,7 @@ IndexCheckExclusion(Relation heapRelation,
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
* does not contain any tuples added to the table while we built the index.
+ * But theese tuples contained in auxiliary index.
*
* Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
* snapshot to be set as active every so often. The reason for that is to
@@ -3301,8 +3469,10 @@ IndexCheckExclusion(Relation heapRelation,
* commit the second transaction and start a third. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
* we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it. We then take a new reference snapshot
- * which is passed to validate_index(). Any tuples that are valid according
+ * insert their new tuples into it. At that moment we clear "indisready" for
+ * auxiliary index, since it is no more required/
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
* to this snap, but are not in the index, must be added to the index.
* (Any tuples committed live after the snap will be inserted into the
* index by their originating transaction. Any tuples committed dead before
@@ -3310,12 +3480,14 @@ IndexCheckExclusion(Relation heapRelation,
* that might care about them before we mark the index valid.)
*
* validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
* ever say "delete it". (This should be faster than a plain indexscan;
* also, not all index AMs support full-index indexscan.) Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index. Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index. Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
*
* Building a unique index this way is tricky: we might try to insert a
* tuple that is already dead or is in process of being deleted, and we
@@ -3331,24 +3503,25 @@ IndexCheckExclusion(Relation heapRelation,
* necessary to be sure there are none left with a transaction snapshot
* older than the reference (and hence possibly able to see tuples we did
* not index). Then we mark the index "indisvalid" and commit. Subsequent
- * transactions will be able to use it for queries.
- *
- * Doing two full table scans is a brute-force strategy. We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm). However that would
- * add yet more locking issues.
+ * transactions will be able to use it for queries. Auxiliary index is
+ * dropped.
*/
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
{
Relation heapRelation,
- indexRelation;
+ indexRelation,
+ auxIndexRelation;
IndexInfo *indexInfo;
- IndexVacuumInfo ivinfo;
- ValidateIndexState state;
+ TransactionId limitXmin;
+ IndexVacuumInfo ivinfo, auxivinfo;
+ ValidateIndexState state, auxState;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ /* Use 80% of maintenance_work_mem to target index sorting and
+ * rest for auxiliary */
+ int main_work_mem_part = (maintenance_work_mem * 8) / 10;
{
const int progress_index[] = {
@@ -3381,13 +3554,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
RestrictSearchPath();
indexRelation = index_open(indexId, RowExclusiveLock);
+ auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
/*
* Fetch info needed for index_insert. (You might think this should be
* passed in from DefineIndex, but its copy is long gone due to having
* been built in a previous transaction.)
+ *
+ * We might need snapshot for index expressions or predicates.
*/
+ PushActiveSnapshot(GetTransactionSnapshot());
indexInfo = BuildIndexInfo(indexRelation);
+ PopActiveSnapshot();
/* mark build is concurrent just for consistency */
indexInfo->ii_Concurrent = true;
@@ -3405,15 +3583,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.strategy = NULL;
ivinfo.validate_index = true;
+ /*
+ * Copy all info to auxiliary info, changing only relation.
+ */
+ auxivinfo = ivinfo;
+ auxivinfo.index = auxIndexRelation;
+
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
* item pointers. This can be significantly faster, primarily because TID
* is a pass-by-reference type on all platforms, whereas int8 is
* pass-by-value on most platforms.
*/
+ auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+ InvalidOid, false,
+ maintenance_work_mem - main_work_mem_part,
+ NULL, TUPLESORT_NONE);
+ auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+ (void) index_bulk_delete(&auxivinfo, NULL,
+ validate_index_callback, &auxState);
+
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
InvalidOid, false,
- maintenance_work_mem,
+ main_work_mem_part,
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
@@ -3436,27 +3629,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
tuplesort_performsort(state.tuplesort);
+ tuplesort_performsort(auxState.tuplesort);
+
+ InvalidateCatalogSnapshot();
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
- * Now scan the heap and "merge" it with the index
+ * Now merge both indexes
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
- table_index_validate_scan(heapRelation,
- indexRelation,
- indexInfo,
- snapshot,
- &state);
+ PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+ limitXmin = table_index_validate_scan(heapRelation,
+ indexRelation,
+ indexInfo,
+ &state,
+ &auxState);
- /* Done with tuplesort object */
+ /* Done with tuplesort objects */
tuplesort_end(state.tuplesort);
+ tuplesort_end(auxState.tuplesort);
/* Make sure to release resources cached in indexInfo (if needed). */
index_insert_cleanup(indexRelation, indexInfo);
elog(DEBUG2,
- "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
- state.htups, state.itups, state.tups_inserted);
+ "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+ " %.0f aux index tuples; inserted %.0f missing tuples",
+ state.htups, state.itups, auxState.itups, state.tups_inserted);
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3664,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
SetUserIdAndSecContext(save_userid, save_sec_context);
/* Close rels, but keep locks */
+ index_close(auxIndexRelation, NoLock);
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ return limitXmin;
}
/*
@@ -3525,6 +3728,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
Assert(!indexForm->indisvalid);
indexForm->indisvalid = true;
break;
+ case INDEX_DROP_CLEAR_READY:
+ /* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+ Assert(indexForm->indislive);
+ Assert(indexForm->indisready);
+ Assert(!indexForm->indisvalid);
+ indexForm->indisready = false;
+ break;
case INDEX_DROP_CLEAR_VALID:
/*
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
BTREE_AM_OID,
rel->rd_rel->reltablespace,
collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
- INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+ INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+ toast_rel->rd_rel->relpersistence);
table_close(toast_rel, NoLock);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..02b636a0050 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -554,6 +554,7 @@ DefineIndex(Oid tableId,
{
bool concurrent;
char *indexRelationName;
+ char *auxIndexRelationName = NULL;
char *accessMethodName;
Oid *typeIds;
Oid *collationIds;
@@ -563,6 +564,7 @@ DefineIndex(Oid tableId,
Oid namespaceId;
Oid tablespaceId;
Oid createdConstraintId = InvalidOid;
+ Oid auxIndexRelationId = InvalidOid;
List *indexColNames;
List *allIndexParams;
Relation rel;
@@ -584,10 +586,10 @@ DefineIndex(Oid tableId,
int numberOfKeyAttributes;
TransactionId limitXmin;
ObjectAddress address;
+ ObjectAddress auxAddress;
LockRelId heaprelid;
LOCKTAG heaplocktag;
LOCKMODE lockmode;
- Snapshot snapshot;
Oid root_save_userid;
int root_save_sec_context;
int root_save_nestlevel;
@@ -834,6 +836,15 @@ DefineIndex(Oid tableId,
stmt->excludeOpNames,
stmt->primary,
stmt->isconstraint);
+ /*
+ * Select name for auxiliary index
+ */
+ if (concurrent)
+ auxIndexRelationName = ChooseRelationName(indexRelationName,
+ NULL,
+ "ccaux",
+ namespaceId,
+ false);
/*
* look up the access method, verify it can handle the requested features
@@ -1227,7 +1238,8 @@ DefineIndex(Oid tableId,
coloptions, NULL, reloptions,
flags, constr_flags,
allowSystemTableMods, !check_rights,
- &createdConstraintId);
+ &createdConstraintId,
+ rel->rd_rel->relpersistence);
ObjectAddressSet(address, RelationRelationId, indexRelationId);
@@ -1569,6 +1581,16 @@ DefineIndex(Oid tableId,
return address;
}
+ /*
+ * In case of concurrent build - create auxiliary index record.
+ */
+ if (concurrent)
+ {
+ auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+ tablespaceId, auxIndexRelationName);
+ ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+ }
+
AtEOXact_GUC(false, root_save_nestlevel);
SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
@@ -1597,11 +1619,11 @@ DefineIndex(Oid tableId,
/*
* For a concurrent build, it's important to make the catalog entries
* visible to other transactions before we start to build the index. That
- * will prevent them from making incompatible HOT updates. The new index
- * will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * will prevent them from making incompatible HOT updates. New indexes
+ * (main and auxiliary) will be marked not indisready and not indisvalid,
+ * so that no one else tries to either insert into it or use it for queries.
*
- * We must commit our current transaction so that the index becomes
+ * We must commit our current transaction so that the indexes becomes
* visible; then start another. Note that all the data structures we just
* built are lost in the commit. The only data we keep past here are the
* relation IDs.
@@ -1611,7 +1633,7 @@ DefineIndex(Oid tableId,
* cannot block, even if someone else is waiting for access, because we
* already have the same lock within our transaction.
*
- * Note: we don't currently bother with a session lock on the index,
+ * Note: we don't currently bother with a session lock on the indexes,
* because there are no operations that could change its state while we
* hold lock on the parent table. This might need to change later.
*/
@@ -1632,14 +1654,16 @@ DefineIndex(Oid tableId,
{
const int progress_cols[] = {
PROGRESS_CREATEIDX_INDEX_OID,
+ PROGRESS_CREATEIDX_AUX_INDEX_OID,
PROGRESS_CREATEIDX_PHASE
};
const int64 progress_vals[] = {
indexRelationId,
+ auxIndexRelationId,
PROGRESS_CREATEIDX_PHASE_WAIT_1
};
- pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
+ pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
}
/*
@@ -1650,7 +1674,7 @@ DefineIndex(Oid tableId,
* with the old list of indexes. Use ShareLock to consider running
* transactions that hold locks that permit writing to the table. Note we
* do not need to worry about xacts that open the table for writing after
- * this point; they will see the new index when they open it.
+ * this point; they will see the new indexes when they open it.
*
* Note: the reason we use actual lock acquisition here, rather than just
* checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1686,39 @@ DefineIndex(Oid tableId,
/*
* At this moment we are sure that there are no transactions with the
- * table open for write that don't have this new index in their list of
+ * table open for write that don't have this new indexes in their list of
* indexes. We have waited out all the existing transactions and any new
- * transaction will have the new index in its list, but the index is still
- * marked as "not-ready-for-inserts". The index is consulted while
+ * transaction will have both new indexes in its list, but indexes are still
+ * marked as "not-ready-for-inserts". The indexes are consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
- * We build the index using all tuples that are visible using multiple
+ * Now call build on auxiliary index. Index will be created empty without
+ * any actual heap scan, but marked as "ready-for-inserts". The goal of
+ * that index is accumulate new tuples while main index is actually built.
+ */
+ index_concurrently_build(tableId, auxIndexRelationId, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Now we need to ensure are no transactions with the with auxiliary index
+ * marked as "not-ready-for-inserts".
+ */
+ WaitForLockers(heaplocktag, ShareLock, true);
+
+ /*
+ * At this moment we are sure what all new tuples in table are inserted into
+ * auxiliary index. Now it is time to build the target index itself.
+ *
+ * We build that index using all tuples that are visible using multiple
* refreshing snapshots. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
@@ -1679,7 +1727,7 @@ DefineIndex(Oid tableId,
*/
/* Perform concurrent build of index */
- index_concurrently_build(tableId, indexRelationId);
+ index_concurrently_build(tableId, indexRelationId, false);
/*
* Commit this transaction to make the indisready update visible.
@@ -1698,43 +1746,28 @@ DefineIndex(Oid tableId,
* the index marked as read-only for updates.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForLockers(heaplocktag, ShareLock, true);
/*
- * Now take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples. Beware! There might still be snapshots in
- * use that treat some transaction as in-progress that our reference
- * snapshot treats as committed. If such a recently-committed transaction
- * deleted tuples in the table, we will not include them in the index; yet
- * those transactions which see the deleting one as still-in-progress will
- * expect such tuples to be there once we mark the index as valid.
- *
- * We solve this by waiting for all endangered transactions to exit before
- * we mark the index as valid.
- *
- * We also set ActiveSnapshot to this snap, since functions in indexes may
- * need a snapshot.
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
*/
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
+ PushActiveSnapshot(GetTransactionSnapshot());
/*
- * Scan the index and the heap, insert any missing index entries.
+ * Now target index is marked as "ready" for all transaction. So, auxiliary
+ * index is not more needed. So, start removing process by reverting "ready"
+ * flag.
*/
- validate_index(tableId, indexRelationId, snapshot);
-
- /*
- * Drop the reference snapshot. We must do this before waiting out other
- * snapshot holders, else we will deadlock against other processes also
- * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
- * they must wait for. But first, save the snapshot's xmin to use as
- * limitXmin for GetCurrentVirtualXIDs().
- */
- limitXmin = snapshot->xmin;
-
+ index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Merge content of auxiliary and target indexes - insert any missing index entries.
+ */
+ limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
/*
* The snapshot subsystem could still contain registered snapshots that
@@ -1747,6 +1780,49 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ /* Now it is time to mark auxiliary index as dead */
+ index_concurrently_set_dead(tableId, auxIndexRelationId);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
+ */
+
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
+ /* Now wait for all transaction to ignore auxiliary because it is dead */
+ WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Drop auxiliary index.
+ *
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
+ *
+ * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+ * right lock level.
+ */
+ performDeletion(&auxAddress, DROP_RESTRICT,
+ PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
/* Tell concurrent index builds to ignore us, if index qualifies */
if (safe_index)
set_indexsafe_procflags();
@@ -1757,12 +1833,12 @@ DefineIndex(Oid tableId,
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted just
- * before the reference snap was taken, we have to wait out any
- * transactions that might have older snapshots.
+ * before the last snapshot during validating was taken, we have to wait
+ * out any transactions that might have older snapshots.
*/
INJECTION_POINT("define_index_before_set_valid");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
WaitForOlderSnapshots(limitXmin, true);
/*
@@ -3542,6 +3618,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
typedef struct ReindexIndexInfo
{
Oid indexId;
+ Oid auxIndexId;
Oid tableId;
Oid amId;
bool safe; /* for set_indexsafe_procflags */
@@ -3563,9 +3640,10 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PROGRESS_CREATEIDX_COMMAND,
PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_INDEX_OID,
+ PROGRESS_CREATEIDX_AUX_INDEX_OID,
PROGRESS_CREATEIDX_ACCESS_METHOD_OID
};
- int64 progress_vals[4];
+ int64 progress_vals[5];
/*
* Create a memory context that will survive forced transaction commits we
@@ -3865,15 +3943,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
foreach(lc, indexIds)
{
char *concurrentName;
+ char *auxConcurrentName;
ReindexIndexInfo *idx = lfirst(lc);
ReindexIndexInfo *newidx;
Oid newIndexId;
+ Oid auxIndexId;
Relation indexRel;
Relation heapRel;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
Relation newIndexRel;
+ Relation auxIndexRel;
LockRelId *lockrelid;
Oid tablespaceid;
@@ -3915,8 +3996,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = 0; /* initializing */
progress_vals[2] = idx->indexId;
- progress_vals[3] = idx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = InvalidOid;
+ progress_vals[4] = idx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
/* Choose a temporary relation name for the new index */
concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3924,6 +4006,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
"ccnew",
get_rel_namespace(indexRel->rd_index->indrelid),
false);
+ auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+ NULL,
+ "ccaux",
+ get_rel_namespace(indexRel->rd_index->indrelid),
+ false);
/* Choose the new tablespace, indexes of toast tables are not moved */
if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4024,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
idx->indexId,
tablespaceid,
concurrentName);
+ auxIndexId = index_concurrently_create_aux(heapRel,
+ idx->indexId,
+ tablespaceid,
+ auxConcurrentName);
/*
* Now open the relation of the new index, a session-level lock is
* also needed on it.
*/
newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+ auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
/*
* Save the list of OIDs and locks in private context
@@ -3951,6 +4043,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
newidx = palloc_object(ReindexIndexInfo);
newidx->indexId = newIndexId;
+ newidx->auxIndexId = auxIndexId;
newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -3969,10 +4062,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
lockrelid = palloc_object(LockRelId);
*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
relationLocks = lappend(relationLocks, lockrelid);
+ lockrelid = palloc_object(LockRelId);
+ *lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+ relationLocks = lappend(relationLocks, lockrelid);
MemoryContextSwitchTo(oldcontext);
index_close(indexRel, NoLock);
+ index_close(auxIndexRel, NoLock);
index_close(newIndexRel, NoLock);
/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4150,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* doing that, wait until no running transactions could have the table of
* the index open with the old list of indexes. See "phase 2" in
* DefineIndex() for more details.
+ */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ StartTransactionCommand();
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+ index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
+
+ CommitTransactionCommand();
+ }
+
+ StartTransactionCommand();
+
+ /*
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Wait until all auxiliary indexes are taken into account by all
+ * transactions.
+ */
WaitForLockersMultiple(lockTags, ShareLock, true);
CommitTransactionCommand();
+ /* Now it is time to perform target index build. */
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
@@ -4086,11 +4225,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
progress_vals[2] = newidx->indexId;
- progress_vals[3] = newidx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = newidx->auxIndexId;
+ progress_vals[4] = newidx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
/* Perform concurrent build of new index */
- index_concurrently_build(newidx->tableId, newidx->indexId);
+ index_concurrently_build(newidx->tableId, newidx->indexId, false);
CommitTransactionCommand();
}
@@ -4102,24 +4242,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* need to set the PROC_IN_SAFE_IC flag here.
*/
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * At this moment all target indexes are marked as "ready-to-insert". So,
+ * we are free to start process of dropping auxiliary indexes.
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ StartTransactionCommand();
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ }
+
/*
* Phase 3 of REINDEX CONCURRENTLY
*
- * During this phase the old indexes catch up with any new tuples that
+ * During this phase the new indexes catch up with any new tuples that
* were created during the previous phase. See "phase 3" in DefineIndex()
* for more details.
*/
-
- pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
- WaitForLockersMultiple(lockTags, ShareLock, true);
- CommitTransactionCommand();
-
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
TransactionId limitXmin;
- Snapshot snapshot;
StartTransactionCommand();
@@ -4134,13 +4302,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /*
- * Take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4149,19 +4310,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
progress_vals[2] = newidx->indexId;
- progress_vals[3] = newidx->amId;
- pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+ progress_vals[3] = newidx->auxIndexId;
+ progress_vals[4] = newidx->amId;
+ pgstat_progress_update_multi_param(5, progress_index, progress_vals);
- validate_index(newidx->tableId, newidx->indexId, snapshot);
-
- /*
- * We can now do away with our active snapshot, we still need to save
- * the xmin limit to wait for older snapshots.
- */
- limitXmin = snapshot->xmin;
-
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+ limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
* To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4335,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* there's no need to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForOlderSnapshots(limitXmin, true);
CommitTransactionCommand();
@@ -4271,14 +4425,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 5 of REINDEX CONCURRENTLY
*
- * Mark the old indexes as dead. First we must wait until no running
- * transaction could be using the index for a query. See also
+ * Mark the old and auxiliary indexes as dead. First we must wait until no
+ * running transaction could be using the index for a query. See also
* index_drop() for more details.
*/
INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_4);
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
foreach(lc, indexIds)
@@ -4303,6 +4457,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PopActiveSnapshot();
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+
+ index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+ PopActiveSnapshot();
+ }
+
/* Commit this transaction to make the updates visible. */
CommitTransactionCommand();
StartTransactionCommand();
@@ -4316,11 +4492,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 6 of REINDEX CONCURRENTLY
*
- * Drop the old indexes.
+ * Drop the old and auxiliary indexes.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_5);
+ PROGRESS_CREATEIDX_PHASE_WAIT_6);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4516,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
add_exact_object_address(&object, objects);
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *idx = lfirst(lc);
+ ObjectAddress object;
+
+ object.classId = RelationRelationId;
+ object.objectId = idx->auxIndexId;
+ object.objectSubId = 0;
+
+ add_exact_object_address(&object, objects);
+ }
+
/*
* Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
* right lock level.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0ecc3147bbd..fa1bdca7e2b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
TableScanDesc scan);
/* see table_index_validate_scan for reference about parameters */
- void (*index_validate_scan) (Relation table_rel,
- Relation index_rel,
- struct IndexInfo *index_info,
- Snapshot snapshot,
- struct ValidateIndexState *state);
+ TransactionId (*index_validate_scan) (Relation table_rel,
+ Relation index_rel,
+ struct IndexInfo *index_info,
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *aux_state);
/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
}
/*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
*
* See validate_index() for an explanation.
*/
-static inline void
+static inline TransactionId
table_index_validate_scan(Relation table_rel,
Relation index_rel,
struct IndexInfo *index_info,
- Snapshot snapshot,
- struct ValidateIndexState *state)
+ struct ValidateIndexState *state,
+ struct ValidateIndexState *auxstate)
{
- table_rel->rd_tableam->index_validate_scan(table_rel,
- index_rel,
- index_info,
- snapshot,
- state);
+ return table_rel->rd_tableam->index_validate_scan(table_rel,
+ index_rel,
+ index_info,
+ state,
+ auxstate);
}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..82d0d6b46d3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
{
INDEX_CREATE_SET_READY,
INDEX_CREATE_SET_VALID,
+ INDEX_DROP_CLEAR_READY,
INDEX_DROP_CLEAR_VALID,
INDEX_DROP_SET_DEAD,
} IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
#define INDEX_CREATE_IF_NOT_EXISTS (1 << 4)
#define INDEX_CREATE_PARTITIONED (1 << 5)
#define INDEX_CREATE_INVALID (1 << 6)
+#define INDEX_CREATE_AUXILIARY (1 << 7)
extern Oid index_create(Relation heapRelation,
const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid index_create(Relation heapRelation,
bits16 constr_flags,
bool allow_system_table_mods,
bool is_internal,
- Oid *constraintId);
+ Oid *constraintId,
+ char relpersistence);
#define INDEX_CONSTR_CREATE_MARK_AS_PRIMARY (1 << 0)
#define INDEX_CONSTR_CREATE_DEFERRABLE (1 << 1)
@@ -100,8 +103,14 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern Oid index_concurrently_create_aux(Relation heapRelation,
+ Oid mainIndexId,
+ Oid tablespaceOid,
+ const char *newName);
+
extern void index_concurrently_build(Oid heapRelationId,
- Oid indexRelationId);
+ Oid indexRelationId,
+ bool auxiliary);
extern void index_concurrently_swap(Oid newIndexId,
Oid oldIndexId,
@@ -145,7 +154,7 @@ extern void index_build(Relation heapRelation,
bool isreindex,
bool parallel);
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..89f8d02fdc3 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -88,6 +88,7 @@
#define PROGRESS_CREATEIDX_TUPLES_DONE 12
#define PROGRESS_CREATEIDX_PARTITIONS_TOTAL 13
#define PROGRESS_CREATEIDX_PARTITIONS_DONE 14
+#define PROGRESS_CREATEIDX_AUX_INDEX_OID 15
/* 15 and 16 reserved for "block number" metrics */
/* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
@@ -96,10 +97,11 @@
#define PROGRESS_CREATEIDX_PHASE_WAIT_2 3
#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN 4
#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT 5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN 6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE 6
#define PROGRESS_CREATEIDX_PHASE_WAIT_3 7
#define PROGRESS_CREATEIDX_PHASE_WAIT_4 8
#define PROGRESS_CREATEIDX_PHASE_WAIT_5 9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6 10
/*
* Subphases of CREATE INDEX, for index_build.
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
(1 row)
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach
+-------------------------
+
+(1 row)
+
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
NOTICE: notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
-- The same in parallel mode
ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
NOTICE: notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE: notice triggered for injection point table_parallelscan_initialize
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
NOTICE: notice triggered for injection point table_parallelscan_initialize
+NOTICE: notice triggered for injection point heapam_index_validate_scan_no_xid
DROP INDEX CONCURRENTLY cic_reset_snap.idx;
DROP SCHEMA cic_reset_snap CASCADE;
NOTICE: drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
CREATE SCHEMA cic_reset_snap;
CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7e008b1cbd9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL: Key (f1)=(b) already exists.
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
ERROR: could not create unique index "concur_index3"
DETAIL: Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
ERROR: could not create unique index "concur_reindex_ind5"
DETAIL: Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL: Key (c1)=(1) is duplicated.
c1 | integer | | |
Indexes:
"concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccaux" stir (c1) INVALID
"concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
--------------------------------+------------+-----------------------+-------------------------------
parted_isvalid_idx | f | parted_isvalid_tab |
parted_isvalid_idx_11 | f | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux | f | parted_isvalid_tab_11 |
parted_isvalid_tab_12_expr_idx | t | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
parted_isvalid_tab_1_expr_idx | f | parted_isvalid_tab_1 | parted_isvalid_idx
parted_isvalid_tab_2_expr_idx | t | parted_isvalid_tab_2 | parted_isvalid_idx
-(5 rows)
+(6 rows)
drop table parted_isvalid_tab;
-- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..c44e460b0d3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
INSERT INTO concur_heap VALUES ('b','x');
-- check if constraint is enforced properly at build time
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
-- This trick creates an invalid index.
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
\d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
--
2.43.0
v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patchapplication/octet-stream; name=v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patchDownload
From 103989dcbe91603da753b7e9647ad12df888cfb4 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 24 Dec 2024 19:17:25 +0100
Subject: [PATCH v9 8/9] Concurrently built index validation uses fresh
snapshots
This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.
The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.
To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
src/backend/access/heap/README.HOT | 15 +++++---
src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
src/backend/access/nbtree/nbtsort.c | 2 +-
src/backend/catalog/index.c | 7 ++--
src/backend/commands/indexcmds.c | 2 +-
src/include/access/transam.h | 15 ++++++++
6 files changed, 66 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT. Other transactions must include
such an index when determining HOT-safety of updates, even though they
must ignore it for both insertion and searching purposes.
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
We must do this to avoid making incorrect index entries. For example,
suppose we are building an index on column X and we make an index entry for
a non-HOT tuple with X=1. Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
use the key value from the live tuple.
We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open. Then we take
-a second reference snapshot and validate the index. This searches for
-tuples missing from the index, and inserts any missing ones. Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open. Then validate
+the index. This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
the value to be inserted is the one from the live tuple.
Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished. This ensures that nobody is
+the latest used snapshot is finished. This ensures that nobody is
alive any longer who could need to see any tuples that might be missing
from the index, as well as ensuring that no one can see any inconsistent
rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ecec3c1c080..1a041c5a77b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1806,27 +1806,35 @@ heapam_index_validate_scan(Relation heapRelation,
fetched;
bool tuplesort_empty = false,
auxtuplesort_empty = false;
+ instr_time snapshotTime,
+ currentTime;
Assert(!HaveRegisteredOrActiveSnapshot());
Assert(!TransactionIdIsValid(MyProc->xmin));
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL 1000
/*
- * Now take the "reference snapshot" that will be used by to filter candidate
- * tuples. Beware! There might still be snapshots in
- * use that treat some transaction as in-progress that our reference
- * snapshot treats as committed. If such a recently-committed transaction
- * deleted tuples in the table, we will not include them in the index; yet
- * those transactions which see the deleting one as still-in-progress will
- * expect such tuples to be there once we mark the index as valid.
+ * Now take the first snapshot that will be used by to filter candidate
+ * tuples. We are going to replace it by newer snapshot every so often
+ * to propagate horizon.
+ *
+ * Beware! There might still be snapshots in use that treat some transaction
+ * as in-progress that our temporary snapshot treats as committed.
+ *
+ * If such a recently-committed transaction deleted tuples in the table,
+ * we will not include them in the index; yet those transactions which
+ * see the deleting one as still-in-progress will expect such tuples to
+ * be there once we mark the index as valid.
*
* We solve this by waiting for all endangered transactions to exit before
- * we mark the index as valid.
+ * we mark the index as valid, for that reason limitX is supported.
*
* We also set ActiveSnapshot to this snap, since functions in indexes may
* need a snapshot.
*/
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
PushActiveSnapshot(snapshot);
+ INSTR_TIME_SET_CURRENT(snapshotTime);
limitXmin = snapshot->xmin;
/*
@@ -1868,6 +1876,23 @@ heapam_index_validate_scan(Relation heapRelation,
bool ts_isnull;
CHECK_FOR_INTERRUPTS();
+ INSTR_TIME_SET_CURRENT(currentTime);
+ INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+ if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+ {
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ /* to make sure we propagate xmin */
+ InvalidateCatalogSnapshot();
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ /* xmin should not go backwards, but just for the case*/
+ limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+ INSTR_TIME_SET_CURRENT(snapshotTime);
+ }
+
/*
* Attempt to fetch the next TID from the auxiliary sort. If it's
* empty, we set auxindexcursor to NULL.
@@ -2020,7 +2045,7 @@ heapam_index_validate_scan(Relation heapRelation,
heapam_index_fetch_end(fetch);
/*
- * Drop the reference snapshot. We must do this before waiting out other
+ * Drop the latest snapshot. We must do this before waiting out other
* snapshot holders, else we will deadlock against other processes also
* doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
* they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 38355601421..60551f82bfa 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -442,7 +442,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
* dead tuples) won't get very full, so we give it only work_mem.
*
* In case of concurrent build dead tuples are not need to be put into index
- * since we wait for all snapshots older than reference snapshot during the
+ * since we wait for all snapshots older than latest snapshot during the
* validation phase.
*/
if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b14f66affc..b4df2b1eee6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3472,8 +3472,9 @@ IndexCheckExclusion(Relation heapRelation,
* insert their new tuples into it. At that moment we clear "indisready" for
* auxiliary index, since it is no more required/
*
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
* (Any tuples committed live after the snap will be inserted into the
* index by their originating transaction. Any tuples committed dead before
* the snap need not be indexed, because we will wait out all transactions
@@ -3486,7 +3487,7 @@ IndexCheckExclusion(Relation heapRelation,
* TIDs of both auxiliary and target indexes, and doing a "merge join" against
* the TID lists to see which tuples from auxiliary index are missing from the
* target index. Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
* particular order: auxiliary first, target last.
*
* Building a unique index this way is tricky: we might try to insert a
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 02b636a0050..71baeced508 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4328,7 +4328,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted
- * just before the reference snap was taken, we have to wait out any
+ * just before the latest snap was taken, we have to wait out any
* transactions that might have older snapshots.
*
* Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..90d358804e4 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdFollows(a, b))
+ return a;
+ return b;
+}
+
/* return the newer of the two IDs */
static inline FullTransactionId
FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
--
2.43.0
v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patchapplication/octet-stream; name=v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patchDownload
From f4c00ab0c12b2af59e801d66d689d2378730a707 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Tue, 24 Dec 2024 19:36:25 +0100
Subject: [PATCH v9 9/9] concurrent index build: Remove PROC_IN_SAFE_IC
optimization
Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.
The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points
This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
src/backend/access/brin/brin.c | 6 +-
src/backend/access/nbtree/nbtsort.c | 6 +-
src/backend/commands/indexcmds.c | 142 +-----------------
src/include/storage/proc.h | 8 +-
src/test/modules/injection_points/Makefile | 2 +-
.../expected/reindex_conc.out | 51 -------
src/test/modules/injection_points/meson.build | 1 -
.../injection_points/sql/reindex_conc.sql | 28 ----
8 files changed, 10 insertions(+), 234 deletions(-)
delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index f076cedcc2c..048c7d7995b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2886,11 +2886,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
int sortmem;
/*
- * The only possible status flag that can be set to the parallel worker is
- * PROC_IN_SAFE_IC.
+ * There are no possible status flag that can be set to the parallel worker.
*/
- Assert((MyProc->statusFlags == 0) ||
- (MyProc->statusFlags == PROC_IN_SAFE_IC));
+ Assert(MyProc->statusFlags == 0);
/* Set debug_query_string for individual workers first */
sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 60551f82bfa..c6f7e527b65 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1907,11 +1907,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
#endif /* BTREE_BUILD_STATS */
/*
- * The only possible status flag that can be set to the parallel worker is
- * PROC_IN_SAFE_IC.
+ * There are no possible status flag that can be set to the parallel worker.
*/
- Assert((MyProc->statusFlags == 0) ||
- (MyProc->statusFlags == PROC_IN_SAFE_IC));
+ Assert(MyProc->statusFlags == 0);
/* Set debug_query_string for individual workers first */
sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 71baeced508..ae058dc701b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
Oid relationOid,
const ReindexParams *params);
static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
/*
* callback argument type for RangeVarCallbackForReindexIndex()
@@ -416,10 +415,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
* lazy VACUUMs, because they won't be fazed by missing index entries
* either. (Manual ANALYZEs, however, can't be excluded because they
* might be within transactions that are going to do arbitrary operations
- * later.) Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
*
* Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
* check for that.
@@ -440,8 +436,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
VirtualTransactionId *old_snapshots;
old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
- PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
- | PROC_IN_SAFE_IC,
+ PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_old_snapshots);
if (progress)
pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -461,8 +456,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
true, false,
- PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
- | PROC_IN_SAFE_IC,
+ PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_newer_snapshots);
for (j = i; j < n_old_snapshots; j++)
{
@@ -576,7 +570,6 @@ DefineIndex(Oid tableId,
amoptions_function amoptions;
bool exclusion;
bool partitioned;
- bool safe_index;
Datum reloptions;
int16 *coloptions;
IndexInfo *indexInfo;
@@ -1153,10 +1146,6 @@ DefineIndex(Oid tableId,
}
}
- /* Is index safe for others to ignore? See set_indexsafe_procflags() */
- safe_index = indexInfo->ii_Expressions == NIL &&
- indexInfo->ii_Predicate == NIL;
-
/*
* Report index creation if appropriate (delay this till after most of the
* error checks)
@@ -1643,10 +1632,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/*
* The index is now visible, so we can report the OID. While on it,
* include the report for the beginning of phase 2.
@@ -1703,9 +1688,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_2);
/*
@@ -1735,10 +1717,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/*
* Phase 3 of concurrent index build
*
@@ -1780,10 +1758,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/*
* Updating pg_index might involve TOAST table access, so ensure we
* have a valid snapshot.
@@ -1795,10 +1769,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /*
- * Because we don't take a snapshot in this transaction, there's no need
- * to set the PROC_IN_SAFE_IC flag here.
- */
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -1811,9 +1781,6 @@ DefineIndex(Oid tableId,
/*
* Drop auxiliary index.
*
- * Because we don't take a snapshot in this transaction, there's no need
- * to set the PROC_IN_SAFE_IC flag here.
- *
* Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
* right lock level.
*/
@@ -1823,10 +1790,6 @@ DefineIndex(Oid tableId,
CommitTransactionCommand();
StartTransactionCommand();
- /* Tell concurrent index builds to ignore us, if index qualifies */
- if (safe_index)
- set_indexsafe_procflags();
-
/* We should now definitely not be advertising any xmin. */
Assert(MyProc->xmin == InvalidTransactionId);
@@ -3621,7 +3584,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
Oid auxIndexId;
Oid tableId;
Oid amId;
- bool safe; /* for set_indexsafe_procflags */
} ReindexIndexInfo;
List *heapRelationIds = NIL;
List *indexIds = NIL;
@@ -3973,17 +3935,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
save_nestlevel = NewGUCNestLevel();
RestrictSearchPath();
- /* determine safety of this index for set_indexsafe_procflags */
- idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
- RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
- if (idx->safe)
- INJECTION_POINT("reindex-conc-index-safe");
- else
- INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
idx->tableId = RelationGetRelid(heapRel);
idx->amId = indexRel->rd_rel->relam;
@@ -4044,7 +3995,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
newidx = palloc_object(ReindexIndexInfo);
newidx->indexId = newIndexId;
newidx->auxIndexId = auxIndexId;
- newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -4137,11 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
CommitTransactionCommand();
StartTransactionCommand();
- /*
- * Because we don't take a snapshot in this transaction, there's no need
- * to set the PROC_IN_SAFE_IC flag here.
- */
-
/*
* Phase 2 of REINDEX CONCURRENTLY
*
@@ -4172,10 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
@@ -4184,11 +4125,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
StartTransactionCommand();
- /*
- * Because we don't take a snapshot in this transaction, there's no need
- * to set the PROC_IN_SAFE_IC flag here.
- */
-
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_2);
/*
@@ -4213,10 +4149,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4237,11 +4169,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
StartTransactionCommand();
- /*
- * Because we don't take a snapshot or Xid in this transaction, there's no
- * need to set the PROC_IN_SAFE_IC flag here.
- */
-
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4262,10 +4189,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
/*
* Updating pg_index might involve TOAST table access, so ensure we
* have a valid snapshot.
@@ -4298,10 +4221,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
*/
CHECK_FOR_INTERRUPTS();
- /* Tell concurrent indexing to ignore us, if index qualifies */
- if (newidx->safe)
- set_indexsafe_procflags();
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4330,9 +4249,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* interesting tuples. But since it might not contain tuples deleted
* just before the latest snap was taken, we have to wait out any
* transactions that might have older snapshots.
- *
- * Because we don't take a snapshot or Xid in this transaction,
- * there's no need to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4354,13 +4270,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
INJECTION_POINT("reindex_relation_concurrently_before_swap");
StartTransactionCommand();
- /*
- * Because this transaction only does catalog manipulations and doesn't do
- * any index operations, we can set the PROC_IN_SAFE_IC flag here
- * unconditionally.
- */
- set_indexsafe_procflags();
-
forboth(lc, indexIds, lc2, newIndexIds)
{
ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4416,12 +4325,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
CommitTransactionCommand();
StartTransactionCommand();
- /*
- * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
- * real need for that, because we only acquire an Xid after the wait is
- * done, and that lasts for a very short period.
- */
-
/*
* Phase 5 of REINDEX CONCURRENTLY
*
@@ -4483,12 +4386,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
CommitTransactionCommand();
StartTransactionCommand();
- /*
- * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
- * real need for that, because we only acquire an Xid after the wait is
- * done, and that lasts for a very short period.
- */
-
/*
* Phase 6 of REINDEX CONCURRENTLY
*
@@ -4748,36 +4645,3 @@ update_relispartition(Oid relationId, bool newval)
table_close(classRel, RowExclusiveLock);
}
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots. On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial. Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
- /*
- * This should only be called before installing xid or xmin in MyProc;
- * otherwise, concurrent processes could see an Xmin that moves backwards.
- */
- Assert(MyProc->xid == InvalidTransactionId &&
- MyProc->xmin == InvalidTransactionId);
-
- LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
- MyProc->statusFlags |= PROC_IN_SAFE_IC;
- ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
- LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5a3dd5d2d40..a8ee412397a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
*/
#define PROC_IS_AUTOVACUUM 0x01 /* is it an autovac worker? */
#define PROC_IN_VACUUM 0x02 /* currently running lazy vacuum */
-#define PROC_IN_SAFE_IC 0x04 /* currently running CREATE INDEX
- * CONCURRENTLY or REINDEX
- * CONCURRENTLY on non-expressional,
- * non-partial index */
#define PROC_VACUUM_FOR_WRAPAROUND 0x08 /* set by autovac only */
#define PROC_IN_LOGICAL_DECODING 0x10 /* currently doing logical
* decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
/* flags reset at EOXact */
#define PROC_VACUUM_STATE_MASK \
- (PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+ (PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
/*
* Xmin-related flags. Make sure any flags that affect how the process' Xmin
* value is interpreted by VACUUM are included here.
*/
-#define PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define PROC_XMIN_FLAGS (PROC_IN_VACUUM)
/*
* We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 73893d351bb..bc0a06a1274 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
DATA = injection_points--1.0.sql
PGFILEDESC = "injection_points - facility for injection points"
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local
-----------------------------
-
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach
--------------------------
-
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach
--------------------------
-
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE: notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE: notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE: notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE: notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach
--------------------------
-
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach
--------------------------
-
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index f288633da4f..73cb5e92fdc 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
'regress': {
'sql': [
'injection_points',
- 'reindex_conc',
'cic_reset_snapshots',
],
'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
--
2.43.0
On Tue, Dec 24, 2024 at 02:06:26PM +0100, Michail Nikolaev wrote:
Now STIR used for validation (but without resetting of snapshot during
that phase for now).
Perhaps I am the only one, but what you are doing here is confusing.
There is a dependency between one patch and the follow-up ones, but
while the first patch is clear regarding its goal of improving the
interactions between REINDEX CONCURRENTLY and INSERT ON CONFLICT
regarding the selection of arbiter index in the executor in 0001 in
the scope of the other thread you have created about this problem, it
is unclear what's the goal of what you are trying to do with 0003~, if
any of the follow-up patches help with that, and even why they have a
need to be posted on this thread. So perhaps you should split things
and explain what your goals are for each patch, or articulate better
why things are done this way? It looks like more things just keep
piling each time a new patch series is sent to the lists. Posting
300kB worth of patches every 3 days is not going to help potential
reviewers, just confuse them.
Note that 0002, that attempts to introduce new tests, is costly. This
is not acceptable for integration. I'd suggest to replace that with
tests that have controlled and successive steps as these lead to
predictible results, rather than have something that runs an arbitrary
amount of time to stress the friction of concurrent activity (this is
still useful to prove your point, though). That's something related
to the other thread, but in passing..
--
Michael
Hello, Michael!
Thank you for your comments and feedback!
Yes, this patch set contains a significant amount of code, which makes it
challenging to review. Some details are explained in the commit messages,
but I’m doing my best to structure the patch set in a way that is as
committable as possible. Once all the parts are ready, I plan to write a
detailed letter explaining everything, including benchmark results and
other relevant information.
Meanwhile, here’s a quick overview of the patch structure. If you have
suggestions for an alternative decomposition approach, I’d be happy to hear.
The primary goals of the patch set are to:
* Enable the xmin horizon to propagate freely during concurrent index
builds
* Build concurrent indexes with a single heap scan
The patch set is split into the following parts. Technically, each part
could be committed separately, but all of them are required to achieve the
goals.
Part 1: Stress tests
- 0001: Yes, this patch is from another thread and not directly required,
it’s included here as a single commit because it’s necessary for stress
testing this patch set. Without it, issues with concurrent reindexing and
upserts cause failures.
- 0002: Yes, I agree these tests need to be refactored or moved into a
separate task. I’ll address this later.
Part 2: During the first phase of concurrently building a index, reset the
snapshot used for heap scans between pages, allowing xmin to go forward.
- 0003: Implement such snapshot resetting for non-parallel and non-unique
cases
- 0004: Extends snapshot resetting to parallel builds
- 0005: Extends snapshot resetting to unique indexes
Part 3: Build concurrent indexes in a single heap scan
- 0006: Introduces the STIR (Short-Term Index Replacement) access method, a
specialized method for auxiliary indexes during concurrent builds
- 0007: Implements the auxiliary index approach, enabling concurrent index
builds to use a single heap scan.
In a few words, it works like this: create an empty auxiliary
STIR index to track new tuples, scan heap and build new index, merge STIR
tuples into new index, drop auxiliary index.
- 0008: Enhances the auxiliary index approach by resetting snapshots during
the merge phase, allowing xmin to propagate
Part 4: This part depends on all three previous parts being committed to
make sense (other parts are possible to apply separately).
- 0009: Remove PROC_IN_SAFE_IC logic, as it is no more required
I have a plan to add a few additional small things (optimizations) and then
do some scaled stress-testing and benchmarking. I think that without it, no
one is going to spend his time for such an amount of code :)
Merry Christmas,
Mikhail.
Hello, everyone!
I’ve added several updates to the patch set:
* Automatic auxiliary index removal where applicable.
* Documentation updates to reflect recent changes.
* Optimization for STIR indexes: skipping datum setup, as they store only
TIDs.
* Numerous assertions to ensure that MyProc->xmin is invalid where
necessary.
I’d like to share some initial benchmark results (see attached graphs).
This involves building a B-tree index on (aid, abalance) in a pgbench setup
with scale 2000 (with WAL), while running a concurrent pgbench workload.
The patched version built the index in 68 seconds, compared to 117 seconds
with the master branch (mostly because of a single heap scan).
There appears to be no effect on the throughput of the concurrent pgbench.
The maximum snapshot age remains near zero.
I am going to continue to benchmark with different options: different HOT
setup, unique index, different index types and DB size (100+ GB).
If someone has some ideas about possible benchmark scenarios - please share.
Best regards,
Mikhail.
[image: image.png]
Show quoted text
[image: image.png]
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR � A[�� sRGB ��� gAMA ���a pHYs � ��o�d ��IDATx^��w\U�����{�D7����;��rT�������m��ai;��
�Vj&���a��V����l.w�����=\����������sA�<��y�eY��dgg#))I��4 zN.>��\|�9���sr�����C�I�b����%�B!�\�&O��
��B!���[`| ��$�B!�)
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��%����p#�B!��@B� KJJ�p#�B!��@B�@�J�B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX����?�/���������9s�u�]'�U-����� &O��F#=��������?F�-0q�D�n���Kq��A���:u���'b��5��i�t7 @.�#**
D���!����������p��Q�\.�d2DDD�o�����k�R�����-]��������V�Z�������Oq��Wb��u������j��9&O����<|���p���C�0�z=:u�����Z�6h����o����f��=�B!�G`|��V�����_ ??_�����U+t��U�u��J�:��;w���U���6m�4h��]��eK��f�X���=|>�x������������0�L���z����?�p�B����R������;C��A�T�c��A���k�\.�7�k&55 �m�6|���())�?}�4�}�]l��*�
�;wF�N��T*�}�v����8{��x<!�B�F��P�+������p���+�/""����pUz��A1b�t7
��'���r����C�-`���p�B��fL�0�[���z����_�����g�����[�0L��$��)�w��������d�"���6m����z�������>|8���'���3}���gY[�l�������dL�4�V�!��F���8�����E���p8��� ���GII RRR�� P(8p ���p��q����'�6�B�~��A�P�����x<�X,���E||<�t�t<�0���/�7o�3g� '''h?!�B���3g
��s�N���+��:�t�R���/������~�U�Va��
8v��7o��(��,�9�o��iiiX�v-6l���;wB�V#!!�`���X�`\.
�n�:x�^�m����W�X�~�k�����[��z���2�^/v�� ����W_}�U�Va���8x� �7o�����>�{��������f�l�����HNN�tE7//_�5V�\�
6�����7��]�����_������C�V�W�^P(�Cp��q�:u
�Z��S�N���Sh��9��m���2������u�T��V�q��q��f���"222h����#x���PXX���;����������_�k��P�T8r����+�������?��������Z�
�o ���O?�iiih��m�������G}�R��-[���c�����o��/���?����q��DGG#&&���������h�� >���j_K�.������U+DEE��aRR��m��
|������+�f��[��F��P���r���o�J�=��T�w��^���5c�X�g�h�Z���^���o�0����ju��r����8}�4:t���M��'�B��/0>h����o����_=����3g�`���8u��x�_��/��v�C��-���>}��f�a���b�p��-q�
7@��"11��r�v�
������sq��at���G�FTT������O��BAA>��s��r�t�M���;
�d��9sF<���s����;��U+�;]�t�����`����:u
.DNN����7�x#��9�o��^�W<�>y�^���@&�!)) ����hdff��O?������~���{���3��E����E�h�� ���*�C�>}EEEh��"""�{�n,]�>��_=F����l��
_�u���\.G�.]�p8p����}>�����N�Cjj*X���+�b�
��j�1c��E�N�PXX�o��6��_d�d�DFFb���U��GNN>��c9r]�t�-����������~�����d}������A��M��h`4������,^������� ��
7��Y�f�S�NA� !���4X������Y3L�2�\s
F�����^�7n��������C��)S�`�������������+
�n��A�T�d2�O�>HLL����������I�&a������?z�!�m�:�x<����0u�T����������a��q��q�_��q#JKK1z�h�}��������I�&�f�!--
>�>�O���s�=5j����i���I�&������/f��W��������d�/�������j������c��x����f����V���u��(--������edd�at��,�b�����t�4i������>� ��k���b�?PJJ
:tB[\\���\�h�111())AVV�����C��k�A�~�p��wb����������������&N�(>�����:p� �^/n��6�7}������1e�4m���������x�}�v���P(���\.��Q����||��W�1c�x�
�X�999����G!��F���B�a��A������m��-[";;g���V���>����WJ�IHH�N�{�������|�k�)))�v�B��={�e���l�N�^�z����@�R�����>�?���t��M<��Grr2N�>-���>}���h���x\`Ny]��i�y�����Y����!""�����`�OII����1h� �|>a��Mx�����{�!///�w���kW(
���_�f��p�� ��� !!A��r�p��1(Q(���{1}�t�����I AO~~>
��������HMM�\.GLL�z�)L�2�Robb"T*UPg)T�:h��T*G���j���x���*�
�F�������(@;g���K/��zx���|�rx<2����������?��n� QQQ`Yeee��m>��C��=�
��Bi�,p��t�N
�r9�7o�����"q;��0��8p� 6n��O?�~�a��������t"66�R��={��W_�W\!n�h4!s��_�88�,����c���m��]�x<p�\(--Eaa!��b�O
��B�v�W^y%&M��g�yF\m��j1b����K�1cn��v�o�J����X�tiP��P�S�N�l6�Ikqq1:v��N�a��[7�|>|����9s&���dddT��H��w:�8|�0��d8p A'� ���Q\\���wc��5X�h>�����
B�������PN������u+���;��3��O�C�c���'�����{���>Q�T���+��s���W^�����[7h�Z���a��eA���Bi�,p0���-,��+ yyyx���0k�,|���X�v-��=����:<��j+]��*++���FNN~���J�S�N����n���v���W*4�/�:u��w�t������C��v�F�����{/�{�9�h�%%%��o��� *�
:t@YY���>MI�TL�����?Z�j����{���/���/����=l ����������v�iJ)))AmI:���gc��9����������m��
�8T�:p�\�M���tb��e�9s&-Z��~� h�� bcc������h�z��A������o��A+OU�h4���������b����'=�B!�L����\B�� '�*�
f�_~�%���p�����^������/`���
v2 �J��={��������G�T*�d��)U
����������3�l�i0p���B�P ++K���.]�@������Z�8q��6mZi�)%%=�^{�5<���b���M����t�Tdd$��o������#33��;w���9s��=\.����3gb����>}:���'��u�eY������k��m�'�xo��^{�5�}����C5�5k������rj�J��������QPP ��)=�B!�H��y�W]�����Cnn.�Z-bccq������]�v2d"##�����������d�J�
Y�{�� ��9iiiA��1P*�!O�|>�,Y���g���3����V�Evvv��4������Pd2"""`�����P��&M�HwU�-Z ;;�Fqq1:u�$^�7��x����h�"x<(
������n��1c �1
�K�.�z�8z�(8����������n��o�����_PzXIII�U��d�Xp��)�F�=��5W<�Ng����KTT|>���W��#p�\����� B!�qh����vc��-A'mG���S�������(
�d2x<���U���M�6�����vW*l
,.r�333q�� ��e�o�>8���������AJJ
rss�g������"33F�M�6������={���^/�l�2������:������_�d���7n�B�@jjj��P�r9RSSQ^^���7C�V��`0 22�O���v�,��(-!!111��};��9#�z'�N�3��TRR���G�W�r9
|>_�s��,�m�����J�J���}��������j�p8�~�z�\.t���Rm!�B��I�f��f��f?~�^����EEE(,,���G����e��_����z+�9���,����9�+W���n�R��B�@�=����{�"??~�j�QQQ������{�g��l6��������K4~�?��.����A���!�����������999p�\����q�F(�J�;�������}�PXX�����4��glll��������`@FFv����[�"33X�~=��_���!C��G�5:Q�����gJJJ��uk���_L1V9���/>7f���7�l����(�5�R���R�DYY:�a0|�pDGG���Z-<�S�N���H����������`@����aj��� �k���={��� 8�N���c������������BZZ�8��n��s�����wu��q���4���=x� ���o;v��_���~�
����`���B�;;w�Dll,�f3dIIIn��h4b�����|������~t��=��X ��h0a��k�����+�s�N\q�x��'��,�w��z:�`�����q# �u��x��G��eKl��?��JJJp���b���5:������c�=�>}�����X�b���������D�����}�����}{����V���f��7�4%�b��wo<����������G�o�>����M�6�:u*�R��|�XJJ
���j��5���~4i�[�l��?��C��g��x��G+���J�����j+�z�&M���{�A�f��g�,_�G����#1u�TDEE�����fj6C���n�/����k��d2����UW]�����|��D`{^�V��'Ob��}���CBB����;���C�B�4���n,^��'O>�,]��O���?���B!�r��(��B!��B!���(p �B!��� �������/R}!�B!�� !�B!�
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��,;;��B!�B ��,)) ��B!�B ��*BH
�����
C~~�t��f�a��qX�n�tW� ?c��yA�����6m� 11��m��yU��9�c�����{���e��u��{��i����[����+���[�y��U���M���O��c���BHC���B.]�v����������������G�����S� ����f���� ''�v���U��o�>���c����������8qb��z=�/_�a��Iw��a������q�0n�8���������`��x��Wa��0o�<,]��v����K�R�@�$P�@�����s��3m�4�������/1m�4q_~~��]��������5j���0m���+��]�v�W�O
�XK��?���o��l�_�����H���#??����������{�EVV233��w��e��|��t�F�
x�>����i�0i�$��c��%��^���`6�a�Zq��Q���qqq���'b��M�>x#�\�(p �\��o�b������HDH��-��%K�k�.l�������f���S1q�D���`��5�������%K� 55iii���Ob���z�- v����{���/�����/^�����f�=z ���������[�k�.����8y�d����eff"++qqqb�p��W�'�]�v��;�������Q<nx�oeO< �A�z=
�_~� `�Za6�q��Wc��uAW�o��|����}8�-[�`��5��g������ZU�{`������� ����#Gb���AA���S�|�r��z�] !��B�!��u�
7�k���U�_~�ELiR���`2��w��f��M�0r�H�?_�n�6m
4h�z=�v��V�Z� drr2Z�n
�^�x 999��l8z�(&N�(���p�
A���JKK1b�����#��o�������������_L��k��[�"??���Cdd$Z�n�a��a����U�v���O����ukt��=h_Uj������[��];�9R��z=�
&��1��'B�XQ�@�lIO8��D={�y��j�V�����xRy��w#;;����z^�5�����S����w�}�|7��]��iV����� iG�s��o����>��<�5!�qr����'B��Q�@�$�tL�o5%\�
�s~~>^x�|��b���b�`0�W��bZ���U�VHLL�Z����W�7���B�"�j|����8���b��5�X,���� _|��?g9|������~!l6&M�Ti�g�����W�!�bF�!��w�@%V<������SIDHBL~~>�n�*�Z�|y�+�z����b��}���O8p@z��������6m���2r�H,]�TL����_�w�`W_}5����������l6]?����Y������~����F���s�����[���F~~>�.]*�|��z�/�^�G�^��K���E�!11�j!=
!����;��+�D��=1q�D6L�w�������������#??_��#tUz��7�T���s���_#���+�x�
��`���A�����G��=q�}�a�����_��]���O?��1c����K������]�"11QLS�j����v��a��Qx��G.��}u�{m�:u*
$�� ������B�E�aYV�[�x�bL�<9�Bi��M��v������B�,0>�B!�BHX8B.K~�!�6B!��B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!a����!�!�B!$�+���� �!�B!$�+P�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!�V�s���0�m���!�����o���`z�;w�����!99�`����Z�A�#�B!�\� 2220{�l�,�e�t�R @ZZ������\���c���b�0g�dee�b�`��e3f���`�Zq���c��)`Y����:u��7B!�B.D�V�YYYHMM��BFF���x�o�)))X�~��o��0���7 ==�eee:t( `��q��y3����~6!�B!���{�`�Xp��I�x��`����I~FF�P$''###�R�a4��U+ddd 77 ��� ��eYq;!�B!���{�����e����e1p�@�~�����GVV��p �%77������B!��Z��,�
��x�bL�<9��:�s�N�=?��������TL�>���`��y9r$�|�I�5
V�#G���#0t�P�{��X�f
����~^�^�$��+���B!��$$$ >>^����l�E��o�HOOg�u�������g�f'L���,�Z,v������Y�e� &����e������W����l�n����t�eYv���lRR����[��Nx~I�@�g�B�g�A�e�B�g��P�g`|P��JiiiA-S����n��!>>���ba��#Gp�� ��955k����jEzz: �w��0�0�Lb�����kB!�BH����a��Q1b�F#�AVV���'��2e
��wo��7OL7�>}:���a41~�x�\����0����0��J?�B!�R;�=p 36o���r��Q����t�R�,��T������?�B!�r�$p �B!�\Z(p �B!��E�!�B!$,Yvv6�!�B!�bYRR�!�B!�bJU"�B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!a����!�!�B!$�+���� �!�B!$�+P�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B �B!�BHX8B!�B����B!��B!���(p �B!��E�!�B!$,
!�B!aQ�@!�B K���
�F!�B!��XA����F!�B!��X��LU*{�!�&B!�BH5.����g'lK�I7B!�B�pY `_��}��� !�B!!\����N �_��\��B!�"qY�5�
��=�/)F����� !�B!�e��2�&�F��{�8_-=�B!��A���<$''#--M�6g�0�a������0�z=v��)n~�08p �Vk��B9����� �����l��B!�B�A�g�}6h�\ZZ������\���c���b�0g�dee�b�`��e3f���`�Zq���c��)`Y����:uj�o �H��C�v=X��3�>��0B!�BHCiii���
<�����">>���GJJ
��_/�1bz��
HOO��bAYY�
7n6o����<���r4� 0<�����=~���+����fS�A!�B� ���0c�<���A�322���
0HNNFFF�V+����}F��Z�BFFrs������,���C����0�Y0:��= po�Wz�9���WH!�BiL$p���/q����+ �� ����'OJ7 rssQZZ*�\��������I��� ����
`Y��5��/��P�@!�B�e+��/^���'Q�v����^{
_�5,���
|���5j&N����TL�>���`��y9r$�|�I�5
V�#G���#0t�P�{��X�f
����s�N�=?��z��%���j���^l>���apJ ��GoBu&�S��+9Ez����J��wk�D��`���
B!�B�JBB�����kMP|�X�hQ������g� *�V�^���=��0a��,k�X���g�fY�e'L� �977�MJJbW�^�������uc���Y�e���W�IIIlnnn�o
�)���eeg���Y?��-���.]tlM-��b����|h��"�@x~I�@�g�B�g�A�e�B�g��P�g`|P��J��O��bBRRV�^�Q�F!55U,l>r�N�8!=���b��5�Z�HOO ���F�&�I,�^�|�X`]1U)�A�� ��s�����~��}^ @����}��N�B!�\��=p���Q�0e�$$$�w���7o��n4}�t$''�h4b���X�r%���a0��w�a���`YYY�7o��GI��A�fPhaQf�N����Q*��������Z[�ya�WgJ(p �B!�K���������Q��m�+��`���`Y6�-�~A�9,�b���0A��}<�O?���:0J��z,�������f�� ��iB!���4h����t���'������Tle�-���0���B!��]�����p�b`�P��>���7~�a@GR� ��bZq �B!��e8���#���m�������0?���XH�� ���@�h p�B!�������!�.'�&�@�(�@ T}� �;�Z�I��,bz���y4�P�!�Bi,�XA����v�����9�e��������"�)��� � �::��E���B!�r�b��6U �����W\ p����"U�lg��Q/ 7�P���c�t%��D!�B��;p�*
�e�(�[$��Z�=z(��`�����z��c�� hA�J�B!���� \��tW:�v+>������������o�[�
Z�p)�r �B!��e84�``�r����=�
���l����<��p5
N��S���gpu���A��t�:+B!��F�� �#_ ����mo�9��@�&.D�������.�z��9�� ��
�T%B!��H]���P }��s��X<��AfD�Y��v?N����jtO���b�[��4% H�G!�B
��:��|��++]8Y�Gn2��$L��Q�w������G��m�H�����@\$���[J�!�Bi*��^f���r�x�W�O�ob0�~�d�C��S��[4A� ��NS�!�Bi$.������^����b�^/�s�k��;���Sp����n���"Zq �B!��e8 @�����b����� �[��x�
p����&[���*B!���� m���oR�kW� ��39�}(;v�����x�HX}qa�O&�8S�!�Bi(p����F�!�*wIR����>C��B��X�
��A������Kh�!�_�x��n&�B
�j)�]W)���������0}�-[��Z`�|��0 @�m�����
�&�4R�_B�� (�2����!��K ���A�;���3 ����f� Hj�=�gJh������i �{�W5�Z�>���<���B�%��������# ��oIw ZP�!����>��B���B��~s����;nB���`�n��!��K�I��c`t:x��k3w�1�P }�f9�h�6� P���@�g?���|���:�����?�$�"�B.M8�'�)
�� X~ ����"�{hi����K�h�t��M��7"f���g k~�B!�2
.�v�X(Z�����o>�'vV�B%��m���B��������`�xv��w��!�\�(p���3 ������/8+�S�h�!���/���A� ��T�� ��m�t7!�r��eggC��s�h��F^/,�#n��3����,�l<��x�p���)
.�+��� ��=
���c(���X����V�p$�#B� ��,)) ����}S!�4���_p��.no��+QK��u���W���������E�6P�L��)�v�"�
�%Epo�W���������q�;�^����nl��b�������r��B!!V�T�Z���?0
`�r��]HW:M�����}�y'M�#uG,��S�����f �����'�������eN���An)�H��=��� ����l�'�������B.&8���� ��=�61p��J���i�m��E������M�b].��� P&�]�0����O��J��/+YE~��@��2�q�M�b�c:<>B�'�Wc��Z�o+ ,���]��e7�� !�bA�C-a�j(Z� x�(U�������b�Zu ��.���B�w�t���T���O�.��] ������1�v������/+En���
L�Bjb�WP�h^�E�w��A��2��Y����w~uG!�aP�P�|KF���R� �������u�T��x���N�H ��,X?~�����7��<
Q!\m�Z%'�M���h)�H�2OW*�kb�E���h)����xh�
�f��.sR�!�40
j��}* �s$8p�.�����d�sK��IJ���o�sp�twH��y����8�S�X�;X���"�wmG�S���[�X�X���IP�n �����.!�6+���iJ��) T=�B_�ix�����lY��!�X}� �W��F������(�/B�8�"eG~��4��+k��<����u�/���w6e�����) ���W�������N�_|��VKw����>