BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Started by PG Bug reporting formabout 4 years ago25 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17401
Logged by: Ben Chobot
Email address: bench@silentmedia.com
PostgreSQL version: 12.9
Operating system: Linux (Ubuntu)
Description:

This bug is is almost identical to BUG #17389, which I filed blaming
pg_repack; however, further testing shows the same symptoms using vanilla
REINDEX TABLE CONCURRENTLY.

1. Put some data in a table with a single btree index:
create table public.simple_test (id int primary key);
insert into public.simple_test(id) (select generate_series(1,1000));

2. Set up streaming replication to a secondary db.

3. In a loop on the primary, concurrently REINDEX that table:
while `true`; do psql -tAc "select now(),relfilenode from pg_class where
relname='simple_test_pkey'" >> log; psql -tAc "reindex table concurrently
public.simple_test"; done

4. In a loop on the secondary, have psql query the secondary db for an
indexed value of that table:
while `true`; do psql -tAc "select count(*) from simple_test where id='3';
select relfilenode from pg_class where relname='simple_test_pkey'" || break;
done; date

With those 4 steps, the client on the replica will reliably fail to open the
OID of the index within 30 minutes of looping. ("ERROR: could not open
relation with OID 6715827") When we run the same client loop on the primary
instead of the replica, or if we reindex without the CONCURRENTLY clause,
then the client loop will run for hours without fail, but neither of those
workarounds are options for us in production.

Like I said before, this isn't a new problem - we've seen it since at least
9.5 - but pre-12 we saw it using pg_repack, which is an easy (and
reasonable) scapegoat. But now that we've upgraded to 12 and are still
seeing it using vanilla concurrent reindexing, it seems more clear this is
an actual postgres bug?

In reply to: PG Bug reporting form (#1)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Tue, Feb 8, 2022 at 1:01 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

With those 4 steps, the client on the replica will reliably fail to open the
OID of the index within 30 minutes of looping. ("ERROR: could not open
relation with OID 6715827") When we run the same client loop on the primary
instead of the replica, or if we reindex without the CONCURRENTLY clause,
then the client loop will run for hours without fail, but neither of those
workarounds are options for us in production.

I find the idea that we'd fail to WAL-log information that is needed
during Hot Standby (to prevent this race condition) plausible.
Michael?

--
Peter Geoghegan

#3Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#2)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote:

I find the idea that we'd fail to WAL-log information that is needed
during Hot Standby (to prevent this race condition) plausible.
Michael?

Yeah, REINDEX relies on some existing index definition, so it feels
like we are missing a piece related to invalid indexes in all that. A
main difference is the lock level, as exclusive locks are getting
logged so the standby can react and wait on that. The 30-minute mark
is interesting. Ben, did you change any replication-related GUCs that
could influence that? Say, wal_receiver_timeout, hot_standby_feedback
or max_standby_streaming_delay?
--
Michael

#4Ben
bench@silentmedia.com
In reply to: Michael Paquier (#3)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Michael Paquier wrote on 2/8/22 5:35 PM:

On Tue, Feb 08, 2022 at 01:23:34PM -0800, Peter Geoghegan wrote:

I find the idea that we'd fail to WAL-log information that is needed
during Hot Standby (to prevent this race condition) plausible.
Michael?

Yeah, REINDEX relies on some existing index definition, so it feels
like we are missing a piece related to invalid indexes in all that. A
main difference is the lock level, as exclusive locks are getting
logged so the standby can react and wait on that. The 30-minute mark
is interesting. Ben, did you change any replication-related GUCs that
could influence that? Say, wal_receiver_timeout, hot_standby_feedback
or max_standby_streaming_delay?

Oh, to be clear, the 30 minute mark is more "the loop has always failed
this far into it" and sometimes that's 5 minutes and sometimes it's
more, but I've never seen it take more than somewhere in the 20s. I was
thinking it was just because of the race condition, but, to answer your
question, yes, we have tuned some replication parameters. Here are the
ones you asked about; did you want to see the value of any others?

=# show wal_receiver_timeout ;
 wal_receiver_timeout
──────────────────────
 1min
(1 row)

04:26:27 db: postgres@postgres, pid:29507
=# show hot_standby_feedback ;
 hot_standby_feedback
──────────────────────
 on
(1 row)

04:26:40 db: postgres@postgres, pid:29507
=# show max_standby_streaming_delay ;
 max_standby_streaming_delay
─────────────────────────────
 10min
(1 row)

#5Andrey Borodin
amborodin@acm.org
In reply to: PG Bug reporting form (#1)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

9 февр. 2022 г., в 01:41, PG Bug reporting form <noreply@postgresql.org> написал(а):

PostgreSQL version: 12.9

I thought we fixed this in 12.9. But apparently not.
Gitlab was observing similar problem on 12.7 here [0]https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6120#note_796846035. And I was convinced that it's manifestation of relcache bug fixed in fdd965d07 [1]https://github.com/postgres/postgres/commit/fdd965d074d46765c295223b119ca437dbcac973. I think we could adapt current amcheck tests for replication checks too. It cannot do heapallindexed on standby, but probably can catch relcache bugs.

Best regards, Andrey Borodin.

[0]: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6120#note_796846035
[1]: https://github.com/postgres/postgres/commit/fdd965d074d46765c295223b119ca437dbcac973

In reply to: Andrey Borodin (#5)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

It cannot do heapallindexed on standby, but probably can catch relcache bugs.

Why not? In general heapallindexed verification works just fine on a standby.

You may be thinking of the extra verification options that are applied
to the index structure itself when the user opts to call
bt_index_parent_check() (rather than bt_index_check). But that's
unrelated.

--
Peter Geoghegan

#7Andrey Borodin
amborodin@acm.org
In reply to: Peter Geoghegan (#6)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

9 февр. 2022 г., в 10:30, Peter Geoghegan <pg@bowt.ie> написал(а):

On Tue, Feb 8, 2022 at 9:25 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

It cannot do heapallindexed on standby, but probably can catch relcache bugs.

Why not? In general heapallindexed verification works just fine on a standby.

Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomes much smaller.

Best regards, Andrey Borodin.

In reply to: Andrey Borodin (#7)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomes much smaller.

Actually, calling bt_index_check() + heapallindexed acquires only an
AccessShareLock on both the table and the index.

Again, it sounds like you're talking about bt_index_parent_check()
(with or without heapallindexed). That does acquire a ShareLock, which
will just throw an error on a standby.

--
Peter Geoghegan

#9Andrey Borodin
amborodin@acm.org
In reply to: Peter Geoghegan (#8)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On 9 Feb 2022, at 12:13, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Feb 8, 2022 at 10:52 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Yes, but it acquires big ShareLock on relation. Probability of unusual interference with startup redo sequences becomes much smaller.

Actually, calling bt_index_check() + heapallindexed acquires only an
AccessShareLock on both the table and the index.

Again, it sounds like you're talking about bt_index_parent_check()
(with or without heapallindexed). That does acquire a ShareLock, which
will just throw an error on a standby.

You are right.

So, I’ve composed dirty test by reverting 7f580aa5d [0]https://github.com/postgres/postgres/commit/7f580aa5d88a9b03d66fcb9a1d7c4fcd69d9e126 to resurrect pgbench_background().
Now I observe in standby log:

2022-02-10 21:46:38.909 +05 [84272] 004_cic_standby.pl LOG: statement: SELECT bt_index_check('idx',false);
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl ERROR: cannot check index "idx_ccold"
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl DETAIL: Index is not valid.
2022-02-10 21:46:38.910 +05 [84272] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',false);

I think it’s somewhat unexpected. But is it a real problem?

Thanks!

[0]: https://github.com/postgres/postgres/commit/7f580aa5d88a9b03d66fcb9a1d7c4fcd69d9e126

Attachments:

0001-Try-to-check-indexes-on-standby.patchapplication/octet-stream; name=0001-Try-to-check-indexes-on-standby.patch; x-unix-mode=0644Download+108-42
#10Ben
bench@silentmedia.com
In reply to: Andrey Borodin (#9)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Andrey Borodin wrote on 2/10/22 8:52 AM:

I think it’s somewhat unexpected. But is it a real problem?

If you're asking me, then I can't speak to your standby log, but I'd say
the bug I raised is a problem, yes. :)

#11Andrey Borodin
amborodin@acm.org
In reply to: Ben (#10)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On 10 Feb 2022, at 22:18, Ben Chobot <bench@silentmedia.com> wrote:

Andrey Borodin wrote on 2/10/22 8:52 AM:

I think it’s somewhat unexpected. But is it a real problem?

If you're asking me, then I can't speak to your standby log, but I'd say the bug I raised is a problem, yes. :)

Well, not exactly. I had attached the patch with the test that reproduces some problem with dropped index on standby.
In this test we have a standby and check query on it at some point finds out that it’s working with invalid index. I’m not sure that this is reproduction of your exact problem.

I see that we are swapping indexes on primary in a following way:

index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
CommitTransactionCommand(); // WAL-logged

But I do not know is it expected to work on standby, since relcache invalidation is not logged. Maybe Peter or someone more experienced than me can explain how this is expected to work.

Thanks!

Best regards, Andrey Borodin.

#12Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#11)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Hi,

On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote:

index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
CommitTransactionCommand(); // WAL-logged

But I do not know is it expected to work on standby, since relcache
invalidation is not logged. Maybe Peter or someone more experienced than me
can explain how this is expected to work.

What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It
should be emitted at the commit, like other invalidation messages?

Here's the WAL for a REINDEX CONCURRENTLY, leaving out the actual catalog
changes (i.e. heap/heap2/btree rmgr):
rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/0155F3F8, prev 0/0155F3B8, desc: CREATE base/5/16399
rmgr: Standby len (rec/tot): 42/ 42, tx: 731, lsn: 0/0155F428, prev 0/0155F3F8, desc: LOCK xid 731 db 5 rel 16399
rmgr: Transaction len (rec/tot): 194/ 194, tx: 731, lsn: 0/0155F960, prev 0/0155F918, desc: COMMIT 2022-02-10 15:53:56.173778 PST; inval msgs: catcache 53 catcache 52 catcache 7 catcache 6 catcache 32 relcache 16399 relcache 16392 snapshot 2608 relcache 16392
rmgr: Transaction len (rec/tot): 146/ 146, tx: 732, lsn: 0/01561BB0, prev 0/01561B70, desc: COMMIT 2022-02-10 15:53:56.192747 PST; inval msgs: catcache 53 catcache 52 catcache 32 relcache 16392 relcache 16399 relcache 16399

It might be worth writing a test that replays WAL records one-by-one, and that
checks at which record things start to break. If the problem is not
reproducible that way, we have a problem "within" a single WAL record
replay. If it is reproducible, it's likely a problem in the way the primary
generates WAL.

That would be generally very useful tooling for detailed testing of sequences
of WAL records.

I'm pretty sure the problem is on the primary. Looking through
ReindexRelationConcurrently() I think I found at least two problems:

1) We don't WAL log snapshot conflicts, afaict (i.e. the
WaitForOlderSnapshots() in phase 3). Which I think means that the new index
can end up being used "too soon" in pre-existing snapshots on the standby.

I don't think this is the problem this thread is about, but it's definitely a
problem.

2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting
for sessions to see the results of Phase 4, 5, we can mark the index as dead
(phase 5) and drop it (phase 6), while there are ongoing accesses.

I think this is the likely cause of the reported bug.

Both seems like a significant issue for several of the CONCURRENTLY commands,
not just REINDEX.

Greetings,

Andres Freund

#13Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#12)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Thu, Feb 10, 2022 at 04:12:40PM -0800, Andres Freund wrote:

I'm pretty sure the problem is on the primary. Looking through
ReindexRelationConcurrently() I think I found at least two problems:

1) We don't WAL log snapshot conflicts, afaict (i.e. the
WaitForOlderSnapshots() in phase 3). Which I think means that the new index
can end up being used "too soon" in pre-existing snapshots on the standby.

I don't think this is the problem this thread is about, but it's definitely a
problem.

2) WaitForLockersMultiple() in phase 5 / 6 isn't WAL logged. Without waiting
for sessions to see the results of Phase 4, 5, we can mark the index as dead
(phase 5) and drop it (phase 6), while there are ongoing accesses.

I think this is the likely cause of the reported bug.

Yep, I was planning to play with this problem from next week, FWIW,
just lacked time/energy to do so. And the release was shipping
anyway, so there is plenty of time.

My impression is that we don't really need to change the WAL format
based on the existing APIs we already have, or that in the worst case
it would be possible to make things backward-compatible enough that it
would not be a worry as long as the standbys are updated before the
primaries.
--
Michael

#14Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#13)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Hi,

On 2022-02-11 10:38:34 +0900, Michael Paquier wrote:

My impression is that we don't really need to change the WAL format
based on the existing APIs we already have, or that in the worst case
it would be possible to make things backward-compatible enough that it
would not be a worry as long as the standbys are updated before the
primaries.

I think it's a question of how "concurrently" we want concurrently to be on
hot standby nodes.

If we are OK with "not at all", we can just lock an AEL at the end of
WaitForLockersMultiple(). It's a bit harder to cause the necessary snapshot
conflict, but we could e.g. extend xl_running_xacts and use the record length
to keep both older primary -> newer replica, and the reverse working.

If we want to actually make concurrently work concurrently on the replica,
we'd have to work harder. That doesn't strike me as feasible for backpatching.

The most important bit bit would be to make WaitForOlderSnapshots() wait for
the xmin of hot_standby_feedback walsenders and/or physical slot xmin
horizons. For phase 5 & 6 we would probably have to do something like waiting
for snapshot conflicts vs walsenders/slots in addition to the
WaitForLockersMultiple() for conflicts on the primary.

Greetings,

Andres Freund

#15Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#12)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On 11 Feb 2022, at 05:12, Andres Freund <andres@anarazel.de> wrote:

On 2022-02-10 22:41:00 +0500, Andrey Borodin wrote:

index_concurrently_swap(newidx->indexId, oldidx->indexId, oldName); // WAL-logged
CacheInvalidateRelcacheByRelid(oldidx->tableId); // Not WAL-logged
CommitTransactionCommand(); // WAL-logged

But I do not know is it expected to work on standby, since relcache
invalidation is not logged. Maybe Peter or someone more experienced than me
can explain how this is expected to work.

What makes you say that CacheInvalidateRelcacheByRelid isn't WAL logged? It
should be emitted at the commit, like other invalidation messages?

Thank you for the explanation!
FWIW here’s more clean TAP-test reproduction. Sometimes I see
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR: could not open relation with OID 16401
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',true);

In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report.

Best regards, Andrey Borodin.

Attachments:

0001-Test-B-tree-index-on-standby-during-REINDEX-CONCURRE.patchapplication/octet-stream; name=0001-Test-B-tree-index-on-standby-during-REINDEX-CONCURRE.patch; x-unix-mode=0644Download+150-1
#16Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#15)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On 12 Feb 2022, at 20:03, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Thank you for the explanation!
FWIW here’s more clean TAP-test reproduction. Sometimes I see
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl ERROR: could not open relation with OID 16401
2022-02-12 21:59:02.786 +05 [19629] 004_cic_standby.pl STATEMENT: SELECT bt_index_check('idx',true);

In tmp_check/log/004_cic_standby_standby_1.log after running amcheck tests. Exactly as per initial report.

I tried to dig into this again. And once again I simply cannot understand how it is supposed to work... I'd appreciate if someone explained it to me.
When I do SELECT bt_index_check('idx',true) in fact I'm doing following:
1. regclassin() lookups 'idx', converts it into oid without taking any lock
2. bt_index_check() gets this oid and lookups index again.
Do we rely on catalog snapshot here? Or how do we know that this oid is still of the index named 'idx' on standby? When backend will understand that this oid is no more of the interest and get an error somehow? I think it must happen during index_open().

BTW is anyone working on this bug?

Best regards, Andrey Borodin.

#17Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#16)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Tue, Mar 08, 2022 at 06:44:31PM +0300, Andrey Borodin wrote:

I tried to dig into this again. And once again I simply cannot
understand how it is supposed to work... I'd appreciate if someone
explained it to me.
When I do SELECT bt_index_check('idx',true) in fact I'm doing following:
1. regclassin() lookups 'idx', converts it into oid without taking any lock
2. bt_index_check() gets this oid and lookups index again.
Do we rely on catalog snapshot here? Or how do we know that this oid
is still of the index named 'idx' on standby? When backend will
understand that this oid is no more of the interest and get an error
somehow? I think it must happen during index_open().

Yeah, I can see that, and I think that it is a second issue, rather
independent of the cache lookup errors that we are seeing on a
standby. As far as I have tested, it is possible to see this error as
well with a sequence of CREATE/DROP INDEX CONCURRENTLY run repeatedly
on a primary with the standby doing a scan of pg_index, like that for
example in a tight loop:
SELECT bt_index_check(indexrelid) FROM pg_index WHERE indisvalid AND
indexrelid > 16000;

It is a matter of seconds to see bt_index_check() complain that one of
the so-said index is not valid, but the scan of pg_index was told to
not select such an index, so we are missing something with the
relation cache.

BTW is anyone working on this bug?

It took me some time to come back to this thread, but I do now that we
are done with the commit fest business.

Speaking of the cache errors when working directly on the relations
rebuilt, I got a patch that seems to work properly, with a mix of
standby snapshots and AELs logged. I have been running a set of
REINDEX INDEX/TABLE CONCURRENTLY on the primary through more than 100k
relfilenodes without the standby complaining, and I should be able to
get this one fixed before the next minor release.
--
Michael

#18Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#17)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On 20 Apr 2022, at 12:14, Michael Paquier <michael@paquier.xyz> wrote:

I got a patch that seems to work properly,

Cool! Can you please share it?

with a mix of
standby snapshots and AELs logged.

I did not succeed with my tests on this way. But probably I was doing something wrong. Or, perhaps, these tests were encountering second problem.

Best regards, Andrey Borodin.

#19Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#18)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

On Wed, Apr 20, 2022 at 11:23:43PM +0500, Andrey Borodin wrote:

Cool! Can you please share it?

Sure. I am finishing for this part with the attached, that would be
backpatchable. Based on the analysis of upthread, I have stuck with
logging standby snapshots once we are done in WaitForOlderSnapshots()
so as we make sure that a standby does not attempt to look at the data
of any running transactions it should wait for and the log of AELs at
the end of WaitForLockersMultiple() to avoid the access of the lockers
too early. I have to admit that a direct use of
LogAccessExclusiveLock() feels a bit funky but we cannot use
LockAcquire() AFAIK, and that should be fine with all the lock tags at
hand.

I have stressed quite a lot a primary/standby setup with flavors of
the tests from Ben upthread, and I am not seeing the cache problems
anymore on the standby, with concurrent CREATE/DROP INDEX or REINDEX
sequences.

I did not succeed with my tests on this way. But probably I was
doing something wrong. Or, perhaps, these tests were encountering
second problem.

There is a second thing going on with amcheck here. REINDEX
CONCURRENTLY in isolation is working correctly when you run
bt_index_check() in parallel of it (aka attach gdb to the backend
process running REINDEX CONCURRENTLY and run bt_index_check()
repeatedly in parallel at the various steps with a \watch or a new
connection each time), but depending on your transaction order you may
finish with an invalid index given as input of
bt_index_check_internal(), roughly:
- REINDEX CONCURRENTLY begins, goes up to the transaction doing the
swap where the old index is marked as indisvalid.
- Execute bt_index_check(), that refers to the old index OID, still
valid at the moment where amcheck begins checking this index, before
the Relation of the parent table is opened.
- The transaction doing the swap commits, making visible that
indisvalid is false for the old index.
- bt_index_check() continues, opens the parent Relation, and complains
about an invalid index.

amcheck could be made more robust here, by calling
RelationGetIndexList() after opening the Relation of the parent to
check if it is still listed in the returned list as it would look at
the relcache and discard any invalid indexes on the way. Returning an
error for an invalid index seems less helpful to me here than doing
nothing, particularly if you begin doing a full check of the indexes
based on the contents of pg_index.
--
Michael

Attachments:

reindex-standby-fix.patchtext/x-diff; charset=us-asciiDownload+29-0
#20Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#19)
Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

Hi,

On 2022-04-21 12:54:00 +0900, Michael Paquier wrote:

Based on the analysis of upthread, I have stuck with logging standby
snapshots once we are done in WaitForOlderSnapshots() so as we make sure
that a standby does not attempt to look at the data of any running
transactions it should wait for and the log of AELs at the end of
WaitForLockersMultiple() to avoid the access of the lockers too early.

Why is this necessary? The comments in WaitForOlderSnapshots() don't really
explain it. This is pretty darn expensive.

amcheck could be made more robust here, by calling
RelationGetIndexList() after opening the Relation of the parent to
check if it is still listed in the returned list as it would look at
the relcache and discard any invalid indexes on the way.

That seems like a weird approach. Why can't the check just be done on the
relcache entry of the index itself? If that doesn't work, something is still
broken in cache invalidation.

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd30f15eba..704d8f3744 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -475,6 +475,16 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
if (progress)
pgstat_progress_update_param(PROGRESS_WAITFOR_DONE, i + 1);
}
+
+	/*
+	 * Take a snapshot of running transactions and write this to WAL.  With
+	 * this information, a standby is able to know that it should not access
+	 * too soon any information we expect to wait for with a concurrent build.
+	 *
+	 * Skip the log of this information if disabled.
+	 */
+	if (XLogStandbyInfoActive())
+		LogStandbySnapshot();
}

"should not access too soon" - no idea what that means / guarantees.

The "Skip the log" bit is redundant with the code.

Greetings,

Andres Freund

#21Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#19)
#22Noah Misch
noah@leadboat.com
In reply to: Andrey Borodin (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#20)
#24Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#22)
#25Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#24)