long-standing data loss bug in initial sync of logical replication

Started by Tomas Vondraover 2 years ago105 messages
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).
I can reproduce it fairly reliably, but I haven't figured out all the
details yet and I'm a bit out of ideas, so I'm sharing what I know with
the hope someone takes a look and either spots the issue or has some
other insight ...

On the pgsql-bugs, Depesz reported reported [1]/messages/by-id/ZTu8GTDajCkZVjMs@depesz.com cases where tables are
added to a publication but end up missing rows on the subscriber. I
didn't know what might be the issue, but given his experience I decided
to take a do some blind attempts to reproduce the issue.

I'm not going to repeat all the details from the pgsql-bugs thread, but
I ended up writing a script that does randomized stress test tablesync
under concurrent load. Attached are two scripts, where crash-test.sh
does the main work, while run.sh drives the test - executes
crash-test.sh in a loop and generates random parameters for it.

The run.sh generates number of tables, refresh interval (after how many
tables we refresh subscription) and how long to sleep between steps (to
allow pgbench to do more work).

The crash-test.sh then does this:

1) initializes two clusters (expects $PATH to have pg_ctl etc.)

2) configures them for logical replication (wal_level, ...)

3) creates publication and subscription on the nodes

4) creates some a bunch of tables

5) starts a pgbench that inserts data into the tables

6) adds the tables to the publication one by one, occasionally
refreshing the subscription

7) waits for tablesync of all the tables to complete (so that the
tables get into the 'r' state, thus replicating normally)

8) stops the pgbench

9) waits for the subscriber to fully catch up

10) compares that the tables on publisher/subscriber nodes

To run this, just make sure PATH includes pg, and do e.g.

./run.sh 10

which does 10 runs of crash-test.sh with random parameters. Each run can
take a couple minutes, depending on the parameters, hardware etc.

Obviously, we expect the tables to match on the two nodes, but the
script regularly detects cases where the subscriber is missing some of
the rows. The script dumps those tables, and the rows contain timestamps
and LSNs to allow "rough correlation" (imperfect thanks to concurrency).

Depesz reported "gaps" in the data, i.e. missing a chunk of data, but
then following rows seemingly replicated. I did see such cases too, but
most of the time I see a missing chunk of rows at the end (but maybe if
the test continued a bit longer, it'd replicate some rows).

The report talks about replication between pg12->pg14, but I don't think
the cross-version part is necessary - I'm able to reproduce the issue on
individual versions (e.g. 12->12) since 12 (I haven't tried 11, but I'd
be surprised if it wasn't affected too).

The rows include `pg_current_wal_lsn()` to roughly track the LSN where
the row is inserted, and the "gap" of missing rows for each table seems
to match pg_subscription_rel.srsublsn, i.e. the LSN up to which
tablesync copied data, and the table should be replicated as usual.

Another interesting observation is that the issue only happens for "bulk
insert" transactions, i.e.

BEGIN;
... INSERT into all tables ...
COMMIT;

but not when each insert is a separate transaction. A bit strange.

After quite a bit of debugging, I came to the conclusion this happens
because we fail to invalidate caches on the publisher, so it does not
realize it should start sending rows for that table.

In particular, we initially build RelationSyncEntry when the table is
not yet included in the publication, so we end up with pubinsert=false,
thus not replicating the inserts. Which makes sense, but we then seems
to fail to invalidate the entry after it's added to the publication.

The other problem is that even if we happen to invalidate the entry, we
call GetRelationPublications(). But even if it happens long after the
table gets added to the publication (both in time and LSN terms), it
still returns NIL as if the table had no publications. And we end up
with pubinsert=false, skipping the inserts again.

Attached are three patches against master. 0001 adds some debug logging
that I found useful when investigating the issue. 0002 illustrates the
issue by forcefully invalidating the entry for each change, and
implementing a non-syscache variant of the GetRelationPublication().
This makes the code unbearably slow, but with both changes in place I
can no longer reproduce the issue. Undoing either of the two changes
makes it reproducible again. (I'll talk about 0003 later.)

I suppose timing matters, so it's possible it gets "fixed" simply
because of that, but I find that unlikely given the number of runs I did
without observing any failure.

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.
If we fail to invalidate the pg_publication_rel syscache somewhere, that
obviously explain why GetRelationPublications() returns stale data, but
it would also explain why the RelationSyncEntry is not invalidated, as
that happens in a syscache callback.

But I tried to do various crazy things in the ALTER PUBLICATION code,
and none of that worked, so I'm a bit confused/lost.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.
I don't know why it works, and I don't even recall what exactly led me
to the idea of changing it.

This is what 0003 does - it reverts 0002 and changes the lock level.

AFAIK the logical decoding code doesn't actually acquire locks on the
decoded tables, so why would this change matter? The only place that
does lock the relation is the tablesync, which gets RowExclusiveLock on
it. And it's interesting that RowExclusiveLock does not conflict with
ShareUpdateExclusiveLock, but does with ShareRowExclusiveLock. But why
would this even matter, when the tablesync can only touch the table
after it gets added to the publication?

regards

[1]: /messages/by-id/ZTu8GTDajCkZVjMs@depesz.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-debug-logging.patchtext/x-patch; charset=UTF-8; name=0001-debug-logging.patchDownload+47-2
0002-experimental-fix.patchtext/x-patch; charset=UTF-8; name=0002-experimental-fix.patchDownload+53-2
0003-alternative-experimental-fix-lock.patchtext/x-patch; charset=UTF-8; name=0003-alternative-experimental-fix-lock.patchDownload+2-55
crash-test.shapplication/x-shellscript; name=crash-test.shDownload
run.shapplication/x-shellscript; name=run.shDownload
#2Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#1)
Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).

:(

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

It could also be be that pgoutput doesn't have sufficient invalidation
handling.

One thing that looks bogus on the DDL side is how the invalidation handling
interacts with locking.

For tables etc the invalidation handling works because we hold a lock on the
relation before modifying the catalog and don't release that lock until
transaction end. That part is crucial: We queue shared invalidations at
transaction commit, *after* the transaction is marked as visible, but *before*
locks are released. That guarantees that any backend processing invalidations
will see the new contents. However, if the lock on the modified object is
released before transaction commit, other backends can build and use a cache
entry that hasn't processed invalidations (invaliations are processed when
acquiring locks).

While there is such an object for publications, it seems to be acquired too
late to actually do much good in a number of paths. And not at all in others.

E.g.:

pubform = (Form_pg_publication) GETSTRUCT(tup);

/*
* If the publication doesn't publish changes via the root partitioned
* table, the partition's row filter and column list will be used. So
* disallow using WHERE clause and column lists on partitioned table in
* this case.
*/
if (!pubform->puballtables && publish_via_partition_root_given &&
!publish_via_partition_root)
{
/*
* Lock the publication so nobody else can do anything with it. This
* prevents concurrent alter to add partitioned table(s) with WHERE
* clause(s) and/or column lists which we don't allow when not
* publishing via root.
*/
LockDatabaseObject(PublicationRelationId, pubform->oid, 0,
AccessShareLock);

a) Another session could have modified the publication and made puballtables out-of-date
b) The LockDatabaseObject() uses AccessShareLock, so others can get past this
point as well

b) seems like a copy-paste bug or such?

I don't see any locking of the publication around RemovePublicationRelById(),
for example.

I might just be misunderstanding things the way publication locking is
intended to work.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.

That's odd. There's cases where changing the lock level can cause invalidation
processing to happen because there is no pre-existing lock for the "new" lock
level, but there was for the old. But OpenTableList() is used when altering
the publications, so I don't see how that connects.

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-17 17:54:43 -0800, Andres Freund wrote:

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

I can confirm that something is broken with invalidation handling.

To test this I just used pg_recvlogical to stdout. It's just interesting
whether something arrives, that's easy to discern even with binary output.

CREATE PUBLICATION pb;
src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb -o messages=true -f -

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN; INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: COMMIT
S2: INSERT INTO d VALUES('d3');
S1: INSERT INTO d VALUES('d4');
RL: <nothing>

Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
to the ALTER PUBLICATION.

I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
TABLE d basically modifies the catalog state of 'd', without a lock preventing
other sessions from having a valid cache entry that they could continue to
use. Due to this, decoding S2's transactions that started before S2's commit,
will populate the cache entry with the state as of the time of S1's last
action, i.e. no need to output the change.

The reason this can happen is because OpenTableList() uses
ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
there's an ongoing INSERT.

I think this isn't just a logical decoding issue. S2's cache state just after
the ALTER PUBLICATION is going to be wrong - the table is already locked,
therefore further operations on the table don't trigger cache invalidation
processing - but the catalog state *has* changed. It's a bigger problem for
logical decoding though, as it's a bit more lazy about invalidation processing
than normal transactions, allowing the problem to persist for longer.

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

Greetings,

Andres Freund

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 02:54, Andres Freund wrote:

Hi,

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).

:(

Yeah :-(

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

It could also be be that pgoutput doesn't have sufficient invalidation
handling.

I'm not sure about the details, but it can't be just about pgoutput
failing to react to some syscache invalidation. As described, just
resetting the RelationSyncEntry doesn't fix the issue - it's the
syscache that's not invalidated, IMO. But maybe that's what you mean.

One thing that looks bogus on the DDL side is how the invalidation handling
interacts with locking.

For tables etc the invalidation handling works because we hold a lock on the
relation before modifying the catalog and don't release that lock until
transaction end. That part is crucial: We queue shared invalidations at
transaction commit, *after* the transaction is marked as visible, but *before*
locks are released. That guarantees that any backend processing invalidations
will see the new contents. However, if the lock on the modified object is
released before transaction commit, other backends can build and use a cache
entry that hasn't processed invalidations (invaliations are processed when
acquiring locks).

Right.

While there is such an object for publications, it seems to be acquired too
late to actually do much good in a number of paths. And not at all in others.

E.g.:

pubform = (Form_pg_publication) GETSTRUCT(tup);

/*
* If the publication doesn't publish changes via the root partitioned
* table, the partition's row filter and column list will be used. So
* disallow using WHERE clause and column lists on partitioned table in
* this case.
*/
if (!pubform->puballtables && publish_via_partition_root_given &&
!publish_via_partition_root)
{
/*
* Lock the publication so nobody else can do anything with it. This
* prevents concurrent alter to add partitioned table(s) with WHERE
* clause(s) and/or column lists which we don't allow when not
* publishing via root.
*/
LockDatabaseObject(PublicationRelationId, pubform->oid, 0,
AccessShareLock);

a) Another session could have modified the publication and made puballtables out-of-date
b) The LockDatabaseObject() uses AccessShareLock, so others can get past this
point as well

b) seems like a copy-paste bug or such?

I don't see any locking of the publication around RemovePublicationRelById(),
for example.

I might just be misunderstanding things the way publication locking is
intended to work.

I've been asking similar questions while investigating this, but the
interactions with logical decoding (which kinda happens concurrently in
terms of WAL, but not concurrently in terms of time), historical
snapshots etc. make my head spin.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.

That's odd. There's cases where changing the lock level can cause invalidation
processing to happen because there is no pre-existing lock for the "new" lock
level, but there was for the old. But OpenTableList() is used when altering
the publications, so I don't see how that connects.

Yeah, I had the idea that maybe the transaction already holds the lock
on the table, and changing this to ShareRowExclusiveLock makes it
different, possibly triggering a new invalidation or something. But I
did check with gdb, and if I set a breakpoint at OpenTableList, there
are no locks on the table.

But the effect is hard to deny - if I run the test 100 times, with the
SharedUpdateExclusiveLock I get maybe 80 failures. After changing it to
ShareRowExclusiveLock I get 0. Sure, there's some randomness for cases
like this, but this is pretty unlikely.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#3)
Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 03:54, Andres Freund wrote:

Hi,

On 2023-11-17 17:54:43 -0800, Andres Freund wrote:

On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.

I can confirm that something is broken with invalidation handling.

To test this I just used pg_recvlogical to stdout. It's just interesting
whether something arrives, that's easy to discern even with binary output.

CREATE PUBLICATION pb;
src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d postgres -o proto_version=4 -o publication_names=pb -o messages=true -f -

S1: CREATE TABLE d(data text not null);
S1: INSERT INTO d VALUES('d1');
S2: BEGIN; INSERT INTO d VALUES('d2');
S1: ALTER PUBLICATION pb ADD TABLE d;
S2: COMMIT
S2: INSERT INTO d VALUES('d3');
S1: INSERT INTO d VALUES('d4');
RL: <nothing>

Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
to the ALTER PUBLICATION.

I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
TABLE d basically modifies the catalog state of 'd', without a lock preventing
other sessions from having a valid cache entry that they could continue to
use. Due to this, decoding S2's transactions that started before S2's commit,
will populate the cache entry with the state as of the time of S1's last
action, i.e. no need to output the change.

The reason this can happen is because OpenTableList() uses
ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
there's an ongoing INSERT.

I guess this would also explain why changing the lock mode from
ShareUpdateExclusiveLock to ShareRowExclusiveLock changes the behavior.
INSERT acquires RowExclusiveLock, which doesn't conflict only with the
latter.

I think this isn't just a logical decoding issue. S2's cache state just after
the ALTER PUBLICATION is going to be wrong - the table is already locked,
therefore further operations on the table don't trigger cache invalidation
processing - but the catalog state *has* changed. It's a bigger problem for
logical decoding though, as it's a bit more lazy about invalidation processing
than normal transactions, allowing the problem to persist for longer.

Yeah. I'm wondering if there's some other operation acquiring a lock
weaker than RowExclusiveLock that might be affected by this. Because
then we'd need to get an even stronger lock ...

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong? If it's required for correctness, it's not really
wrong, IMO. Sure, stronger locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#5)
Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.

If it's required for correctness, it's not really wrong, IMO. Sure, stronger
locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

That certainly is true.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

Agreed.

Greetings,

Andres Freund

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#6)
Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 19:12, Andres Freund wrote:

Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:

I guess it's not really feasible to just increase the lock level here though
:(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
would perhaps lead to new deadlocks and such? But it also seems quite wrong.

If this really is about the lock being too weak, then I don't see why
would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.

Ah, you meant the current lock mode seems wrong, not that changing the
locks seems wrong. Yeah, true.

If it's required for correctness, it's not really wrong, IMO. Sure, stronger
locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

That certainly is true.

We could brute force this in the logical decoding infrastructure, by
distributing invalidations from catalog modifying transactions to all
concurrent in-progress transactions (like already done for historic catalog
snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
be a fairly significant increase in overhead.

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.

Agreed.

... and it would no not fix the other places outside logical decoding.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#7)
Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Greetings,

Andres Freund

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: long-standing data loss bug in initial sync of logical replication

On 11/18/23 22:05, Andres Freund wrote:

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

With the stronger SRE lock, the commits would have to happen in the
opposite order, because as you say it prevents the bad ordering.

But why would this matter for logical decoding? We accumulate the the
invalidations and execute them at transaction commit, or did I miss
something?

So what I think should happen is we get to apply B first, which won't
see the table as part of the publication. It might even build the cache
entries (syscache+relsync), reflecting that. But then we get to execute
A, along with all the invalidations, and that should invalidate them.

I'm clearly missing something, because the SRE does change the behavior,
so there has to be a difference (and by my reasoning it shouldn't be).

Or maybe it's the other way around? Won't B get the invalidation, but
use a historical snapshot that doesn't yet see the table in publication?

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Because we apply the invalidations at commit time, so it happens as a
single operation that can't interleave with other sessions?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#9)
Re: long-standing data loss bug in initial sync of logical replication

On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:

On 11/18/23 22:05, Andres Freund wrote:

Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:

On 11/18/23 19:12, Andres Freund wrote:

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

I don't think this the problematic sequence - at least it's not what I had
reproed in
/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

Adding line numbers:

1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: COMMIT
6) S2: INSERT INTO d VALUES('d3');
7) S1: INSERT INTO d VALUES('d4');
8) RL: <nothing>

The problem with the sequence is that the insert from 3) is decoded *after* 4)
and that to decode the insert (which happened before the ALTER) the catalog
snapshot and cache state is from *before* the ALTER TABLE. Because the
transaction started in 3) doesn't actually modify any catalogs, no
invalidations are executed after decoding it. The result is that the cache
looks like it did at 3), not like after 4). Undesirable timetravel...

It's worth noting that here the cache state is briefly correct, after 4), it's
just that after 5) it stays the old state.

If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
everything is fine.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Because we apply the invalidations at commit time, so it happens as a
single operation that can't interleave with other sessions?

Yea, the situation is much simpler during logical decoding than "originally" -
there's no concurrency.

Greetings,

Andres Freund

#11Vadim Lakt
vadim.lakt@gmail.com
In reply to: Andres Freund (#10)
Re: long-standing data loss bug in initial sync of logical replication

Hi,

On 19.11.2023 09:18, Andres Freund wrote:

Yea, the situation is much simpler during logical decoding than "originally" -
there's no concurrency.

Greetings,

Andres Freund

We've encountered a similar error on our industrial server.

The case: After adding a table to logical replication, table
initialization proceeds normally, but new data from the publisher's
table does not appear on the subscriber server. After we added the
table, we checked and saw that the data was present on the subscriber
and everything was normal, we discovered the error after some time. I
have attached scripts to the email.

The patch from the first message also solves this problem.

--
Best regards,
Vadim Lakt

Attachments:

err_logical_replication.zipapplication/zip; name=err_logical_replication.zipDownload
#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#10)
Re: long-standing data loss bug in initial sync of logical replication

On Sun, Nov 19, 2023 at 7:48 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
A: writes the invalidation message(s) into WAL
B: inserts into table "t"
B: commit
A: commit

I don't think this the problematic sequence - at least it's not what I had
reproed in
/messages/by-id/20231118025445.crhaeeuvoe2g5dv6@awork3.anarazel.de

Adding line numbers:

1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: COMMIT
6) S2: INSERT INTO d VALUES('d3');
7) S1: INSERT INTO d VALUES('d4');
8) RL: <nothing>

The problem with the sequence is that the insert from 3) is decoded *after* 4)
and that to decode the insert (which happened before the ALTER) the catalog
snapshot and cache state is from *before* the ALTER TABLE. Because the
transaction started in 3) doesn't actually modify any catalogs, no
invalidations are executed after decoding it. The result is that the cache
looks like it did at 3), not like after 4). Undesirable timetravel...

It's worth noting that here the cache state is briefly correct, after 4), it's
just that after 5) it stays the old state.

If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
everything is fine.

I agree, your analysis looks right to me.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

--
With Regards,
Amit Kapila.

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#12)
Re: long-standing data loss bug in initial sync of logical replication

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#13)
Re: long-standing data loss bug in initial sync of logical replication

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

--
With Regards,
Amit Kapila.

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#14)
Re: long-standing data loss bug in initial sync of logical replication

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#15)
Re: long-standing data loss bug in initial sync of logical replication

On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

Okay, I'll take care of this.

--
With Regards,
Amit Kapila.

#17vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#16)
Re: long-standing data loss bug in initial sync of logical replication

On Thu, 27 Jun 2024 at 08:38, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/25/24 07:04, Amit Kapila wrote:

On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/24/24 12:54, Amit Kapila wrote:

...

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

Okay, thanks for pointing to the CF entry. Would you like to take care
of this? Are you seeing anything more than the simple fix to use SRE
in OpenTableList()?

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

Okay, I'll take care of this.

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.
To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes. The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Regards,
Vignesh

Attachments:

alternate_lock_HEAD.patchtext/x-patch; charset=US-ASCII; name=alternate_lock_HEAD.patchDownload+73-1
issue_reproduce_testcase_head.patchtext/x-patch; charset=US-ASCII; name=issue_reproduce_testcase_head.patchDownload+66-0
alternate_lock_PG14.patchtext/x-patch; charset=US-ASCII; name=alternate_lock_PG14.patchDownload+1-1
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#17)
Re: long-standing data loss bug in initial sync of logical replication

On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();

- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);

The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.

To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes.

But won't the test that reproduces the problem in HEAD be successful
after the code change? If so, can't we use the same test instead of
slight modification to verify the lock mode?

The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Why didn't you include the test in the back branches? If it is due to
background psql stuff, then won't commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055)
can address it?

--
With Regards,
Amit Kapila.

#19vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#18)
Re: long-standing data loss bug in initial sync of logical replication

On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();

- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);

The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.

Updated comments for the same.

To reproduce the issue in the HEAD version, we cannot use the same
test as in the alternate_lock_HEAD patch because the behavior changes
slightly after the fix to wait for the lock until the open transaction
completes.

But won't the test that reproduces the problem in HEAD be successful
after the code change? If so, can't we use the same test instead of
slight modification to verify the lock mode?

Before the patch fix, the ALTER PUBLICATION command would succeed
immediately. Now, the ALTER PUBLICATION command waits until it
acquires the ShareRowExclusiveLock. This change means that in test
cases, previously we waited until the table was added to the
publication, whereas now, after applying the patch, we wait until the
ALTER PUBLICATION command is actively waiting for the
ShareRowExclusiveLock. This waiting step ensures consistent execution
and sequencing of tests each time.

The attached issue_reproduce_testcase_head.patch can be
used to reproduce the issue through TAP test in HEAD.
The changes made in the HEAD version do not directly apply to older
branches. For PG14, PG13, and PG12 branches, you can use the
alternate_lock_PG14.patch.

Why didn't you include the test in the back branches? If it is due to
background psql stuff, then won't commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=187b8991f70fc3d2a13dc709edd408a8df0be055)
can address it?

Indeed, I initially believed it wasn't available. Currently, I haven't
incorporated the back branch patch, but I plan to include it in a
subsequent version once there are no review comments on the HEAD
patch.

The updated v2 version patch has the fix for the comments.

Regards,
Vignesh

Attachments:

v2-0001-Fix-random-data-loss-during-logical-replication.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-random-data-loss-during-logical-replication.patchDownload+81-4
#20Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#19)
Re: long-standing data loss bug in initial sync of logical replication

On Tue, Jul 9, 2024 at 8:14 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 9 Jul 2024 at 17:05, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 1, 2024 at 10:51 AM vignesh C <vignesh21@gmail.com> wrote:

This issue is present in all supported versions. I was able to
reproduce it using the steps recommended by Andres and Tomas's
scripts. I also conducted a small test through TAP tests to verify the
problem. Attached is the alternate_lock_HEAD.patch, which includes the
lock modification(Tomas's change) and the TAP test.

@@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
/* Allow query cancel in case this takes a long time */
CHECK_FOR_INTERRUPTS();

- rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+ rel = table_openrv(t->relation, ShareRowExclusiveLock);

The comment just above this code ("Open, share-lock, and check all the
explicitly-specified relations") needs modification. It would be
better to explain the reason of why we would need SRE lock here.

Updated comments for the same.

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

--
With Regards,
Amit Kapila.

Attachments:

v3-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchapplication/octet-stream; name=v3-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patchDownload+84-6
#21vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#20)
#22Nitin Motiani
nitinmotiani@google.com
In reply to: vignesh C (#21)
#23Nitin Motiani
nitinmotiani@google.com
In reply to: Nitin Motiani (#22)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Nitin Motiani (#23)
#25vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#24)
#26Nitin Motiani
nitinmotiani@google.com
In reply to: vignesh C (#25)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Nitin Motiani (#26)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#27)
#29vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#29)
#31vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#30)
#32Nitin Motiani
nitinmotiani@google.com
In reply to: vignesh C (#31)
#33Nitin Motiani
nitinmotiani@google.com
In reply to: Nitin Motiani (#32)
#34Nitin Motiani
nitinmotiani@google.com
In reply to: Nitin Motiani (#33)
#35Nitin Motiani
nitinmotiani@google.com
In reply to: Nitin Motiani (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#31)
#37Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#36)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#37)
#39Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#38)
#40Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#39)
#41Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#40)
#42vignesh C
vignesh21@gmail.com
In reply to: Shlok Kyal (#40)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#42)
#44vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#43)
#45Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#43)
#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#45)
#47Nitin Motiani
nitinmotiani@google.com
In reply to: Amit Kapila (#43)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Nitin Motiani (#47)
#49Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#46)
#50Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#46)
#51Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Shlok Kyal (#41)
#52Nitin Motiani
nitinmotiani@google.com
In reply to: Amit Kapila (#48)
#53Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#49)
#54Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#41)
#55Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#54)
#56Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#55)
#57Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#54)
#58Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#56)
#59Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#58)
#60Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#58)
#61Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#59)
#62Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#60)
#63Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#62)
#64Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#63)
#65Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Shlok Kyal (#64)
#66Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#65)
#67Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Masahiko Sawada (#37)
#68Michael Paquier
michael@paquier.xyz
In reply to: Shlok Kyal (#67)
#69Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#68)
#70Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#69)
#71Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#66)
#72Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Shlok Kyal (#67)
#73Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#72)
#74Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Shlok Kyal (#71)
#75Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: Masahiko Sawada (#72)
#76Amit Kapila
amit.kapila16@gmail.com
In reply to: Benoit Lobréau (#75)
#77Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Masahiko Sawada (#72)
#78Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#77)
#79Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#73)
#80Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#79)
#81Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#80)
#82Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#81)
#83Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: Amit Kapila (#82)
#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#74)
#85Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Benoit Lobréau (#83)
#86Benoit Lobréau
benoit.lobreau@dalibo.com
In reply to: Zhijie Hou (Fujitsu) (#85)
#87Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#82)
#88Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#87)
#89Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#88)
#90Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#89)
#91Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#90)
#92Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#91)
#93Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#92)
#94Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#91)
#95Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#93)
#96Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#94)
#97Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#96)
#98Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#97)
#99Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Amit Kapila (#98)
#100Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#96)
#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#100)
#102Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#101)
#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#102)
#104Shlok Kyal
shlok.kyal.oss@gmail.com
In reply to: Amit Kapila (#103)
#105Amit Kapila
amit.kapila16@gmail.com
In reply to: Shlok Kyal (#104)