BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Started by PG Bug reporting formover 5 years ago51 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16643
Logged by: Henry Hinze
Email address: henry.hinze@googlemail.com
PostgreSQL version: 13.0
Operating system: Ubuntu 20.04
Description:

The following setup of logical replication for one table works nicely with
PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
over and over again.
All settings are default except for wal_level which was changed to
"logical".

To replicate:
-- Create 2 Databases
psql

CREATE DATABASE d1;
CREATE DATABASE d2;

-- In d1: Create table, fill with data, create publication, create
replication slot
psql d1

CREATE TABLE t (f int);
INSERT INTO t SELECT * FROM generate_series(1, 10000000);
ALTER TABLE t ADD PRIMARY KEY (f);

CREATE PUBLICATION p;
ALTER PUBLICATION p ADD TABLE t;
SELECT pg_create_logical_replication_slot('my_slot','pgoutput');

-- In d2: Create table, create subscription

psql d2

CREATE TABLE t (f int);
ALTER TABLE t ADD PRIMARY KEY (f);
CREATE SUBSCRIPTION s CONNECTION 'user=postgres dbname=d1' PUBLICATION p
WITH (create_slot = false, slot_name='my_slot');

Here are the Logs:
**********************
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 LOG: logical decoding
found consistent point at 0/33CDDBB8
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:27:13.267 CEST [17178] postgres@d1 STATEMENT: select
pg_create_logical_replication_slot('my_slot','pgoutput');
2020-09-29 21:27:37.543 CEST [17229] LOG: logical replication apply worker
for subscription "s" has started
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 LOG: starting logical
decoding for slot "my_slot"
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 DETAIL: Streaming
transactions committing after 0/33CDDBF0, reading WAL from 0/33CDDBB8.
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 LOG: logical decoding
found consistent point at 0/33CDDBB8
2020-09-29 21:27:37.548 CEST [17230] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:27:37.552 CEST [17231] LOG: logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:27:37.704 CEST [17232] postgres@d1 LOG: logical decoding
found consistent point at 0/33D49208
2020-09-29 21:27:37.704 CEST [17232] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 LOG: starting logical
decoding for slot "my_slot_16398_sync_16393"
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 DETAIL: Streaming
transactions committing after 0/33D49240, reading WAL from 0/33D49208.
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 LOG: logical decoding
found consistent point at 0/33D49208
2020-09-29 21:27:58.888 CEST [17232] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:28:05.138 CEST [17231] ERROR: error reading result of
streaming command:
2020-09-29 21:28:05.143 CEST [17130] LOG: background worker "logical
replication worker" (PID 17231) exited with exit code 1
2020-09-29 21:28:05.145 CEST [17256] LOG: logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:28:05.222 CEST [17257] postgres@d1 LOG: logical decoding
found consistent point at 0/62FBD4E8
2020-09-29 21:28:05.222 CEST [17257] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 LOG: starting logical
decoding for slot "my_slot_16398_sync_16393"
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 DETAIL: Streaming
transactions committing after 0/62FBD520, reading WAL from 0/62FBD4E8.
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 LOG: logical decoding
found consistent point at 0/62FBD4E8
2020-09-29 21:28:49.187 CEST [17257] postgres@d1 DETAIL: There are no
running transactions.
2020-09-29 21:28:55.672 CEST [17256] ERROR: error reading result of
streaming command:
2020-09-29 21:28:55.676 CEST [17130] LOG: background worker "logical
replication worker" (PID 17256) exited with exit code 1
2020-09-29 21:28:55.679 CEST [17308] LOG: logical replication table
synchronization worker for subscription "s", table "t" has started
2020-09-29 21:28:55.833 CEST [17309] postgres@d1 LOG: logical decoding
found consistent point at 0/9E9AEA38
2020-09-29 21:28:55.833 CEST [17309] postgres@d1 DETAIL: There are no
running transactions.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

PG Bug reporting form <noreply@postgresql.org> writes:

The following setup of logical replication for one table works nicely with
PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t. What
non-default replication settings are you using?

regards, tom lane

#3Henry Hinze
henry.hinze@gmail.com
In reply to: Tom Lane (#2)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Hi Tom,

thanks for your reply!

For my comparison test of PG 12 and 13 I did not change any default setting
except wal_level.

But I also did some tests and increased wal_keep_size
and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
without success. The setup works in PG 13 only with a small amount of data.

Best,
Henry

Am Di., 29. Sept. 2020 um 22:09 Uhr schrieb Tom Lane <tgl@sss.pgh.pa.us>:

PG Bug reporting form <noreply@postgresql.org> writes:

The following setup of logical replication for one table works nicely

with

PG 12.4. In PG 13.0 the initial startup never finishes. Instead it starts
over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t. What
non-default replication settings are you using?

regards, tom lane

--
Diplominformatiker Henry Hinze
Dietzgenstraße 75
13156 Berlin
Tel: +49 - 177 - 3160621
USt-ID: DE306639264

#4Henry Hinze
henry.hinze@gmail.com
In reply to: Henry Hinze (#3)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

I've made an important observation!

Since I had the impression this setup was already working with RC1 of PG
13, I re-installed RC1 and did the same test. And it's working fine!

So it seems that something has changed from RC1 to the final version that
breaks logical replication in some cases.

Best,
Henry

Am Mi., 30. Sept. 2020 um 08:34 Uhr schrieb Henry Hinze <
henry.hinze@gmail.com>:

Show quoted text

Hi Tom,

thanks for your reply!

For my comparison test of PG 12 and 13 I did not change any default
setting except wal_level.

But I also did some tests and increased wal_keep_size
and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
without success. The setup works in PG 13 only with a small amount of data.

Best,
Henry

Am Di., 29. Sept. 2020 um 22:09 Uhr schrieb Tom Lane <tgl@sss.pgh.pa.us>:

PG Bug reporting form <noreply@postgresql.org> writes:

The following setup of logical replication for one table works nicely

with

PG 12.4. In PG 13.0 the initial startup never finishes. Instead it

starts

over and over again.

It kinda looks like something is timing out before it's managed to
catch up with the (rather large) initial state of table t. What
non-default replication settings are you using?

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Henry Hinze (#4)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Henry Hinze <henry.hinze@gmail.com> writes:

I've made an important observation!
Since I had the impression this setup was already working with RC1 of PG
13, I re-installed RC1 and did the same test. And it's working fine!

Ugh. So that points the finger at commits 07082b08c/bfb12cd2b,
which are the only nearby change between rc1 and 13.0. A quick
comparison of before-and-after checkouts confirms it.

After some digging around, I realize that that commit actually
resulted in a protocol break. libpqwalreceiver is expecting to
get an additional CommandComplete message after COPY OUT finishes,
per libpqrcv_endstreaming(), and it's no longer getting one.

(I have not read the protocol document to see if this is per spec;
but spec or no, that's what libpqwalreceiver is expecting.)

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

regards, tom lane

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Sep-30, Tom Lane wrote:

Henry Hinze <henry.hinze@gmail.com> writes:

I've made an important observation!
Since I had the impression this setup was already working with RC1 of PG
13, I re-installed RC1 and did the same test. And it's working fine!

Ugh. So that points the finger at commits 07082b08c/bfb12cd2b,
which are the only nearby change between rc1 and 13.0. A quick
comparison of before-and-after checkouts confirms it.

Oh dear.

After some digging around, I realize that that commit actually
resulted in a protocol break. libpqwalreceiver is expecting to
get an additional CommandComplete message after COPY OUT finishes,
per libpqrcv_endstreaming(), and it's no longer getting one.

(I have not read the protocol document to see if this is per spec;
but spec or no, that's what libpqwalreceiver is expecting.)

Yeah, definitely.

The minimal fix seems to be to add an EndReplicationCommand() call in
the T_StartReplicationCmd case. Testing this now ...

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

... yeah, that's indeed an important question. I'm going to guess that
the TAP suites are too forgiving :-(

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2020-Sep-30, Tom Lane wrote:

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

... yeah, that's indeed an important question. I'm going to guess that
the TAP suites are too forgiving :-(

One thing I noticed while trying to trace this down is that while the
initial table sync is happening, we have *both* a regular
walsender/walreceiver pair and a "sync" pair, eg

postgres 905650 0.0 0.0 186052 11888 ? Ss 17:12 0:00 postgres: logical replication worker for subscription 16398
postgres 905651 50.1 0.0 173704 13496 ? Ss 17:12 0:09 postgres: walsender postgres [local] idle
postgres 905652 104 0.4 186832 148608 ? Rs 17:12 0:19 postgres: logical replication worker for subscription 16398 sync 16393
postgres 905653 12.2 0.0 174380 15524 ? Ss 17:12 0:02 postgres: walsender postgres [local] COPY

Is it supposed to be like that? Notice also that the regular walsender
has consumed significant CPU time; it's not pinning a CPU like the sync
walreceiver is, but it's eating maybe 20% of a CPU according to "top".
I wonder whether in cases with only small tables (which is likely all
that our tests test), the regular walreceiver manages to complete the
table sync despite repeated(?) failures of the sync worker.

regards, tom lane

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#5)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-09-30 23:32, Tom Lane wrote:

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

What's the difference between this case and what the test suite is
testing? Is it that it replicates between two databases on the same
instance?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#7)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-09-30 23:52, Tom Lane wrote:

One thing I noticed while trying to trace this down is that while the
initial table sync is happening, we have *both* a regular
walsender/walreceiver pair and a "sync" pair, eg

postgres 905650 0.0 0.0 186052 11888 ? Ss 17:12 0:00 postgres: logical replication worker for subscription 16398
postgres 905651 50.1 0.0 173704 13496 ? Ss 17:12 0:09 postgres: walsender postgres [local] idle
postgres 905652 104 0.4 186832 148608 ? Rs 17:12 0:19 postgres: logical replication worker for subscription 16398 sync 16393
postgres 905653 12.2 0.0 174380 15524 ? Ss 17:12 0:02 postgres: walsender postgres [local] COPY

That's normal. You could also have even more if tables are syncing in
parallel.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#8)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Oct-01, Peter Eisentraut wrote:

On 2020-09-30 23:32, Tom Lane wrote:

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

What's the difference between this case and what the test suite is testing?
Is it that it replicates between two databases on the same instance?

I don't know why the tests pass, but the message
ERROR: error reading result of streaming command:
does appear in the logs after running src/test/subscription many times
(I see it in tests 001, 002, 013 and 014, apart from the new one in
100). It's indeed surprising that these tests all pass!

I turned Henry's reproducer into the attached TAP test, and it does
reproduce the problem; but if I reduce the number of rows from 5000 to
1000, then it no longer does. I don't quite see why this would be a
problem with a larger table only. Do you?

The fix is the commented-out line in walsender.c; the test reliably
passes for me if I uncomment that, and the error message disappear from
the server logs in all the other tests.

Attachments:

0001-Restore-double-command-completion-tags.patchtext/x-diff; charset=us-asciiDownload+30-3
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#10)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2020-Oct-01, Peter Eisentraut wrote:

What's the difference between this case and what the test suite is testing?
Is it that it replicates between two databases on the same instance?

I don't know why the tests pass, but the message
ERROR: error reading result of streaming command:
does appear in the logs after running src/test/subscription many times
(I see it in tests 001, 002, 013 and 014, apart from the new one in
100). It's indeed surprising that these tests all pass!

I turned Henry's reproducer into the attached TAP test, and it does
reproduce the problem; but if I reduce the number of rows from 5000 to
1000, then it no longer does. I don't quite see why this would be a
problem with a larger table only. Do you?

I think we really need to figure that out before concluding that this
problem is solved. Now that we've seen this, I'm wondering urgently
what other coverage gaps we've got there.

The fix is the commented-out line in walsender.c; the test reliably
passes for me if I uncomment that, and the error message disappear from
the server logs in all the other tests.

I agree that this is what we need to do code-wise; we can't let the
protocol break stand, or we'll break all sorts of cross-version
replication scenarios. However, we'd better also change the protocol
spec to say that this is what is supposed to happen.

regards, tom lane

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Sep-30, Tom Lane wrote:

After some digging around, I realize that that commit actually
resulted in a protocol break. libpqwalreceiver is expecting to
get an additional CommandComplete message after COPY OUT finishes,
per libpqrcv_endstreaming(), and it's no longer getting one.

(I have not read the protocol document to see if this is per spec;
but spec or no, that's what libpqwalreceiver is expecting.)

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

So I think I can answer this now. Petr, as intellectual author of this
code I would appreciate your thoughts on this -- you probably understand
this much better. Same goes for Peter as committer.

I think the logical replication feature is missing an detailed
architectural diagram.

What is happening is that the tablesync code, which is in charge of
initial syncing of tables, has two main code paths: the first one is an
optimization (?) in LogicalRepSyncTableStart() that if the sync takes
little time, the process will exit immediately (via
finish_sync_worker()) without returning control to the main logical
replication apply worker code. In the TAP tests, all tables are pretty
small, so that optimization always fires.

So in that case we never get control back to walrcv_endstreaming (the
complainant in Henry's test case), which is why replication never fails
in the tests. You can verify this because the final "return slotname"
line in LogicalRepSyncTableStart has zero coverage.

The other code path returns the slot name to ApplyWorkerMain, which
does the normal walrcv_startstreaming / LogicalRepApplyLoop /
endstreaming dance. And whenever that code path is taken, all the
src/test/subscription tests fail with the missing command tag, and
they work correctly when the command tag is restored, which is what
we wanted.

My proposal at this stage is to remove the optimization in
LogicalRepSyncTableStart, per 0001, if only because the code becomes
simpler (0002 reindents code). With this, we get a coverage increase
for tablesync.c not only in the final "return slotname" but also for
process_syncing_tables_for_sync() which is currently uncovered:
https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table? In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

Or is it?

Another thing I noticed is that if I crank up the number of rows in the
new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
in parallel. What's going on with that? Shouldn't there be two
simultaneous workers?

(*) In the submitted patch, the test uses 10 million rows. I intend to
reduce that to, say, 5000 rows in the committed version. Strictly
speaking, there's no need for this test at all since with the
LogicalRepSyncTableStart changes I propose, even t/001_rep_changes.pl
fails. But since my new test case fails even without that change, I
prefer to keep it anyway.

Attachments:

0001-Restore-double-command-completion-tags.patchtext/x-diff; charset=us-asciiDownload+61-54
0002-move-declarations-and-reindent.patchtext/x-diff; charset=us-asciiDownload+68-75
#13Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#12)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Hi,

On 10/10/2020 21:06, Alvaro Herrera wrote:

On 2020-Sep-30, Tom Lane wrote:

After some digging around, I realize that that commit actually
resulted in a protocol break. libpqwalreceiver is expecting to
get an additional CommandComplete message after COPY OUT finishes,
per libpqrcv_endstreaming(), and it's no longer getting one.

(I have not read the protocol document to see if this is per spec;
but spec or no, that's what libpqwalreceiver is expecting.)

The question that this raises is how the heck did that get past
our test suites? It seems like the error should have been obvious
to even the most minimal testing.

So I think I can answer this now. Petr, as intellectual author of this
code I would appreciate your thoughts on this -- you probably understand
this much better. Same goes for Peter as committer.

I think the logical replication feature is missing an detailed
architectural diagram.

What is happening is that the tablesync code, which is in charge of
initial syncing of tables, has two main code paths: the first one is an
optimization (?) in LogicalRepSyncTableStart() that if the sync takes
little time, the process will exit immediately (via
finish_sync_worker()) without returning control to the main logical
replication apply worker code. In the TAP tests, all tables are pretty
small, so that optimization always fires.

It's not only about size of the tables, it's mainly that there is no
write activity so the main apply is not moving past the LSN at which
table sync has started at. With bigger table you get at least something
written (running xacts, autovacuum, or whatever) that moves lsn forward
eventually.

So in that case we never get control back to walrcv_endstreaming (the
complainant in Henry's test case), which is why replication never fails
in the tests. You can verify this because the final "return slotname"
line in LogicalRepSyncTableStart has zero coverage.

Correct.

The other code path returns the slot name to ApplyWorkerMain, which
does the normal walrcv_startstreaming / LogicalRepApplyLoop /
endstreaming dance. And whenever that code path is taken, all the
src/test/subscription tests fail with the missing command tag, and
they work correctly when the command tag is restored, which is what
we wanted.

My proposal at this stage is to remove the optimization in
LogicalRepSyncTableStart, per 0001, if only because the code becomes
simpler (0002 reindents code). With this, we get a coverage increase
for tablesync.c not only in the final "return slotname" but also for
process_syncing_tables_for_sync() which is currently uncovered:
https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table? In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

There are 2 possible states at that point, either tablesync is ahead
(when main apply lags or nothing is happening on publication side) or
it's behind the main apply. When tablesync is ahead we are indeed done
and just need to update the state of the table (which is what the code
you removed did, but LogicalRepApplyLoop should do it as well, just a
bit later). When it's behind we need to do catchup for that table only
which still happens in the tablesync worker. See the explanation at the
beginning of tablesync.c, it probably needs some small adjustments after
the changes in your first patch.

Another thing I noticed is that if I crank up the number of rows in the
new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
in parallel. What's going on with that? Shouldn't there be two
simultaneous workers?

That's likely because the second replication slot is waiting for xid
created by the first one to commit or abort while building consistent
snapshot.

--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#13)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Oct-12, Petr Jelinek wrote:

It's not only about size of the tables, it's mainly that there is no write
activity so the main apply is not moving past the LSN at which table sync
has started at. With bigger table you get at least something written
(running xacts, autovacuum, or whatever) that moves lsn forward eventually.

I see -- yeah, okay.

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table? In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

There are 2 possible states at that point, either tablesync is ahead (when
main apply lags or nothing is happening on publication side) or it's behind
the main apply. When tablesync is ahead we are indeed done and just need to
update the state of the table (which is what the code you removed did, but
LogicalRepApplyLoop should do it as well, just a bit later). When it's
behind we need to do catchup for that table only which still happens in the
tablesync worker. See the explanation at the beginning of tablesync.c, it
probably needs some small adjustments after the changes in your first patch.

... Ooh, things start to make some sense now. So how about the
attached? There are some not really related cleanups. (Changes to
protocol.sgml are still pending.)

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

Attachments:

0001-Restore-logical-replication-dupe-command-tags.patchtext/x-diff; charset=us-asciiDownload+46-3
0002-Review-logical-replication-tablesync-code.patchtext/x-diff; charset=us-asciiDownload+122-156
#15Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#14)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Hi,

On 14/10/2020 03:12, Alvaro Herrera wrote:

On 2020-Oct-12, Petr Jelinek wrote:

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table? In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

There are 2 possible states at that point, either tablesync is ahead (when
main apply lags or nothing is happening on publication side) or it's behind
the main apply. When tablesync is ahead we are indeed done and just need to
update the state of the table (which is what the code you removed did, but
LogicalRepApplyLoop should do it as well, just a bit later). When it's
behind we need to do catchup for that table only which still happens in the
tablesync worker. See the explanation at the beginning of tablesync.c, it
probably needs some small adjustments after the changes in your first patch.

... Ooh, things start to make some sense now. So how about the
attached? There are some not really related cleanups. (Changes to
protocol.sgml are still pending.)

It would be nice if the new sentences at the beginning of tablesync.c
started with uppercase, but that's about as nitpicky as I can be :)

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

I don't really have any problems with the simplification you propose.
The saved time is probably in order of hundreds of ms which for table
sync is insignificant.

--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#15)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Oct-14, Petr Jelinek wrote:

It would be nice if the new sentences at the beginning of tablesync.c
started with uppercase, but that's about as nitpicky as I can be :)

OK, fixed that :-) And pushed (to master only). There's one more
change I added at the last minute, which is to remove the 'missing_ok'
parameter of GetSubscriptionRelState. There are some other cosmetic
changes, but nothing of substance.

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

I don't really have any problems with the simplification you propose. The
saved time is probably in order of hundreds of ms which for table sync is
insignificant.

Great, thanks.

If you think this is done ... I have taken a few notes in the process:

* STREAM COMMIT bug?
In apply_handle_stream_commit, we do CommitTransactionCommand, but
apparently in a tablesync worker we shouldn't do it.
* process_syncing_tables_for_apply belongs not in tablesync.c (maybe
worker.c?)
* DropSubscription bug / logicalrep_find_workers
DropSubscription believes that it's valid to get a list of workers, then
kill them one by one. But this is racy since one worker could end for other
reasons and be replaced by a worker to another subscription. This code
should use a loop that restarts after killing each worker.
* subscription_change_cb too coarse
It seems that we should only reset "subscription valid" when *our*
subscription is changed, not other subscriptions. Otherwise we read our own
sub data far too often.
* PostgresNode uses print where it should use Test::More's note()

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#16)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2020-Oct-14, Petr Jelinek wrote:

It would be nice if the new sentences at the beginning of tablesync.c
started with uppercase, but that's about as nitpicky as I can be :)

OK, fixed that :-) And pushed (to master only). There's one more
change I added at the last minute, which is to remove the 'missing_ok'
parameter of GetSubscriptionRelState. There are some other cosmetic
changes, but nothing of substance.

If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.

I don't really have any problems with the simplification you propose. The
saved time is probably in order of hundreds of ms which for table sync is
insignificant.

Great, thanks.

If you think this is done ... I have taken a few notes in the process:

* STREAM COMMIT bug?
In apply_handle_stream_commit, we do CommitTransactionCommand, but
apparently in a tablesync worker we shouldn't do it.

In the tablesync stage, we don't allow streaming. See pgoutput_startup
where we disable streaming for the init phase. As far as I understand,
for tablesync we create the initial slot during which streaming will
be disabled then we will copy the table (here logical decoding won't
be used) and then allow the apply worker to get any other data which
is inserted in the meantime. Now, I might be missing something here
but if you can explain it a bit more or share some test to show how we
can reach here via tablesync worker then we can discuss the possible
solution.

Sorry for responding late.

--
With Regards,
Amit Kapila.

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#17)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On 2020-Nov-04, Amit Kapila wrote:

On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

* STREAM COMMIT bug?
In apply_handle_stream_commit, we do CommitTransactionCommand, but
apparently in a tablesync worker we shouldn't do it.

In the tablesync stage, we don't allow streaming. See pgoutput_startup
where we disable streaming for the init phase. As far as I understand,
for tablesync we create the initial slot during which streaming will
be disabled then we will copy the table (here logical decoding won't
be used) and then allow the apply worker to get any other data which
is inserted in the meantime. Now, I might be missing something here
but if you can explain it a bit more or share some test to show how we
can reach here via tablesync worker then we can discuss the possible
solution.

Hmm, okay, that sounds like there would be no bug then. Maybe what we
need is just an assert in apply_handle_stream_commit that
!am_tablesync_worker(), as in the attached patch. Passes tests.

Attachments:

assert.patchtext/x-diff; charset=us-asciiDownload+3-0
#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#18)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2020-Nov-04, Amit Kapila wrote:

On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

* STREAM COMMIT bug?
In apply_handle_stream_commit, we do CommitTransactionCommand, but
apparently in a tablesync worker we shouldn't do it.

In the tablesync stage, we don't allow streaming. See pgoutput_startup
where we disable streaming for the init phase. As far as I understand,
for tablesync we create the initial slot during which streaming will
be disabled then we will copy the table (here logical decoding won't
be used) and then allow the apply worker to get any other data which
is inserted in the meantime. Now, I might be missing something here
but if you can explain it a bit more or share some test to show how we
can reach here via tablesync worker then we can discuss the possible
solution.

Hmm, okay, that sounds like there would be no bug then. Maybe what we
need is just an assert in apply_handle_stream_commit that
!am_tablesync_worker(), as in the attached patch. Passes tests.

+1. But do we want to have this Assert only in stream_commit API or
all stream APIs as well?

--
With Regards,
Amit Kapila.

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#19)
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

On Thu, Nov 5, 2020 at 9:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2020-Nov-04, Amit Kapila wrote:

On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

* STREAM COMMIT bug?
In apply_handle_stream_commit, we do CommitTransactionCommand, but
apparently in a tablesync worker we shouldn't do it.

In the tablesync stage, we don't allow streaming. See pgoutput_startup
where we disable streaming for the init phase. As far as I understand,
for tablesync we create the initial slot during which streaming will
be disabled then we will copy the table (here logical decoding won't
be used) and then allow the apply worker to get any other data which
is inserted in the meantime. Now, I might be missing something here
but if you can explain it a bit more or share some test to show how we
can reach here via tablesync worker then we can discuss the possible
solution.

Hmm, okay, that sounds like there would be no bug then. Maybe what we
need is just an assert in apply_handle_stream_commit that
!am_tablesync_worker(), as in the attached patch. Passes tests.

+1. But do we want to have this Assert only in stream_commit API or
all stream APIs as well?

One more point to look here is at what point does the tablesync worker
is involved in applying decode transactions if any? Basically, I
would like to ensure that if it uses the slot it has initially created
(before copy) then it is probably fine because we don't enable
streaming with it during the initial phase but if it uses the slot to
decode xacts after copy then we need to probably once check if it is
still true that streaming is not enabled at that point. I am not
completely sure if there are existing test cases to cover any such
scenarios so probably thinking a bit more about this might be helpful.

--
With Regards,
Amit Kapila.

#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#19)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#21)
#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#22)
#24Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#17)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#24)
#26Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#25)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#26)
#28Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#25)
#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#28)
#30Peter Smith
smithpb2250@gmail.com
In reply to: Alvaro Herrera (#18)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#30)
#32Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#31)
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#32)
#34Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#33)
#35Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#35)
#37Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#36)
#38Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#37)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#38)
#40Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#39)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#40)
#42Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#41)
#43Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#41)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#43)
#45Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#44)
#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#45)
#47Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#46)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#47)
#49Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#48)
#50Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#49)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#50)