src/test/subscription/t/005_encoding.pl is broken

Started by Tom Laneover 8 years ago8 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

To reproduce the subscription-startup hang that Thomas Munro observed,
I changed src/backend/replication/logical/launcher.c like this:

@@ -427,7 +427,8 @@ retry:
bgw.bgw_notify_pid = MyProcPid;
bgw.bgw_main_arg = Int32GetDatum(slot);

-	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+	if (random() < 1000000000 ||
+		!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
 	{
 		/* Failed to start worker, so clean up the worker slot. */
 		LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);

This causes about 50% of worker launch requests to fail.
With the fix I just committed, 002_types.pl gets through fine,
but 005_encoding.pl does not; it sometimes fails like this:

t/005_encoding.pl ..... 1/1
# Failed test 'data replicated to subscriber'
# at t/005_encoding.pl line 49.
# got: ''
# expected: '1'
# Looks like you failed 1 test of 1.
t/005_encoding.pl ..... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests

The reason seems to be that its method of waiting for replication
to happen is completely inapropos. It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: src/test/subscription/t/005_encoding.pl is broken

Hi,

On 2017-09-18 11:50:06 -0400, Tom Lane wrote:

The reason seems to be that its method of waiting for replication
to happen is completely inapropos. It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

So I'm not sure that theory holds.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: src/test/subscription/t/005_encoding.pl is broken

Andres Freund <andres@anarazel.de> writes:

On 2017-09-18 11:50:06 -0400, Tom Lane wrote:

The reason seems to be that its method of waiting for replication
to happen is completely inapropos. It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

So I'm not sure that theory holds.

Well, there's *something* wrong with this test's wait method.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: src/test/subscription/t/005_encoding.pl is broken

On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

Huh? Decoding and applying the changes has to take some finite time
greater than 0.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#4)
Re: src/test/subscription/t/005_encoding.pl is broken

On Tue, Sep 19, 2017 at 8:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

Huh? Decoding and applying the changes has to take some finite time
greater than 0.

FWIW, the wait method looks fine to me.

Now, I just had a look at the logs for a failure and a success, and
one difference can be seen in the subscriber's logs as follows:
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has started
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has finished
+WARNING:  out of background worker slots
+HINT:  You might need to increase max_worker_processes.
The "+" portion is for a failure, and I think that this causes the
subscription to not consume the changes from the publisher which
explains the failure in the test as the logical worker applying the
changes on the subscriber-side is not here.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#5)
Re: src/test/subscription/t/005_encoding.pl is broken

Michael Paquier <michael.paquier@gmail.com> writes:

Now, I just had a look at the logs for a failure and a success, and
one difference can be seen in the subscriber's logs as follows:
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has started
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has finished
+WARNING:  out of background worker slots
+HINT:  You might need to increase max_worker_processes.
The "+" portion is for a failure, and I think that this causes the
subscription to not consume the changes from the publisher which
explains the failure in the test as the logical worker applying the
changes on the subscriber-side is not here.

That would indicate that something isn't ever retrying the worker
start; but if that's the case, how is it that we get through the
other subscription tests with my random-failure patch in place?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#6)
Re: src/test/subscription/t/005_encoding.pl is broken

On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That would indicate that something isn't ever retrying the worker
start; but if that's the case, how is it that we get through the
other subscription tests with my random-failure patch in place?

I have been able to dig into this issue further, and the problem is
indeed in the wait logic of 005_encoding.pl. It is important to wait
for the initial sync of the subscriber to happen. There is no need to
incorporate the additional wait query in wait_for_caught_up() as well.
Please see the attached which fixes the stability problems for me even
after forcing failures in launcher.c.
--
Michael

Attachments:

tap-subs-encoding.patchapplication/octet-stream; name=tap-subs-encoding.patchDownload+6-0
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#7)
Re: src/test/subscription/t/005_encoding.pl is broken

Michael Paquier <michael.paquier@gmail.com> writes:

On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That would indicate that something isn't ever retrying the worker
start; but if that's the case, how is it that we get through the
other subscription tests with my random-failure patch in place?

I have been able to dig into this issue further, and the problem is
indeed in the wait logic of 005_encoding.pl. It is important to wait
for the initial sync of the subscriber to happen. There is no need to
incorporate the additional wait query in wait_for_caught_up() as well.
Please see the attached which fixes the stability problems for me even
after forcing failures in launcher.c.

Ah, that makes sense. Pushed.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers