logical replication seems broken

Started by Erik Rijkersalmost 5 years ago11 messages
#1Erik Rijkers
er@xs4all.nl

Hello,

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]/messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl).

The symptom:
HEAD-replication fails (most of the time) when cascading 3 instances (master+2 replicas).

HEAD-replication works with 2 instances (master+replica).

I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent changes) - and this server runs the same test program without failure; so I think the culprit might be somewhere in those changes. Or (always possible) there might be something my testing does wrong - but then again, I do this test a few times every week, and it never fails.

This weekend I can dig into it some more (make a self-sufficient example) but I thought I'd mention it already. perhaps one of you see the light immediately...

Erik Rijkers

[1]: /messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Erik Rijkers (#1)
Re: logical replication seems broken

On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:

Hello,

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]).

The symptom:
HEAD-replication fails (most of the time) when cascading 3 instances (master+2 replicas).

HEAD-replication works with 2 instances (master+replica).

I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent changes) - and this server runs the same test program without failure; so I think the culprit might be somewhere in those changes.

Oh, if you are running this on today's HEAD then the recent commit
bff456d7a0 could be the culprit but not sure without knowing the
details.

Or (always possible) there might be something my testing does wrong - but then again, I do this test a few times every week, and it never fails.

Do you expect anything in particular while running this test, any
expected messages? What is exactly failing?

This weekend I can dig into it some more (make a self-sufficient example) but I thought I'd mention it already. perhaps one of you see the light immediately...

That would be really helpful.

--
With Regards,
Amit Kapila.

#3Noname
er@xs4all.nl
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: logical replication seems broken

On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:

Hello,

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]).

Hi,

Here is a test program. Careful, it deletes stuff. And it will need some changes:

I compile postgres server versions into directories like:
$HOME/pg_stuff/pg_installations/pgsql.$project where project is a name

The attached script (logrep_cascade_bug.sh) assumes that two such compiled versions are present (on my machine they are called HEAD and head0):
$HOME/pg_stuff/pg_installations/pgsql.HEAD --> git master as of today - friday 12 febr 2021
$HOME/pg_stuff/pg_installations/pgsql.head0 --> 3063eb17593c so that's from 11 febr, before the replication changes

In the test script, bash variables 'project' (and 'BIN') reflect my set up - so should probably be changed.

The instance from today 12 february ('HEAD') has the bug:
it keeps endlessly waiting/looping with 'NOK' (=Not OK).
'Not OK' means: primary not identical to all replicas (replica1 seems ok, but replica2 remains empty)

The instance from yesterday 11 february ('head0') is ok:
it finishes in 20 s after waiting/looping just 2 or 3 times
'ok' means: all replicas are identical to primary (as proven by the md5s).

That's all I have for now - I have no deeper idea about what exactly goes wrong.

I hope that helps, let me know when you cannot reproduce the problem.

Erik Rijkers

Attachments:

logrep_cascade_bug.shapplication/x-shellscript; name=logrep_cascade_bug.shDownload
#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Noname (#3)
1 attachment(s)
Re: logical replication seems broken

On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:

On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:

Hello,

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]).

Hi,

Here is a test program. Careful, it deletes stuff. And it will need some changes:

Thanks for sharing the test. I think I have found the problem.
Actually, it was an existing code problem exposed by the commit
ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
prepare_write ('w') message but then the actual message was not being
sent. This was the case when we didn't found the origin of a txn. This
can happen after that commit because we have now started using origins
for tablesync workers as well and those origins are removed once the
tablesync workers are finished. We might want to change the behavior
related to the origin messages as indicated in the comments but for
now, fixing the existing code.

Can you please test if the attached fixes the problem at your end as well?

--
With Regards,
Amit Kapila.

Attachments:

fix_origin_message_1.patchapplication/octet-stream; name=fix_origin_message_1.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 79765f9696..1b993fb032 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -342,10 +342,6 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	{
 		char	   *origin;
 
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
 		/*----------
 		 * XXX: which behaviour do we want here?
 		 *
@@ -357,7 +353,13 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 		 *----------
 		 */
 		if (replorigin_by_oid(txn->origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
 			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
+		}
+
 	}
 
 	OutputPluginWrite(ctx, true);
@@ -780,12 +782,13 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	{
 		char	   *origin;
 
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
 		if (replorigin_by_oid(txn->origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
 			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+		}
 	}
 
 	OutputPluginWrite(ctx, true);
#5Erik Rijkers
er@xs4all.nl
In reply to: Amit Kapila (#4)
Re: logical replication seems broken

On 02/13/2021 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:

On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]).

Hi,

Here is a test program. Careful, it deletes stuff. And it will need some changes:

Thanks for sharing the test. I think I have found the problem.
Actually, it was an existing code problem exposed by the commit
ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
prepare_write ('w') message but then the actual message was not being
sent. This was the case when we didn't found the origin of a txn. This
can happen after that commit because we have now started using origins
for tablesync workers as well and those origins are removed once the
tablesync workers are finished. We might want to change the behavior
related to the origin messages as indicated in the comments but for
now, fixing the existing code.

Can you please test if the attached fixes the problem at your end as well?

[fix_origin_message_1.patch]

I compiled just now a binary from HEAD, and a binary from HEAD+patch

HEAD is still broken; your patch rescues it, so yes, fixed.

Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

Thanks,

Erik Rijkers

Show quoted text

--
With Regards,
Amit Kapila.

#6vignesh C
vignesh21@gmail.com
In reply to: Erik Rijkers (#5)
1 attachment(s)
Re: logical replication seems broken

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

On 02/13/2021 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:

On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:

I am seeing errors in replication in a test program that I've been running for years with very little change (since 2017, really [1]).

Hi,

Here is a test program. Careful, it deletes stuff. And it will need some changes:

Thanks for sharing the test. I think I have found the problem.
Actually, it was an existing code problem exposed by the commit
ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
prepare_write ('w') message but then the actual message was not being
sent. This was the case when we didn't found the origin of a txn. This
can happen after that commit because we have now started using origins
for tablesync workers as well and those origins are removed once the
tablesync workers are finished. We might want to change the behavior
related to the origin messages as indicated in the comments but for
now, fixing the existing code.

Can you please test if the attached fixes the problem at your end as well?

[fix_origin_message_1.patch]

I compiled just now a binary from HEAD, and a binary from HEAD+patch

HEAD is still broken; your patch rescues it, so yes, fixed.

Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

Regards,
Vignesh

Attachments:

v1-0001-Test-for-verifying-data-is-replicated-in-cascaded.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Test-for-verifying-data-is-replicated-in-cascaded.patchDownload
From eff13970ae34bcdc8590a9602eddce5eb6200195 Mon Sep 17 00:00:00 2001
From: vignesh <vignesh@localhost.localdomain>
Date: Mon, 15 Feb 2021 11:41:55 +0530
Subject: [PATCH v1] Test for verifying data is replicated in cascaded setup.

Test for verifying data is replicated in cascaded setup.
---
 src/test/subscription/t/100_bugs.pl | 65 +++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index d1e407a..afb2d08 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -153,3 +153,68 @@ is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t"),
 	$rows * 2, "2x$rows rows in t");
 is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t2"),
 	$rows * 2, "2x$rows rows in t2");
+
+# Verify table data is synced with cascaded replication setup.
+my $node_pub = get_new_node('testpublisher1');
+$node_pub->init(allows_streaming => 'logical');
+$node_pub->start;
+
+my $node_pub_sub = get_new_node('testpublisher_subscriber');
+$node_pub_sub->init(allows_streaming => 'logical');
+$node_pub_sub->start;
+
+my $node_sub = get_new_node('testsubscriber1');
+$node_sub->init(allows_streaming => 'logical');
+$node_sub->start;
+
+# Create the tables in all nodes.
+$node_pub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_pub_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+
+# Create a cascaded replication setup like:
+# N1 - Create publication testpub1.
+# N2 - Create publication testpub2 and also include subscriber which subscribes
+#      to testpub1.
+# N3 - Create subscription testsub2 subscribes to testpub2.
+$node_pub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub1 FOR TABLE tab1");
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub2 FOR TABLE tab1");
+
+my $publisher1_connstr = $node_pub->connstr . ' dbname=postgres';
+my $publisher2_connstr = $node_pub_sub->connstr . ' dbname=postgres';
+
+$node_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher2_connstr' PUBLICATION testpub2"
+);
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub1 CONNECTION '$publisher1_connstr' PUBLICATION testpub1"
+);
+
+$node_pub->safe_psql('postgres',
+	"INSERT INTO tab1 values(generate_series(1,10))");
+
+# Verify that the data is cascaded from testpub1 to testsub1 and further from
+# testpub2 (which had testsub1) to testsub2.
+$node_pub->wait_for_catchup('testsub1');
+$node_pub_sub->wait_for_catchup('testsub2');
+
+# Drop subscriptions as we don't need them anymore
+$node_pub_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub1");
+$node_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub2");
+
+# Drop publications as we don't need them anymore
+$node_pub->safe_psql('postgres', "DROP PUBLICATION testpub1");
+$node_pub_sub->safe_psql('postgres', "DROP PUBLICATION testpub2");
+
+# Clean up the tables on both publisher and subscriber as we don't need them
+$node_pub->safe_psql('postgres', "DROP TABLE tab1");
+$node_pub_sub->safe_psql('postgres', "DROP TABLE tab1");
+$node_sub->safe_psql('postgres', "DROP TABLE tab1");
+
+$node_pub->stop('fast');
+$node_pub_sub->stop('fast');
+$node_sub->stop('fast');
-- 
1.8.3.1

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#6)
1 attachment(s)
Re: logical replication seems broken

On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

I compiled just now a binary from HEAD, and a binary from HEAD+patch

HEAD is still broken; your patch rescues it, so yes, fixed.

Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

--
With Regards,
Amit Kapila.

Attachments:

fix_origin_message_2.patchapplication/octet-stream; name=fix_origin_message_2.patchDownload
From 8d55de7130d8b0b9587d1b1dc4adc4d1d6010c8c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 15 Feb 2021 16:50:47 +0530
Subject: [PATCH] Remove the unnecessary PrepareWrite in pgoutput.

This issue exists from the inception of this code (PG-10) but got exposed
by the recent commit ce0fdbfe97 where we are using origins in tablesync
workers. The problem was that we were sometimes sending the prepare_write
('w') message but then the actual message was not being sent and on the
subscriber side, we always expect a message after prepare_write message
which led to this bug.

I refrained from backpatching this because there is no way in the core
code to hit this prior to commit ce0fdbfe97 and we haven't received any
complaints so far.

Reported-by: Erik Rijkers
Author: Amit Kapila and Vignesh C
Tested-by: Erik Rijkers
Discussion: https://postgr.es/m/1295168140.139428.1613133237154@webmailclassic.xs4all.nl
---
 src/backend/replication/pgoutput/pgoutput.c | 19 ++++----
 src/test/subscription/t/100_bugs.pl         | 69 +++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 79765f9..1b993fb 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -342,10 +342,6 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	{
 		char	   *origin;
 
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
 		/*----------
 		 * XXX: which behaviour do we want here?
 		 *
@@ -357,7 +353,13 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 		 *----------
 		 */
 		if (replorigin_by_oid(txn->origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
 			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
+		}
+
 	}
 
 	OutputPluginWrite(ctx, true);
@@ -780,12 +782,13 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	{
 		char	   *origin;
 
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
 		if (replorigin_by_oid(txn->origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
 			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
+		}
 	}
 
 	OutputPluginWrite(ctx, true);
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index d1e407a..b8f46f0 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -153,3 +153,72 @@ is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t"),
 	$rows * 2, "2x$rows rows in t");
 is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t2"),
 	$rows * 2, "2x$rows rows in t2");
+
+# Verify table data is synced with cascaded replication setup. This is mainly
+# to test whether the data written by tablesync worker gets replicated.
+my $node_pub = get_new_node('testpublisher1');
+$node_pub->init(allows_streaming => 'logical');
+$node_pub->start;
+
+my $node_pub_sub = get_new_node('testpublisher_subscriber');
+$node_pub_sub->init(allows_streaming => 'logical');
+$node_pub_sub->start;
+
+my $node_sub = get_new_node('testsubscriber1');
+$node_sub->init(allows_streaming => 'logical');
+$node_sub->start;
+
+# Create the tables in all nodes.
+$node_pub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_pub_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+
+# Create a cascaded replication setup like:
+# N1 - Create publication testpub1.
+# N2 - Create publication testpub2 and also include subscriber which subscribes
+#      to testpub1.
+# N3 - Create subscription testsub2 subscribes to testpub2.
+#
+# Note that subscription on N3 needs to be created before subscription on N2 to
+# test whether the data written by tablesync worker of N2 gets replicated.
+$node_pub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub1 FOR TABLE tab1");
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub2 FOR TABLE tab1");
+
+my $publisher1_connstr = $node_pub->connstr . ' dbname=postgres';
+my $publisher2_connstr = $node_pub_sub->connstr . ' dbname=postgres';
+
+$node_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher2_connstr' PUBLICATION testpub2"
+);
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub1 CONNECTION '$publisher1_connstr' PUBLICATION testpub1"
+);
+
+$node_pub->safe_psql('postgres',
+	"INSERT INTO tab1 values(generate_series(1,10))");
+
+# Verify that the data is cascaded from testpub1 to testsub1 and further from
+# testpub2 (which had testsub1) to testsub2.
+$node_pub->wait_for_catchup('testsub1');
+$node_pub_sub->wait_for_catchup('testsub2');
+
+# Drop subscriptions as we don't need them anymore
+$node_pub_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub1");
+$node_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub2");
+
+# Drop publications as we don't need them anymore
+$node_pub->safe_psql('postgres', "DROP PUBLICATION testpub1");
+$node_pub_sub->safe_psql('postgres', "DROP PUBLICATION testpub2");
+
+# Clean up the tables on both publisher and subscriber as we don't need them
+$node_pub->safe_psql('postgres', "DROP TABLE tab1");
+$node_pub_sub->safe_psql('postgres', "DROP TABLE tab1");
+$node_sub->safe_psql('postgres', "DROP TABLE tab1");
+
+$node_pub->stop('fast');
+$node_pub_sub->stop('fast');
+$node_sub->stop('fast');
-- 
1.8.3.1

#8vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#7)
Re: logical replication seems broken

On Mon, Feb 15, 2021 at 5:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

I compiled just now a binary from HEAD, and a binary from HEAD+patch

HEAD is still broken; your patch rescues it, so yes, fixed.

Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

The changes look fine to me.

Regards,
Vignesh

#9Noname
er@xs4all.nl
In reply to: Amit Kapila (#7)
Re: logical replication seems broken

On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

I compiled just now a binary from HEAD, and a binary from HEAD+patch
HEAD is still broken; your patch rescues it, so yes, fixed.
Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

My tests indeed run OK with this.

(I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you did that)

Thanks!

Erik Rijkers

#10vignesh C
vignesh21@gmail.com
In reply to: Noname (#9)
Re: logical replication seems broken

On Mon, Feb 15, 2021 at 6:14 PM <er@xs4all.nl> wrote:

On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

I compiled just now a binary from HEAD, and a binary from HEAD+patch
HEAD is still broken; your patch rescues it, so yes, fixed.
Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

My tests indeed run OK with this.

(I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you did that)

I could re-create the scenario that you had faced with this test. This
test case is a simplified test of your script, I have removed the use
of pgbench, reduced the number of tables used and simulated the same
problem with the similar replication setup that you had used.

Regards,
Vignesh

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#10)
Re: logical replication seems broken

On Mon, Feb 15, 2021 at 7:50 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Feb 15, 2021 at 6:14 PM <er@xs4all.nl> wrote:

On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:

I compiled just now a binary from HEAD, and a binary from HEAD+patch
HEAD is still broken; your patch rescues it, so yes, fixed.
Maybe a test (check or check-world) should be added to run a second replica? (Assuming that would have caught this bug)

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

My tests indeed run OK with this.

(I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you did that)

I could re-create the scenario that you had faced with this test. This
test case is a simplified test of your script, I have removed the use
of pgbench, reduced the number of tables used and simulated the same
problem with the similar replication setup that you had used.

Yeah, I was also able to see an issue with this test without a patch
and it got fixed after the patch. Pushed!

--
With Regards,
Amit Kapila.