Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

Started by Ilyasov Ianover 1 year ago13 messages
#1Ilyasov Ian
ianilyasov@outlook.com
1 attachment(s)

Hello, hackers!

Recently I've been building postgres with different cflags and cppflags.
And suddenly on REL_15_STABLE, REL_16_STABLE and master
I faced a failure of a src/test/subscription/t/029_on_error.pl test when
      CPPFLAGS="-DWAL_DEBUG"
and
      printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
(or when both publisher and subscriber or only subscriber are run with wal_debug=on)

So I propose a little fix to the test.

Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional

Attachments:

0001-Test-subscription-029_on_error.pl-fix-when-wal_debug.patchtext/x-patch; name=0001-Test-subscription-029_on_error.pl-fix-when-wal_debug.patchDownload
From 74cc32fe47def995c7cd85d3685215db383d3630 Mon Sep 17 00:00:00 2001
From: Ian Ilyasov <ianilyasov@outlook.com>
Date: Tue, 14 May 2024 13:08:02 +0300
Subject: [PATCH] Fix subscription/029_on_error.pl test when wal_debug=on

wal_debug=on results in additional log entries. Previously we would look
for the first matching line to get LSN to SKIP up to, but, with
wal_debug in mind, we better look for the last matching line to make
sure we won't get an error like
ERROR:  skip WAL location (LSN x/y) must be greater than origin LSN x/y+delta
---
 src/test/subscription/t/029_on_error.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/test/subscription/t/029_on_error.pl b/src/test/subscription/t/029_on_error.pl
index 7a8e424a22e..aedcd9a6213 100644
--- a/src/test/subscription/t/029_on_error.pl
+++ b/src/test/subscription/t/029_on_error.pl
@@ -28,8 +28,9 @@ sub test_skip_lsn
 
 	# Get the finish LSN of the error transaction.
 	my $contents = slurp_file($node_subscriber->logfile, $offset);
+	# We need LSN from the last matching line to avoid getting wrong LSN with wal_debug=on
 	$contents =~
-	  qr/processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
+	  qr/(?:.|\n)*processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
 	  or die "could not get error-LSN";
 	my $lsn = $1;
 
-- 
2.34.1

#2Michael Paquier
michael@paquier.xyz
In reply to: Ilyasov Ian (#1)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Tue, May 14, 2024 at 10:22:29AM +0000, Ilyasov Ian wrote:

Hello, hackers!

Recently I've been building postgres with different cflags and cppflags.
And suddenly on REL_15_STABLE, REL_16_STABLE and master
I faced a failure of a src/test/subscription/t/029_on_error.pl test when
      CPPFLAGS="-DWAL_DEBUG"
and
      printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
(or when both publisher and subscriber or only subscriber are run with wal_debug=on)

So I propose a little fix to the test.

Rather than assuming that the last line is the one to check, wouldn't
it be better to grab the data from the CONTEXT line located just after
the ERROR reporting the primary key violation?

A multi-line regexp, grabbing the LSN with more matching context based
on the ERROR and the DETAIL strings generating the CONTEXT we want
seems like a more stable alternative to me than grabbing the last line
of the logs.
--
Michael

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#2)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Wed, May 15, 2024 at 9:26 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, May 14, 2024 at 10:22:29AM +0000, Ilyasov Ian wrote:

Hello, hackers!

Recently I've been building postgres with different cflags and cppflags.
And suddenly on REL_15_STABLE, REL_16_STABLE and master
I faced a failure of a src/test/subscription/t/029_on_error.pl test when
CPPFLAGS="-DWAL_DEBUG"
and
printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
(or when both publisher and subscriber or only subscriber are run with wal_debug=on)

So I propose a little fix to the test.

Rather than assuming that the last line is the one to check, wouldn't
it be better to grab the data from the CONTEXT line located just after
the ERROR reporting the primary key violation?

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation. One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
my $res =
$node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
is($res, $expected, $msg);

--
With Regards,
Amit Kapila.

#4Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#3)
1 attachment(s)
RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

Dear Amit, Ian,

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation. One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
my $res =
$node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
is($res, $expected, $msg);

I made a patch for confirmation purpose. This worked well on my environment.
Ian, how about you?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

fix_029.diffapplication/octet-stream; name=fix_029.diffDownload
diff --git a/src/test/subscription/t/029_on_error.pl b/src/test/subscription/t/029_on_error.pl
index 5c29e5cc32..16603aa9a7 100644
--- a/src/test/subscription/t/029_on_error.pl
+++ b/src/test/subscription/t/029_on_error.pl
@@ -45,9 +45,8 @@ sub test_skip_lsn
 		"SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'"
 	);
 
-	# Check the log to ensure that the transaction is skipped, and advance the
-	# offset of the log file for the next test.
-	$offset = $node_subscriber->wait_for_log(
+	# Check the log to ensure that the transaction is skipped
+	$node_subscriber->wait_for_log(
 		qr/LOG: ( [A-Z0-9]+:)? logical replication completed skipping transaction at LSN $lsn/,
 		$offset);
 
@@ -61,6 +60,11 @@ sub test_skip_lsn
 	my $res =
 	  $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tbl");
 	is($res, $expected, $msg);
+
+	# Advance the offset of the log file for the next test. The line must be
+	# after the non-conflict transaction because it may output additional log
+	# messages if some debug logs are enabled.
+	$offset = -s $node_subscriber->logfile;
 }
 
 # Create publisher node. Set a low value of logical_decoding_work_mem to test
#5Ilyasov Ian
ianilyasov@outlook.com
In reply to: Hayato Kuroda (Fujitsu) (#4)
RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

Dear Hayato,

I made a patch for confirmation purpose. This worked well on my environment.
Ian, how about you?

I checked this patch on my environment. It also works well.
I like this change, but as I see it makes a different approach from Michael's advice.
Honesly, I do not know what would be better for this test.

Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional

#6Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#3)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Wed, May 15, 2024 at 05:58:18PM +0530, Amit Kapila wrote:

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation.

And? You could pass the ERROR string expected as argument of the
function if more flexibility is wanted at some point, no? It happens
that you don't require that now, which is fine for the three scenarios
calling test_skip_lsn.

One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
my $res =
$node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
is($res, $expected, $msg);

That still looks like a shortcut to me, weak to race conditions on
slow machines where more log entries could be generated in-between.
So it seems to me that you'd still want to make sure that the CONTEXT
from which the LSN is retrieved maps to the sole expected error. This
is not going to be stable unless there are stronger checks to avoid
log entries that can parasite the output, and a stronger matching
ensures that.
--
Michael

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#6)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Thu, May 16, 2024 at 3:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, May 15, 2024 at 05:58:18PM +0530, Amit Kapila wrote:

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation.

And? You could pass the ERROR string expected as argument of the
function if more flexibility is wanted at some point, no?

Normally, we consider error_codes for comparison as they are less
prone to change but here it may be okay to use error_string as we can
change it, if required. But let's discuss a bit more on the other
solution being discussed below.

It happens
that you don't require that now, which is fine for the three scenarios
calling test_skip_lsn.

One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
my $res =
$node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
is($res, $expected, $msg);

That still looks like a shortcut to me, weak to race conditions on
slow machines where more log entries could be generated in-between.
So it seems to me that you'd still want to make sure that the CONTEXT
from which the LSN is retrieved maps to the sole expected error. This
is not going to be stable unless there are stronger checks to avoid
log entries that can parasite the output, and a stronger matching
ensures that.

This can only be a problem if the apply worker generates more LOG
entries with the required context but it won't do that unless there is
an operation on the publisher to replicate. If we see any such danger
then I agree this can break on some slow machines but as of now, I
don't see any such race condition.

--
With Regards,
Amit Kapila.

#8Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#7)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Thu, May 16, 2024 at 09:00:47AM +0530, Amit Kapila wrote:

This can only be a problem if the apply worker generates more LOG
entries with the required context but it won't do that unless there is
an operation on the publisher to replicate. If we see any such danger
then I agree this can break on some slow machines but as of now, I
don't see any such race condition.

Perhaps, still I'm not completely sure if this assumption is going to
always stand for all the possible configurations we support. So,
rather than coming back to this test again, I would choose to make the
test as stable as possible from the start. That's what mapping with
the error message would offer when grabbing the LSN from the CONTEXT
part of it, because there's a one-one mapping between the expected
ERROR and its CONTEXT from which the information used by the test is
retrieved.
--
Michael

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#8)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Fri, May 17, 2024 at 5:25 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 16, 2024 at 09:00:47AM +0530, Amit Kapila wrote:

This can only be a problem if the apply worker generates more LOG
entries with the required context but it won't do that unless there is
an operation on the publisher to replicate. If we see any such danger
then I agree this can break on some slow machines but as of now, I
don't see any such race condition.

Perhaps, still I'm not completely sure if this assumption is going to
always stand for all the possible configurations we support.

As per my understanding, this will primarily rely on the apply worker
design, not the other configurations, whether we see additional LOG.

So,
rather than coming back to this test again, I would choose to make the
test as stable as possible from the start. That's what mapping with
the error message would offer when grabbing the LSN from the CONTEXT
part of it, because there's a one-one mapping between the expected
ERROR and its CONTEXT from which the information used by the test is
retrieved.

I was slightly hesitant to do an ERROR string-based check because the
error string can change and it doesn't seem to bring additional
stability for this particular test. But if you and others are still
not convinced with the simple fix suggested by me then feel free to
proceed with an error-string based check.

--
With Regards,
Amit Kapila.

#10Ilyasov Ian
ianilyasov@outlook.com
In reply to: Amit Kapila (#9)
1 attachment(s)
RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

Dear Michael, Amit, Hayato

I corrected my patch according to what I think
Michael wanted. I attached the new patch to the letter.

--
Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional

Attachments:

0002-Fix-subscription-029_on_error.pl-test-when-wal_debug.patchapplication/octet-stream; name=0002-Fix-subscription-029_on_error.pl-test-when-wal_debug.patchDownload
From 53972f4a250c9a8c0a89bc199ffe1e8f612e374d Mon Sep 17 00:00:00 2001
From: Ian Ilyasov <ianilyasov@outlook.com>
Date: Wed, 22 May 2024 17:10:42 +0300
Subject: [PATCH] Fix subscription/029_on_error.pl test when wal_debug=on

wal_debug=on results in additional log entries. Previously we would look
for the first matching line to get LSN to SKIP up to, but, with
wal_debug in mind, we better look for the last matching CONTEXT line
after the ERROR line to make sure we won't get an error like
ERROR:  skip WAL location (LSN x/y) must be greater than origin LSN x/y+delta
---
 src/test/subscription/t/029_on_error.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/subscription/t/029_on_error.pl b/src/test/subscription/t/029_on_error.pl
index 5c29e5cc32..d0df627b3b 100644
--- a/src/test/subscription/t/029_on_error.pl
+++ b/src/test/subscription/t/029_on_error.pl
@@ -29,7 +29,7 @@ sub test_skip_lsn
 	# Get the finish LSN of the error transaction.
 	my $contents = slurp_file($node_subscriber->logfile, $offset);
 	$contents =~
-	  qr/processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
+	  qr/ERROR:  duplicate key.*\n.*DETAIL:.*\n.*CONTEXT:.* finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/m
 	  or die "could not get error-LSN";
 	my $lsn = $1;
 
-- 
2.34.1

#11Michael Paquier
michael@paquier.xyz
In reply to: Ilyasov Ian (#10)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Wed, May 22, 2024 at 02:24:37PM +0000, Ilyasov Ian wrote:

I corrected my patch according to what I think
Michael wanted. I attached the new patch to the letter.

Thanks for compiling this patch. Yes, that's the idea.

-	  qr/processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
+	  qr/ERROR:  duplicate key.*\n.*DETAIL:.*\n.*CONTEXT:.* finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/m

There are three CONTEXT strings that could map to this context. It
seems to me that we should keep the 'for replication target relation
"public.tbl" in transaction \d+,', before the "finished at" so as it
is possible to make a difference with the context that has a column
name and the context where there is no target relation. That makes
the regexp longer, but it is not that bad.
--
Michael

#12Ilyasov Ian
ianilyasov@outlook.com
In reply to: Michael Paquier (#11)
1 attachment(s)
RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

It seems to me that we should keep the 'for replication target relation

"public.tbl" in transaction \d+,', before the "finished at" so as it
is possible to make a difference with the context that has a column
name and the context where there is no target relation.

I agree. Attached the updated patch.

--
Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional

Attachments:

0003-Fix-subscription-029_on_error.pl-test-when-wal_debug.patchapplication/octet-stream; name=0003-Fix-subscription-029_on_error.pl-test-when-wal_debug.patchDownload
From e0c3dc8fc84bf557fad12a1b37c0d6c1a11f7f85 Mon Sep 17 00:00:00 2001
From: Ian Ilyasov <ianilyasov@outlook.com>
Date: Thu, 23 May 2024 11:06:57 +0300
Subject: [PATCH] Fix subscription/029_on_error.pl test when wal_debug=on

wal_debug=on results in additional log entries. Previously we would look
for the first matching line to get LSN to SKIP up to, but, with
wal_debug in mind, we better look for the last matching CONTEXT line
after the ERROR line to make sure we won't get an error like
ERROR:  skip WAL location (LSN x/y) must be greater than origin LSN x/y+delta
---
 src/test/subscription/t/029_on_error.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/subscription/t/029_on_error.pl b/src/test/subscription/t/029_on_error.pl
index 5c29e5cc32..c6dca18030 100644
--- a/src/test/subscription/t/029_on_error.pl
+++ b/src/test/subscription/t/029_on_error.pl
@@ -29,7 +29,7 @@ sub test_skip_lsn
 	# Get the finish LSN of the error transaction.
 	my $contents = slurp_file($node_subscriber->logfile, $offset);
 	$contents =~
-	  qr/processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
+	  qr/ERROR:  duplicate key.*\n.*DETAIL:.*\n.*CONTEXT:.* for replication target relation "public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/m
 	  or die "could not get error-LSN";
 	my $lsn = $1;
 
-- 
2.34.1

#13Michael Paquier
michael@paquier.xyz
In reply to: Ilyasov Ian (#12)
Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

On Thu, May 23, 2024 at 08:12:07AM +0000, Ilyasov Ian wrote:

It seems to me that we should keep the 'for replication target relation

"public.tbl" in transaction \d+,', before the "finished at" so as it
is possible to make a difference with the context that has a column
name and the context where there is no target relation.

I agree. Attached the updated patch.

One issue that you have here is that the regexp detection would still
fail when setting log_error_verbosity = verbose because of the extra
error code added between the ERROR and the string. This caused the
LSN to not be fetched properly.

At the end, I've come up with a regexp that checks for a match of the
error string after the ERROR to not confuse the last part getting the
xdigits, and applied that down to 15. Perhaps I would have added the
first "ERROR" part in the check, but could not come down to it for the
readability of the thing.
--
Michael