039_end_of_wal: error in "xl_tot_len zero" test

Started by Anton Voloshinalmost 2 years ago20 messages

a.voloshin@postgrespro.ru

almost 2 years ago

3 attachment(s)

Hello, hackers,

I believe there is a small problem in the 039_end_of_wal.pl's
"xl_tot_len zero" test. It supposes that after immediate shutdown the
server, upon startup recovery, should always produce a message matching
"invalid record length at .*: wanted 24, got 0". However, if the
circumstances are just right and we happened to hit exactly on the edge
of WAL page, then the message on startup recovery would be "invalid
magic number 0000 in log segment .*, offset .*". The test does not take
that into account.

Now, reproducing this is somewhat tricky, because exact position in WAL
at the test time depends on what exactly initdb did, and that not only
differs in different major verisons, but also depends on e.g. username
length, locales available, and, perhaps, more. Even though originally
this problem was found "in the wild" on one particular system on one
particular code branch, I've written small helper patch to make
reproduction on master easier, see
0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.

This patch adds single emit_message of (hopefully) the right size to
make sure we hit end of WAL block right by the time we call
$node->stop('immediate') in "xl_tot_len zero" test. With this patch,
"xl_tot_len zero" test fails every time because the server writes
"invalid magic number 0000 in log segment" while the test still only
expects "invalid record length at .*: wanted 24, got 0". If course, this
0001 patch is *not* meant to be committed, but only as an issue
reproduction helper.

I can think of two possible fixes:

1. Update advance_out_of_record_splitting_zone to also avoid stopping at
exactly the block end:

      my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || 
$page_offset <= $SizeOfXLogShortPHD)
      {
see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch

We need to compare with $SizeOfXLogShortPHD (and not with zero) because
at that point, even though we didn't actually write out new WAL page
yet, it's header is already in place in memory and taken in account
for LSN reporting.

2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
number 0000 in WAL segment" message as well:

$node->start;
ok( $node->log_contains(
+ "invalid magic number 0000 in WAL segment|" .
"invalid record length at .*: expected at least 24, got 0",
$log_size
),
see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

I think it makes sense to backport whatever the final change would be to
all branches with 039_end_of_wal (REL_12+).

Any thoughts?

Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Attachments:

0001-repro-for-039_end_of_wal-s-problem-with-page-end.patchtext/x-patch; charset=UTF-8; name=0001-repro-for-039_end_of_wal-s-problem-with-page-end.patchDownload

From 5f12139816f6c1bc7d625ba8007aedb8f5d04a71 Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.voloshin@postgrespro.ru>
Date: Thu, 18 Jan 2024 12:45:12 +0300
Subject: [PATCH 1/3] repro for 039_end_of_wal's problem with page
 end

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d0..ecf9b6089de 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -36,6 +36,8 @@ my $WAL_SEGMENT_SIZE;
 my $WAL_BLOCK_SIZE;
 my $TLI;
 
+my $SizeOfXLogShortPHD = 24; # rounded up to 8 bytes
+
 # Build path of a WAL segment.
 sub wal_segment_path
 {
@@ -258,9 +260,27 @@ my $prev_lsn;
 note "Single-page end-of-WAL detection";
 ###########################################################################
 
+my $lsn = get_insert_lsn($node);
+my $lsn_offset = $lsn % $WAL_BLOCK_SIZE;
+my $empty_msg_size = ( ( ($ENV{EMPTY_MSG_SIZE} || 51) + 7) / 8) * 8;
+my $commit_msg_size = ( (34 + 7) / 8) * 8;
+# empty logical msg and commit message take this many bytes of WAL:
+my $empty_msg_overhead = $empty_msg_size + $commit_msg_size;
+# cound overhead twice to account for two emit_message calls below:
+# we want to hit the page edge after the second call.
+my $target_lsn_offset = $WAL_BLOCK_SIZE * 2 - $empty_msg_overhead * 2;
+my $msg_size = ($target_lsn_offset - $lsn_offset) % $WAL_BLOCK_SIZE;
+# If we happen to switch to the next WAL block mid-message, reduce the message
+# by $SizeOfXLogShortPHD (minimal page header) to hit the same target.
+if ($lsn_offset + $msg_size >= $WAL_BLOCK_SIZE) { $msg_size -= $SizeOfXLogShortPHD; }
+print "QWE0: $lsn WAL_BLOCK_SIZE='$WAL_BLOCK_SIZE', lsn_offset='$lsn_offset' target_lsn_offset='$target_lsn_offset' msg_size='$msg_size'\n";
+emit_message($node, $msg_size);
+printf "QWE1: %s - after corrective msg\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 # xl_tot_len is 0 (a common case, we hit trailing zeroes).
 emit_message($node, 0);
+printf "QWE2: %s - after zero-length message\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 $end_lsn = advance_out_of_record_splitting_zone($node);
+printf "QWE3: %s - after advance_out_of_record_splitting_szone()\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
-- 
2.43.0

0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patchtext/x-patch; charset=UTF-8; name=0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patchDownload

From eebbc755c880071194bbd02825b4c47f11f99213 Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.voloshin@postgrespro.ru>
Date: Thu, 18 Jan 2024 14:47:52 +0300
Subject: [PATCH 2/3] fix "xl_tot_len zero" test: amend
 advance_out_of_record_splitting_zone

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index ecf9b6089de..3168d991fee 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -173,6 +173,8 @@ sub build_page_header
 # a couple of small records.  This inserts a few records of a fixed size, until
 # the threshold gets close enough to the end of the WAL page inserting records
 # to.
+# Also avoid the very beginning of a page: that would break a test for a
+# zero-length logical message.
 sub advance_out_of_record_splitting_zone
 {
 	my $node = shift;
@@ -180,7 +182,7 @@ sub advance_out_of_record_splitting_zone
 	my $page_threshold = $WAL_BLOCK_SIZE / 4;
 	my $end_lsn = get_insert_lsn($node);
 	my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+	while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || $page_offset <= $SizeOfXLogShortPHD)
 	{
 		emit_message($node, $page_threshold);
 		$end_lsn = get_insert_lsn($node);
-- 
2.43.0

0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patchtext/x-patch; charset=UTF-8; name=0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patchDownload

From 2419ec6780f1e6a003b553066f83d966fee0c15b Mon Sep 17 00:00:00 2001
From: Anton Voloshin <a.voloshin@postgrespro.ru>
Date: Thu, 18 Jan 2024 15:03:11 +0300
Subject: [PATCH 3/3] alt.fix for "xl_tot_len zero" test: accept
 "invalid magic number 0000 in WAL segment" as well

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index 3168d991fee..f7a6a722a78 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -287,6 +287,7 @@ $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
 ok( $node->log_contains(
+		"invalid magic number 0000 in WAL segment|" .
 		"invalid record length at .*: expected at least 24, got 0", $log_size
 	),
 	"xl_tot_len zero");
-- 
2.43.0

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Anton Voloshin (#1)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin
<a.voloshin@postgrespro.ru> wrote:

I believe there is a small problem in the 039_end_of_wal.pl's
"xl_tot_len zero" test. It supposes that after immediate shutdown the
server, upon startup recovery, should always produce a message matching
"invalid record length at .*: wanted 24, got 0". However, if the
circumstances are just right and we happened to hit exactly on the edge
of WAL page, then the message on startup recovery would be "invalid
magic number 0000 in log segment .*, offset .*". The test does not take
that into account.

Hi Anton,

Thanks for figuring this out! Right, I see. I will look more closely
when I'm back from summer vacation in a few days, but first reaction:

Now, reproducing this is somewhat tricky, because exact position in WAL
at the test time depends on what exactly initdb did, and that not only
differs in different major verisons, but also depends on e.g. username
length, locales available, and, perhaps, more. Even though originally
this problem was found "in the wild" on one particular system on one
particular code branch, I've written small helper patch to make
reproduction on master easier, see
0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.

This patch adds single emit_message of (hopefully) the right size to
make sure we hit end of WAL block right by the time we call
$node->stop('immediate') in "xl_tot_len zero" test. With this patch,
"xl_tot_len zero" test fails every time because the server writes
"invalid magic number 0000 in log segment" while the test still only
expects "invalid record length at .*: wanted 24, got 0". If course, this
0001 patch is *not* meant to be committed, but only as an issue
reproduction helper.

I can think of two possible fixes:

1. Update advance_out_of_record_splitting_zone to also avoid stopping at
exactly the block end:
my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold ||
$page_offset <= $SizeOfXLogShortPHD)
{
see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch
We need to compare with $SizeOfXLogShortPHD (and not with zero) because
at that point, even though we didn't actually write out new WAL page
yet, it's header is already in place in memory and taken in account
for LSN reporting.

I like the fact that this preserves the same end-of-WAL case that
we're trying to test. I don't yet have an opinion on the best way to
do it though. Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()? The thing about this one
specific test that is different from the later ones is that it doesn't
actually write a record header at all, it was relying purely on
pre-existing trailing zeroes, but it assumed the page header would be
valid. As you figured out, that isn't true if we were right on the
page boundary. Perhaps advance_out_of_record_splitting_zone()
followed by emit_message(0) would make that always true, even then?

2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
number 0000 in WAL segment" message as well:

$node->start;
ok( $node->log_contains(
+ "invalid magic number 0000 in WAL segment|" .
"invalid record length at .*: expected at least 24, got 0",
$log_size
),
see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

Tolerating the two different messages would weaken the test.

I think it makes sense to backport whatever the final change would be to
all branches with 039_end_of_wal (REL_12+).

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Thomas Munro (#2)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Fri, Jan 19, 2024 at 11:35:30AM +1300, Thomas Munro wrote:

On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin
<a.voloshin@postgrespro.ru> wrote:

I believe there is a small problem in the 039_end_of_wal.pl's
"xl_tot_len zero" test. It supposes that after immediate shutdown the
server, upon startup recovery, should always produce a message matching
"invalid record length at .*: wanted 24, got 0". However, if the
circumstances are just right and we happened to hit exactly on the edge
of WAL page, then the message on startup recovery would be "invalid
magic number 0000 in log segment .*, offset .*". The test does not take
that into account.

Thanks for figuring this out! Right, I see. I will look more closely
when I'm back from summer vacation in a few days, but first reaction:

Thomas, are you planning to look at that?
--
Michael

Anton Voloshin

a.voloshin@postgrespro.ru

almost 2 years ago

In reply to: Thomas Munro (#2)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Hello, Thomas,

On 19/01/2024 01:35, Thomas Munro wrote:

I don't yet have an opinion on the best way to
do it though. Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()?

Yes, indeed that seems to be enough. At least I could not produce any
more "xl_tot_len zero" failures with that addition.

I like this solution the best.

Tolerating the two different messages would weaken the test.

I agree, I also don't really like this option.

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Anton Voloshin (#4)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Thu, Feb 15, 2024 at 10:40 PM Anton Voloshin
<a.voloshin@postgrespro.ru> wrote:

On 19/01/2024 01:35, Thomas Munro wrote:

I don't yet have an opinion on the best way to
do it though. Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()?

Yes, indeed that seems to be enough. At least I could not produce any
more "xl_tot_len zero" failures with that addition.

I like this solution the best.

Oh, it looks like this new build farm animal "skimmer" might be
reminding us about this issue:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer&br=HEAD

I don't know why it changed, but since this is an LSN/page alignment
thing, it could be due to external things like an OS upgrade adding
more locales or something that affects initdb. Will look soon and
fix.

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Thomas Munro (#5)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Thomas Munro <thomas.munro@gmail.com> writes:

Oh, it looks like this new build farm animal "skimmer" might be
reminding us about this issue:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer&br=HEAD
I don't know why it changed,

At this point it seems indisputable that 7d2c7f08d9 is what broke
skimmer, but that didn't go anywhere near WAL-related code, so how?

My best guess is that that changed the amount of WAL generated by
initdb just enough to make the problem reproduce on this animal.
However, why's it *only* happening on this animal? The amount of
WAL we generate isn't all that system-specific.

regards, tom lane

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Tom Lane (#6)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Mon, 6 May 2024 at 15:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My best guess is that that changed the amount of WAL generated by
initdb just enough to make the problem reproduce on this animal.
However, why's it *only* happening on this animal? The amount of
WAL we generate isn't all that system-specific.

I'd say that's a good theory as it's now passing again [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skimmer&dt=2024-05-06%2017%3A43%3A38 after the
recent system_views.sql change done in 521a7156ab.

David

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skimmer&dt=2024-05-06%2017%3A43%3A38

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: David Rowley (#7)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, 6 May 2024 at 15:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My best guess is that that changed the amount of WAL generated by
initdb just enough to make the problem reproduce on this animal.
However, why's it *only* happening on this animal? The amount of
WAL we generate isn't all that system-specific.

I'd say that's a good theory as it's now passing again [1] after the
recent system_views.sql change done in 521a7156ab.

Hm. It occurs to me that there *is* a system-specific component to
the amount of WAL emitted during initdb: the number of locales
that "locale -a" prints translates directly to the number of
rows inserted into pg_collation. So maybe skimmer has a locale
set that's a bit different from anybody else's, and that's what
let it see this issue.

regards, tom lane

Anton Voloshin

a.voloshin@postgrespro.ru

over 1 year ago

In reply to: Tom Lane (#8)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On 13/05/2024 00:39, Tom Lane wrote:

Hm. It occurs to me that there *is* a system-specific component to
the amount of WAL emitted during initdb: the number of locales
that "locale -a" prints translates directly to the number of
rows inserted into pg_collation. [...]

Yes. Another system-specific circumstance affecting WAL position is the
name length of the unix user doing initdb. I've seen 039_end_of_wal
failing consistently under user aaaaaaaa but working fine with aaaa,
both on the same machine at the same time.

To be more precise, on one particular machine under those particular
circumstances (including set of locales) it would work for any username
with length < 8 or >= 16, but would fail for length 8..15 (in bytes, not
characters, if non-ASCII usernames were used).

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

#10

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Anton Voloshin (#9)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

I am seeing the exact problem described in this thread on my laptop since
commit 490f869. I have yet to do a thorough investigation, but what I've
seen thus far does seem to fit the subtle-differences-in-generated-WAL
theory. If no one is planning to pick up the fix soon, I will try to.

--
nathan

#11

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Nathan Bossart (#10)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Sat, Aug 24, 2024 at 10:33 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I am seeing the exact problem described in this thread on my laptop since
commit 490f869. I have yet to do a thorough investigation, but what I've
seen thus far does seem to fit the subtle-differences-in-generated-WAL
theory. If no one is planning to pick up the fix soon, I will try to.

Sorry for dropping that. It looks like we know approximately how to
stabilise it, and I'll look at it early next week if you don't beat me
to it, but please feel free if you would like to.

#12

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#11)

1 attachment(s)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Sat, Aug 24, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Aug 24, 2024 at 10:33 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I am seeing the exact problem described in this thread on my laptop since
commit 490f869. I have yet to do a thorough investigation, but what I've
seen thus far does seem to fit the subtle-differences-in-generated-WAL
theory. If no one is planning to pick up the fix soon, I will try to.

Sorry for dropping that. It looks like we know approximately how to
stabilise it, and I'll look at it early next week if you don't beat me
to it, but please feel free if you would like to.

It fails reliably if you nail down the initial conditions like this:

$TLI = $node->safe_psql('postgres',
"SELECT timeline_id FROM pg_control_checkpoint();");

+$node->safe_psql('postgres', "SELECT pg_switch_wal();");
+emit_message($node, 7956);
+
 my $end_lsn;
 my $prev_lsn;

The fix I propose to commit shortly is just the first of those new
lines, to homogenise the initial state. See attached. The previous
idea works too, I think, but this bigger hammer is more obviously
removing variation.

Attachments:

0001-Stabilize-039_end_of_wal-test.patchtext/x-patch; charset=US-ASCII; name=0001-Stabilize-039_end_of_wal-test.patchDownload

From 9f9f16f6325fc12a83d62683503dab7a9bdfdc45 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 29 Aug 2024 17:15:24 +1200
Subject: [PATCH] Stabilize 039_end_of_wal test.

The first test was sensitive to the insert LSN after setting up the
catalogs, which depended on environmental things like the locales on the
OS and usernames.  Switch to a new WAL file before the first test, as a
simple way to put every computer into the same state.

Back-patch to all supported releases.

Reported-by: Anton Voloshin <a.voloshin@postgrespro.ru>
Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/b26aeac2-cb6d-4633-a7ea-945baae83dcf%40postgrespro.ru
---
 src/test/recovery/t/039_end_of_wal.pl | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d0..8498e83f4a7 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -251,6 +251,8 @@ $WAL_BLOCK_SIZE = get_int_setting($node, 'wal_block_size');
 $TLI = $node->safe_psql('postgres',
 	"SELECT timeline_id FROM pg_control_checkpoint();");
 
+$node->safe_psql('postgres', "SELECT pg_switch_wal();");
+
 my $end_lsn;
 my $prev_lsn;
 
-- 
2.46.0

#13

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Thomas Munro (#12)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Thomas Munro <thomas.munro@gmail.com> writes:

The fix I propose to commit shortly is just the first of those new
lines, to homogenise the initial state. See attached. The previous
idea works too, I think, but this bigger hammer is more obviously
removing variation.

+1, but a comment explaining the need for the pg_switch_wal call
seems in order.

regards, tom lane

#14

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Tom Lane (#13)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Thu, Aug 29, 2024 at 01:55:27AM -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

The fix I propose to commit shortly is just the first of those new
lines, to homogenise the initial state. See attached. The previous
idea works too, I think, but this bigger hammer is more obviously
removing variation.

+1, but a comment explaining the need for the pg_switch_wal call
seems in order.

--
nathan

#15

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Nathan Bossart (#14)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Pushed. Thanks!

#16

Christoph Berg

myon@debian.org

about 1 year ago

In reply to: Thomas Munro (#15)

1 attachment(s)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Re: Thomas Munro

Pushed. Thanks!

Well...

postgresql 13.17, Debian bullseye, amd64:

t/039_end_of_wal.pl .................. Dubious, test returned 2 (wstat 512, 0x200)
No subtests run

Test Summary Report
-------------------
t/039_end_of_wal.pl (Wstat: 512 Tests: 0 Failed: 0)
Non-zero exit status: 2
Parse errors: No plan found in TAP output
Files=26, Tests=257, 102 wallclock secs ( 0.11 usr 0.04 sys + 14.35 cusr 12.78 csys = 27.28 CPU)
Result: FAIL

Christoph

#17

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Christoph Berg (#16)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Fri, Nov 15, 2024 at 4:54 AM Christoph Berg <myon@debian.org> wrote:

postgresql 13.17, Debian bullseye, amd64:

t/039_end_of_wal.pl .................. Dubious, test returned 2 (wstat 512, 0x200)

This seems to be the interesting bit:

******** build/src/test/recovery/tmp_check/log/regress_log_039_end_of_wal
********
No such file or directory at
/home/myon/projects/postgresql/debian/13/build/../src/test/perl/TestLib.pm
line 655.

I assume that must be coming from:

my @scan_result = scan_server_header('access/xlog_internal.h',

... which reaches:

open my $header_h, '<', "$stdout/$header_path" or die "$!"; <-- line 655

Not sure yet what is different in this environment or why you're
suddenly noticing on 13.17. The logic has been there since 13.13 (ie
it was backpatched then).

#18

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Thomas Munro (#17)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Fri, Nov 15, 2024 at 11:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Not sure yet what is different in this environment or why you're
suddenly noticing on 13.17. The logic has been there since 13.13 (ie
it was backpatched then).

Hi Christoph,

Also why only this branch, when they all have it? Reproducible or
one-off? Do you have any more clues?

#19

Christoph Berg

myon@debian.org

about 1 year ago

In reply to: Thomas Munro (#18)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Re: Thomas Munro

Also why only this branch, when they all have it? Reproducible or
one-off? Do you have any more clues?

It is (was?) weird. The PG13 build for bullseye on apt.pg.o worked
normally, but when I built for bullseye-security (Debian LTS), I had
to remove the patch that removed the 039_end_of_wal.pl file because
you had changed the file. The build then failed and I put the
remove-that-file-patch back. I did not try to repeat the build.

I'll try again with today's releases.

Christoph

#20

Christoph Berg

myon@debian.org

about 1 year ago

In reply to: Christoph Berg (#19)

Re: 039_end_of_wal: error in "xl_tot_len zero" test

Re: To Thomas Munro

It is (was?) weird. The PG13 build for bullseye on apt.pg.o worked
normally, but when I built for bullseye-security (Debian LTS), I had
to remove the patch that removed the 039_end_of_wal.pl file because
you had changed the file. The build then failed and I put the
remove-that-file-patch back. I did not try to repeat the build.

I'll try again with today's releases.

******** build/src/test/recovery/tmp_check/log/regress_log_039_end_of_wal ********
No such file or directory at /home/myon/projects/postgresql/debian/13/build/../src/test/perl/TestLib.pm line 655.

It's still happening, but it's Debian's fault. We used to lobotomize
the relocatability of the binaries a bit to cater for our directory
layout, so in the PG13 in bullseye, includedir-server is
/usr/include/postgresql/13/server even when called from a temp tree.

https://salsa.debian.org/postgresql/postgresql/-/blob/13-bullseye/debian/patches/50-per-version-dirs.patch?ref_type=heads

In the meantime, I had noticed that the logic is no longer required
(if it ever was), so the newer branches (including 13 for apt.pg.o)
have a much shorter version of that patch:

https://salsa.debian.org/postgresql/postgresql/-/blob/13/debian/patches/50-per-version-dirs.patch?ref_type=heads

so tl;dr: it's been fixed in the meantime, sorry for the noise.

Christoph