pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Started by Michael Paquierabout 4 years ago20 messages
#1Michael Paquier
michael@paquier.xyz

Add TAP test for archive_cleanup_command and recovery_end_command

This adds tests checking for the execution of both commands. The
recovery test 002_archiving.pl is nicely adapted to that, as promotion
is triggered already twice there, and even if any of those commands fail
they don't affect recovery or promotion.

A command success is checked using a file generated by an "echo"
command, that should be able to work in all the buildfarm environments,
even Msys (but we'll know soon about that). Command failure is tested
with an "echo" command that points to a path that does not exist,
scanning the backend logs to make sure that the failure happens. Both
rely on the backend triggering the commands from the root of the data
folder, making its logic more robust.

Thanks to Neha Sharma for the extra tests on Windows.

Author: Amul Sul, Michael Paquier
Reviewed-by: Andres Freund, Euler Taveira
Discussion: /messages/by-id/CAAJ_b95R_c4T5moq30qsybSU=eDzDHm=4SPiAWaiMWc2OW7=1Q@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/46dea2419ee7895a4eb3d048317682e6f18a17e1

Modified Files
--------------
src/test/recovery/t/002_archiving.pl | 47 +++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#1)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Michael Paquier <michael@paquier.xyz> writes:

Add TAP test for archive_cleanup_command and recovery_end_command

grassquit just showed a non-reproducible failure in this test [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&amp;dt=2022-04-07%2015%3A45%3A48:

# Postmaster PID for node "standby" is 291160
ok 1 - check content from archives
not ok 2 - archive_cleanup_command executed on checkpoint

# Failed test 'archive_cleanup_command executed on checkpoint'
# at t/002_archiving.pl line 74.

This test is sending a CHECKPOINT command to the standby and
expecting it to run the archive_cleanup_command, but it looks
like the standby did not actually run any checkpoint:

2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection received: host=[local]
2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection authorized: user=bf database=postgres application_name=002_archiving.pl
2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: CHECKPOINT
2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: session time: 0:00:00.032 user=bf database=postgres host=[local]

I am suspicious that the reason is that ProcessUtility does not
ask for a forced checkpoint when in recovery:

RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
(RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));

The trouble with this theory is that this test has been there for
nearly six months and this is the first such failure (I scraped the
buildfarm logs to be sure). Seems like failures should be a lot
more common than that. I wondered if the recent pg_stats changes
could have affected this, but I don't really see how.

Thoughts?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&amp;dt=2022-04-07%2015%3A45%3A48

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Hi,

On 2022-04-07 13:40:30 -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

Add TAP test for archive_cleanup_command and recovery_end_command

grassquit just showed a non-reproducible failure in this test [1]:

I was just staring at that as well.

# Postmaster PID for node "standby" is 291160
ok 1 - check content from archives
not ok 2 - archive_cleanup_command executed on checkpoint

# Failed test 'archive_cleanup_command executed on checkpoint'
# at t/002_archiving.pl line 74.

This test is sending a CHECKPOINT command to the standby and
expecting it to run the archive_cleanup_command, but it looks
like the standby did not actually run any checkpoint:

2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection received: host=[local]
2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection authorized: user=bf database=postgres application_name=002_archiving.pl
2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: CHECKPOINT
2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: session time: 0:00:00.032 user=bf database=postgres host=[local]

I am suspicious that the reason is that ProcessUtility does not
ask for a forced checkpoint when in recovery:

RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
(RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));

The trouble with this theory is that this test has been there for
nearly six months and this is the first such failure (I scraped the
buildfarm logs to be sure). Seems like failures should be a lot
more common than that.

I wondered if the recent pg_stats changes could have affected this, but I
don't really see how.

I don't really see either. It's a bit more conceivable that the recovery
prefetching changes could affect the timing sufficiently?

It's also possible that it requires an animal of a certain speed to happen -
we didn't have an -fsanitize=address animal until recently.

I guess we'll have to wait and see what the frequency of the problem is?

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Andres Freund <andres@anarazel.de> writes:

On 2022-04-07 13:40:30 -0400, Tom Lane wrote:

This test is sending a CHECKPOINT command to the standby and
expecting it to run the archive_cleanup_command, but it looks
like the standby did not actually run any checkpoint:
...
I wondered if the recent pg_stats changes could have affected this, but I
don't really see how.

I don't really see either. It's a bit more conceivable that the recovery
prefetching changes could affect the timing sufficiently?

Oh, that's at least a little plausible.

I guess we'll have to wait and see what the frequency of the problem is?

Yeah, with only one instance it could just be cosmic rays or something.
However, assuming it is real, I guess I wonder why we don't say
CHECKPOINT_FORCE in standby mode too.

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Hi,

On 2022-04-07 13:57:45 -0400, Tom Lane wrote:

Yeah, with only one instance it could just be cosmic rays or something.
However, assuming it is real, I guess I wonder why we don't say
CHECKPOINT_FORCE in standby mode too.

I guess it might partially be that restartpoints require a checkpoint to have
happened on the primary. If we used FORCE, we'd have to wait till the next
checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
CHECKPOINT; before shutting the standby down.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Andres Freund <andres@anarazel.de> writes:

On 2022-04-07 13:57:45 -0400, Tom Lane wrote:

Yeah, with only one instance it could just be cosmic rays or something.

Not cosmic rays: skink has shown the same symptom three times running.
Looks like maybe the archive_cleanup_command itself is doing something
it shouldn't?

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
1 attachment(s)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Andres Freund <andres@anarazel.de> writes:

On 2022-04-07 13:57:45 -0400, Tom Lane wrote:

Yeah, with only one instance it could just be cosmic rays or something.
However, assuming it is real, I guess I wonder why we don't say
CHECKPOINT_FORCE in standby mode too.

I guess it might partially be that restartpoints require a checkpoint to have
happened on the primary. If we used FORCE, we'd have to wait till the next
checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
CHECKPOINT; before shutting the standby down.

After seeing skink's results, I tried running that test under valgrind
here, and it fails just like that every time. skink's history allows
us to bound the failure introduction between 79b716cfb7 and
d7ab2a9a3c, which I think makes it just about certain that it was
5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
not bisected to be 100% sure.

Adding some debug printouts to ExecuteRecoveryCommand convinces me
that indeed the archive_cleanup_command is NOT getting called by the
problematic CHECKPOINT command. I surmise based on Andres' comment
above that the standby isn't making a restartpoint for lack of
an available primary checkpoint, which looks to me like it could be
a pre-existing bug in the test case: it's sure not doing anything to
guarantee that the primary's checkpoint record has reached the standby.

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

On the whole, I'm not sure that the WAL prefetch logic is noticeably
more stable than when we booted it out last year :-(. However, I also
wonder why it is that this test case wasn't occasionally failing already.

regards, tom lane

Attachments:

002_archiving-hack.patchtext/x-diff; charset=us-ascii; name=002_archiving-hack.patchDownload
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index c8f5ffbaf0..1032d8a388 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -44,13 +44,14 @@ $node_standby->start;
 # Create some content on primary
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
-  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 
 # Note the presence of this checkpoint for the archive_cleanup_command
 # check done below, before switching to a new segment.
 $node_primary->safe_psql('postgres', "CHECKPOINT");
 
+my $current_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
 # Force archiving of WAL file to make it present on primary
 $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
 
#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:

After seeing skink's results, I tried running that test under valgrind
here, and it fails just like that every time. skink's history allows
us to bound the failure introduction between 79b716cfb7 and
d7ab2a9a3c, which I think makes it just about certain that it was
5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
not bisected to be 100% sure.

I've tested it, it's 5dc0418fab that makes the difference. I reduced the cycle
time by making initdb not go through valgrind, but have normal postgres
instances go through it.

On the whole, I'm not sure that the WAL prefetch logic is noticeably
more stable than when we booted it out last year :-(.

IDK. Last year's issues seems to have largely been caused by a flaky
machine. And a bug, if it's that, in some archiving corner case that's not
normally reached during tests...

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.

What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery. To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive. As this requires at least two timeline
# switches, promote the existing standby first. Then create a second
# standby based on the promoted one. Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

$node_standby2->init_from_backup($node_primary, $backup_name,
has_restoring => 1);

It's created from the original primary, not the first standby, as the
description says...

Both nodes get promoted independently, in a run without valgrind:

standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 17:23:42.96686-07
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000

standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 17:23:43.307443->
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000

except that standby2 can't choose tli 2 because it finds it used.

Sure looks like something is funky with that test.

But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""

note that we appear to wait between replaying of records, even though we
apparently have WAL for the next record!

And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#9)
1 attachment(s)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Sat, Apr 9, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.

Sorry for the delay... I got a bit confused about the different things
going on in this thread but I hope I've got it now:

1. This test had some pre-existing bugs/races, which hadn't failed
before due to scheduling, even under Valgrind. The above changes
appear to fix those problems. To Michael for comment.

What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery. To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive. As this requires at least two timeline
# switches, promote the existing standby first. Then create a second
# standby based on the promoted one. Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

2. There may also be other problems with the test but those aren't
relevant to skink's failure, which starts on the 5th test. To Michael
for comment.

But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

3. Urghl. Yeah. There is indeed code to report XLREAD_WOULDBLOCK
for the lastSourceFailed case, but we don't really want to do that
more often than every 5 seconds. So I think I need to make that
"sticky", so that we don't attempt to prefetch again (ie to read from
the next segment, which invokes restore_command) until we've replayed
all the records we already have, then hit the end and go to sleep.
The attached patch does that, and makes the offending test pass under
Valgrind for me, even without the other changes already mentioned. If
I understand correctly, this is due to a timing race in the tests
(though I didn't check where exactly), because all those extra
fork/exec calls are extremely slow under Valgrind.

And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.

I see them now. It's because it gives up when it's reading ahead
(nonblocking), which may not be strictly necessary but I found it
simpler to think about. Then when it tries again in 5 seconds it's in
blocking mode so it doesn't give up so easily.

2022-04-11 18:15:08.220 NZST [524796] DEBUG: switched WAL source from
stream to archive after failure
cp: cannot stat '/tmp/archive/000000010000000000000017': No such file
or directory
2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not restore file
"000000010000000000000017" from archive: child process exited with
exit code 1
2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not open file
"pg_wal/000000010000000000000017": No such file or directory
2022-04-11 18:15:08.226 NZST [524796] DEBUG: switched WAL source from
archive to stream after failure

Attachments:

0001-Don-t-retry-restore_command-while-reading-ahead.patchtext/x-patch; charset=US-ASCII; name=0001-Don-t-retry-restore_command-while-reading-ahead.patchDownload
From e488e20f7c364ba1adaae8d1f6ea92a7f5d0bda6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 11 Apr 2022 17:16:24 +1200
Subject: [PATCH] Don't retry restore_command while reading ahead.

Suppress further attempts to read ahead in the WAL if we run out of
data, until records already decoded have been replayed.  When replaying
from archives, this avoids repeatedly retrying the restore_command until
after we've slept for 5 seconds.

When replaying from a network stream, this makes us less aggressive in
theory, but we probably can't benefit from being more aggressive yet
anyway as the flushedUpto variable doesn't advance until we run out of
data (though we could improve that in future).

While here, fix a potential off-by one confusion with the
no_readahead_until mechanism: we suppress readahead until after the
record that begins at that LSN has been replayed (ie we are replaying a
higher LSN), so change < to <=.

Defect in commit 5dc0418f.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220409005910.alw46xqmmgny2sgr%40alap3.anarazel.de
---
 src/backend/access/transam/xlogprefetcher.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index f3428888d2..43e82b65c2 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -487,8 +487,8 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			 */
 			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
 
-			/* Certain records act as barriers for all readahead. */
-			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+			/* Readahead is disabled until we replay past a certain point. */
+			if (nonblocking && replaying_lsn <= prefetcher->no_readahead_until)
 				return LRQ_NEXT_AGAIN;
 
 			record = XLogReadAhead(prefetcher->reader, nonblocking);
@@ -496,8 +496,13 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			{
 				/*
 				 * We can't read any more, due to an error or lack of data in
-				 * nonblocking mode.
+				 * nonblocking mode.  Don't try to read ahead again until we've
+				 * replayed everything already decoded.
 				 */
+				if (nonblocking && prefetcher->reader->decode_queue_tail)
+					prefetcher->no_readahead_until =
+						prefetcher->reader->decode_queue_tail->lsn;
+
 				return LRQ_NEXT_AGAIN;
 			}
 
-- 
2.30.2

#11Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#10)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote:

Sorry for the delay... I got a bit confused about the different things
going on in this thread but I hope I've got it now:

1. This test had some pre-existing bugs/races, which hadn't failed
before due to scheduling, even under Valgrind. The above changes
appear to fix those problems. To Michael for comment.

I have seen the thread, and there is a lot in it. I will try to look
tomorrow at the parts I got involved in.
--
Michael

#12Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#10)
1 attachment(s)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote:

1. This test had some pre-existing bugs/races, which hadn't failed
before due to scheduling, even under Valgrind. The above changes
appear to fix those problems. To Michael for comment.

Yeah, there are two problems here. From what I can see, ensuring the
execution of archive_cleanup_command on the standby needs the
checkpoint on the primary and the restart point on the standby. So
pg_current_wal_lsn() should be located after the primary's checkpoint
and not before it so as we are sure that the checkpoint records finds
its way to the standby. That's what Tom mentioned upthread.

The second problem is to make sure that $standby2 sees the promotion
of $standby and its history file, but we also want to recover
00000002.history from some archives to create a RECOVERYHISTORY at
recovery for the purpose of the test. Switching to a new segment as
proposed by Andres does not seem completely right to me because we are
not 100% sure of the ordering an archive is going to happen, no? I
think that the logic to create $standby2 from the initial backup of
the primary is right, because there is no 00000002.history in it, but
we also need to be sure that 00000002.history has been archived once
the promotion of $standby is done. This can be validated thanks to
the logs, actually.

What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery. To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive. As this requires at least two timeline
# switches, promote the existing standby first. Then create a second
# standby based on the promoted one. Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

2. There may also be other problems with the test but those aren't
relevant to skink's failure, which starts on the 5th test. To Michael
for comment.

This comes from df86e52, where we want to recovery a history file that
would be created as RECOVERYHISTORY and make sure that the file gets
removed at the end of recovery. So $standby2 should choose a new
timeline different from the one of chosen by $standby. Looking back
at what has been done, it seems to me that the comment is the
incorrect part:
/messages/by-id/20190930080340.GO2888@paquier.xyz

All that stuff leads me to the attached. Thoughts?
--
Michael

Attachments:

tap-archiving-michael.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index c8f5ffbaf0..45aafcb35c 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -24,6 +24,8 @@ $node_primary->backup($backup_name);
 
 # Initialize standby node from backup, fetching WAL from archives
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+# Note that this makes the standby archive its contents on the archives
+# of the primary.
 $node_standby->init_from_backup($node_primary, $backup_name,
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
@@ -44,13 +46,16 @@ $node_standby->start;
 # Create some content on primary
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
-  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 
 # Note the presence of this checkpoint for the archive_cleanup_command
 # check done below, before switching to a new segment.
 $node_primary->safe_psql('postgres', "CHECKPOINT");
 
+# Done after the checkpoint to ensure that the checkpoint gets replayed
+# on the standby.
+my $current_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
 # Force archiving of WAL file to make it present on primary
 $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
 
@@ -81,10 +86,34 @@ ok( !-f "$data_dir/$recovery_end_command_file",
 # file, switch to a timeline large enough to allow a standby to recover
 # a history file from an archive.  As this requires at least two timeline
 # switches, promote the existing standby first.  Then create a second
-# standby based on the promoted one.  Finally, the second standby is
-# promoted.
+# standby based on the primary, using its archives.  Finally, the second
+# standby is promoted.
 $node_standby->promote;
 
+# Wait until the history file has been archived on the archives of the
+# primary once the promotion of the standby completes.
+my $primary_archive = $node_primary->archive_dir;
+my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+my $attempts     = 0;
+while ($attempts < $max_attempts)
+{
+	last if (-e "$primary_archive/00000002.history");
+
+	# Wait 0.1 second before retrying.
+	usleep(100_000);
+
+	$attempts++;
+}
+
+if ($attempts >= $max_attempts)
+{
+	die "timed out waiting for 00000002.history\n";
+}
+else
+{
+	note "found 00000002.history after $attempts attempts\n"
+}
+
 # recovery_end_command should have been triggered on promotion.
 ok( -f "$data_dir/$recovery_end_command_file",
 	'recovery_end_command executed after promotion');
@@ -108,14 +137,19 @@ my $log_location = -s $node_standby2->logfile;
 # Now promote standby2, and check that temporary files specifically
 # generated during archive recovery are removed by the end of recovery.
 $node_standby2->promote;
+
+# Check the logs of the standby to see that the commands have failed.
+my $log_contents = slurp_file($node_standby2->logfile, $log_location);
 my $node_standby2_data = $node_standby2->data_dir;
+
+like(
+	$log_contents,
+	qr/restored log file "00000002.history" from archive/s,
+	"00000002.history retrieved from the archives");
 ok( !-f "$node_standby2_data/pg_wal/RECOVERYHISTORY",
 	"RECOVERYHISTORY removed after promotion");
 ok( !-f "$node_standby2_data/pg_wal/RECOVERYXLOG",
 	"RECOVERYXLOG removed after promotion");
-
-# Check the logs of the standby to see that the commands have failed.
-my $log_contents = slurp_file($node_standby2->logfile, $log_location);
 like(
 	$log_contents,
 	qr/WARNING:.*recovery_end_command/s,
#13Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#12)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Tue, Apr 12, 2022 at 3:49 PM Michael Paquier <michael@paquier.xyz> wrote:

All that stuff leads me to the attached. Thoughts?

Under valgrind I got "Undefined subroutine &main::usleep called at
t/002_archiving.pl line 103" so I added "use Time::HiRes qw(usleep);",
and now I get past the first 4 tests with your patch, but then
promotion times out, not sure why:

+++ tap check in src/test/recovery +++
t/002_archiving.pl ..
ok 1 - check content from archives
ok 2 - archive_cleanup_command executed on checkpoint
ok 3 - recovery_end_command not executed yet
# found 00000002.history after 14 attempts
ok 4 - recovery_end_command executed after promotion
Bailout called.  Further testing stopped:  command "pg_ctl -D
/home/tmunro/projects/postgresql/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata
-l /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/log/002_archiving_standby2.log
promote" exited with value 1

Since it's quite painful to run TAP tests under valgrind, I found a
place to stick a plain old sleep to repro these problems:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1035,7 +1035,7 @@ sub enable_restoring
        my $copy_command =
          $PostgreSQL::Test::Utils::windows_os
          ? qq{copy "$path\\\\%f" "%p"}
-         : qq{cp "$path/%f" "%p"};
+         : qq{sleep 1 && cp "$path/%f" "%p"};

Soon I'll push the fix to the slowness that xlogprefetcher.c
accidentally introduced to continuous archive recovery, ie the problem
of calling a failing restore_command repeatedly as we approach the end
of a WAL segment instead of just once every 5 seconds after we run out
of data, and after that you'll probably need to revert that fix
locally to repro this.

#14Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#13)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Sun, Apr 17, 2022 at 08:56:33AM +1200, Thomas Munro wrote:

Under valgrind I got "Undefined subroutine &main::usleep called at
t/002_archiving.pl line 103" so I added "use Time::HiRes qw(usleep);",
and now I get past the first 4 tests with your patch, but then
promotion times out, not sure why:

+++ tap check in src/test/recovery +++
t/002_archiving.pl ..
ok 1 - check content from archives
ok 2 - archive_cleanup_command executed on checkpoint
ok 3 - recovery_end_command not executed yet
# found 00000002.history after 14 attempts
ok 4 - recovery_end_command executed after promotion
Bailout called.  Further testing stopped:  command "pg_ctl -D
/home/tmunro/projects/postgresql/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata
-l /home/tmunro/projects/postgresql/src/test/recovery/tmp_check/log/002_archiving_standby2.log
promote" exited with value 1

Hmm. As far as I can see, aren't you just hitting the 60s timeout of
pg_ctl here due to the slowness of valgrind?

Since it's quite painful to run TAP tests under valgrind, I found a
place to stick a plain old sleep to repro these problems:

Actually, I am wondering how you are patching Cluster.pm to do that.

Soon I'll push the fix to the slowness that xlogprefetcher.c
accidentally introduced to continuous archive recovery, ie the problem
of calling a failing restore_command repeatedly as we approach the end
of a WAL segment instead of just once every 5 seconds after we run out
of data, and after that you'll probably need to revert that fix
locally to repro this.

Okay. Thanks. Anyway, I'll do something about that tomorrow (no
room to look at the buildfarm today), and I was thinking about
replacing the while loop I had in the last version of the patch with a
poll_query_until that does a pg_stat_file() with an absolute path to
the history file to avoid the dependency to usleep() in the test,
splitting the fix into two commits as there is more than one problem,
each applying to different branches.
--
Michael

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#14)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On 2022-04-17 Su 00:17, Michael Paquier wrote:

Since it's quite painful to run TAP tests under valgrind, I found a
place to stick a plain old sleep to repro these problems:

Actually, I am wondering how you are patching Cluster.pm to do that.

I don't really think it's Cluster.pm's business to deal with that. It
takes an install path as given either explicitly or implicitly.

It shouldn't be too hard to get Makefile.global to install valgrind
wrappers into the tmp_install/bin directory.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#16Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#15)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote:

I don't really think it's Cluster.pm's business to deal with that. It
takes an install path as given either explicitly or implicitly.

It shouldn't be too hard to get Makefile.global to install valgrind
wrappers into the tmp_install/bin directory.

Or what gets used in just a wrapper of the contents of bin/ that get
enforced to be first in PATH?
--
Michael

#17Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#12)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Tue, Apr 12, 2022 at 12:49:48PM +0900, Michael Paquier wrote:

This comes from df86e52, where we want to recovery a history file that
would be created as RECOVERYHISTORY and make sure that the file gets
removed at the end of recovery. So $standby2 should choose a new
timeline different from the one of chosen by $standby. Looking back
at what has been done, it seems to me that the comment is the
incorrect part:
/messages/by-id/20190930080340.GO2888@paquier.xyz

acf1dd42 has taken care of the failures of this test with skink, and I
have just taken care of the two races in the tests with e61efaf and
1a8b110. I have left e61efaf out of REL_10_STABLE as the idea of
relying on a poll_query_until() with pg_stat_file() and an absolute
path would not work there, and the branch will be EOL'd soon while
there were no complains with this test for two years.
--
Michael

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#16)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On 2022-04-17 Su 19:49, Michael Paquier wrote:

On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote:

I don't really think it's Cluster.pm's business to deal with that. It
takes an install path as given either explicitly or implicitly.

It shouldn't be too hard to get Makefile.global to install valgrind
wrappers into the tmp_install/bin directory.

Or what gets used in just a wrapper of the contents of bin/ that get
enforced to be first in PATH?

That seems likely to be difficult. For example pg_ctl might find its
colocated postgres rather than the valgrind wrapper. Ditto initdb.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#16)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Mon, Apr 18, 2022 at 11:49 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Apr 17, 2022 at 10:56:08AM -0400, Andrew Dunstan wrote:

I don't really think it's Cluster.pm's business to deal with that. It
takes an install path as given either explicitly or implicitly.

It shouldn't be too hard to get Makefile.global to install valgrind
wrappers into the tmp_install/bin directory.

Or what gets used in just a wrapper of the contents of bin/ that get
enforced to be first in PATH?

Delayed response to the question on how I did that, because it was a 4
day weekend down here and I got distracted by sunshine...

A horrible slow way to do it is to build with -DUSE_VALGRIND and then
just run the whole process tree (eg make, perl, psql, ... and all)
under valgrind with --trace-children=yes:

tmunro@x1:~/projects/postgresql$ valgrind --quiet
--suppressions=`pwd`/src/tools/valgrind.supp --trace-children=yes
--track-origins=yes --run-libc-freeres=no --vgdb=no
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END make -C
src/test/recovery/ check PROVE_TESTS=t/002_* PROVE_FLAGS=-v

I think that sort of thing actually worked when I tried it on a
beefier workstation, but it sent my Thinkpad that "only" has a 16GB of
RAM into some kind of death spiral. The way I succeeded was indeed
using a wrapper script, based on a suggestion from Andres, my
kludgy-hardcoded-path-assuming implementation of which looked like:

=== install-postgres-valgrind, to be run once after building ===
#!/bin/sh

SRC=$HOME/projects/postgresql

# move the real binary out of the way
mv $SRC/src/backend/postgres $SRC/src/backend/postgres.real

# install the wrapper, in the location it'll be copied from for tmp_install
cp postgres.valgrind $SRC/src/backend/postgres
===

=== postgres.valgrind wrapper script ===
#!/bin/sh
exec /usr/bin/valgrind \
--quiet \
--error-exitcode=128 \
--suppressions=$HOME/projects/postgresql/src/tools/valgrind.supp \
--trace-children=yes --track-origins=yes --read-var-info=no \
--leak-check=no \
--run-libc-freeres=no \
--vgdb=no \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
$HOME/projects/postgresql/src/backend/postgres.real \
"$@"
===

Then just:

make -C src/test/recovery/ check PROVE_TESTS=t/002_* PROVE_FLAGS=-v

Yeah, it might be quite neat to find a tool-supported way to do that.

Tangentially, I'd also like to look into making
PostgreSQL-under-Valgrind work on FreeBSD and macOS, which didn't work
last time I tried it for reasons that might, I hope, have been fixed
on the Valgrind side by now.

#20Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#19)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

On Tue, Apr 19, 2022 at 09:45:11AM +1200, Thomas Munro wrote:

Delayed response to the question on how I did that, because it was a 4
day weekend down here and I got distracted by sunshine...

Happy Easter.

I think that sort of thing actually worked when I tried it on a
beefier workstation, but it sent my Thinkpad that "only" has a 16GB of
RAM into some kind of death spiral. The way I succeeded was indeed
using a wrapper script, based on a suggestion from Andres, my
kludgy-hardcoded-path-assuming implementation of which looked like:

Yeah, it might be quite neat to find a tool-supported way to do that.

Thanks for the details. I feared that it was something like that for
the backend. At least that's better than having valgrind spawn all
the processes kicked by the make command. :/

Tangentially, I'd also like to look into making
PostgreSQL-under-Valgrind work on FreeBSD and macOS, which didn't work
last time I tried it for reasons that might, I hope, have been fixed
on the Valgrind side by now.

Okay.

As a side note, skink has cooled down since acf1dd4, and did not
complain either after the additions of e61efaf and 1a8b110.
--
Michael