pgsql: Add regression test for recovery pause.

Started by Fujii Masaoover 4 years ago6 messages
#1Fujii Masao
fujii@postgresql.org

Add regression test for recovery pause.

Previously there was no regression test for recovery pause feature.
This commit adds the test that checks

- recovery can be paused or resumed expectedly
- pg_get_wal_replay_pause_state() reports the correct pause state
- the paused state ends and promotion continues if a promotion
is triggered while recovery is paused

Suggested-by: Michael Paquier
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Dilip Kumar
Discussion: /messages/by-id/YKNirzqM1HYyk5h4@paquier.xyz

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/6bbc5c5e96b08f6b8c7d28d10ed8dfe6c49dca30

Modified Files
--------------
src/test/recovery/t/005_replay_delay.pl | 59 +++++++++++++++++++++++++++++++--
1 file changed, 57 insertions(+), 2 deletions(-)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#1)
Re: pgsql: Add regression test for recovery pause.

Fujii Masao <fujii@postgresql.org> writes:

Add regression test for recovery pause.

Buildfarm member jacana doesn't like this patch:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&amp;dt=2021-06-02%2012%3A00%3A44

the symptom being

Jun 02 09:05:17 t/005_replay_delay..................# poll_query_until timed out executing this query:
Jun 02 09:05:17 # SELECT '0/3002A20'::pg_lsn < pg_last_wal_receive_lsn()
Jun 02 09:05:17 # expecting this output:
Jun 02 09:05:17 # t
Jun 02 09:05:17 # last actual query output:
Jun 02 09:05:17 #
Jun 02 09:05:17 # with stderr:
Jun 02 09:05:17 # ERROR: syntax error at or near "pg_lsn"
Jun 02 09:05:17 # LINE 1: SELECT '0\\3002A20';pg_lsn < pg_last_wal_receive_lsn()
Jun 02 09:05:17 # ^

Checking the postmaster log confirms that what the backend is getting is

2021-06-02 08:58:01.073 EDT [60b78059.f84:4] 005_replay_delay.pl ERROR: syntax error at or near "pg_lsn" at character 20
2021-06-02 08:58:01.073 EDT [60b78059.f84:5] 005_replay_delay.pl STATEMENT: SELECT '0\\3002A20';pg_lsn < pg_last_wal_receive_lsn()

It sort of looks like something has decided that the pg_lsn constant
is a search path and made a lame attempt to convert it to Windows
style. I doubt our own code is doing that, so I'm inclined to blame
IPC::Run thinking it can mangle the command string it's given.
I wonder whether jacana has got a freshly-installed version of IPC::Run.

Another interesting question is how come we managed to get this far
in the tests. There is a nearly, but not quite, identical delay
query in 002_archiving.pl, which already ran successfully:

# Wait until necessary replay has been done on standby
my $caughtup_query =
"SELECT '$current_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
$node_standby->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby to catch up";

I wonder whether the fact that 002 uses '<=' not '<' could be
at all related. (I also wonder which one is correct as a means
of waiting for replay; they are not both correct.)

In any case, letting IPC::Run munge SQL commands seems completely
unacceptable. We can't plan on working around that every time.

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: pgsql: Add regression test for recovery pause.

On 6/2/21 5:26 PM, Tom Lane wrote:

Fujii Masao <fujii@postgresql.org> writes:

Add regression test for recovery pause.

Buildfarm member jacana doesn't like this patch:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&amp;dt=2021-06-02%2012%3A00%3A44

the symptom being

Jun 02 09:05:17 t/005_replay_delay..................# poll_query_until timed out executing this query:
Jun 02 09:05:17 # SELECT '0/3002A20'::pg_lsn < pg_last_wal_receive_lsn()
Jun 02 09:05:17 # expecting this output:
Jun 02 09:05:17 # t
Jun 02 09:05:17 # last actual query output:
Jun 02 09:05:17 #
Jun 02 09:05:17 # with stderr:
Jun 02 09:05:17 # ERROR: syntax error at or near "pg_lsn"
Jun 02 09:05:17 # LINE 1: SELECT '0\\3002A20';pg_lsn < pg_last_wal_receive_lsn()
Jun 02 09:05:17 # ^

Checking the postmaster log confirms that what the backend is getting is

2021-06-02 08:58:01.073 EDT [60b78059.f84:4] 005_replay_delay.pl ERROR: syntax error at or near "pg_lsn" at character 20
2021-06-02 08:58:01.073 EDT [60b78059.f84:5] 005_replay_delay.pl STATEMENT: SELECT '0\\3002A20';pg_lsn < pg_last_wal_receive_lsn()

It sort of looks like something has decided that the pg_lsn constant
is a search path and made a lame attempt to convert it to Windows
style. I doubt our own code is doing that, so I'm inclined to blame
IPC::Run thinking it can mangle the command string it's given.
I wonder whether jacana has got a freshly-installed version of IPC::Run.

Another interesting question is how come we managed to get this far
in the tests. There is a nearly, but not quite, identical delay
query in 002_archiving.pl, which already ran successfully:

# Wait until necessary replay has been done on standby
my $caughtup_query =
"SELECT '$current_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
$node_standby->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for standby to catch up";

I wonder whether the fact that 002 uses '<=' not '<' could be
at all related. (I also wonder which one is correct as a means
of waiting for replay; they are not both correct.)

In any case, letting IPC::Run munge SQL commands seems completely
unacceptable. We can't plan on working around that every time.

Looks to me like we're getting munged by the msys shell, and unlike on
msys2 there isn't a way to disable it:
https://stackoverflow.com/questions/7250130/how-to-stop-mingw-and-msys-from-mangling-path-names-given-at-the-command-line

c.f. commit 73ff3a0abbb

Maybe a robust solution would be to have the query piped to psql on its
stdin rather than on the command line. poll_query_until looks on a quick
check like the only place in PostgresNode where we use "psql -c"

I'll experiment a bit tomorrow.

cheers

andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#3)
Re: pgsql: Add regression test for recovery pause.

On 6/2/21 6:25 PM, Andrew Dunstan wrote:

Looks to me like we're getting munged by the msys shell, and unlike on
msys2 there isn't a way to disable it:
https://stackoverflow.com/questions/7250130/how-to-stop-mingw-and-msys-from-mangling-path-names-given-at-the-command-line

c.f. commit 73ff3a0abbb

Maybe a robust solution would be to have the query piped to psql on its
stdin rather than on the command line. poll_query_until looks on a quick
check like the only place in PostgresNode where we use "psql -c"

I'll experiment a bit tomorrow.

My suspicion was correct. Fix pushed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#4)
Re: pgsql: Add regression test for recovery pause.

Andrew Dunstan <andrew@dunslane.net> writes:

My suspicion was correct. Fix pushed.

Great, thanks.

Do we need to worry about back-patching that? It seems only
accidental if no existing back-branch test cases hit this.

regards, tom lane

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: pgsql: Add regression test for recovery pause.

On 6/3/21 4:45 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

My suspicion was correct. Fix pushed.

Great, thanks.

Do we need to worry about back-patching that? It seems only
accidental if no existing back-branch test cases hit this.

Well, we haven't had breakage, but its also useful to keep things in
sync as much as possible. Ill do it shortly.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com