Address the bug in 041_checkpoint_at_promote.pl
While testing, I discovered an issue in 041_checkpoint_at_promote.pl.
# Wait until the previous restart point completes on the newly-promoted
# standby, checking the logs for that.
my $checkpoint_complete = 0;
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
if ($node_standby->log_contains("restartpoint complete"), $logstart)
{
$checkpoint_complete = 1;
last;
}
usleep(100_000);
}
is($checkpoint_complete, 1, 'restart point has completed');
The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.
I've attached a patch to fix this issue. Please review and share your feedback.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Attachments:
0001-Fix-the-bug-in-041_checkpoint_at_promote.pl.patchapplication/octet-stream; name=0001-Fix-the-bug-in-041_checkpoint_at_promote.pl.patchDownload
From ac1fcc7031ef547245ba29b00e10bfe77afeb56f Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Wed, 12 Feb 2025 07:42:44 +0000
Subject: [PATCH] Fix the bug in 041_checkpoint_at_promote.pl
The code isn't functioning as intended. It should wait for the restart
point operation to complete, but it doesn't due to a small mistake
with the closing parenthesis. The logstart variable should be passed
as the second argument to log_contains(). This patch corrects that issue.
---
src/test/recovery/t/041_checkpoint_at_promote.pl | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/test/recovery/t/041_checkpoint_at_promote.pl b/src/test/recovery/t/041_checkpoint_at_promote.pl
index cb301d01427..cc5aa80af40 100644
--- a/src/test/recovery/t/041_checkpoint_at_promote.pl
+++ b/src/test/recovery/t/041_checkpoint_at_promote.pl
@@ -110,7 +110,7 @@ $node_standby->safe_psql('postgres',
my $checkpoint_complete = 0;
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
- if ($node_standby->log_contains("restartpoint complete"), $logstart)
+ if ($node_standby->log_contains("restartpoint complete", $logstart))
{
$checkpoint_complete = 1;
last;
--
2.43.0
On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:
The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.I've attached a patch to fix this issue. Please review and share your feedback.
Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.
Will fix down to v17 where this error has been introduced.
--
Michael
Removing the wakeup makes the test
complete, but it should wait in its lookup loop.
Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Show quoted text
On Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:
The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.I've attached a patch to fix this issue. Please review and share your feedback.
Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.Will fix down to v17 where this error has been introduced.
--
Michael
Removing the wakeup makes the test
complete, but it should wait in its lookup loop.Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?
I see that it's already been committed and understand that no other
changes are needed. Thank you!
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Wed, Feb 12, 2025 at 3:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
Show quoted text
Removing the wakeup makes the test
complete, but it should wait in its lookup loop.Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:
The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.I've attached a patch to fix this issue. Please review and share your feedback.
Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.Will fix down to v17 where this error has been introduced.
--
Michael
On Wed, Feb 12, 2025 at 04:50:34PM +0530, Nitin Jadhav wrote:
I see that it's already been committed and understand that no other
changes are needed. Thank you!
My apologies for the lack of updates here. I've looked at the whole
test again yesterday and the issue that you have reported was the only
one standing out.
Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.
--
Michael
On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:
Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.
+1. I was wondering the same.
--
Best Wishes,
Ashutosh Bapat
Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.+1. I was wondering the same.
I was writing a TAP test to reproduce a crash recovery issue and used
parts of 041_checkpoint_at_promote.pl. Unfortunately, my test wasn't
waiting for the desired message to appear in the log. I then realized
there was a mistake in log_contains(), which I had copied from the
existing test. After testing 041_checkpoint_at_promote.pl multiple
times to see if it worked as expected, I noticed differences in some
iterations.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
On Thu, Feb 13, 2025 at 11:18 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
Show quoted text
On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:
Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.+1. I was wondering the same.
--
Best Wishes,
Ashutosh Bapat