Address the bug in 041_checkpoint_at_promote.pl

Started by Nitin Jadhav11 months ago7 messages
#1Nitin Jadhav
nitinjadhavpostgres@gmail.com
1 attachment(s)

While testing, I discovered an issue in 041_checkpoint_at_promote.pl.

# Wait until the previous restart point completes on the newly-promoted
# standby, checking the logs for that.
my $checkpoint_complete = 0;
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
if ($node_standby->log_contains("restartpoint complete"), $logstart)
{
$checkpoint_complete = 1;
last;
}
usleep(100_000);
}
is($checkpoint_complete, 1, 'restart point has completed');

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Attachments:

0001-Fix-the-bug-in-041_checkpoint_at_promote.pl.patchapplication/octet-stream; name=0001-Fix-the-bug-in-041_checkpoint_at_promote.pl.patchDownload
From ac1fcc7031ef547245ba29b00e10bfe77afeb56f Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Wed, 12 Feb 2025 07:42:44 +0000
Subject: [PATCH] Fix the bug in 041_checkpoint_at_promote.pl

The code isn't functioning as intended. It should wait for the restart
point operation to complete, but it doesn't due to a small mistake
with the closing parenthesis. The logstart variable should be passed
as the second argument to log_contains(). This patch corrects that issue.
---
 src/test/recovery/t/041_checkpoint_at_promote.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/recovery/t/041_checkpoint_at_promote.pl b/src/test/recovery/t/041_checkpoint_at_promote.pl
index cb301d01427..cc5aa80af40 100644
--- a/src/test/recovery/t/041_checkpoint_at_promote.pl
+++ b/src/test/recovery/t/041_checkpoint_at_promote.pl
@@ -110,7 +110,7 @@ $node_standby->safe_psql('postgres',
 my $checkpoint_complete = 0;
 foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
 {
-	if ($node_standby->log_contains("restartpoint complete"), $logstart)
+	if ($node_standby->log_contains("restartpoint complete", $logstart))
 	{
 		$checkpoint_complete = 1;
 		last;
-- 
2.43.0

#2Michael Paquier
michael@paquier.xyz
In reply to: Nitin Jadhav (#1)
Re: Address the bug in 041_checkpoint_at_promote.pl

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

#3Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Michael Paquier (#2)
Re: Address the bug in 041_checkpoint_at_promote.pl

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

#4Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Nitin Jadhav (#3)
Re: Address the bug in 041_checkpoint_at_promote.pl

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

I see that it's already been committed and understand that no other
changes are needed. Thank you!

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Wed, Feb 12, 2025 at 3:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Nitin Jadhav (#4)
Re: Address the bug in 041_checkpoint_at_promote.pl

On Wed, Feb 12, 2025 at 04:50:34PM +0530, Nitin Jadhav wrote:

I see that it's already been committed and understand that no other
changes are needed. Thank you!

My apologies for the lack of updates here. I've looked at the whole
test again yesterday and the issue that you have reported was the only
one standing out.

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.
--
Michael

#6Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Michael Paquier (#5)
Re: Address the bug in 041_checkpoint_at_promote.pl

On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

--
Best Wishes,
Ashutosh Bapat

#7Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Ashutosh Bapat (#6)
Re: Address the bug in 041_checkpoint_at_promote.pl

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

I was writing a TAP test to reproduce a crash recovery issue and used
parts of 041_checkpoint_at_promote.pl. Unfortunately, my test wasn't
waiting for the desired message to appear in the log. I then realized
there was a mistake in log_contains(), which I had copied from the
existing test. After testing 041_checkpoint_at_promote.pl multiple
times to see if it worked as expected, I noticed differences in some
iterations.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Thu, Feb 13, 2025 at 11:18 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Show quoted text

On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

--
Best Wishes,
Ashutosh Bapat