Address the bug in 041_checkpoint_at_promote.pl

Started by Nitin Jadhavover 1 year ago7 messageshackers

nitinjadhavpostgres@gmail.com

over 1 year ago

While testing, I discovered an issue in 041_checkpoint_at_promote.pl.

# Wait until the previous restart point completes on the newly-promoted
# standby, checking the logs for that.
my $checkpoint_complete = 0;
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
if ($node_standby->log_contains("restartpoint complete"), $logstart)
{
$checkpoint_complete = 1;
last;
}
usleep(100_000);
}
is($checkpoint_complete, 1, 'restart point has completed');

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Nitin Jadhav (#1)

Re: Address the bug in 041_checkpoint_at_promote.pl

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: Michael Paquier (#2)

Re: Address the bug in 041_checkpoint_at_promote.pl

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: Nitin Jadhav (#3)

Re: Address the bug in 041_checkpoint_at_promote.pl

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

I see that it's already been committed and understand that no other
changes are needed. Thank you!

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Wed, Feb 12, 2025 at 3:29 PM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

Show quoted text

Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Thank you for confirming. Besides fixing the if condition as done in
the patch, do you think any other changes are necessary?

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Wed, Feb 12, 2025 at 1:54 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 12, 2025 at 01:28:55PM +0530, Nitin Jadhav wrote:

The code is intended to wait for the restart point to complete before
proceeding. However, it doesn't actually wait. Regardless of whether
the restart point completes, the loop exits after the first iteration
because the if condition always evaluates to true. This happens
because $logstart is not passed as an argument to log_contains() by
mistake. If the restart point operation is quick, this issue might not
be noticeable, which is often the case.

I've attached a patch to fix this issue. Please review and share your feedback.

Oops, you're right. I am measuring 2ms or so in average between the wakeup
and the restartpoint complete. Removing the wakeup makes the test
complete, but it should wait in its lookup loop.

Will fix down to v17 where this error has been introduced.
--
Michael

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Nitin Jadhav (#4)

Re: Address the bug in 041_checkpoint_at_promote.pl

On Wed, Feb 12, 2025 at 04:50:34PM +0530, Nitin Jadhav wrote:

I see that it's already been committed and understand that no other
changes are needed. Thank you!

My apologies for the lack of updates here. I've looked at the whole
test again yesterday and the issue that you have reported was the only
one standing out.

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.
--
Michael

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 1 year ago

In reply to: Michael Paquier (#5)

Re: Address the bug in 041_checkpoint_at_promote.pl

On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

--
Best Wishes,
Ashutosh Bapat

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: Ashutosh Bapat (#6)

Re: Address the bug in 041_checkpoint_at_promote.pl

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

I was writing a TAP test to reproduce a crash recovery issue and used
parts of 041_checkpoint_at_promote.pl. Unfortunately, my test wasn't
waiting for the desired message to appear in the log. I then realized
there was a mistake in log_contains(), which I had copied from the
existing test. After testing 041_checkpoint_at_promote.pl multiple
times to see if it worked as expected, I noticed differences in some
iterations.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Thu, Feb 13, 2025 at 11:18 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Show quoted text

On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier <michael@paquier.xyz> wrote:

Anyway, how did you find that? Did you see a pattern when running the
test on a very slow machine or just from a read? That was a good
catch.

+1. I was wondering the same.

--
Best Wishes,
Ashutosh Bapat

Address the bug in 041_checkpoint_at_promote.pl

Attachments: