Instability of pg_walsummary/002_blocks.pl due to timing

Started by Alexander Lakhin6 months ago4 messages
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

A couple of the 002.blocks test's failures occurred during past three
months: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2025-04-09%2007%3A36%3A05, [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2025-07-01%2010%3A23%3A38 with the following diagnostics:
#   Failed test 'WAL summarizer generates statistics for WAL reads'
#   at /home/bf/bf-build/culicidae/REL_18_STABLE/pgsql/src/bin/pg_walsummary/t/002_blocks.pl line 54.
#          got: 'f'
#     expected: 't'
# Looks like you failed 1 test of 8.

pgsql.build/testrun/pg_walsummary/002_blocks/log/regress_log_002_blocks
[12:29:12.131](0.351s) ok 1 - WAL summarization caught up after insert
[12:29:12.196](0.065s) not ok 2 - WAL summarizer generates statistics for WAL reads
[12:29:12.198](0.002s) #   Failed test 'WAL summarizer generates statistics for WAL reads'
#   at /home/bf/bf-build/culicidae/REL_18_STABLE/pgsql/src/bin/pg_walsummary/t/002_blocks.pl line 54.
[12:29:12.198](0.000s) #          got: 'f'
#     expected: 't'
[12:29:12.267](0.069s) # after insert, summarized through 0/1821510
[12:29:12.507](0.240s) ok 3 - got new WAL summary after update

This test case is rather new, it was added by f4694e0f3 (from 2025-03-05).

I could reproduce this failure within 20 test runs with the following
modification:
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1544,6 +1544,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
                                  * so we don't tight-loop.
                                  */
                                 ProcessWalSummarizerInterrupts();
+pg_usleep(1000000);
                                 summarizer_wait_for_wal();

Michael, as you added the test case, could you please have a look?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2025-04-09%2007%3A36%3A05
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2025-07-01%2010%3A23%3A38

Best regards,
Alexander

#2Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#1)
1 attachment(s)
Re: Instability of pg_walsummary/002_blocks.pl due to timing

On Sun, Jul 06, 2025 at 12:00:00PM +0300, Alexander Lakhin wrote:

This test case is rather new, it was added by f4694e0f3 (from 2025-03-05).

Thanks for the report.

I could reproduce this failure within 20 test runs with the following
modification:
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1544,6 +1544,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
                                 * so we don't tight-loop.
                                 */
                                ProcessWalSummarizerInterrupts();
+pg_usleep(1000000);
                                summarizer_wait_for_wal();

Michael, as you added the test case, could you please have a look?

I'm failing to reproduce it, unfortunately. It looks like just a
timing issue with the reports, so the best option I can think of here
would be to switch the test to do a wait until the stats have been
generated, leading to the attached. Do you still see the problem with
that in place?
--
Michael

Attachments:

walsummarizer-test.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl
index 270332780a45..0f98c7df82e6 100644
--- a/src/bin/pg_walsummary/t/002_blocks.pl
+++ b/src/bin/pg_walsummary/t/002_blocks.pl
@@ -47,11 +47,12 @@ EOM
 ok($result, "WAL summarization caught up after insert");
 
 # The WAL summarizer should have generated some IO statistics.
-my $stats_reads = $node1->safe_psql(
+$node1->poll_query_until(
 	'postgres',
-	qq{SELECT sum(reads) > 0 FROM pg_stat_io
-   WHERE backend_type = 'walsummarizer' AND object = 'wal'});
-is($stats_reads, 't', "WAL summarizer generates statistics for WAL reads");
+	q{SELECT sum(reads) > 0 FROM pg_stat_io
+   WHERE backend_type = 'walsummarizer' AND object = 'wal'})
+  or die
+  "Timed out while waiting for WAL summarizer to generate statistics for WAL reads";
 
 # Find the highest LSN that is summarized on disk.
 my $summarized_lsn = $node1->safe_psql('postgres', <<EOM);
#3Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#2)
Re: Instability of pg_walsummary/002_blocks.pl due to timing

Hello Michail,

07.07.2025 03:18, Michael Paquier wrote:

I'm failing to reproduce it, unfortunately. It looks like just a
timing issue with the reports, so the best option I can think of here
would be to switch the test to do a wait until the stats have been
generated, leading to the attached. Do you still see the problem with
that in place?

I'm sorry for not being accurate enough -- I forgot to mention that I
replicated the config from culicidae, and now I see that "fsync=on"
is needed to reproduce the failure for me (though maybe).

With:
./configure -q --enable-cassert --enable-debug --enable-tap-tests && make -s -j8
echo "fsync=on" >/tmp/temp.config
for i in {1..100}; do echo "ITERATION $i"; TEMP_CONFIG=/tmp/temp.config PROVE_TESTS="t/002*" make -s check -C
src/bin/pg_walsummary/ || break; done
I got failures on iterations 3, 5, 1. With your patch applied, I got 100
iterations passed.

Thank you for the fix!

Best regards,
Alexander

#4Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#3)
Re: Instability of pg_walsummary/002_blocks.pl due to timing

On Mon, Jul 07, 2025 at 08:00:01AM +0300, Alexander Lakhin wrote:

I'm sorry for not being accurate enough -- I forgot to mention that I
replicated the config from culicidae, and now I see that "fsync=on"
is needed to reproduce the failure for me (though maybe).

Even with that, the situation was stable here with two hours running
the test in a loop.

I got failures on iterations 3, 5, 1. With your patch applied, I got 100
iterations passed.

This is good enough for me, so applied. Thanks for double-checking.
--
Michael