Instability of pg_walsummary/002_blocks.pl due to timing
Hello hackers,
A couple of the 002.blocks test's failures occurred during past three
months: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2025-04-09%2007%3A36%3A05, [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2025-07-01%2010%3A23%3A38 with the following diagnostics:
# Failed test 'WAL summarizer generates statistics for WAL reads'
# at /home/bf/bf-build/culicidae/REL_18_STABLE/pgsql/src/bin/pg_walsummary/t/002_blocks.pl line 54.
# got: 'f'
# expected: 't'
# Looks like you failed 1 test of 8.
pgsql.build/testrun/pg_walsummary/002_blocks/log/regress_log_002_blocks
[12:29:12.131](0.351s) ok 1 - WAL summarization caught up after insert
[12:29:12.196](0.065s) not ok 2 - WAL summarizer generates statistics for WAL reads
[12:29:12.198](0.002s) # Failed test 'WAL summarizer generates statistics for WAL reads'
# at /home/bf/bf-build/culicidae/REL_18_STABLE/pgsql/src/bin/pg_walsummary/t/002_blocks.pl line 54.
[12:29:12.198](0.000s) # got: 'f'
# expected: 't'
[12:29:12.267](0.069s) # after insert, summarized through 0/1821510
[12:29:12.507](0.240s) ok 3 - got new WAL summary after update
This test case is rather new, it was added by f4694e0f3 (from 2025-03-05).
I could reproduce this failure within 20 test runs with the following
modification:
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1544,6 +1544,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
* so we don't tight-loop.
*/
ProcessWalSummarizerInterrupts();
+pg_usleep(1000000);
summarizer_wait_for_wal();
Michael, as you added the test case, could you please have a look?
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2025-04-09%2007%3A36%3A05
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2025-07-01%2010%3A23%3A38
Best regards,
Alexander
On Sun, Jul 06, 2025 at 12:00:00PM +0300, Alexander Lakhin wrote:
This test case is rather new, it was added by f4694e0f3 (from 2025-03-05).
Thanks for the report.
I could reproduce this failure within 20 test runs with the following modification: --- a/src/backend/postmaster/walsummarizer.c +++ b/src/backend/postmaster/walsummarizer.c @@ -1544,6 +1544,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state, * so we don't tight-loop. */ ProcessWalSummarizerInterrupts(); +pg_usleep(1000000); summarizer_wait_for_wal();Michael, as you added the test case, could you please have a look?
I'm failing to reproduce it, unfortunately. It looks like just a
timing issue with the reports, so the best option I can think of here
would be to switch the test to do a wait until the stats have been
generated, leading to the attached. Do you still see the problem with
that in place?
--
Michael
Attachments:
walsummarizer-test.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl
index 270332780a45..0f98c7df82e6 100644
--- a/src/bin/pg_walsummary/t/002_blocks.pl
+++ b/src/bin/pg_walsummary/t/002_blocks.pl
@@ -47,11 +47,12 @@ EOM
ok($result, "WAL summarization caught up after insert");
# The WAL summarizer should have generated some IO statistics.
-my $stats_reads = $node1->safe_psql(
+$node1->poll_query_until(
'postgres',
- qq{SELECT sum(reads) > 0 FROM pg_stat_io
- WHERE backend_type = 'walsummarizer' AND object = 'wal'});
-is($stats_reads, 't', "WAL summarizer generates statistics for WAL reads");
+ q{SELECT sum(reads) > 0 FROM pg_stat_io
+ WHERE backend_type = 'walsummarizer' AND object = 'wal'})
+ or die
+ "Timed out while waiting for WAL summarizer to generate statistics for WAL reads";
# Find the highest LSN that is summarized on disk.
my $summarized_lsn = $node1->safe_psql('postgres', <<EOM);
Hello Michail,
07.07.2025 03:18, Michael Paquier wrote:
I'm failing to reproduce it, unfortunately. It looks like just a
timing issue with the reports, so the best option I can think of here
would be to switch the test to do a wait until the stats have been
generated, leading to the attached. Do you still see the problem with
that in place?
I'm sorry for not being accurate enough -- I forgot to mention that I
replicated the config from culicidae, and now I see that "fsync=on"
is needed to reproduce the failure for me (though maybe).
With:
./configure -q --enable-cassert --enable-debug --enable-tap-tests && make -s -j8
echo "fsync=on" >/tmp/temp.config
for i in {1..100}; do echo "ITERATION $i"; TEMP_CONFIG=/tmp/temp.config PROVE_TESTS="t/002*" make -s check -C
src/bin/pg_walsummary/ || break; done
I got failures on iterations 3, 5, 1. With your patch applied, I got 100
iterations passed.
Thank you for the fix!
Best regards,
Alexander
On Mon, Jul 07, 2025 at 08:00:01AM +0300, Alexander Lakhin wrote:
I'm sorry for not being accurate enough -- I forgot to mention that I
replicated the config from culicidae, and now I see that "fsync=on"
is needed to reproduce the failure for me (though maybe).
Even with that, the situation was stable here with two hours running
the test in a loop.
I got failures on iterations 3, 5, 1. With your patch applied, I got 100
iterations passed.
This is good enough for me, so applied. Thanks for double-checking.
--
Michael