stats test intermittent failure
Hi,
Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2023-06-16%2018%3A30%3A32 and more recently here on
francolin [2]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=francolin&dt=2023-07-07%2018%3A43%3A57&stg=recovery-check).
Failing test diff for those who prefer not to scroll:
+++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
2023-07-07 18:48:25.976313231 +0000
@@ -1415,7 +1415,7 @@
:io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
?column? | ?column?
----------+----------
- t | t
+ t | f
My theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.
- Melanie
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2023-06-16%2018%3A30%3A32
[2]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=francolin&dt=2023-07-07%2018%3A43%3A57&stg=recovery-check
Attachments:
v1-0001-Fix-pg_stat_io-buffer-reuse-test-instability.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-pg_stat_io-buffer-reuse-test-instability.patchDownload+32-11
Hi Melanie,
10.07.2023 21:35, Melanie Plageman wrote:
Hi,
Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).Failing test diff for those who prefer not to scroll:
+++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2023-07-07 18:48:25.976313231 +0000 @@ -1415,7 +1415,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? ----------+---------- - t | t + t | fMy theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.
I managed to reproduce that failure with the attached patch applied
(on master) and with the following script (that effectively multiplies
probability of the failure by 360):
CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests && make -s -j`nproc` && make -s check
-C src/test/recovery
mkdir -p src/test/recovery00/t
cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/
cp src/test/recovery/Makefile src/test/recovery00/
for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done
for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make
-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 ||
break; done
Without your patch, I get:
iteration 2
...
recovery5 # Failed test 'regression tests pass'
recovery5 # at t/027_stream_regress.pl line 92.
recovery5 # got: '256'
recovery5 # expected: '0'
...
src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains:
--- .../src/test/regress/expected/stats.out 2023-07-11 20:05:10.536059706 +0300
+++ .../src/test/recovery5/tmp_check/results/stats.out 2023-07-11 20:30:46.790551305 +0300
@@ -1418,7 +1418,7 @@
:io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
?column? | ?column?
----------+----------
- t | t
+ t | f
(1 row)
With your patch applied, 10 iterations performed successfully for me.
So it looks like your theory and your fix are correct.
Best regards,
Alexander
Attachments:
regress-repeat-stats-line.patchtext/x-patch; charset=UTF-8; name=regress-repeat-stats-line.patchDownload+108-55
Hi,
On Tue, Jul 11, 2023 at 3:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
Hi,
Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).Failing test diff for those who prefer not to scroll:
+++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2023-07-07 18:48:25.976313231 +0000 @@ -1415,7 +1415,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? ----------+---------- - t | t + t | fMy theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.
Yeah, I've not reproduced this issue but it's possible. IIUC if we get
the buffer from the ring, we count an I/O as "reuse" even if the
buffer has already been flushed/replaced. However, if the buffer in
the ring is pinned by other backends, we end up evicting a buffer from
outside of the ring and adding it to the buffer, which is counted as
"eviction".
Regarding the patch, I have a comment:
-- Test that reuse of strategy buffers and reads of blocks into these reused
--- buffers while VACUUMing are tracked in pg_stat_io.
+-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient
+-- demand for shared buffers from concurrent queries, some blocks may be
+-- evicted from the strategy ring before they can be reused. In such cases
+-- this, the backend will evict a block from a shared buffer outside of the
+-- ring and add it to the ring. This is considered an eviction and not a reuse.
The new comment seems not to be accurate if my understanding is
correct. How about the following?
Test that reuse of strategy buffers and reads of blocks into these
reused buffers while VACUUMing are tracked in pg_stat_io. If there is
sufficient demand for shared buffers from concurrent queries, some
buffers may be pinned by other backends before they can be reused. In
such cases, the backend will evict a buffer from a shared buffer
outside of the ring and add it to the ring. This is considered an
eviction and not a reuse.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On 2023-07-31 21:03:07 +0900, Masahiko Sawada wrote:
Regarding the patch, I have a comment:
-- Test that reuse of strategy buffers and reads of blocks into these reused --- buffers while VACUUMing are tracked in pg_stat_io. +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient +-- demand for shared buffers from concurrent queries, some blocks may be +-- evicted from the strategy ring before they can be reused. In such cases +-- this, the backend will evict a block from a shared buffer outside of the +-- ring and add it to the ring. This is considered an eviction and not a reuse.The new comment seems not to be accurate if my understanding is correct. How
about the following?Test that reuse of strategy buffers and reads of blocks into these
reused buffers while VACUUMing are tracked in pg_stat_io. If there is
sufficient demand for shared buffers from concurrent queries, some
buffers may be pinned by other backends before they can be reused. In
such cases, the backend will evict a buffer from a shared buffer
outside of the ring and add it to the ring. This is considered an
eviction and not a reuse.
I integrated the suggested change of the comment and tweaked it a bit
more. And finally pushed the fix.
Sorry that it took so long.
Greetings,
Andres Freund