Odd, intermittent failure in contrib/pageinspect

Started by Tom Laneabout 5 years ago11 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

whelk failed today [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2021-01-18%2020%3A42%3A13 with this surprising symptom:

--- snip ---
diff -w -U3 C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out
--- C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out	2020-03-08 09:00:35.036254700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out	2021-01-18 22:10:10.889655500 +0100
@@ -90,8 +90,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags   
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 -- output the decoded flag HEAP_XMIN_FROZEN instead
@@ -99,8 +99,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags   
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 -- tests for decoding of combined flags
--- snip ---

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-04-17%2023%3A42%3A10. So I'm not sure what to make of it. Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason? That'd be a bug I should think.

Also, not really a bug, but why is this test script running exactly
the same query twice in a row? If that's of value, and not just a
copy-and-paste error, the comments sure don't explain why. But what
it looks like is that these queries were different when first added,
and then 58b4cb30a5b made them the same when it probably should have
removed one.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2021-01-18%2020%3A42%3A13
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-04-17%2023%3A42%3A10

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: Odd, intermittent failure in contrib/pageinspect

On 2021-Jan-18, Tom Lane wrote:

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2]. So I'm not sure what to make of it. Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason? That'd be a bug I should think.

Hmm, doesn't vacuum skip pages when they are pinned? I don't think
VACUUM FREEZE would be treated especially -- only "aggressive"
wraparound would be an exception, IIRC. This would reflect in the
relfrozenxid for the table after vacuum, but I'm not sure if there's a
decent way to make the regression tests reflect that.

Also, not really a bug, but why is this test script running exactly
the same query twice in a row? If that's of value, and not just a
copy-and-paste error, the comments sure don't explain why. But what
it looks like is that these queries were different when first added,
and then 58b4cb30a5b made them the same when it probably should have
removed one.

Agreed.

--
�lvaro Herrera 39�49'30"S 73�17'W

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Odd, intermittent failure in contrib/pageinspect

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2021-Jan-18, Tom Lane wrote:

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2]. So I'm not sure what to make of it. Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason? That'd be a bug I should think.

Hmm, doesn't vacuum skip pages when they are pinned? I don't think
VACUUM FREEZE would be treated especially -- only "aggressive"
wraparound would be an exception, IIRC.

Right. If that's the explanation, then adding DISABLE_PAGE_SKIPPING
to the test's VACUUM options should fix it. However, to believe that
theory you have to have some reason to think that some other process
might have the page pinned. What would that be? test1 only has one
small tuple in it, so it doesn't seem credible that autovacuum or
autoanalyze would have fired on it.

[ thinks for a bit... ] Does the checkpointer pin pages it's writing
out? I guess it'd have to ...

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: Odd, intermittent failure in contrib/pageinspect

On 2021-Jan-18, Tom Lane wrote:

Right. If that's the explanation, then adding DISABLE_PAGE_SKIPPING
to the test's VACUUM options should fix it. However, to believe that
theory you have to have some reason to think that some other process
might have the page pinned. What would that be? test1 only has one
small tuple in it, so it doesn't seem credible that autovacuum or
autoanalyze would have fired on it.

I guess the machine would have to be pretty constrained. (It takes
almost seven minutes to go through the pg_upgrade test, so it does seems
small.)

[ thinks for a bit... ] Does the checkpointer pin pages it's writing
out? I guess it'd have to ...

It does, per SyncOneBuffer(), called from BufferSync(), called from
CheckPointBuffers().

--
�lvaro Herrera 39�49'30"S 73�17'W

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#4)
Re: Odd, intermittent failure in contrib/pageinspect

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2021-Jan-18, Tom Lane wrote:

[ thinks for a bit... ] Does the checkpointer pin pages it's writing
out? I guess it'd have to ...

It does, per SyncOneBuffer(), called from BufferSync(), called from
CheckPointBuffers().

Right, then we don't need any strange theories about autovacuum,
just bad timing luck. whelk does seem pretty slow, so it's not
much of a stretch to imagine that it's more susceptible to this
corner case than faster machines.

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages? By this theory, they'd
all be failures waiting to happen.

regards, tom lane

#6Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#5)
Re: Odd, intermittent failure in contrib/pageinspect

On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:

Right, then we don't need any strange theories about autovacuum,
just bad timing luck. whelk does seem pretty slow, so it's not
much of a stretch to imagine that it's more susceptible to this
corner case than faster machines.

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages? By this theory, they'd
all be failures waiting to happen.

That looks possible by looking at the code around lazy_scan_heap(),
but that's narrow.

check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
seems to me that we had better be sure that no pages are skipped for
their cases?

The duplicated query result looks to be an oversight from 58b4cb3 when
the thing got rewritten, so it can just go away. Good catch.
--
Michael

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#6)
Re: Odd, intermittent failure in contrib/pageinspect

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages? By this theory, they'd
all be failures waiting to happen.

check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
seems to me that we had better be sure that no pages are skipped for
their cases?

It looks to me like heap_surgery ought to be okay, because it's operating
on a temp table; if there are any page access conflicts on that, we've
got BIG trouble ;-)

Poking around, I found a few other places where it looked like a skipped
page could produce diffs in the expected output:
contrib/amcheck/t/001_verify_heapam.pl
contrib/pg_visibility/sql/pg_visibility.sql

There are lots of other vacuums of course, but they don't look like
a missed page would have any effect on the visible results, so I think
we should leave them alone.

In short I propose the attached patch, which also gets rid of
that duplicate query.

regards, tom lane

Attachments:

prevent-vacuum-page-skips-where-important.patchtext/x-diff; charset=us-ascii; name=prevent-vacuum-page-skips-where-important.patchDownload+15-24
#8Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#4)
Re: Odd, intermittent failure in contrib/pageinspect

Hi,

On 2021-01-18 19:40:05 -0300, Alvaro Herrera wrote:

[ thinks for a bit... ] Does the checkpointer pin pages it's writing
out? I guess it'd have to ...

It does, per SyncOneBuffer(), called from BufferSync(), called from
CheckPointBuffers().

I think you don't event need checkpointer to be involved, normal buffer
replacement would do the trick. We briefly pin the page in BufferAlloc()
even if the page is clean. Longer when it's dirty, of course.

Greetings,

Andres Freund

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: Odd, intermittent failure in contrib/pageinspect

Andres Freund <andres@anarazel.de> writes:

I think you don't event need checkpointer to be involved, normal buffer
replacement would do the trick. We briefly pin the page in BufferAlloc()
even if the page is clean. Longer when it's dirty, of course.

True, but it seems unlikely that the pages in question here would be
chosen as replacement victims. These are non-parallel tests, so
there's little competitive pressure. I could believe that a background
autovacuum is active, but not that it's dirtied so many pages that
tables the test script just created need to get swapped out.

The checkpointer theory seems good because it requires no assumptions
at all about competing demand for buffers. If the clock sweep gets
to the table page (which we know is recently dirtied) at just the right
time, we'll see a failure.

regards, tom lane

#10Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#7)
Re: Odd, intermittent failure in contrib/pageinspect

On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote:

It looks to me like heap_surgery ought to be okay, because it's operating
on a temp table; if there are any page access conflicts on that, we've
got BIG trouble ;-)

Bah, of course. I managed to miss this part.

Poking around, I found a few other places where it looked like a skipped
page could produce diffs in the expected output:
contrib/amcheck/t/001_verify_heapam.pl
contrib/pg_visibility/sql/pg_visibility.sql

There are lots of other vacuums of course, but they don't look like
a missed page would have any effect on the visible results, so I think
we should leave them alone.

Yeah, I got to wonder a bit about check_btree.sql on a second look,
but that's no big deal to leave it alone either.

In short I propose the attached patch, which also gets rid of
that duplicate query.

Agreed, +1.
--
Michael

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#10)
Re: Odd, intermittent failure in contrib/pageinspect

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote:

In short I propose the attached patch, which also gets rid of
that duplicate query.

Agreed, +1.

Pushed, thanks for looking at it.

regards, tom lane