All-zero page in GIN index causes assertion failure
This is a continuation of the discussion at
/messages/by-id/CAMkU=1zUc=h0oCZntaJaqqW7gxxVxCWsYq8DD2t7oHgsgVEsgA@mail.gmail.com,
I'm starting a new thread as this is a separate issue than the original
LWLock bug.
On Thu, Jul 16, 2015 at 12:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:I don't see how this is related to the LWLock issue, but I didn't see it
without your patch. Perhaps the system just didn't survive long enough to
uncover it without the patch (although it shows up pretty quickly). It
could just be an overzealous Assert, since the casserts off didn't show
problems.bt and bt full are shown below.
Cheers,
Jeff
#0 0x0000003dcb632625 in raise () from /lib64/libc.so.6
#1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6
#2 0x0000000000930b7a in ExceptionalCondition (
conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc
"FailedAssertion",
fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54
#3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at
ginvacuum.c:713It now looks like this *is* unrelated to the LWLock issue. The assert that
it is tripping over was added just recently (302ac7f27197855afa8c) and so I
had not been testing under its presence until now. It looks like it is
finding all-zero pages (index extended but then a crash before initializing
the page?) and it doesn't like them.(gdb) f 3
(gdb) p *(char[8192]*)(page)
$11 = '\000' <repeats 8191 times>Presumably before this assert, such pages would just be permanently
orphaned.
Yeah, so it seems. It's normal to have all-zero pages in the index, if
you crash immediately after the relation has been extended, but before
the new page has been WAL-logged. What is your test case like; did you
do crash-testing?
ISTM ginvacuumcleanup should check for PageIsNew, and put the page to
the FSM. That's what btvacuumpage() gistvacuumcleanup() do.
spgvacuumpage() seems to also check for PageIsNew(), but it seems broken
in a different way: it initializes the page and marks the page as dirty,
but it is not WAL-logged. That is a problem at least if checksums are
enabled: if you crash you might have a torn page on disk, with invalid
checksum.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/20/2015 11:14 AM, Heikki Linnakangas wrote:
ISTM ginvacuumcleanup should check for PageIsNew, and put the page to
the FSM. That's what btvacuumpage() gistvacuumcleanup() do.
spgvacuumpage() seems to also check for PageIsNew(), but it seems broken
in a different way: it initializes the page and marks the page as dirty,
but it is not WAL-logged. That is a problem at least if checksums are
enabled: if you crash you might have a torn page on disk, with invalid
checksum.
Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:
1. An empty heap page is all-zeroes except for the small page header in
the beginning of the page. For a torn page to matter, the page would
need to be torn in the header, but we rely elsewhere (control file)
anyway that a 512-byte sector update is atomic, so that shouldn't
happen. Note that this hinges on the fact that there is no special area
on heap pages, so you cannot rely on this for index pages.
2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is
the XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization,
it's required for correctness.
Heap update can also leave behind a page in the buffer cache that's been
initialized by RelationGetBufferForTuple but not yet WAL-logged.
However, it doesn't mark the buffer dirty, so the torn-page problem
cannot happen because the page will not be flushed to disk if nothing
else touches it. The XLOG_HEAP_INIT_PAGE optimization is needed in that
case too, however.
B-tree, GiST, and SP-GiST's relation extension work similarly, but they
have other mitigating factors. If a newly-initialized B-tree page is
left behind in the relation, it won't be reused for anything, and vacuum
will ignore it (by accident, I think; there is no explicit comment on
what will happen to such pages, but it will be treated like an internal
page and ignored). Eventually the buffer will be evicted from cache, and
because it's not marked as dirty, it will not be flushed to disk, and
will later be read back as all-zeros and vacuum will recycle it.
BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page
and WAL-logging the change. If that happens, the next update that uses
the same page will WAL-log the change but it will not use the
XLOG_BRIN_INIT_PAGE option, and redo will not initialize the page. Redo
will fail.
BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN
pages? If an empty page is missing from the FSM for any reason, there's
nothing to add it there.
This is all very subtle. The whole business of leaving behind an
already-initialized page in the buffer cache, without marking the buffer
as dirty, is pretty ugly. I wish we had a more robust pattern to handle
all-zero pages and relation extension.
Thoughts? As a minimal backpatchable fix, I think we should add the
check in ginvacuumpage() to initialize any all-zeros pages it
encounters. That needs to be WAL-logged, and WAL-logging needs to be
added to the page initialization in spgvacuumpage too.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:1. An empty heap page is all-zeroes except for the small page header in the
beginning of the page. For a torn page to matter, the page would need to be
torn in the header, but we rely elsewhere (control file) anyway that a
512-byte sector update is atomic, so that shouldn't happen. Note that this
hinges on the fact that there is no special area on heap pages, so you
cannot rely on this for index pages.2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is the
XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's
required for correctness.
Hmm, I guess this is a case of an optimization hiding a bug :-( I mean,
if we didn't have that, we would have noticed this a lot sooner, I
imagine.
BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.
Hmm, interesting.
BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.
Probably, yeah, makes sense. If you recall, the original design I
proposed had a scan of the whole index (though for other reasons I am
thankful we got rid of that).
This is all very subtle.
No kidding ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/20/2015 07:11 PM, Alvaro Herrera wrote:
Heikki Linnakangas wrote:
Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:1. An empty heap page is all-zeroes except for the small page header in the
beginning of the page. For a torn page to matter, the page would need to be
torn in the header, but we rely elsewhere (control file) anyway that a
512-byte sector update is atomic, so that shouldn't happen. Note that this
hinges on the fact that there is no special area on heap pages, so you
cannot rely on this for index pages.2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is the
XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's
required for correctness.Hmm, I guess this is a case of an optimization hiding a bug :-( I mean,
if we didn't have that, we would have noticed this a lot sooner, I
imagine.
Yeah, probably.
I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it
simply leaves the page as all-zeroes, and adds it to the FSM. The code
that grabs a target page from the FSM handles that, and initializes the
page anyway, so that was simpler. This made the SP-GiST is-deleted flag
obsolete, it's no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a
page that's marked as deleted must also be empty, and wherever we check
for the deleted-flag, we also check for PageIsEmpty()))
- Heikki
Attachments:
fix-gin-spgist-vacuum-1.patchapplication/x-patch; name=fix-gin-spgist-vacuum-1.patchDownload+9-20
On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it simply
leaves the page as all-zeroes, and adds it to the FSM. The code that grabs
a target page from the FSM handles that, and initializes the page anyway,
so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's
no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a page
that's marked as deleted must also be empty, and wherever we check for the
deleted-flag, we also check for PageIsEmpty()))
This patch, in conjunction with the LWLock deadlock patch, fixes all the
issues I was having with GIN indexes in 9.5.
I haven't tried SP-GiST.
Cheers,
Jeff
Heikki Linnakangas wrote:
BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.
There's even a worse case. In brin_getinsertbuffer, if an old buffer is
passed, and the function extends the index, and then
brin_getinsertbuffer finds that the old buffer has been used to extend
the revmap, then we don't even return the newly extended page. We do
enter it into the FSM, though, and the FSM is vacuumed.
So next time around somebody requests a page from FSM, it might return
that page; but since it wasn't initialized, the insertion will fail.
Everybody expects the page to have been initialized previously, but this
will not be the case here.
I guess that function needs some restructuring, but it's pretty hairy
already ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/23/2015 07:43 PM, Jeff Janes wrote:
On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it simply
leaves the page as all-zeroes, and adds it to the FSM. The code that grabs
a target page from the FSM handles that, and initializes the page anyway,
so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's
no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a page
that's marked as deleted must also be empty, and wherever we check for the
deleted-flag, we also check for PageIsEmpty()))This patch, in conjunction with the LWLock deadlock patch, fixes all the
issues I was having with GIN indexes in 9.5.
Thanks, I've pushed this, as well a fix to similar failure from B-tree
vacuum that Amit Langote reported in the other thread.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 28, 2015 at 12:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Thanks, I've pushed this, as well a fix to similar failure from B-tree
vacuum that Amit Langote reported in the other thread.
Thanks Heikki and sorry I didn't notice this new thread.
Regards,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.
Here's a fix for BRIN: when brin_getinsertbuffer extends the relation
and doesn't return the page, it initializes and logs the page as an
empty regular page before returning, and also records it into the FSM.
That way, some future process that gets a page from FSM will use it,
preventing this type of problem from bloating the index forever. Also,
this function no longer initializes the page when it returns it to the
caller: instead the caller (brin_doinsert or brin_doupdate) must do
that. (Previously, the code was initializing the page outside the
critical section that WAL-logs this action).
BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.
Probably. I didn't change this part yet. There are two things to fix:
1. since we use log_newpage_buffer(), we log the initialization but not
the recording into FSM, so the page would be forgotten about. This can
be tested with PageIsEmpty(). An alternative to the vacuum scan is to
use our own WAL record that not only logs the initialization itself but
also the FSM update. Not sure this is worth the trouble.
2. additionally, if brin_getinsertbuffer extends the relation but we
crash before the caller initializes it, the page would be detected by
PageIsNew instead and would also need initialization.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
brin-page-init.patchtext/x-diff; charset=us-asciiDownload+140-23
Alvaro Herrera wrote:
BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.Probably. I didn't change this part yet. There are two things to fix:
1. since we use log_newpage_buffer(), we log the initialization but not
the recording into FSM, so the page would be forgotten about. This can
be tested with PageIsEmpty(). An alternative to the vacuum scan is to
use our own WAL record that not only logs the initialization itself but
also the FSM update. Not sure this is worth the trouble.2. additionally, if brin_getinsertbuffer extends the relation but we
crash before the caller initializes it, the page would be detected by
PageIsNew instead and would also need initialization.
Added this part. It's using log_newpage_buffer too. The vacuum scan
fixes the whole FSM, though, so after vacuum the FSM is up to date.
I think we could shave off a few bytes by using a separate WAL record,
but I'm not sure it's worth the trouble.
I intend to push this tomorrow.
I now think the free space calculations are broken, but I'll leave that
for later.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services