_hash_alloc_buckets() safety

Started by Amit Kapilaover 9 years ago3 messages

amit.kapila16@gmail.com

over 9 years ago

While working on write-ahead-logging of hash indexes, I noticed that
this function allocates buckets in batches and the mechanism it uses
is that it initialize the last page of batch with zeros and expect
that the filesystem will ensure the intervening pages read as zeroes
too.

I think to make it WAL enabled, we need to initialize the page header
(using PageInit() or equivalent) instead of initializing it with
zeroes as some part of our WAL replay machinery expects that the page
should not be new as indicated by me in other thread [1]/messages/by-id/CAA4eK1JS+SiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-=0w@mail.gmail.com. I think WAL
consistency check tool [2]https://commitfest.postgresql.org/10/741/ also uses same part of replay functions and
will show this as problem, if we don't initialize the page header.

The point which is not clear to me is that whether it is okay as-is or
shall we try to initialize each page of batch during
_hash_alloc_buckets() considering now we are trying to make hash
indexes WAL enabled. Offhand, I don't see any problem with just
initializing the last page and write the WAL for same with
log_newpage(), however if we try to initialize all pages, there could
be some performance penalty on split operation.

Thoughts?

[1]: /messages/by-id/CAA4eK1JS+SiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-=0w@mail.gmail.com
[2]: https://commitfest.postgresql.org/10/741/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Amit Kapila (#1)

Re: _hash_alloc_buckets() safety

Amit Kapila <amit.kapila16@gmail.com> writes:

While working on write-ahead-logging of hash indexes, I noticed that
this function allocates buckets in batches and the mechanism it uses
is that it initialize the last page of batch with zeros and expect
that the filesystem will ensure the intervening pages read as zeroes
too.

Yes. AFAIK that filesystem behavior is required by POSIX.

I think to make it WAL enabled, we need to initialize the page header
(using PageInit() or equivalent) instead of initializing it with
zeroes as some part of our WAL replay machinery expects that the page
should not be new as indicated by me in other thread [1].

I don't really see why that's a problem. The only way one of the fill
pages would get to be not-zero is if there is a WAL action later in the
stream that overwrites it. So how would things become inconsistent?

Offhand, I don't see any problem with just
initializing the last page and write the WAL for same with
log_newpage(), however if we try to initialize all pages, there could
be some performance penalty on split operation.

"Some" seems like rather an understatement. And it's not just the
added I/O, it's the fact that you'd need to lock each bucket as you
went through them to avoid clobbering concurrently-inserted data.
If you weren't talking about such an enormous penalty, I might be okay
with zeroing the intervening pages explicitly rather than depending on
the filesystem to do it. But since you are, I think you need a clearer
explanation of why this is necessary.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Tom Lane (#2)

Re: _hash_alloc_buckets() safety

On Tue, Sep 13, 2016 at 6:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

While working on write-ahead-logging of hash indexes, I noticed that
this function allocates buckets in batches and the mechanism it uses
is that it initialize the last page of batch with zeros and expect
that the filesystem will ensure the intervening pages read as zeroes
too.

Yes. AFAIK that filesystem behavior is required by POSIX.

I think to make it WAL enabled, we need to initialize the page header
(using PageInit() or equivalent) instead of initializing it with
zeroes as some part of our WAL replay machinery expects that the page
should not be new as indicated by me in other thread [1].

I don't really see why that's a problem. The only way one of the fill
pages would get to be not-zero is if there is a WAL action later in the
stream that overwrites it. So how would things become inconsistent?

It could lead to an error/log message indicating "unexpected zero
page". Refer below code in XLogReadBufferExtended():

if (mode == RBM_NORMAL)
{

/* check that page has been initialized */
Page page = (Page) BufferGetPage(buffer);
..
if (PageIsNew(page))
{
..
log_invalid_page(rnode, forknum, blkno, true);
}

Now, during replay of XLOG_FPI (WAL record for new page log_newpage),
it won't hit that condition because it is ensured in the caller that
if block has image, than we use RBM_ZERO_AND_CLEANUP_LOCK or
RBM_ZERO_AND_LOCK buffer mode. However, a generic reader of blocks
like pgstattuple
(pgstattuple->pgstat_hash_page->_hash_getbuf_with_strategy->_hash_checkpage())
or new check consistency tool [1]/messages/by-id/CAGz5QCLC=0jbmGag-BaRPYAAVu9CN0b-scOEPeCGBe6Rhh8YVg@mail.gmail.com being developed by Kuntal would lead
to an error/log - " .. contains unexpected zero page at block ..". I
think the check consistency tool might use some checks to avoid it,
but not sure.

Offhand, I don't see any problem with just
initializing the last page and write the WAL for same with
log_newpage(), however if we try to initialize all pages, there could
be some performance penalty on split operation.

"Some" seems like rather an understatement. And it's not just the
added I/O, it's the fact that you'd need to lock each bucket as you
went through them to avoid clobbering concurrently-inserted data.

The question of concurrent-data insertion can only come into picture
when we update metapage information(maxbucket, lowmask, highmask, ..).
Without that, I don't think concurrent users would ever be aware of
the new buckets. However, as this work is done with metapage locked,
it is not advisable to do more work here, unless it is required.

If you weren't talking about such an enormous penalty, I might be okay
with zeroing the intervening pages explicitly rather than depending on
the filesystem to do it. But since you are, I think you need a clearer
explanation of why this is necessary.

I am also not in favor of doing the extra work unless it is required.
The point of this mail is to check if anybody sees that it is
required, because I can't find a reason to initialize all pages.

[1]: /messages/by-id/CAGz5QCLC=0jbmGag-BaRPYAAVu9CN0b-scOEPeCGBe6Rhh8YVg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers