[PATCH] Assert that the correct locks are held when calling PageGetLSN()

robertmhaas@gmail.com

over 8 years ago

In reply to: Jacob Champion (#1)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Thu, Aug 31, 2017 at 2:53 PM, Jacob Champion <pchampion@pivotal.io> wrote:

While working on checksum support for GPDB, we noticed that several
callers of PageGetLSN() didn't follow the correct locking procedure.
To try to help ferret out present and future mistakes, we added an
assertion to PageGetLSN() that checks whether those locks were being
held correctly. This patch is a first-draft attempt to port that
assertion back up to postgres master, based on work by Asim Praveen,
Ashwin Agrawal, and myself.

The patch is really two pieces: add the assertion, and fix the callers
that would trip it. The latter part is still in progress, because I'm
running into some places where I'm not sure what the correct way
forward is.

(I'm a newbie to this list and this code base, so please don't
hesitate to correct me on anything, whether that's code- or
culture-related!)

It's a good idea to add patches to commitfest.postgresql.org

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

jacob.champion@enterprisedb.com

over 8 years ago

In reply to: Robert Haas (#2)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It's a good idea to add patches to commitfest.postgresql.org

Hi Robert, I added it yesterday as
https://commitfest.postgresql.org/14/1279/ . Let me know if I need to
touch anything up there.

Thanks!
--Jacob

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

robertmhaas@gmail.com

over 8 years ago

In reply to: Jacob Champion (#3)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Fri, Sep 1, 2017 at 12:24 PM, Jacob Champion <pchampion@pivotal.io> wrote:

On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It's a good idea to add patches to commitfest.postgresql.org

Hi Robert, I added it yesterday as
https://commitfest.postgresql.org/14/1279/ . Let me know if I need to
touch anything up there.

Nah, looks fine, I just somehow missed it when I looked the first time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

michael@paquier.xyz

over 8 years ago

In reply to: Jacob Champion (#1)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Fri, Sep 1, 2017 at 3:53 AM, Jacob Champion <pchampion@pivotal.io> wrote:

The patch is really two pieces: add the assertion, and fix the callers
that would trip it. The latter part is still in progress, because I'm
running into some places where I'm not sure what the correct way
forward is.

I looked at your patch.

While working on checksum support for GPDB, we noticed that several
callers of PageGetLSN() didn't follow the correct locking procedure.

Well, PageGetLSN can be used in some hot code paths, xloginsert.c
being one, so it does not seem wise to me to switch it to something
more complicated than a macro, and also it is not bounded to any
locking contracts now. For example, pageinspect works on a copy of the
buffer, so that's one case where we just don't care. So we should it
be tightened in Postgres? That's the portion of the proposal I don't
get. You could introduce a custom API for this purpose, isn't the
origin of your problems with GPDB for checksums that data is
replicated across many nodes so things need to be locked uniquely?

- This change introduces a subtle source incompatibility: whereas
previously both Pages and PageHeaders could be passed to PageGetLSN(),
now callers must explicitly cast to Page if they don't already have
one.

That would produce warnings for extensions, so people would likely
catch that when compiling with a newer version, if they use -Werror...

- Is my use of FRONTEND here correct? The documentation made it seem
like some compilers could try to link against the
AssertPageIsLockedForLSN function, even if PageGetLSN was never
called.

+#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
+void
+AssertPageIsLockedForLSN(Page page)
+{
From a design point of view, bufmgr.c is an API interface for buffer
management on the backend-size, so just using FRONTEND there looks
wrong to me (bufmgr.h is already bad enough IMO now).

- Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't
really know what the best way is to fix it. It explicitly unlocks the
buffer, with the comment that visibilitymap_set() needs to handle
locking itself, but then calls PageGetLSN() to determine whether
visibilitymap_set() needs to be called.

This bit in src/backend/access/transam/README may interest you:
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.

So I see no problem with the exiting caller.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

jacob.champion@enterprisedb.com

over 8 years ago

In reply to: Michael Paquier (#5)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

Hi Michael, thanks for the review!

On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

While working on checksum support for GPDB, we noticed that several
callers of PageGetLSN() didn't follow the correct locking procedure.

Well, PageGetLSN can be used in some hot code paths, xloginsert.c
being one, so it does not seem wise to me to switch it to something
more complicated than a macro,

If raw perf is an issue -- now that I've refactored the assertion into
its own function, I could probably push the implementation back into a
macro. Something like

#define PageGetLSN(page) (AssertPageIsLockedForLSN((Page) page),
PageXLogRecPtrGet(((PageHeader) page)->pd_lsn))

Callers would need to watch out for the double-evaluation, but that's
already documented with this group of macros.

and also it is not bounded to any
locking contracts now. For example, pageinspect works on a copy of the
buffer, so that's one case where we just don't care. So we should it
be tightened in Postgres?

The proposed patch doesn't tighten the contract in this case -- if the
buffer that's passed to PageGetLSN() isn't a shared buffer (i.e. if
it's not part of the BufferBlocks array), the assertion passes
unconditionally.

That's the portion of the proposal I don't
get. You could introduce a custom API for this purpose, isn't the
origin of your problems with GPDB for checksums that data is
replicated across many nodes so things need to be locked uniquely?

GPDB introduces additional problems, but I think this issue is
independent of anything GPDB-specific. One way to look at it: are the
changes in the proposed patch, from PageGetLSN to BufferGetLSNAtomic,
correct? If not, well, that's a good discussion to have. But if they
are correct, then why were they missed? and is there a way to help
future contributors out in the future? That's the goal of this patch;
I don't think it's something that a custom API solves.

- This change introduces a subtle source incompatibility: whereas
previously both Pages and PageHeaders could be passed to PageGetLSN(),
now callers must explicitly cast to Page if they don't already have
one.

That would produce warnings for extensions, so people would likely
catch that when compiling with a newer version, if they use -Werror...

- Is my use of FRONTEND here correct? The documentation made it seem
like some compilers could try to link against the
AssertPageIsLockedForLSN function, even if PageGetLSN was never
called.
+#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
+void
+AssertPageIsLockedForLSN(Page page)
+{
From a design point of view, bufmgr.c is an API interface for buffer
management on the backend-size, so just using FRONTEND there looks
wrong to me (bufmgr.h is already bad enough IMO now).

The comments suggested that bufmgr.h could be incidentally pulled into
frontend code through other headers. Or do you mean that I don't need
to check for FRONTEND in the C source file (since, I assume, it's only
ever being compiled for the backend)? I'm not sure what you mean by
"bufmgr.h is already bad enough".

- Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't
really know what the best way is to fix it. It explicitly unlocks the
buffer, with the comment that visibilitymap_set() needs to handle
locking itself, but then calls PageGetLSN() to determine whether
visibilitymap_set() needs to be called.

This bit in src/backend/access/transam/README may interest you:
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.

So I see no problem with the exiting caller.

Is heap_xlog_visible only called during startup? If so, is there a
general rule for which locks are required during startup and which
aren't?

Currently there is no exception in the assertion made for startup
processes, so that's something that would need to be changed. Is there
a way to determine whether the current process considers itself to be
a startup process?

Thanks again!
--Jacob

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

michael@paquier.xyz

over 8 years ago

In reply to: Jacob Champion (#6)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion <pchampion@pivotal.io> wrote:

On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
+#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
+void
+AssertPageIsLockedForLSN(Page page)
+{
From a design point of view, bufmgr.c is an API interface for buffer
management on the backend-size, so just using FRONTEND there looks
wrong to me (bufmgr.h is already bad enough IMO now).
The comments suggested that bufmgr.h could be incidentally pulled into
frontend code through other headers. Or do you mean that I don't need
to check for FRONTEND in the C source file (since, I assume, it's only
ever being compiled for the backend)? I'm not sure what you mean by
"bufmgr.h is already bad enough".

I mean that this should not become worse without a better reason. This
patch makes it so.

This bit in src/backend/access/transam/README may interest you:
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.

So I see no problem with the exiting caller.

Is heap_xlog_visible only called during startup?

Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record.

If so, is there a
general rule for which locks are required during startup and which
aren't?

You can surely say that a process is fine to read a variable without a
lock if it is the only one updating it, other processes would need a
lock to read a variable. In this case the startup process is the only
one updating blocks, so that's at least one code path where the is no
need to take a lock when looking at a page LSN with PageGetLSN.
Another example is pageinspect which works on a copy of a page and
uses PageGetLSN, so no direct locks on the buffer page is needed.

In short, it seems to me that this patch should be rejected in its
current shape. There is no need to put more constraints on a API which
does not need more constraints and which is used widely by extensions.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

jacob.champion@enterprisedb.com

over 8 years ago

In reply to: Michael Paquier (#7)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion <pchampion@pivotal.io> wrote:
On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
+#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
+void
+AssertPageIsLockedForLSN(Page page)
+{
From a design point of view, bufmgr.c is an API interface for buffer
management on the backend-size, so just using FRONTEND there looks
wrong to me (bufmgr.h is already bad enough IMO now).
The comments suggested that bufmgr.h could be incidentally pulled into
frontend code through other headers. Or do you mean that I don't need
to check for FRONTEND in the C source file (since, I assume, it's only
ever being compiled for the backend)? I'm not sure what you mean by
"bufmgr.h is already bad enough".
I mean that this should not become worse without a better reason. This
patch makes it so.

This bit in src/backend/access/transam/README may interest you:
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.

So I see no problem with the exiting caller.

Is heap_xlog_visible only called during startup?

Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record.

If so, is there a
general rule for which locks are required during startup and which
aren't?

You can surely say that a process is fine to read a variable without a
lock if it is the only one updating it, other processes would need a
lock to read a variable. In this case the startup process is the only
one updating blocks, so that's at least one code path where the is no
need to take a lock when looking at a page LSN with PageGetLSN.

Is there a way to programmatically determine whether or not we're in
such a situation? That could be added to the assertion.

The code here is pretty inconsistent on whether or not PageGetLSN is
called inside or outside a lock -- see XLogReadBufferForRedoExtended
for instance.

Another example is pageinspect which works on a copy of a page and
uses PageGetLSN, so no direct locks on the buffer page is needed.

And again -- this case should be correctly handled by the new
assertion. Copies of pages are not checked for locks; we can't recover
a buffer header from a page that isn't part of the BufferBlocks array.

In short, it seems to me that this patch should be rejected in its
current shape.

Is the half of the patch that switches PageGetLSN to
BufferGetLSNAtomic correct, at least?

There is no need to put more constraints on a API which
does not need more constraints and which is used widely by extensions.

The goal here is not to add more constraints, but to verify that the
existing constraints are followed. Perhaps this patch doesn't do that
well/correctly, but I want to make sure we're on the same page
regarding the end goal.

--Jacob

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

jacob.champion@enterprisedb.com

over 8 years ago

In reply to: Jacob Champion (#8)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion <pchampion@pivotal.io> wrote:

On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

In short, it seems to me that this patch should be rejected in its
current shape.

Is the half of the patch that switches PageGetLSN to
BufferGetLSNAtomic correct, at least?

Any further thoughts on this? If the BufferGetLSNAtomic fixes made
here are not correct to begin with, then the rest of the patch is
probably moot; I just want to double-check that that is the case.

--Jacob

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Daniel Gustafsson

daniel@yesql.se

over 8 years ago

In reply to: Jacob Champion (#9)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On 20 Sep 2017, at 00:29, Jacob Champion <pchampion@pivotal.io> wrote:

On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion <pchampion@pivotal.io> wrote:

On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

In short, it seems to me that this patch should be rejected in its
current shape.

Is the half of the patch that switches PageGetLSN to
BufferGetLSNAtomic correct, at least?

Any further thoughts on this? If the BufferGetLSNAtomic fixes made
here are not correct to begin with, then the rest of the patch is
probably moot; I just want to double-check that that is the case.

Based on the discussions in this thread, I’m marking this patch Returned with
feedback. Please re-submit a new version in an upcoming commitfest when ready.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

robertmhaas@gmail.com

over 8 years ago

In reply to: Michael Paquier (#5)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Mon, Sep 4, 2017 at 2:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
A>> that would trip it. The latter part is still in progress, because I'm

Well, PageGetLSN can be used in some hot code paths, xloginsert.c
being one, so it does not seem wise to me to switch it to something
more complicated than a macro, and also it is not bounded to any
locking contracts now.

I don't see how turning it into a static inline function is worse.
We've been doing a fair amount of that lately and it generally makes
things better, not worse, often avoiding multiple evaluation hazards
at the same time it generates better machine code.

I find this patch sort of messy; in particular, the definition of
AssertPageIsLockedForLSN tries to reverse-engineer a buffer ID from a
Page, and that's sort of ugly. But I think the concept of trying to
make sure that our code is adhering to necessary locking rules is a
pretty good one.

I think the first question we ought to be asking ourselves is whether
any of the PageGetLSN -> BufferGetLSNAtomic changes the patch
introduces are live bugs. If they are, then we ought to fix those
separately (and probably back-patch). If they are not, then we need
to think about how to adjust the patch so that it doesn't complain
about things that are in fact OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

michael@paquier.xyz

over 8 years ago

In reply to: Robert Haas (#11)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Mon, Oct 2, 2017 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think the first question we ought to be asking ourselves is whether
any of the PageGetLSN -> BufferGetLSNAtomic changes the patch
introduces are live bugs. If they are, then we ought to fix those
separately (and probably back-patch). If they are not, then we need
to think about how to adjust the patch so that it doesn't complain
about things that are in fact OK.

If you look at each item one by one that the patch touches based on
the contract defined in transam/README...

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -640,7 +640,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size
freespace, GISTSTATE *giststate)
        }

        stack->page = (Page) BufferGetPage(stack->buffer);
-       stack->lsn = PageGetLSN(stack->page);
+       stack->lsn = BufferGetLSNAtomic(stack->buffer);
There is an incorrect use of PageGetLSN here. A shared lock can be
taken on the page but there is no buffer header lock used when using
PageGetLSN.

@@ -890,7 +890,7 @@ gistFindPath(Relation r, BlockNumber child,
OffsetNumber *downlinkoffnum)
break;
}

-       top->lsn = PageGetLSN(page);
+       top->lsn = BufferGetLSNAtomic(buffer);
Here as well only a shared lock is taken on the page.

@@ -61,7 +61,7 @@ gistkillitems(IndexScanDesc scan)
     * read. killedItems could be not valid so LP_DEAD hints applying is not
     * safe.
     */
-   if (PageGetLSN(page) != so->curPageLSN)
+   if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
    {
        UnlockReleaseBuffer(buffer);
        so->numKilled = 0;      /* reset counter */
@@ -384,7 +384,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem
*pageItem, double *myDistances,
     * safe to apply LP_DEAD hints to the page later. This allows us to drop
     * the pin for MVCC scans, which allows vacuum to avoid blocking.
     */
-   so->curPageLSN = PageGetLSN(page);
+   so->curPageLSN = BufferGetLSNAtomic(buffer);
Those two as well only use a shared lock.

@@ -249,7 +249,7 @@ gistbulkdelete(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,

                ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
                ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-               ptr->parentlsn = PageGetLSN(page);
+               ptr->parentlsn = BufferGetLSNAtomic(buffer);
                ptr->next = stack->next;
                stack->next = ptr;
Here also a shared lock is only taken, that's a VACUUM code path for
Gist indexes.

+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1224,7 +1224,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection
dir, OffsetNumber offnum)
     * safe to apply LP_DEAD hints to the page later.  This allows us to drop
     * the pin for MVCC scans, which allows vacuum to avoid blocking.
     */
-   so->currPos.lsn = PageGetLSN(page);
+   so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
Here the caller of _bt_readpage should have taken a lock, but the page
is not necessarily pinned. Still, _bt_getbuf makes sure that the pin
is done.

+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1772,7 +1772,7 @@ _bt_killitems(IndexScanDesc scan)
            return;

        page = BufferGetPage(buf);
-       if (PageGetLSN(page) == so->currPos.lsn)
+       if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
            so->currPos.buf = buf;
Same here thanks to _bt_getbuf.

So those bits could be considered for integration.

Also, if I read the gist code correctly, there is one other case in
gistFindCorrectParent. And in btree, there is one occurence in
_bt_split. In XLogRecordAssemble, there could be more checks regarding
the locks taken on the buffer registered.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

robertmhaas@gmail.com

over 8 years ago

In reply to: Michael Paquier (#12)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So those bits could be considered for integration.

Yes, and they also tend to suggest that the rest of the patch has value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

michael@paquier.xyz

over 8 years ago

In reply to: Robert Haas (#13)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So those bits could be considered for integration.

Yes, and they also tend to suggest that the rest of the patch has value.

Well, there are cases where you don't need any locking checks, and the
proposed patch ignores that. Take for example pageinspect which works
on a copy of a page, or just WAL replay which serializes everything,
and in both cases PageGetLSN can be used safely. So for compatibility
reasons I think that PageGetLSN should be kept alone to not surprise
anybody using it, or at least an equivalent should be introduced. It
would be interesting to make BufferGetLSNAtomic hold tighter checks,
like something that makes use of LWLockHeldByMe for example.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#14)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

Michael Paquier wrote:

On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So those bits could be considered for integration.

Yes, and they also tend to suggest that the rest of the patch has value.

Well, there are cases where you don't need any locking checks, and the
proposed patch ignores that. Take for example pageinspect which works
on a copy of a page, or just WAL replay which serializes everything,
and in both cases PageGetLSN can be used safely. So for compatibility
reasons I think that PageGetLSN should be kept alone to not surprise
anybody using it, or at least an equivalent should be introduced. It
would be interesting to make BufferGetLSNAtomic hold tighter checks,
like something that makes use of LWLockHeldByMe for example.

I'd argue about this in the same direction I argued about
BufferGetPage() needing an LSN check that's applied separately: if it's
too easy for a developer to do the wrong thing (i.e. fail to apply the
separate LSN check after reading the page) then the routine should be
patched to do the check itself, and another routine should be offered
for the cases that explicitly can do without the check.

I was eventually outvoted in the BufferGetPage() saga, though, so I'm
not sure that that discussion sets precedent.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

robertmhaas@gmail.com

over 8 years ago

In reply to: Michael Paquier (#14)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Mon, Oct 2, 2017 at 11:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Well, there are cases where you don't need any locking checks, and the
proposed patch ignores that.

I understand that, but shouldn't we then look for a way to adjust the
patch so that it doesn't have that issue any longer, rather than just
kicking it to the curb? I mean, just saying "patch suxxor, next"
doesn't seem like the right approach to something that has apparently
already found real problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

michael@paquier.xyz

over 8 years ago

In reply to: Alvaro Herrera (#15)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Michael Paquier wrote:

On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So those bits could be considered for integration.

Yes, and they also tend to suggest that the rest of the patch has value.

Well, there are cases where you don't need any locking checks, and the
proposed patch ignores that. Take for example pageinspect which works
on a copy of a page, or just WAL replay which serializes everything,
and in both cases PageGetLSN can be used safely. So for compatibility
reasons I think that PageGetLSN should be kept alone to not surprise
anybody using it, or at least an equivalent should be introduced. It
would be interesting to make BufferGetLSNAtomic hold tighter checks,
like something that makes use of LWLockHeldByMe for example.

Jacob, here are some ideas to make this thread move on. I would
suggest to produce a set of patches that do things incrementally:
1) One patch that changes the calls of PageGetLSN to
BufferGetLSNAtomic which are now not appropriate. You have spotted
some of them in the btree and gist code, but not all based on my first
lookup. There is still one in gistFindCorrectParent and one in btree
_bt_split. The monitoring of the other calls (sequence.c and
vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble
that should be fixed I think.
2) A second patch that strengthens a bit checks around
BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you
are originally suggesting.
A comment could be as well added in bufpage.h for PageGetLSN to let
users know that it should be used carefully, in the vein of what is
mentioned in src/backend/access/transam/README.

I'd argue about this in the same direction I argued about
BufferGetPage() needing an LSN check that's applied separately: if it's
too easy for a developer to do the wrong thing (i.e. fail to apply the
separate LSN check after reading the page) then the routine should be
patched to do the check itself, and another routine should be offered
for the cases that explicitly can do without the check.

I was eventually outvoted in the BufferGetPage() saga, though, so I'm
not sure that that discussion sets precedent.

Oh... I don't recall this discussion. A quick lookup at the archives
does not show me a specific thread either.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#17)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

Michael Paquier wrote:

On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I'd argue about this in the same direction I argued about
BufferGetPage() needing an LSN check that's applied separately: if it's
too easy for a developer to do the wrong thing (i.e. fail to apply the
separate LSN check after reading the page) then the routine should be
patched to do the check itself, and another routine should be offered
for the cases that explicitly can do without the check.

I was eventually outvoted in the BufferGetPage() saga, though, so I'm
not sure that that discussion sets precedent.

Oh... I don't recall this discussion. A quick lookup at the archives
does not show me a specific thread either.

I mean Kevin patch for snapshot-too-old:
/messages/by-id/CACjxUsPPCbov6DDPnuGpR=FmxHsjSn_MRi3rJYgvbRMCRfFz+A@mail.gmail.com

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#17)

Re: [PATCH] Assert that the correct locks are held when calling PageGetLSN()

Michael Paquier wrote:

On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I'd argue about this in the same direction I argued about
BufferGetPage() needing an LSN check that's applied separately: if it's
too easy for a developer to do the wrong thing (i.e. fail to apply the
separate LSN check after reading the page) then the routine should be
patched to do the check itself, and another routine should be offered
for the cases that explicitly can do without the check.

I was eventually outvoted in the BufferGetPage() saga, though, so I'm
not sure that that discussion sets precedent.

Oh... I don't recall this discussion. A quick lookup at the archives
does not show me a specific thread either.

Just search for "ï¿½sop" in the archives:
/messages/by-id/CACjxUsPPCbov6DDPnuGpR=FmxHsjSn_MRi3rJYgvbRMCRfFz+A@mail.gmail.com

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20