missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?
Hi,
while looking at trigger.c from the POV of foreign key locks I noticed
that GetTupleForTrigger commentless assumes it can just look at a pages
content without a
LockBuffer(buffer, BUFFER_LOCK_SHARE);
That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
so its fine it's not locking the tuple itself. That already needs to
have happened before.
The code in question:
if (newslot_which_is_passed_by_before_triggers)
...
else
{
Page page;
ItemId lp;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
page = BufferGetPage(buffer);
lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
Assert(ItemIdIsNormal(lp));
tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple.t_len = ItemIdGetLength(lp);
tuple.t_self = *tid;
tuple.t_tableOid = RelationGetRelid(relation);
}
result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);
As can be seen in the excerpt above this is basically a very stripped
down heap_fetch(). But without any locking on the buffer we just read.
I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?
Am I missing something?
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com>wrote:
Hi,
while looking at trigger.c from the POV of foreign key locks I noticed
that GetTupleForTrigger commentless assumes it can just look at a pages
content without a
LockBuffer(buffer, BUFFER_LOCK_SHARE);That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
so its fine it's not locking the tuple itself. That already needs to
have happened before.The code in question:
if (newslot_which_is_passed_by_before_triggers)
...
else
{
Page page;
ItemId lp;buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
page = BufferGetPage(buffer);
lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));Assert(ItemIdIsNormal(lp));
tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple.t_len = ItemIdGetLength(lp);
tuple.t_self = *tid;
tuple.t_tableOid = RelationGetRelid(relation);
}result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);As can be seen in the excerpt above this is basically a very stripped
down heap_fetch(). But without any locking on the buffer we just read.I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?Am I missing something?
That seems to be safe to me. Anything thats been read above can't really
change. The tuple is already locked, so a concurrent update/delete has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really change those
fields.
heap_fetch() though looks very similar has an important difference. It
might be reading the tuple without lock on it and we need the buffer lock
to protect against concurrent update/delete to the tuple. AFAIK once the
tuple is passed through qualification checks etc, even callers of
heap_fetch() can read the tuple data without any lock on the buffer itself.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2012-11-30 08:52:18 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com>wrote:
Hi,
while looking at trigger.c from the POV of foreign key locks I noticed
that GetTupleForTrigger commentless assumes it can just look at a pages
content without a
LockBuffer(buffer, BUFFER_LOCK_SHARE);That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
so its fine it's not locking the tuple itself. That already needs to
have happened before.The code in question:
if (newslot_which_is_passed_by_before_triggers)
...
else
{
Page page;
ItemId lp;buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
page = BufferGetPage(buffer);
lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));Assert(ItemIdIsNormal(lp));
tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
tuple.t_len = ItemIdGetLength(lp);
tuple.t_self = *tid;
tuple.t_tableOid = RelationGetRelid(relation);
}result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);As can be seen in the excerpt above this is basically a very stripped
down heap_fetch(). But without any locking on the buffer we just read.I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?Am I missing something?
That seems to be safe to me. Anything thats been read above can't really
change. The tuple is already locked, so a concurrent update/delete has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really change those
fields.
We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers around
heap_fetch() though looks very similar has an important difference. It
might be reading the tuple without lock on it and we need the buffer lock
to protect against concurrent update/delete to the tuple. AFAIK once the
tuple is passed through qualification checks etc, even callers of
heap_fetch() can read the tuple data without any lock on the buffer itself.
Sure, but the page header isn't accessed there, just the tuple data
itself which always stays at the same place on the page.
(A normal heap_fetch shouldn't be worried about updates/deletions to its
tuple, MVCC to the rescue.)
Even if it turns out to be safe, I think this deserves at least a
comment...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund escribió:
On 2012-11-30 08:52:18 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com>wrote:
I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?
Yeah, it does seem fishy to me as well.
Even if it turns out to be safe, I think this deserves at least a
comment...
No doubt that trigger.c is hugely underdocumented in some places.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
That seems to be safe to me. Anything thats been read above can't really
change. The tuple is already locked, so a concurrent update/delete has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really changethose
fields.
We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.
Hmm. Yeah, you're right. That is a possible risky scenario. Even though
cleanup lock waits for all pins to go away, it will work only if every
reader takes at least a SHARE lock unless it was continuously holding a pin
on a buffer (in which case its OK to drop lock and read a tuple body
without reacquiring it again). Otherwise, as you rightly pointed out, we
could actually be reading a page which being actively cleaned up and tuples
are being moved around.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers around
I don't we move the line pointers around ever because the indexes will be
pointing to them. But the vacuum/prune is dangerous enough to require a
SHARE lock here in any case.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:
We only get the pin right there, I don't see any preexisting pin.
Seems easy enough to test with an Assert patch.
If the Assert doesn't fail, we apply it as "documentation" of the
requirement for a pin.
If it fails, we fix the bug.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
That seems to be safe to me. Anything thats been read above can't really
change. The tuple is already locked, so a concurrent update/delete has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really changethose
fields.
We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.Hmm. Yeah, you're right. That is a possible risky scenario. Even though
cleanup lock waits for all pins to go away, it will work only if every
reader takes at least a SHARE lock unless it was continuously holding a pin
on a buffer (in which case its OK to drop lock and read a tuple body
without reacquiring it again). Otherwise, as you rightly pointed out, we
could actually be reading a page which being actively cleaned up and tuples
are being moved around.
Well, live (or recently dead) tuples can't be just moved arround unless
I miss something. That would cause problems with with HeapTuples
directly pointing into the page as youve noticed.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers aroundI don't we move the line pointers around ever because the indexes will be
pointing to them.
The indexes point to a tid, not to a line pointer. So reshuffling the
linepointer array - while keeping the tids valid - is entirely
fine. Right?
Also, PageAddItem does that all the time, so I think we would have
noticed if not ;)
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2012-11-30 12:50:06 +0000, Simon Riggs wrote:
On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:
We only get the pin right there, I don't see any preexisting pin.
Seems easy enough to test with an Assert patch.
If the Assert doesn't fail, we apply it as "documentation" of the
requirement for a pin.If it fails, we fix the bug.
I think its wrong even if we were holding a pin all the time due the the
aforementioned PageAddItem reshuffling of line pointers. So that Assert
wouldn't proof enough.
I can try to proof corruption there, but I would rather see somebody
coming along telling me why its safe and that I am dumb for not
realizing it.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 30, 2012 at 6:21 PM, Andres Freund <andres@2ndquadrant.com>wrote:
On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com
wrote:That seems to be safe to me. Anything thats been read above can't
really
change. The tuple is already locked, so a concurrent update/delete
has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really changethose
fields.
We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.Hmm. Yeah, you're right. That is a possible risky scenario. Even though
cleanup lock waits for all pins to go away, it will work only if every
reader takes at least a SHARE lock unless it was continuously holding apin
on a buffer (in which case its OK to drop lock and read a tuple body
without reacquiring it again). Otherwise, as you rightly pointed out, we
could actually be reading a page which being actively cleaned up andtuples
are being moved around.
Well, live (or recently dead) tuples can't be just moved arround unless
I miss something. That would cause problems with with HeapTuples
directly pointing into the page as youve noticed.
I think we confusing with the terminology. The tuple headers and bodies
(live or dead) can be moved around freely as long as you have a cleanup
lock on the page. Thats how HOT-prune frees up fragmented space.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers aroundI don't we move the line pointers around ever because the indexes will be
pointing to them.The indexes point to a tid, not to a line pointer. So reshuffling the
linepointer array - while keeping the tids valid - is entirely
fine. Right?
I think its again terminology confusion :-) I thought TID and Line Pointers
are the same and used interchangeably. Or at least I've done so for long.
But I think you are reading line pointer as the thing stored in the ItemId
structure and not the ItemId itself.
Also, PageAddItem() doesn't move things around normally. It does that only
during recovery, at least for heao pages. Elsewhere it will either reuse an
UNUSED pointer or add at the end. Isn't that how it is ?
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
heap_fetch() though looks very similar has an important difference. It
might be reading the tuple without lock on it and we need the buffer lock
to protect against concurrent update/delete to the tuple. AFAIK once the
tuple is passed through qualification checks etc, even callers of
heap_fetch() can read the tuple data without any lock on the bufferitself.
Sure, but the page header isn't accessed there, just the tuple data
itself which always stays at the same place on the page.
(A normal heap_fetch shouldn't be worried about updates/deletions to its
tuple, MVCC to the rescue.)
heap_fetch() reads the tuple header though to apply snapshot visibility
rules. So a SHARE lock is must for that purpose because a concurrent
update/delete may set XMAX or other visibility related fields. On the other
hand, you can read the tuple body without a page lock if you were holding a
pin on the buffer continuously since you last dropped the lock.
In any case, a lock is needed in GetTupleForTrigger unless someone can
argue that a pin is continuously held on the buffer since the lock was last
dropped, something I don't see happening in this case.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2012-11-30 18:35:05 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 6:21 PM, Andres Freund <andres@2ndquadrant.com>wrote:
On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com
wrote:That seems to be safe to me. Anything thats been read above can't
really
change. The tuple is already locked, so a concurrent update/delete
has to
wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
happen either. I can't see any other operation that can really changethose
fields.
We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.Hmm. Yeah, you're right. That is a possible risky scenario. Even though
cleanup lock waits for all pins to go away, it will work only if every
reader takes at least a SHARE lock unless it was continuously holding apin
on a buffer (in which case its OK to drop lock and read a tuple body
without reacquiring it again). Otherwise, as you rightly pointed out, we
could actually be reading a page which being actively cleaned up andtuples
are being moved around.
Well, live (or recently dead) tuples can't be just moved arround unless
I miss something. That would cause problems with with HeapTuples
directly pointing into the page as youve noticed.
I think we confusing with the terminology. The tuple headers and bodies
(live or dead) can be moved around freely as long as you have a cleanup
lock on the page. Thats how HOT-prune frees up fragmented space.
Need to read more code here. This is nothing I really have looked into
before.Youre probably right.
Up to now I thought hot pruning only removed intermediate & surely dead
tuples but didn't move stuff arround. And that page fragementation was
repaired separately. But it looks like I am wrong.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers aroundI don't we move the line pointers around ever because the indexes will be
pointing to them.The indexes point to a tid, not to a line pointer. So reshuffling the
linepointer array - while keeping the tids valid - is entirely
fine. Right?
I think its again terminology confusion :-) I thought TID and Line Pointers
are the same and used interchangeably. Or at least I've done so for long.
But I think you are reading line pointer as the thing stored in the ItemId
structure and not the ItemId itself.
I always understood ItemIds to be line pointers and ItemPointers to be
tids. Line pointers are only meaningful within a single page, store the
actual offset within the page and so on. ItemPointers (cids) are longer
lasting and store (page, offset_number)…
Also, PageAddItem() doesn't move things around normally. It does that only
during recovery, at least for heao pages. Elsewhere it will either reuse an
UNUSED pointer or add at the end. Isn't that how it is ?
Hm? It doesn't move the page contents around but it moves the ItemId
array during completely normal operation (c.f. needshuffle in
PageAddItem). Or am I missing something?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 30, 2012 at 6:55 PM, Andres Freund <andres@2ndquadrant.com>wrote:
Hm? It doesn't move the page contents around but it moves the ItemId
array during completely normal operation (c.f. needshuffle in
PageAddItem). Or am I missing something?
I think that probably only used for non-heap pages. For heap pages, it just
doesn't make sense to shuffle the ItemId array. That would defeat the
entire purpose of having them in the first place.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2012-11-30 13:57:46 +0100, Andres Freund wrote:
On 2012-11-30 12:50:06 +0000, Simon Riggs wrote:
On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:
We only get the pin right there, I don't see any preexisting pin.
Seems easy enough to test with an Assert patch.
If the Assert doesn't fail, we apply it as "documentation" of the
requirement for a pin.If it fails, we fix the bug.
I think its wrong even if we were holding a pin all the time due the the
aforementioned PageAddItem reshuffling of line pointers. So that Assert
wouldn't proof enough.
But a failing Assert obviously proofs something. Stupid me. So here we
go:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 98b8207..3b61d06 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2662,6 +2662,16 @@ ltrmark:;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+#ifdef USE_ASSERT_CHECKING
+ if (!BufferIsLocal(buffer))
+ {
+ /* Only pinned by the above ReadBuffer */
+ if (PrivateRefCount[buffer - 1] <= 1)
+ elog(ERROR, "too low local pin count: %d",
+ PrivateRefCount[buffer - 1]);
+ }
+#endif
+
page = BufferGetPage(buffer);
lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
CREATE OR REPLACE FUNCTION raise_notice_id()
RETURNS trigger
LANGUAGE plpgsql AS $body$
BEGIN
RAISE NOTICE 'id: %', OLD.id;
RETURN NULL;
END
$body$;
postgres=#
CREATE TABLE crashdummy(id serial primary key, data int);
postgres=#
CREATE TRIGGER crashdummy_after_delete
AFTER DELETE ON crashdummy
FOR EACH ROW EXECUTE PROCEDURE raise_notice_id();
postgres=#
INSERT INTO crashdummy(data) SELECT * FROM generate_series(1, 1000);
postgres=#
DELETE FROM crashdummy WHERE ctid IN (SELECT ctid FROM crashdummy WHERE data < 1000);
ERROR: too low local pin count: 1
Time: 4.515 ms
A plain DELETE without the subselect doesn't trigger the problem though,
thats probably the reason we haven't seen any problems so far.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
But a failing Assert obviously proofs something.
It doesn't prove that the code is unsafe though.
I am suspicious that it is safe, because it's pretty darn hard to
believe we'd not have seen field reports of problems with triggers
otherwise. It's not like this is a seldom-executed code path.
But I concede that I don't see *why* it's safe --- it looks like
it ought to be possible for a VACUUM to move the tuple while the
trigger code is fetching it. You'd need the timing to be just
so...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
But a failing Assert obviously proofs something.
It doesn't prove that the code is unsafe though.
Hehe.
I am suspicious that it is safe, because it's pretty darn hard to
believe we'd not have seen field reports of problems with triggers
otherwise. It's not like this is a seldom-executed code path.
Thats true, but I think due to the fact that the plain DELETEs do *not*
trigger the Assert we might just not have noticed it. A make check with
the error checking in place doesn't error out in fact.
Also I think the wrong data caused by it aren't that likely to be
noticed. Its just the OLD in triggers so its not going to be looked at
all the time all too closely and we might return data thats somewhat
validly looking.
I think most executor paths actually hold a separate pin "by accident"
while this is executing. I think that my example is hitting that case is
due to the ReleaseBuffer() after ExecStoreTuple() in TidNext(). Seqscans
have another pin via scan->rs_cbuf, indexscans one via ->xs_cbuf. Many
of the other nodes that are likely below ExecUpdate/Delete probably work
similar.
I think most cases with high throughput (which you probably need to
actually hit the relatively small window) won't use plans that are
complex enough to trigger the bug.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
I am suspicious that it is safe, because it's pretty darn hard to
believe we'd not have seen field reports of problems with triggers
otherwise. It's not like this is a seldom-executed code path.
Thats true, but I think due to the fact that the plain DELETEs do *not*
trigger the Assert we might just not have noticed it.
OK, I've managed to reproduce an actual failure, ie VACUUM moving the
tuple underneath the fetch. It's not easy to do though, and the timing
has to be *really* tight.
The reason that simple update/delete queries don't show the issue is
that the tuple being fetched is the "old" one that was just
updated/deleted, and in a simple query that one was just returned by a
relation-scan plan node that's underneath the ModifyTuple node, and
*that scan node still has pin* on the table page; or more accurately the
TupleTableSlot it's returned the scan tuple in has the pin. This pin's
been held since we did a LockBuffer to examine the tuple's liveness, so
it's sufficient to prevent anyone from getting a cleanup lock. However,
in a more complex plan such as a join, the ModifyTable node could be
receiving tuples that aren't the current tuple of a scan anymore.
(Your example case passes the tuples through a Sort, so none of them
are pinned by the time the ModifyTable node gets them.)
Another thing that reduces the odds of seeing this, in recent versions,
is that when we first scanned the page containing the "old" tuple,
we'll have (tried to) do a page prune. So even if a VACUUM comes along
now, the odds are that it will not find any newly-reclaimable space.
To reproduce the problem, I had to arrange for another tuple on the
same page to become reclaimable between the relation scan and the
GetTupleForTrigger fetch (which I did by having another, serializable
transaction scan the table, then deleting the other tuple, then allowing
the serializable transaction to commit after the page prune step).
Lastly, the only way you see a problem is if VACUUM acquires cleanup
lock before GetTupleForTrigger pins the page, finds some reclaimable
space, and moves the target tuple on the page in order to compact the
free space, and that data movement happens in the narrow time window
between where GetTupleForTrigger does PageGetItem() and where it
finishes the heap_copytuple() call just below that. Even then, you
might escape seeing a problem, unless the data movement also shifts
some *other* tuple into the space formerly occupied by the target.
So that's why nobody's reported the problem --- it's not happening
often enough for anyone to see it as a repeatable issue.
I'll go insert a LockBuffer call. Good catch!
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2012-11-30 12:53:27 -0500, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
I am suspicious that it is safe, because it's pretty darn hard to
believe we'd not have seen field reports of problems with triggers
otherwise. It's not like this is a seldom-executed code path.Thats true, but I think due to the fact that the plain DELETEs do *not*
trigger the Assert we might just not have noticed it.OK, I've managed to reproduce an actual failure, ie VACUUM moving the
tuple underneath the fetch. It's not easy to do though, and the timing
has to be *really* tight.So that's why nobody's reported the problem --- it's not happening
often enough for anyone to see it as a repeatable issue.
Yea. I have been playing with trying to reproduce the issue as well and
I needed 3 sessions and two gdb's to cause any problem... And even then
it took me several tries to get all conditions right.
We call heap_prune* surprisingly often...
I'll go insert a LockBuffer call. Good catch!
Thanks.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
BTW, I went looking for other places that might have a similar mistake.
I found that contrib/pageinspect/btreefuncs.c pokes around in btree
pages without any buffer lock, which seems pretty broken --- don't we
want it to take share lock?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2012-11-30 14:08:05 -0500, Tom Lane wrote:
BTW, I went looking for other places that might have a similar mistake.
I found that contrib/pageinspect/btreefuncs.c pokes around in btree
pages without any buffer lock, which seems pretty broken --- don't we
want it to take share lock?
I seem to remember comments somewhere indicating that pageinspect (?)
doesn't take locks by intention to make debugging of locking problems
easier. Not sure whether thats really realistic, but ...
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2012-11-30 14:08:05 -0500, Tom Lane wrote:
BTW, I went looking for other places that might have a similar mistake.
I found that contrib/pageinspect/btreefuncs.c pokes around in btree
pages without any buffer lock, which seems pretty broken --- don't we
want it to take share lock?
I seem to remember comments somewhere indicating that pageinspect (?)
doesn't take locks by intention to make debugging of locking problems
easier. Not sure whether thats really realistic, but ...
Dunno, that seems like a pretty lame argument when compared to the very
real possibility of crashing due to processing corrupt data. Besides,
the heap-page examination functions in the same module take buffer lock,
so why shouldn't these?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
BTW, on further inspection, there's yet another reason why we've not
heard about this from the field: even if all the wrong things happen and
GetTupleForTrigger manages to copy bogus data for the old tuple, that
data *is not passed to the trigger function*. The only thing we do with
it is decide whether to queue an event for the trigger. So if you've
got a WHEN condition for the trigger, that expression would see the bad
data, or if it's an RI trigger the bad data is passed to
RI_FKey_pk_upd_check_required or RI_FKey_fk_upd_check_required. But
unless the data is corrupt enough to cause a crash, the worst outcome
would be a wrong decision about whether the trigger needs to be run.
It's possible this explains some of the reports we've seen about RI
constraints being violated.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers