Small locking bugs in hs

Started by Andres Freundabout 16 years ago11 messages

Andres Freund

andres@anarazel.de

about 16 years ago

1 attachment(s)

Hi,

While unlikely to cause issues two new LWLockAcquire calls use the wrong
locking mode.
Patch attached.

Andres

Attachments:

0001-Use-correct-locking-at-two-functions-used-by-Hot-Sta.patchtext/x-patch; charset=UTF-8; name=0001-Use-correct-locking-at-two-functions-used-by-Hot-Sta.patchDownload

From e70524baa7399972e51345d81b50377d7f15196d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 27 Dec 2009 17:28:39 +0100
Subject: [PATCH 1/2] Use correct locking at two functions used by Hot Standby - while not
 likely to cause issues in reality its better to be correct.

---
 src/backend/storage/ipc/procarray.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 85f14f6..8e2de35 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1648,7 +1648,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
 					 errmsg("out of memory")));
 	}
 
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
 	 * If we don't know the TransactionId that created the conflict, set
@@ -1710,7 +1710,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode)
 	int			index;
 	pid_t		pid = 0;
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-- 
1.6.5.12.gd65df24

Simon Riggs

simon@2ndQuadrant.com

about 16 years ago

In reply to: Andres Freund (#1)

Re: Small locking bugs in hs

On Sun, 2009-12-27 at 20:12 +0100, Andres Freund wrote:

While unlikely to cause issues two new LWLockAcquire calls use the wrong
locking mode.
Patch attached.

It's important to explain why you think something is a bug, rather than
make that claim on its own.

The locking mode was intentional, though not documented either way.

However, I think I see a different issue with conflict handling, so
looking at this again was worthwhile, thanks. Will come back with more
over next few days.

--
Simon Riggs www.2ndQuadrant.com

Andres Freund

andres@anarazel.de

about 16 years ago

In reply to: Simon Riggs (#2)

Re: Small locking bugs in hs

On Sunday 27 December 2009 23:10:09 Simon Riggs wrote:

On Sun, 2009-12-27 at 20:12 +0100, Andres Freund wrote:

While unlikely to cause issues two new LWLockAcquire calls use the wrong
locking mode.
Patch attached.

It's important to explain why you think something is a bug, rather than
make that claim on its own.

Youre right.

My idea was that another SIGINT handler (e.g. normal client query cancel) is
running while CONFLICT_MODE_FATAL is issued that might get ignored. Thus the
fatal error might get ignored.
But actually that was mostly my gut feeling - I am not really understanding
the whole query cancellation process yet.
(Its not exactly unlikely that my patch does not fix that though)

I think that if under protection of a shared lock the protected objects gets
altered, that definitely needs to be commented from my pov...
In the reverse case its not as important but still a good idea.

However, I think I see a different issue with conflict handling, so
looking at this again was worthwhile, thanks. Will come back with more
over next few days.

Cool.

Andres

Andres Freund

andres@anarazel.de

about 16 years ago

In reply to: Simon Riggs (#2)

Re: Small locking bugs in hs

Hi Simon,

Btw, dont understand my questions as criticism or such. I am really impressed
by the quality of the HS patch - many thanks to you, heikki and all the
others.

Andres

Simon Riggs

simon@2ndQuadrant.com

about 16 years ago

In reply to: Andres Freund (#4)

Re: Small locking bugs in hs

On Mon, 2009-12-28 at 01:22 +0100, Andres Freund wrote:

Btw, dont understand my questions as criticism or such.

I didn't take them that way. Your questions and bug reports are welcome.

It was important that HS was released in Alpha so that we can shake out
bugs, issues and concerns early enough to get as many of them fixed in
this release as possible. It is also important that I fix remaining
issues in priority order.

--
Simon Riggs www.2ndQuadrant.com

Robert Haas

robertmhaas@gmail.com

about 16 years ago

In reply to: Andres Freund (#1)

Re: Small locking bugs in hs

On Sun, Dec 27, 2009 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

While unlikely to cause issues two new LWLockAcquire calls use the wrong
locking mode.
Patch attached.

Does it make sense to add this to the 2010-01 CommitFest so we don't
lose track of it?

...Robert

Simon Riggs

simon@2ndQuadrant.com

almost 16 years ago

In reply to: Robert Haas (#6)

Re: Small locking bugs in hs

On Thu, 2010-01-07 at 13:16 -0500, Robert Haas wrote:

On Sun, Dec 27, 2009 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

While unlikely to cause issues two new LWLockAcquire calls use the wrong
locking mode.
Patch attached.

Does it make sense to add this to the 2010-01 CommitFest so we don't
lose track of it?

No. As mentioned upthread, this is not a bug.

I also mentioned having a separate, related thought about this, which
has already been handled. There are no further actions required on this.

--
Simon Riggs www.2ndQuadrant.com

Andres Freund

andres@anarazel.de

almost 16 years ago

In reply to: Simon Riggs (#7)

Re: Small locking bugs in hs

On Saturday 16 January 2010 12:32:35 Simon Riggs wrote:

On Thu, 2010-01-07 at 13:16 -0500, Robert Haas wrote:

On Sun, Dec 27, 2009 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

While unlikely to cause issues two new LWLockAcquire calls use the
wrong locking mode.
Patch attached.

Does it make sense to add this to the 2010-01 CommitFest so we don't
lose track of it?

No. As mentioned upthread, this is not a bug.

Could you also mention in a littlebit more detail why not?

Thanks,

Andres

Simon Riggs

simon@2ndQuadrant.com

almost 16 years ago

In reply to: Andres Freund (#8)

Re: Small locking bugs in hs

On Wed, 2010-01-20 at 04:47 +0100, Andres Freund wrote:

On Saturday 16 January 2010 12:32:35 Simon Riggs wrote:

No. As mentioned upthread, this is not a bug.

Could you also mention in a little bit more detail why not?

When a cleanup record arrives without a latestRemovedXid we are forced
to assume that the xid could be as late as latestCompletedXid.
Regrettably we aren't certain which of the xids are still there since it
is possible that earlier xids in KnownAssignedXids are actually FATAL
errors that did not write abort records. So we need to conflict with all
current snapshots whose xmin is less than latestCompletedXid to be safe.
This can cause false positives in our assessment of which vxids
conflict.

By using exclusive lock we prevent new snapshots from being taken while
we work out which snapshots to conflict with. This protects those new
snapshots from also being included in our conflict list.

After the lock is released, we allow snapshots again. It is possible
that we arrive at a snapshot that is identical to one that we just
decided we should conflict with. This a case of false positives, not an
actual problem.

There are two cases: (1) if we were correct in using latestCompletedXid
then that means that all xids in the snapshot lower than that are FATAL
errors, so not xids that ever commit. We can make no visibility errors
if we allow such xids into the snapshot. (2) if we erred on the side of
caution and in fact the latestRemovedXid should have been earlier than
latestCompletedXid then we conflicted with a snapshot needlessly. Taking
another identical snapshot is OK, because the earlier conflicted
snapshot was a false positive.

In either case, a snapshot taken after conflict assessment will still be
valid and non-conflicting even if an identical snapshot that existed
before conflict assessment was assessed as conflicting.

If we allowed concurrent snapshots while we were deciding who to
conflict with we would need to include all concurrent snapshotters in
the conflict list as well. We'd have difficulty in working out exactly
who that was, so it is happier for all concerned if we take an exclusive
lock.

It also means that users waiting for a snapshot is a good thing, since
it is more likely that they will live longer after having waited. So its
not a bug for us to use exclusive lock and is actually desirable.

We could reduce false positives by having the master calculate the exact
xmin each time it issues an XLOG_BTREE_DELETE record. That would
introduce more contention since that happens during btree split
operations, so might be counter productive.

--
Simon Riggs www.2ndQuadrant.com

#10

Andres Freund

andres@anarazel.de

almost 16 years ago

In reply to: Simon Riggs (#9)

Re: Small locking bugs in hs

On Wednesday 20 January 2010 12:59:40 Simon Riggs wrote:

On Wed, 2010-01-20 at 04:47 +0100, Andres Freund wrote:

On Saturday 16 January 2010 12:32:35 Simon Riggs wrote:

No. As mentioned upthread, this is not a bug.

Could you also mention in a little bit more detail why not?

When a cleanup record arrives without a latestRemovedXid we are forced
to assume that the xid could be as late as latestCompletedXid.
Regrettably we aren't certain which of the xids are still there since it
is possible that earlier xids in KnownAssignedXids are actually FATAL
errors that did not write abort records. So we need to conflict with all
current snapshots whose xmin is less than latestCompletedXid to be safe.
This can cause false positives in our assessment of which vxids
conflict.
By using exclusive lock we prevent new snapshots from being taken while
we work out which snapshots to conflict with. This protects those new
snapshots from also being included in our conflict list.

After the lock is released, we allow snapshots again. It is possible
that we arrive at a snapshot that is identical to one that we just
decided we should conflict with. This a case of false positives, not an
actual problem.

There are two cases: (1) if we were correct in using latestCompletedXid
then that means that all xids in the snapshot lower than that are FATAL
errors, so not xids that ever commit. We can make no visibility errors
if we allow such xids into the snapshot. (2) if we erred on the side of
caution and in fact the latestRemovedXid should have been earlier than
latestCompletedXid then we conflicted with a snapshot needlessly. Taking
another identical snapshot is OK, because the earlier conflicted
snapshot was a false positive.

In either case, a snapshot taken after conflict assessment will still be
valid and non-conflicting even if an identical snapshot that existed
before conflict assessment was assessed as conflicting.

If we allowed concurrent snapshots while we were deciding who to
conflict with we would need to include all concurrent snapshotters in
the conflict list as well. We'd have difficulty in working out exactly
who that was, so it is happier for all concerned if we take an exclusive
lock.

It also means that users waiting for a snapshot is a good thing, since
it is more likely that they will live longer after having waited. So its
not a bug for us to use exclusive lock and is actually desirable.

We could reduce false positives by having the master calculate the exact
xmin each time it issues an XLOG_BTREE_DELETE record. That would
introduce more contention since that happens during btree split
operations, so might be counter productive.

Wow. Thanks for the extensive explanation!

I do understand it correctly that in CancelVirtualTransaction LW_SHARED is
taken only so that another transaction can finish during that time?

Andres

#11

Simon Riggs

simon@2ndQuadrant.com

almost 16 years ago

In reply to: Andres Freund (#10)

Re: Small locking bugs in hs

On Wed, 2010-01-20 at 14:13 +0100, Andres Freund wrote:

I do understand it correctly that in CancelVirtualTransaction
LW_SHARED is
taken only so that another transaction can finish during that time?

We're canceling one specific vxid, so no need to block other snapshots
from being taken.

Read only queries don't take ProcArrayLock when they complete and the
Startup process is the only process to record xact completion during
recovery.

--
Simon Riggs www.2ndQuadrant.com