Improve heavyweight locks instead of building new lock managers?
Hi,
I started this as a reply to
/messages/by-id/CAA4eK1JMgUfiTdAgr9k3nA4cdKvvruousKBg7FWTDNzQgBpOZA@mail.gmail.com
but the email seemed to morph into distinct topic that I thought it
best to separate out.
We've had a number of cases where heavyweight locks turned out to be
too, well, heavy. And I don't mean cases where a fixed number of locks
can do the job. The last case of this is the above thread, where a
separate lock manager just for relation extension was implemented.
My problem with that patch is that it just seems like the wrong
direction architecturally to me. There's two main aspects to this:
1) It basically builds a another, more lightweight but less capable, of
a lock manager that can lock more objects than we can have distinct
locks for. It is faster because it uses *one* hashtable without
conflict handling, because it has fewer lock modes, and because it
doesn't support detecting deadlocks. And probably some other things.
2) A lot of the contention around file extension comes from us doing
multiple expensive things under one lock (determining current
relation size, searching victim buffer, extending file), and in tiny
increments (growing a 1TB table by 8kb). This patch doesn't address
that at all.
I'm only planning to address 1) in this thread, and write a separate
about 2) as part of the above thread. So more on that later.
With regard to 1):
To me this seems to go in the direction of having multiple bespoke lock
managers with slightly different feature sets, different debugging /
logging / monitoring support, with separate locking code each. That's
bad for maintainability.
I think one crucial piece of analysis that I've not seen fully done, is
why a separate lock manager is actually faster. An early (quite
different) version of this patch yielded the following micro-benchmark:
/messages/by-id/CAD21AoD1NenjTmD=5ypOBo9=FRtAtWVxUcHqHxY3wNos_5Bb5w@mail.gmail.com
On 2017-11-21 07:19:30 +0900, Masahiko Sawada wrote:
Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.
but I've not actually seen an analysis as to *why* precisely there's a
threefold difference in cost, and whether the problem could instead be
addressed by making the difference much smaller.
My guess would be that the difference, to a very large degree, comes from
avoiding dynahash lookups. We currently have quite a few hashtable
lookups:
* LockRelationForExtension
** LockAcquireExtended does HASH_ENTERs in LockMethodLocalHash
** SetupLockInTable does HASH_ENTER_NULL in LockMethodLockHash
** SetupLockInTable does HASH_ENTER_NULL in LockMethodProcLockHash
* UnlockRelationForExtension
** LockRelease does HASH_FIND in LockMethodLocalHash
** CleanUpLock does HASH_REMOVE in LockMethodProcLockHash
** CleanUpLock does HASH_REMOVE in LockMethodLockHash
** RemoveLocalLock does HASH_REMOVE in LockMethodLocalHash
it's pretty easy to believe that a lock mapping like:
+ relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
with
+typedef struct RelExtLockTag
+{
+ Oid dbid; /* InvalidOid for a shared relation */
+ Oid relid;
+} RelExtLockTag;
...
+static inline uint32
+RelExtLockTargetTagToIndex(RelExtLockTag *locktag)
+{
+ return tag_hash(locktag, sizeof(RelExtLockTag)) % N_RELEXTLOCK_ENTS;
+}
and then just waiting till that specific hash entry becomes free, is
cheaper than *7* dynahash lookups. Especially if doing so doesn't
require any mapping locks.
but it's not at all clear to me whether we cannot sometimes / always
avoid some of this overhead. Improving lock.c would be a much bigger
win, without building separate infrastructure.
E.g. we could:
- Have an extended LockAcquire API where the caller provides a stack
variable for caching information until the LockRelease call, avoiding
separate LockMethodLocalHash lookup for release.
- Instead of always implementing HASH_REMOVE as a completely fresh
lookup in dynahash, we should provide an API for removing an object we
*know* is in the hashtable at a specific pointer. We're obviously
already relying on that address to stay constant, so we'd not loose
flexibility. With this separate API we can avoid the bucket lookup,
walking through each element in that bucket, and having to compare the
hash key every time (which is quite expensive for a full LOCKTAG).
For the biggest benefit, we'd have to make the bucket list doubly
linked, but even if we were to just go through from start, and just
compare entries by pointer, we'd win big.
- We should try to figure out whether we can avoid needing the separate
lock table for PROCLOCKs. As far as I can tell there's not a single
reason for it to be a hashtable, rather than something like
NUM_LOCK_PARTITIONS lists of free PROCLOCKs.
- Whenever we do a relation extension, we better already hold a normal
relation lock. We don't actually need to have an entirely distinct
type of lock for extensions, that theoretically could support N
extension locks for each relation. The fact that these are associated
could be utilized in different ways:
- We could define relation extension as a separate lock level only
conflicting with itself (and perhaps treat AELs as including
it). That'd avoid needing separate shared memory states in many
cases.
This might be too problematic, because extension locks don't have
the same desired behaviour around group locking (and a small army of
other issues).
- We could keep a separate extension lock cached inside the relation
lock. The first time a transaction needs to extend, it does the
normal work, and after that stores the PROCLOCK in the LOCALLOCK (or
something like that). Once extension is done, don't release the lock
entirely, but instead just reduce the lock level to a new
non-conflicting lock level
Alternatively we could implement something very similar outside of
lock.c, e.g. by caching the LOCALLOCK* (possibly identified by an
integer or such) in RelationData. That'd require enough support
from lock.c to be able to make that really cheap.
The second big difference between lock.c and the proposed relation
extension lock is that it doesn't have a "mapping" lock. It does instead
solve the the mapping without a lock by having exactly one potential
location for each lock, using atomic ops to manipulate the lock state,
and deals with conflicts by waiting for the bucket to become free.
I don't think it's realistic to not use locks to protect the lock
mapping hashtable, nor does it seem likely we can make the manipulation
of individual lock states entirely atomic. But it very well seems
possible to reduce the likelihood of contention:
We could e.g. split the maintenance of the hashtable(s) from protecting
the state of individual locks. The locks protecting the hashtable would
just be held long enough to change a "pin count" of each lock, and then
a per LOCK lwlock would protect each heavyweight lock's state. We'd not
need to lock the partition locks for quite a few cases, because there's
many locks in a loaded system that always have lockers. There'd be cost
to that, needing more atomic ops in some cases, but I think it'd reduce
contention tremendously, and it'd open a lot more optimization
potential. It seems plausible that we could even work, as a followup,
to not needing the partition locks for some lock releases (e.g. when
there are other holders), and we might even be able to avoid it for
acquisitions, by caching the LOCK inside LOCALLOCK, and re-identifying
the identity.
Thoughts?
Greetings,
Andres Freund
Hi,
Some of the discussions about improving the locking code, in particular
the group locking / deadlock detector discussion with Robert, again made
me look at lock.c. While looking at how much work it'd be to a) remove
the PROCLOCK hashtable b) move more of the group locking logic into
lock.c, rather than smarts in deadlock.c, I got sidetracked by all the
verbose and hard to read SHM_QUEUE code.
Here's a *draft* patch series for removing all use of SHM_QUEUE, and
subsequently removing SHM_QUEUE. There's a fair bit of polish needed,
but I do think it makes the code considerably easier to read
(particularly for predicate.c). The diffstat is nice too:
src/include/lib/ilist.h | 132 +++++++++++++++++----
src/include/replication/walsender_private.h | 3 +-
src/include/storage/lock.h | 10 +-
src/include/storage/predicate_internals.h | 49 +++-----
src/include/storage/proc.h | 18 +--
src/include/storage/shmem.h | 22 ----
src/backend/access/transam/twophase.c | 4 +-
src/backend/lib/ilist.c | 8 +-
src/backend/replication/syncrep.c | 89 ++++++--------
src/backend/replication/walsender.c | 2 +-
src/backend/storage/ipc/Makefile | 1 -
src/backend/storage/ipc/shmqueue.c | 190 ------------------------------
src/backend/storage/lmgr/deadlock.c | 76 +++++-------
src/backend/storage/lmgr/lock.c | 129 ++++++++------------
src/backend/storage/lmgr/predicate.c | 692 +++++++++++++++++++++++++++++++++++------------------------------------------------------------------------
src/backend/storage/lmgr/proc.c | 197 +++++++++++++------------------
16 files changed, 569 insertions(+), 1053 deletions(-)
I don't want to invest a lot of time into this if there's not some
agreement that this is a good direction to go into. So I'd appreciate a
few cursory looks before spending more time.
Overview:
0001: Add additional prev/next & detached node functions to ilist.
I think the additional prev/next accessors are nice. I am less
convinced by the 'detached' stuff, but it's used by some SHM_QUEUE
users. I don't want to make the plain dlist_delete reset the node's
prev/next pointers, it's not needed in the vast majority of cases...
0002: Use dlists instead of SHM_QUEUE for heavyweight locks.
I mostly removed the odd reliance on PGPROC.links needing to be the
first struct member - seems odd.
I think we should rename PROC_QUEUE.links, elsewhere that's used for
list membership nodes, so it's imo confusing/odd.
0003: Use dlist for syncrep queue.
This seems fairly simple to me.
0004: Use dlists for predicate locking.
Unfortunately pretty large. I think it's a huge improvement, but it's
also subtle code. Wonder if there's something better to do here wrt
OnConflict_CheckForSerializationFailure?
0005: Remove now unused SHMQueue*.
0006: Remove PROC_QUEUE.size.
I'm not sure whether this is a a good idea. I was looking primarily at
that because I thought it'd allow us to remove PROC_QUEUE as a whole
if we wanted to. But as PROC_QUEUE.size doesn't really seem to buy us
much, we should perhaps just do something roughly like in the patch?
Greetings,
Andres Freund
Attachments:
v1-0001-Add-additional-prev-next-detached-node-functions-.patchtext/x-diff; charset=us-asciiDownload+113-28
v1-0002-Use-dlists-instead-of-SHM_QUEUE-for-heavyweight-l.patchtext/x-diff; charset=us-asciiDownload+161-235
v1-0003-Use-dlist-for-syncrep-queue.patchtext/x-diff; charset=us-asciiDownload+41-58
v1-0004-Use-dlists-for-predicate-locking.patchtext/x-diff; charset=us-asciiDownload+241-501
v1-0005-Remove-now-unused-SHMQueue.patchtext/x-diff; charset=us-asciiDownload+0-214
v1-0006-Remove-PROC_QUEUE.size.patchtext/x-diff; charset=us-asciiDownload+23-35
Hi Andres,
On Thu, Feb 20, 2020 at 1:14 PM Andres Freund <andres@anarazel.de> wrote:
Here's a *draft* patch series for removing all use of SHM_QUEUE, and
subsequently removing SHM_QUEUE. There's a fair bit of polish needed,
but I do think it makes the code considerably easier to read
(particularly for predicate.c). The diffstat is nice too:0005: Remove now unused SHMQueue*.
0006: Remove PROC_QUEUE.size.
Maybe you're aware, but there still seem to be places using it. In
LOCK_DEBUG build:
lock.c: In function ‘LOCK_PRINT’:
lock.c:320:20: error: ‘PROC_QUEUE’ {aka ‘const struct PROC_QUEUE’} has
no member named ‘size’
lock->waitProcs.size,
lock.c: In function ‘DumpLocks’:
lock.c:3906:2: error: unknown type name ‘SHM_QUEUE’; did you mean ‘SI_QUEUE’?
Fwiw, I for one, am all for removing specialized data structures when
more widely used data structures will do, especially if that
specialization is no longer relevant.
Thanks,
Amit
On Thu, Feb 20, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:
16 files changed, 569 insertions(+), 1053 deletions(-)
Nice!
Some comments on 0001, 0003, 0004:
Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to
+extern void dlist_check(const dlist_head *head);
+extern void slist_check(const slist_head *head);
I approve of the incidental constification in this patch.
+/*
+ * Like dlist_delete(), but also sets next/prev to NULL to signal not being in
+ * list.
+ */
+static inline void
+dlist_delete_thoroughly(dlist_node *node)
+{
+ node->prev->next = node->next;
+ node->next->prev = node->prev;
+ node->next = NULL;
+ node->prev = NULL;
+}
Instead of introducing this strange terminology, why not just have the
callers do ...
dlist_delete(node);
dlist_node_init(node);
..., or perhaps supply dlist_delete_and_reinit(node) that does exactly
that? That is, reuse the code and terminology.
+/*
+ * Check if node is detached. A node is only detached if it either has been
+ * initialized with dlist_init_node(), or deleted with
+ * dlist_delete_thoroughly().
+ */
+static inline bool
+dlist_node_is_detached(const dlist_node *node)
+{
+ Assert((node->next == NULL && node->prev == NULL) ||
+ (node->next != NULL && node->prev != NULL));
+
+ return node->next == NULL;
+}
How about dlist_node_is_linked()? I don't like introducing random new
verbs when we already have 'linked' in various places, and these
things are, y'know, linked lists.
Subject: [PATCH v1 3/6] Use dlist for syncrep queue.
LGTM.
Subject: [PATCH v1 4/6] Use dlists for predicate locking.
+ dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts)
Yuck... I suppose you could do this:
- dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts)
+ dlist_foreach_const(iter, &reader->outConflicts)
... given:
+/* Variant for when you have a pointer to const dlist_head. */
+#define dlist_foreach_const(iter, lhead) \
+ for (AssertVariableIsOfTypeMacro(iter, dlist_iter), \
+ AssertVariableIsOfTypeMacro(lhead, const dlist_head *), \
+ (iter).end = (dlist_node *) &(lhead)->head, \
+ (iter).cur = (iter).end->next ? (iter).end->next : (iter).end; \
+ (iter).cur != (iter).end; \
+ (iter).cur = (iter).cur->next)
+
... or find a way to make dlist_foreach() handle that itself, which
seems pretty reasonable given its remit to traverse lists without
modifying them, though perhaps that would require a different iterator
type...
Otherwise looks OK to me and passes various tests I threw at it.
Hi,
On 2020-02-21 12:40:06 +1300, Thomas Munro wrote:
On Thu, Feb 20, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:
16 files changed, 569 insertions(+), 1053 deletions(-)
Nice!
Thanks!
Some comments on 0001, 0003, 0004:
Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to
+extern void dlist_check(const dlist_head *head); +extern void slist_check(const slist_head *head);I approve of the incidental constification in this patch.
It was just necessary fallout :)
+/* + * Like dlist_delete(), but also sets next/prev to NULL to signal not being in + * list. + */ +static inline void +dlist_delete_thoroughly(dlist_node *node) +{ + node->prev->next = node->next; + node->next->prev = node->prev; + node->next = NULL; + node->prev = NULL; +}Instead of introducing this strange terminology, why not just have the
callers do ...dlist_delete(node);
dlist_node_init(node);
There's quite a few callers in predicate.c - I actually did that first.
..., or perhaps supply dlist_delete_and_reinit(node) that does exactly
that? That is, reuse the code and terminology.
Yea, that's might be better, but see paragraph below. I quite dislike
adding any "empty node" state.
+/* + * Check if node is detached. A node is only detached if it either has been + * initialized with dlist_init_node(), or deleted with + * dlist_delete_thoroughly(). + */ +static inline bool +dlist_node_is_detached(const dlist_node *node) +{ + Assert((node->next == NULL && node->prev == NULL) || + (node->next != NULL && node->prev != NULL)); + + return node->next == NULL; +}How about dlist_node_is_linked()? I don't like introducing random new
verbs when we already have 'linked' in various places, and these
things are, y'know, linked lists.
Well, but that doesn't signal that you can't just delete and have
dlist_node_is_linked() work. I *want* it to sound "different". We could
of course make delete always do this, but I don't want to add that
overhead unnecessarily.
Subject: [PATCH v1 4/6] Use dlists for predicate locking.
+ dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts)
Yuck...
It doesn't seem *that* bad to me, at least signals properly what we're
doing, and only does so in one place.
I suppose you could do this:
- dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->outConflicts) + dlist_foreach_const(iter, &reader->outConflicts)
We'd need a different iterator type too, I think? Because the iterator
itself can't be constant, but we'd want the elements themselves be
pointers to constants.
const just isn't a very granular thing in C :(.
... given:
+/* Variant for when you have a pointer to const dlist_head. */ +#define dlist_foreach_const(iter, lhead) \ + for (AssertVariableIsOfTypeMacro(iter, dlist_iter), \ + AssertVariableIsOfTypeMacro(lhead, const dlist_head *), \ + (iter).end = (dlist_node *) &(lhead)->head, \ + (iter).cur = (iter).end->next ? (iter).end->next : (iter).end; \ + (iter).cur != (iter).end; \ + (iter).cur = (iter).cur->next) +... or find a way to make dlist_foreach() handle that itself, which
seems pretty reasonable given its remit to traverse lists without
modifying them, though perhaps that would require a different iterator
type...
I was trying that first, but I don't easily see how we can do
so. Iterating over a non-constant list with dlist_foreach obviously
still allows to to manipulate the list members. Thus dlist_iter.cur
can't be a 'pointer to const'. Whereas that's arguably what'd be needed
for a correct dlist_foreach() of a constant list?
We could just accept const pointers for dlist_foreach(), but then we'd
*always* accept them, and we'd thus unconditionally have iter.cur as
non-const. Would that be better?
Otherwise looks OK to me and passes various tests I threw at it
Thanks!
Greetings,
Andres Freund
On Mon, Feb 10, 2020 at 11:22 PM Andres Freund <andres@anarazel.de> wrote:
To me this seems to go in the direction of having multiple bespoke lock
managers with slightly different feature sets, different debugging /
logging / monitoring support, with separate locking code each. That's
bad for maintainability.
I think it would help a lot if the lock manager were not such a
monolithic thing. For instance, suppose that instead of having the
deadlock detector be part of the lock manager, it's a separate thing
that integrates with the lock manager. Instead of only knowing about
waits for heavyweight locks, it knows about whatever you want to tell
it about. Then, for instance, you could possibly tell it that process
A is waiting for a cleanup lock while process B holds a pin, possibly
detecting a deadlock we can't notice today. There are likely other
cases as well.
E.g. we could:
- Have an extended LockAcquire API where the caller provides a stack
variable for caching information until the LockRelease call, avoiding
separate LockMethodLocalHash lookup for release.
Maybe. Seems like it might make things a little clunky for the caller,
though. If you just kept a stack of locks acquired and checked the
lock being released against the top of the stack, you'd probably catch
a large percentage of cases. Or, the caller could pass a flag
indicating whether they intend to release the lock prior to the end of
the transaction (which could be cross-checked by assertions). Any that
you intend to release early go on a stack.
- Instead of always implementing HASH_REMOVE as a completely fresh
lookup in dynahash, we should provide an API for removing an object we
*know* is in the hashtable at a specific pointer. We're obviously
already relying on that address to stay constant, so we'd not loose
flexibility. With this separate API we can avoid the bucket lookup,
walking through each element in that bucket, and having to compare the
hash key every time (which is quite expensive for a full LOCKTAG).
That makes a lot of sense, although I wonder if we should go further
and replace dynahash entirely.
- We should try to figure out whether we can avoid needing the separate
lock table for PROCLOCKs. As far as I can tell there's not a single
reason for it to be a hashtable, rather than something like
NUM_LOCK_PARTITIONS lists of free PROCLOCKs.
I agree that the PROCLOCK thing seems ugly and inefficient. I'm not
sure that a bunch of lists is the best answer, though it might be. One
case to consider is when a lock is initially acquired via the
fast-path and then, because of a conflicting lock acquisition,
transferred by *some other process* to the main lock table. If this
occurs, the original locker must be able to find its PROCLOCK. It
doesn't have to be crazy efficient because it shouldn't happen very
often, but it shouldn't suck too much.
- Whenever we do a relation extension, we better already hold a normal
relation lock. We don't actually need to have an entirely distinct
type of lock for extensions, that theoretically could support N
extension locks for each relation. The fact that these are associated
could be utilized in different ways:- We could define relation extension as a separate lock level only
conflicting with itself (and perhaps treat AELs as including
it). That'd avoid needing separate shared memory states in many
cases.This might be too problematic, because extension locks don't have
the same desired behaviour around group locking (and a small army of
other issues).
Yeah, I don't think that's likely to work out very nicely.
- We could keep a separate extension lock cached inside the relation
lock. The first time a transaction needs to extend, it does the
normal work, and after that stores the PROCLOCK in the LOCALLOCK (or
something like that). Once extension is done, don't release the lock
entirely, but instead just reduce the lock level to a new
non-conflicting lock levelAlternatively we could implement something very similar outside of
lock.c, e.g. by caching the LOCALLOCK* (possibly identified by an
integer or such) in RelationData. That'd require enough support
from lock.c to be able to make that really cheap.
Not sure I quite see what you mean here.
We could e.g. split the maintenance of the hashtable(s) from protecting
the state of individual locks. The locks protecting the hashtable would
just be held long enough to change a "pin count" of each lock, and then
a per LOCK lwlock would protect each heavyweight lock's state. We'd not
need to lock the partition locks for quite a few cases, because there's
many locks in a loaded system that always have lockers. There'd be cost
to that, needing more atomic ops in some cases, but I think it'd reduce
contention tremendously, and it'd open a lot more optimization
potential. It seems plausible that we could even work, as a followup,
to not needing the partition locks for some lock releases (e.g. when
there are other holders), and we might even be able to avoid it for
acquisitions, by caching the LOCK inside LOCALLOCK, and re-identifying
the identity.
I agree. This seems worth exploring. The idea of caching the probably
location of the lock and re-pinning it to check whether it's the one
you expected seems like it would avoid false sharing in a lot of
practical cases.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 19, 2020 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:
Here's a *draft* patch series for removing all use of SHM_QUEUE, and
subsequently removing SHM_QUEUE.
+1 for that idea.
But color me skeptical of what Thomas described as the "incidental
constification".
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company