LwLocks contention

Started by Michael Lewisalmost 4 years ago4 messagesgeneral

lewis.michaelr@gmail.com

almost 4 years ago

We are occasionally seeing heavy CPU contention with hundreds of processes
active but waiting on a lightweight lock - usually lock manager or buffer
mapping it seems. This is happening with VMs configured with about 64 CPUs,
350GBs ram, and while we would typically only have 30-100 concurrent
processes, there will suddenly be ~300 and many show active with LwLock and
they take much longer than usual. Any suggested options to monitor for such
issues or logging to setup so the next issue can be debugged properly?

It has seemed to me that this occurs when there are more than the usual
number of a particular process type and also something that is a bit heavy
in usage of memory/disk. It has happened on various tenant instances and
different application processes as well.

Would/how might the use of huge pages (or transparent huge pages, or OFF)
play into this scenario?

Chris Bisnett

cbisnett@gmail.com

almost 4 years ago

In reply to: Michael Lewis (#1)

Re: LwLocks contention

We are occasionally seeing heavy CPU contention with hundreds of processes active but waiting on a lightweight lock - usually lock manager or buffer mapping it seems. This is happening with VMs configured with about 64 CPUs, 350GBs ram, and while we would typically only have 30-100 concurrent processes, there will suddenly be ~300 and many show active with LwLock and they take much longer than usual. Any suggested options to monitor for such issues or logging to setup so the next issue can be debugged properly?

It has seemed to me that this occurs when there are more than the usual number of a particular process type and also something that is a bit heavy in usage of memory/disk. It has happened on various tenant instances and different application processes as well.

Would/how might the use of huge pages (or transparent huge pages, or OFF) play into this scenario?

I've also been contending with a good bit of lightweight lock
contention that causes performance issues. Most often we see this with
the WAL write lock, but when we get too many parallel queries running
we end up in a "thundering herd" type of issue were the contention for
the lock manager lock consumes significant CPU resources causing the
number of parallel queries to increase as more clients back up behind
the lock contention leading to even more lock contention. When this
happens we have to pause our background workers long enough to allow
the lock contention to reduce and then we can resume our background
workers. When we hit the lock contention it's not a gradual
degredation, it goes immediately from nothing more than 100% CPU
usage. The same is true when reducing the lock contention - it goes
from 100% to nothing.

I've been working under the assumption that this has to do with our
native partitioning scheme and the fact that some queries cannot take
advantage of partition pruning because they don't contain the
partition column. My understanding is that when this happens ACCESS
SHARED locks have to be taken on all tables as well as all associated
resources (indexes, sequences, etc.) and the act of taking and
releasing all of those locks will increase the lock contention
significantly. We're working to update our application so that we can
take advantage of the pruning. Are you also using native partitioning?

- Chris

Michael Lewis

lewis.michaelr@gmail.com

almost 4 years ago

In reply to: Chris Bisnett (#2)

Re: LwLocks contention

On Thu, Apr 21, 2022 at 6:17 AM Chris Bisnett <cbisnett@gmail.com> wrote:

We're working to update our application so that we can
take advantage of the pruning. Are you also using native partitioning?

No partitioned tables at all, but we do have 1800 tables and some very
complex functions, some trigger insanity, huge number of indexes, etc etc.

There are lots of things to fix, but I just do not yet have a good sense of
the most important thing to address right now to reduce the odds of this
type of traffic jam occurring again. I very much appreciate you sharing
your experience. If I could reliably reproduce the issue or knew what data
points to start collecting going forward, that would at least give me
something to go on, but it feels like I am just waiting for it to happen
again and hope that some bit of information makes itself known that time.

Perhaps I should have posted this to the performance list instead of
general.

Robert Treat

xzilla@users.sourceforge.net

almost 4 years ago

In reply to: Michael Lewis (#3)

Re: LwLocks contention

On Mon, Apr 25, 2022 at 10:33 AM Michael Lewis <lewis.michaelr@gmail.com> wrote:

On Thu, Apr 21, 2022 at 6:17 AM Chris Bisnett <cbisnett@gmail.com> wrote:

We're working to update our application so that we can
take advantage of the pruning. Are you also using native partitioning?

No partitioned tables at all, but we do have 1800 tables and some very complex functions, some trigger insanity, huge number of indexes, etc etc.

There are lots of things to fix, but I just do not yet have a good sense of the most important thing to address right now to reduce the odds of this type of traffic jam occurring again. I very much appreciate you sharing your experience. If I could reliably reproduce the issue or knew what data points to start collecting going forward, that would at least give me something to go on, but it feels like I am just waiting for it to happen again and hope that some bit of information makes itself known that time.

Perhaps I should have posted this to the performance list instead of general.

In my experience lwlock contention (especially around buffer_mapping)
is more about concurrent write activity than any particular number of
tables/partitions. The first recommendation I would have is to install
pg_buffercache and see if you can capture some snapshots of what the
buffer cache looks like, especially looking for pinning_backends. I'd
also spend some time capturing pg_stat_activity output to see what
relations are in play for the queries that are waiting on said lwlocks
(especially trying to map write queries to tables/indexes).

Robert Treat
https://xzilla.net