Issues Scaling Postgres Concurrency

Started by Harrison Borgesover 3 years ago6 messagesgeneral

harrison@rlly.com

over 3 years ago

Hello everyone.

I’m running into severe performance problems with Postgres as I increase
the number of concurrent requests against my backend. I’ve identified that
the bottleneck is Postgres, and to simplify the test case, I created an
endpoint that only does a count query on a table with ~500k rows. At 5
concurrent users, the response time was 33ms, at 10 users it was 60ms, and
at 20 users it was 120ms.

As the number of concurrent users increases, the response time for the
count query also increases significantly, indicating that Postgres may not
be scaling well to handle the increasing load.

This manifests in essentially a server meltdown on production. As the
concurrent requests stack up, our server is stuck waiting for more and more
queries. Eventually requests begin timing out as they start taking over 30
seconds to respond.

Am I doing something obviously wrong? Does this sound like normal behavior?
I'm not very experienced at DB ops so I'm not 100% sure what to expect
here, but I have worked as a Software Engineer for over 10 years and I've
not encountered problems like this before.

I would appreciate any insights or advice on how to optimize Postgres for
high concurrency scenarios. Thank you in advance for your help!

Adrian Klaver

adrian.klaver@aklaver.com

over 3 years ago

In reply to: Harrison Borges (#1)

Re: Issues Scaling Postgres Concurrency

On 3/13/23 9:24 AM, Harrison Borges wrote:

Hello everyone.

I’m running into severe performance problems with Postgres as I increase
the number of concurrent requests against my backend. I’ve identified
that the bottleneck is Postgres, and to simplify the test case, I
created an endpoint that only does a count query on a table with ~500k
rows. At 5 concurrent users, the response time was 33ms, at 10 users it
was 60ms, and at 20 users it was 120ms.

As the number of concurrent users increases, the response time for the
count query also increases significantly, indicating that Postgres may
not be scaling well to handle the increasing load.

This manifests in essentially a server meltdown on production. As the
concurrent requests stack up, our server is stuck waiting for more and
more queries. Eventually requests begin timing out as they start taking
over 30 seconds to respond.

Am I doing something obviously wrong? Does this sound like normal
behavior? I'm not very experienced at DB ops so I'm not 100% sure what
to expect here, but I have worked as a Software Engineer for over 10
years and I've not encountered problems like this before.

1) https://wiki.postgresql.org/wiki/Slow_Counting

2) Are you using connection pooling?

I would appreciate any insights or advice on how to optimize Postgres
for high concurrency scenarios. Thank you in advance for your help!

--
Adrian Klaver
adrian.klaver@aklaver.com

Alan Hodgson

ahodgson@lists.simkin.ca

over 3 years ago

In reply to: Harrison Borges (#1)

Re: Issues Scaling Postgres Concurrency

On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:

Hello everyone.

I’m running into severe performance problems with Postgres as I
increase the number of concurrent requests against my backend. I’ve
identified that the bottleneck is Postgres, and to simplify the
test case, I created an endpoint that only does a count query on a
table with ~500k rows. At 5 concurrent users, the response time was
33ms, at 10 users it was 60ms, and at 20 users it was 120ms.

I'm no expert on high concurrency, but for something this simple I'd
expect that you're just CPU bottlenecked. Count in PostgreSQL
actually has to read all the rows in the table. And yeah you can't do
too many of them at the same time.

Laurenz Albe

laurenz.albe@cybertec.at

over 3 years ago

In reply to: Harrison Borges (#1)

Re: Issues Scaling Postgres Concurrency

On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:

I’m running into severe performance problems with Postgres as I increase the number
of concurrent requests against my backend. I’ve identified that the bottleneck is
Postgres, and to simplify the test case, I created an endpoint that only does a
count query on a table with ~500k rows. At 5 concurrent users, the response time
was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.

As the number of concurrent users increases, the response time for the count query
also increases significantly, indicating that Postgres may not be scaling well to
handle the increasing load.

This manifests in essentially a server meltdown on production. As the concurrent
requests stack up, our server is stuck waiting for more and more queries.
Eventually requests begin timing out as they start taking over 30 seconds to respond.

Am I doing something obviously wrong? Does this sound like normal behavior?

That sounds like quite normal and expected behavior.

A query that counts the number of rows in a table of half a million rows is
quite expensive and keeps a CPU core busy for a while (provided everything is
cached). At some degree of parallelism, your CPU is overloaded, which leads
to non-linear slowdown.

The thing you are doing wrong is that you are putting too much load on this
system.

Yours,
Laurenz Albe

Peter J. Holzer

hjp-pgsql@hjp.at

over 3 years ago

In reply to: Laurenz Albe (#4)

Re: Issues Scaling Postgres Concurrency

On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote:

On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:

I’m running into severe performance problems with Postgres as I
increase the number of concurrent requests against my backend. I’ve
identified that the bottleneck is Postgres, and to simplify the test
case, I created an endpoint that only does a count query on a table
with ~500k rows. At 5 concurrent users, the response time was 33ms,
at 10 users it was 60ms, and at 20 users it was 120ms.

[...]

This manifests in essentially a server meltdown on production. As
the concurrent requests stack up, our server is stuck waiting for
more and more queries. Eventually requests begin timing out as they
start taking over 30 seconds to respond.

Am I doing something obviously wrong? Does this sound like normal
behavior?

That sounds like quite normal and expected behavior.

A query that counts the number of rows in a table of half a million
rows is quite expensive and keeps a CPU core busy for a while
(provided everything is cached). At some degree of parallelism, your
CPU is overloaded, which leads to non-linear slowdown.

The slowdown looks pretty linear to me (6ms per user).

The interesting thing is that on my laptop even two concurrent accesses
cause a 100% slowdown. I think this is because the task is actually
memory-bound: The cores may do the counting in parallel, but they have
to read the data from the same RAM (since it's too large to fit in the
CPU cache) and they have to take turns accessing it.

The thing you are doing wrong is that you are putting too much load on
this system.

Or possibly counting stuff far more often than necessary. If an exact
count is necessary more frequently than it changes it is probably a good
idea to store that somewhere and update it in a trigger.

(If the count doesn't have to be totally up-to-date, caching it in the
application may be even better.)

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter J. Holzer (#5)

Re: Issues Scaling Postgres Concurrency

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:

On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote:

A query that counts the number of rows in a table of half a million
rows is quite expensive and keeps a CPU core busy for a while
(provided everything is cached). At some degree of parallelism, your
CPU is overloaded, which leads to non-linear slowdown.

The interesting thing is that on my laptop even two concurrent accesses
cause a 100% slowdown. I think this is because the task is actually
memory-bound: The cores may do the counting in parallel, but they have
to read the data from the same RAM (since it's too large to fit in the
CPU cache) and they have to take turns accessing it.

I wondered if synchronize_seqscans might exacerbate this behavior by
encouraging more cores to be touching the same buffers at the same time.
I couldn't measure much difference between having it on vs. off in a
pgbench test with all the clients counting the same table ... but maybe
on other hardware the effect would show up.

regards, tom lane