Huge backend memory footprint

Started by Konstantin Knizhnikabout 8 years ago5 messages

k.knizhnik@postgrespro.ru

about 8 years ago

While my experiments with pthreads version of Postgres I find out that I
can not create more than 100k backends even at the system with 4Tb of RAM.
I do not want to discuss now the idea of creating so large number of
backends - yes, most of the real production systems are using pgbouncer
or similar connection pooling
tool allowing to restrict number of connections to the database. But
there are 144 cores at this system and if we want to utilize all system
resources then optimal number of
backends will be several hundreds (especially taken in account that
Postgres backends are usually not CPU bounded and have to read data from
the disk, so number of backends
should be much larger than number of cores).

There are several per-backend arrays in postgres which size depends on
maximal number of backends.
For max_connections=100000 Postgres allocates 26Mb for each snapshot:

CurrentRunningXacts->xids = (TransactionId *)
malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));

It seems to be too overestimated value, because TOTAL_MAX_CACHED_SUBXIDS
is defined as:

    /*
    * During Hot Standby processing we have a data structure called
    * KnownAssignedXids, created in shared memory. Local data
structures are
    * also created in various backends during GetSnapshotData(),
    * TransactionIdIsInProgress() and GetRunningTransactionData(). All
of the
    * main structures created in those functions must be identically
sized,
    * since we may at times copy the whole of the data structures
around. We
    * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
    *
    * Ideally we'd only create this structure if we were actually
doing hot
    * standby in the current run, but we don't know that yet at the time
    * shared memory is being set up.
    */
#define TOTAL_MAX_CACHED_SUBXIDS \
    ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)

Another 12Mb array is used for deadlock detection:

#2 0x00000000008ac397 in InitDeadLockChecking () at deadlock.c:196
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
(gdb) list
191        * last MaxBackends entries in possibleConstraints[] are
reserved as
192        * output workspace for FindLockCycle.
193        */
194        maxPossibleConstraints = MaxBackends * 4;
195        possibleConstraints =
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
197

As result amount of dynamic memory allocated for each backend exceeds
50Mb and so 100k backends can not be launched even at the system with 4Tb!
I think that we should use more accurate allocation policy in this
places and do not waste memory in such manner (even if it is virtual).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Claudio Freire

klaussfreire@gmail.com

about 8 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Huge backend memory footprint

On Fri, Dec 22, 2017 at 10:07 AM, Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

While my experiments with pthreads version of Postgres I find out that I
can not create more than 100k backends even at the system with 4Tb of RAM.
I do not want to discuss now the idea of creating so large number of
backends - yes, most of the real production systems are using pgbouncer or
similar connection pooling
tool allowing to restrict number of connections to the database. But there
are 144 cores at this system and if we want to utilize all system resources
then optimal number of
backends will be several hundreds (especially taken in account that
Postgres backends are usually not CPU bounded and have to read data from
the disk, so number of backends
should be much larger than number of cores).

There are several per-backend arrays in postgres which size depends on
maximal number of backends.
For max_connections=100000 Postgres allocates 26Mb for each snapshot:

CurrentRunningXacts->xids = (TransactionId *)
malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));

It seems to be too overestimated value, because TOTAL_MAX_CACHED_SUBXIDS
is defined as:

/*
* During Hot Standby processing we have a data structure called
* KnownAssignedXids, created in shared memory. Local data structures
are
* also created in various backends during GetSnapshotData(),
* TransactionIdIsInProgress() and GetRunningTransactionData(). All of
the
* main structures created in those functions must be identically
sized,
* since we may at times copy the whole of the data structures around.
We
* refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
*
* Ideally we'd only create this structure if we were actually doing
hot
* standby in the current run, but we don't know that yet at the time
* shared memory is being set up.
*/
#define TOTAL_MAX_CACHED_SUBXIDS \
((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)

Another 12Mb array is used for deadlock detection:

#2 0x00000000008ac397 in InitDeadLockChecking () at deadlock.c:196
196 (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
(gdb) list
191 * last MaxBackends entries in possibleConstraints[] are
reserved as
192 * output workspace for FindLockCycle.
193 */
194 maxPossibleConstraints = MaxBackends * 4;
195 possibleConstraints =
196 (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
197

As result amount of dynamic memory allocated for each backend exceeds
50Mb and so 100k backends can not be launched even at the system with 4Tb!
I think that we should use more accurate allocation policy in this places
and do not waste memory in such manner (even if it is virtual).

Don't forget each thread also has its own stack. I don't think you can
expect 100k threads to ever work.

If you get to that point, you really need to consider async query
execution. There was a lot of work related to that in other threads, you
may want to take a look.

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Huge backend memory footprint

Hi,

On 2017-12-22 16:07:23 +0300, Konstantin Knizhnik wrote:

While my experiments with pthreads version of Postgres I find out that I can
not create more than 100k backends even at the system with 4Tb of RAM.

I don't think this is a problem we need to address at this point. Would
you care to argue otherwise?

For now we've so much more relevant limitations, e.g. the O(N)
complexity of computing a snapshot, that I think addressing these
concerns is a waste of time at this point.

Greetings,

Andres Freund

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 8 years ago

In reply to: Claudio Freire (#2)

Re: Huge backend memory footprint

On 22.12.2017 16:13, Claudio Freire wrote:

On Fri, Dec 22, 2017 at 10:07 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

While my experiments with pthreads version of Postgres I find out
that I can not create more than 100k backends even at the system
with 4Tb of RAM.
I do not want to discuss now the idea of creating so large number
of backends - yes, most of the real production systems are using
pgbouncer or similar connection pooling
tool allowing to restrict number of connections to the database.
But there are 144 cores at this system and if we want to utilize
all system resources then optimal number of
backends will be several hundreds (especially taken in account
that Postgres backends are usually not CPU bounded and have to
read data from the disk, so number of backends
should be much larger than number of cores).

There are several per-backend arrays in postgres which size
depends on maximal number of backends.
For max_connections=100000 Postgres allocates 26Mb for each snapshot:

        CurrentRunningXacts->xids = (TransactionId *)
            malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));

It seems to be too overestimated value, because
TOTAL_MAX_CACHED_SUBXIDS is defined as:

    /*
    * During Hot Standby processing we have a data structure called
    * KnownAssignedXids, created in shared memory. Local data
structures are
    * also created in various backends during GetSnapshotData(),
    * TransactionIdIsInProgress() and
GetRunningTransactionData(). All of the
    * main structures created in those functions must be
identically sized,
    * since we may at times copy the whole of the data structures
around. We
    * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
    *
    * Ideally we'd only create this structure if we were actually
doing hot
    * standby in the current run, but we don't know that yet at
the time
    * shared memory is being set up.
    */
#define TOTAL_MAX_CACHED_SUBXIDS \
    ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)

Another 12Mb array is used for deadlock detection:

#2 0x00000000008ac397 in InitDeadLockChecking () at deadlock.c:196
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
(gdb) list
191        * last MaxBackends entries in possibleConstraints[]
are reserved as
192        * output workspace for FindLockCycle.
193        */
194        maxPossibleConstraints = MaxBackends * 4;
195        possibleConstraints =
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
197

As result amount of dynamic memory allocated for each backend
exceeds 50Mb and so 100k backends can not be launched even at the
system with 4Tb!
I think that we should use more accurate allocation policy in this
places and do not waste memory in such manner (even if it is virtual).

Don't forget each thread also has its own stack. I don't think you can
expect 100k threads to ever work.

Yes, Postgres requires large stack. Although minimal pthread stack size
is 16kb, Postgres requires at least 512kb and it is still not enough for
passing regression tests.
But even with 1Mb thread stack size, 100k connections requires just (!)
100Gb. But 50Mb is too much.

If you get to that point, you really need to consider async query
execution. There was a lot of work related to that in other threads,
you may want to take a look.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 8 years ago

In reply to: Andres Freund (#3)

Re: Huge backend memory footprint

On 22.12.2017 16:21, Andres Freund wrote:

Hi,

On 2017-12-22 16:07:23 +0300, Konstantin Knizhnik wrote:

While my experiments with pthreads version of Postgres I find out that I can
not create more than 100k backends even at the system with 4Tb of RAM.

I don't think this is a problem we need to address at this point. Would
you care to argue otherwise?

I am not sure that be able to support 100k sessions is really the most
topical goal.
Some modern systems has hundreds of cores and to be able to utilize them
we need thousands of backends.
But not yet 100k backends.

If Postgres can efficiently support arbitrary number of backends
(actually managing more than 100k network connections is problematic in
any case)
then we do not need to use connection pooling. And it definitely has
some clear advantages: use of session variables, prepared statements,...
Also any intermediate proxy can only decrease speed.

But certainly there are a lot of problems except this few arrays which
cause quadratic increase of backends memory footprint with increasing
max_connections.
Postgres backens are heavy wieght: each of them maintains is own private
relation and catalog caches, prepared statement cache,...
I expect that for real large databases them will consume most of the
backend's memory.
So may be the right apporach is to replace this private caches with
shared caches.

For now we've so much more relevant limitations, e.g. the O(N)
complexity of computing a snapshot, that I think addressing these
concerns is a waste of time at this point.

Well, we have CSN patch which now is stable enough and shows reaults
comparable with mainstream Postgres. So may be snapshot calculation is
not the most critical problem...

Greetings,

Andres Freund

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company